# Pandas

## Installation

In [155]:
#pip install pandas --upgrade --user


## Importing convetion

In [2]:
import pandas as pd
import numpy as np

# What is Pandas?

Pandas can be thought as an enhanced version of numpy arrays. In this case, the rows and columns can be identified with labels instead of just simple integer indices.

There are **three** main pandas elements we **need** to understand.
1. Pandas Series
2. Pandas DataFrame
3. Index

# The Pandas Series

A pandas series is a one-dimensional (**1-D**) indexed array.

In [157]:
pd.

SyntaxError: invalid syntax (<ipython-input-157-bc888235687a>, line 1)

In [3]:
pd.Series()

  pd.Series()


Series([], dtype: float64)

## Creating a pandas Series from a list

You'll start to recognize a pandas Series by its visual 

In [7]:
np.array([5,8,3])

array([5, 8, 3])

In [10]:
pd.Series([5,8,3])

0    5
1    8
2    3
dtype: int64

`dtype` means the `data type` of what is inside your pandas Series.

In [14]:
# As in lists, you don't need to have all of the same type inside a pandas series
pd.Series(['Valor nao computado',2, 3])

0    Valor nao computado
1                      2
2                      3
dtype: object

When you see `dtype: object`, it usually means you have a `str` inside your `Series`

In [15]:
data = pd.Series([10,23,3,43,25,136])

In [16]:
data

0     10
1     23
2      3
3     43
4     25
5    136
dtype: int64

In [17]:
type(data)

pandas.core.series.Series

So, the `type` of `data` is a `pandas...Series` and the types of the data inside the `pandas.Series` is `int`

## Accessing elements 

Can be done like a numpy array. 

In [18]:
data

0     10
1     23
2      3
3     43
4     25
5    136
dtype: int64

In [19]:
data[0]

10

In [167]:
data[3:]

3     43
4     25
5    136
dtype: int64

In [20]:
data[-1]

KeyError: -1

In [21]:
{'Pedro':2}['Raiana']

KeyError: 'Raiana'

Em resumo: pandas series pode ser considerado uma numpy array de 1-D

### What is the difference then? Numpy array vs Pandas Series

Mostly the index notation.

Numpy arrays only have the **implicit** index associated with its location. By using a **explicit** index notation, Pandas Series are much more flexible. For example:

## Index don't need to be numbers.

In [23]:
pd.Series([5,8,2] , index =['Isabela', 'Rebeca','Carlos'])

Isabela    5
Rebeca     8
Carlos     2
dtype: int64

In [25]:
my_series = pd.Series(data=[1,2,3,5,7,9], index=['Andre','Joao','Andre','Vampirão','Tieko','Satiko']) #index argument
my_series

Andre       1
Joao        2
Andre       3
Vampirão    5
Tieko       7
Satiko      9
dtype: int64

## Index don't need to be in sequence

In [41]:
data = pd.Series(data=[1,2,3,4], 
                 index=[1,7,4313,19])

In [33]:
data_reverse = pd.Series(data=[1,2,3,4], index = list(range(4,0,-1)))
data_reverse

4    1
3    2
2    3
1    4
dtype: int64

In [42]:
data

1       1
7       2
4313    3
19      4
dtype: int64

### Then how can I access these pandas series?

In [43]:
my_series

Andre       1
Joao        2
Andre       3
Vampirão    5
Tieko       7
Satiko      9
dtype: int64

In [44]:
my_series[1]

2

In [45]:
my_series['Joao']

2

In [50]:
my_series['Andre']

Andre    1
Andre    3
dtype: int64

In [179]:
my_series['Andre'].mean()

2.0

**NOTE:** One can think of a pandas series, then, as a form of dictionary, in which the indexes are keys and the rows are the values

In [51]:
my_series

Andre       1
Joao        2
Andre       3
Vampirão    5
Tieko       7
Satiko      9
dtype: int64

In [52]:
my_series.keys()

Index(['Andre', 'Joao', 'Andre', 'Vampirão', 'Tieko', 'Satiko'], dtype='object')

In [53]:
my_series.values

array([1, 2, 3, 5, 7, 9], dtype=int64)

In [54]:
my_series * 3

Andre        3
Joao         6
Andre        9
Vampirão    15
Tieko       21
Satiko      27
dtype: int64

## Creating a pandas series from a dict.

In [55]:
my_dict = {'JOAO': 20, 
           'ANDRE':10}

In [56]:
my_dict

{'JOAO': 20, 'ANDRE': 10}

In [57]:
pd.Series(my_dict)

JOAO     20
ANDRE    10
dtype: int64

But what about > 1-D?


# Pandas DataFrame


Pandas Dataframes can be thought as a generalization of **2-D** numpy arrays. However, again, they bring flexibility on both the indices and column names.

In [59]:
pd.DataFrame()

In [60]:
type(pd.DataFrame())

pandas.core.frame.DataFrame

## Pandas DataFrame can be thought as a group of Pandas Series

In [61]:
my_dict = {'JOAO': 25, 
           'ANDRE':28}

data = pd.Series(my_dict)

In [62]:
data

JOAO     25
ANDRE    28
dtype: int64

In [63]:
another_dict = {'JOAO': 177,'ANDRE': 175}

data_2 = pd.Series(another_dict)

In [64]:
data_2

JOAO     177
ANDRE    175
dtype: int64

In [65]:
new_dict = {'Idade':data, 'Altura' : data_2}

In [66]:
new_dict

{'Idade': JOAO     25
 ANDRE    28
 dtype: int64,
 'Altura': JOAO     177
 ANDRE    175
 dtype: int64}

In [67]:
pd.DataFrame(new_dict)

Unnamed: 0,Idade,Altura
JOAO,25,177
ANDRE,28,175


# Create dataframe as a collection of Series

In [68]:
{'idade':data, 'altura':data_2}

{'idade': JOAO     25
 ANDRE    28
 dtype: int64,
 'altura': JOAO     177
 ANDRE    175
 dtype: int64}

In [84]:
my_dataframe = pd.DataFrame({'idade':data, 'altura':data_2})
my_dataframe 

Unnamed: 0,idade,altura
JOAO,25,177
ANDRE,28,175


In [70]:
pd.DataFrame([[3,5,2],[5,8,9]],index = ['Primeiro','Segundo'],columns=['A','B','C'])

Unnamed: 0,A,B,C
Primeiro,3,5,2
Segundo,5,8,9


In [71]:
matriz = np.random.randint(1,10,(5,8))
pd.DataFrame(matriz)

Unnamed: 0,0,1,2,3,4,5,6,7
0,7,1,2,5,2,7,6,6
1,8,4,2,1,6,2,3,9
2,1,1,9,9,2,3,3,6
3,3,7,5,7,6,6,8,8
4,4,8,3,7,8,5,9,8


**NOTE:**: So a dataframe can be thought of as a dictionary, in which `keys` are the `column names` and `values` are the `pandas Series` themselves

In [72]:
my_dataframe

Unnamed: 0,idade,altura
JOAO,25,177
ANDRE,28,175


In [85]:
my_dataframe['idade'] 

JOAO     25
ANDRE    28
Name: idade, dtype: int64

In [86]:
my_dataframe.idade

JOAO     25
ANDRE    28
Name: idade, dtype: int64

# `Access` Methods: Accessing dataframes rows and columns

These are the correct way to access data in a dataframe. You can specify both row and column. You can also specify only row.

## `dataframe.loc[row_name, col_name]`

In [87]:
my_dataframe

Unnamed: 0,idade,altura
JOAO,25,177
ANDRE,28,175


In [92]:
my_dataframe['idade']['JOAO']

26

In [101]:
my_dataframe.loc['JOAO', 'idade']

26

In [96]:
my_dataframe.loc[:,['idade']]

Unnamed: 0,idade
JOAO,26
ANDRE,28


In [97]:
my_dataframe.loc['ANDRE', 'altura']

175

In [213]:
my_dataframe

Unnamed: 0,idade,altura
JOAO,25,177
ANDRE,28,175


In [103]:
my_dataframe.loc['JOAO':,:]

Unnamed: 0,idade,altura
JOAO,26,177
ANDRE,28,175


## `dataframe.iloc[row_number, col_number]`

In [110]:
my_dataframe

Unnamed: 0,idade,altura
JOAO,26,177
ANDRE,28,175


In [118]:
my_dataframe.iloc[0, :]

idade      26
altura    177
Name: JOAO, dtype: int64

In [113]:
my_dataframe.iloc[1, 1]

175

In [114]:
my_dataframe.iloc[-1, 1]

175

What is the difference of selecting a column via: `my_dataframe['idade']` vs `my_dataframe.loc[:, 'idade']`?

In [116]:
#my_dataframe['idade']
my_dataframe.loc[:, 'idade']

JOAO     26
ANDRE    28
Name: idade, dtype: int64

# Creating dataframes

## From a list in 1-D

In [220]:
my_list = [1,2,3]

In [221]:
np.array(my_list)

array([1, 2, 3])

In [222]:
np.array(my_list).shape

(3,)

In [223]:
pd.Series(my_list)

0    1
1    2
2    3
dtype: int64

In [224]:
pd.DataFrame(data=my_list)

Unnamed: 0,0
0,1
1,2
2,3


In [225]:
pd.DataFrame(data=my_list, columns=['notas'], index=['Andre','Maria','Joao'])

Unnamed: 0,notas
Andre,1
Maria,2
Joao,3


## From a list in > 1-D (let's remember numpy arrays here!)

In [119]:
my_list = [[1,2,3],
           [-5,-6,-7]]

In [120]:
np.array(my_list)

array([[ 1,  2,  3],
       [-5, -6, -7]])

In [121]:
np.array(my_list).shape

(2, 3)

In [122]:
df = pd.DataFrame(data=my_list, columns=['idade','peso','altura'])
df

Unnamed: 0,idade,peso,altura
0,1,2,3
1,-5,-6,-7


In [230]:
df.shape

(2, 3)

## From a dictionary composed by lists

In [123]:
pd.DataFrame({'ironhack_students': ['a','b','c'],
              'NOTA':[10, 10, None]})

Unnamed: 0,ironhack_students,NOTA
0,a,10.0
1,b,10.0
2,c,


## From a numpy array

In [124]:
a = np.random.random(size=(5, 4))
a

array([[1.09769905e-01, 8.23560212e-01, 4.72655129e-01, 1.09701124e-01],
       [2.62275024e-01, 4.88557230e-01, 4.26120865e-01, 9.20615012e-02],
       [2.18933952e-01, 6.04818926e-01, 5.22181674e-01, 4.77412298e-01],
       [6.66762515e-01, 9.19310672e-01, 6.49391597e-04, 7.92251592e-01],
       [2.77998483e-01, 3.88348172e-01, 1.63666990e-01, 2.90784435e-01]])

In [125]:
data = pd.DataFrame(a, columns=['altura', 'peso', 'idade', 'largura'])

In [126]:
data

Unnamed: 0,altura,peso,idade,largura
0,0.10977,0.82356,0.472655,0.109701
1,0.262275,0.488557,0.426121,0.092062
2,0.218934,0.604819,0.522182,0.477412
3,0.666763,0.919311,0.000649,0.792252
4,0.277998,0.388348,0.163667,0.290784


### Accessing rows and columns:

#### `.loc`

remember: `.loc` receives `[row_name, column_name]`

In [128]:
data

Unnamed: 0,altura,peso,idade,largura
0,0.10977,0.82356,0.472655,0.109701
1,0.262275,0.488557,0.426121,0.092062
2,0.218934,0.604819,0.522182,0.477412
3,0.666763,0.919311,0.000649,0.792252
4,0.277998,0.388348,0.163667,0.290784


In [129]:
# get third row and column `peso`
data.loc[2,'peso']

0.60481892625221

In [132]:
# get entire third row
#data.loc[2,['altura','peso','idade','largura']]
#data.loc[2,:]
data.loc[[2],:]

Unnamed: 0,altura,peso,idade,largura
2,0.218934,0.604819,0.522182,0.477412


In [136]:
# get entire `idade` column
data.loc[:,['idade']]

Unnamed: 0,idade
0,0.472655
1,0.426121
2,0.522182
3,0.000649
4,0.163667


In [138]:
# get all rows from column `peso` up to `largura`
#data.loc[:,'peso':'largura']
#data.loc[:,['peso','idade','largura']]
data.loc[:,'peso':'largura']

Unnamed: 0,peso,idade,largura
0,0.82356,0.472655,0.109701
1,0.488557,0.426121,0.092062
2,0.604819,0.522182,0.477412
3,0.919311,0.000649,0.792252
4,0.388348,0.163667,0.290784


In [106]:
data.loc[:3:2, 'peso':]

Unnamed: 0,peso,idade,largura
0,0.034074,0.022657,0.095354
2,0.234287,0.496448,0.310047


#### `.iloc`

In [140]:
data.iloc[0:4, :]
data.loc[0:3,:]

Unnamed: 0,altura,peso,idade,largura
0,0.10977,0.82356,0.472655,0.109701
1,0.262275,0.488557,0.426121,0.092062
2,0.218934,0.604819,0.522182,0.477412
3,0.666763,0.919311,0.000649,0.792252


In [108]:
data.iloc[0:4, 1:4]
data.loc[0:3,'peso':'largura']

Unnamed: 0,peso,idade,largura
0,0.034074,0.022657,0.095354
1,0.520528,0.004815,0.976007
2,0.234287,0.496448,0.310047
3,0.732967,0.733775,0.321665


In [109]:
data

Unnamed: 0,altura,peso,idade,largura
0,0.205258,0.034074,0.022657,0.095354
1,0.512434,0.520528,0.004815,0.976007
2,0.024348,0.234287,0.496448,0.310047
3,0.587896,0.732967,0.733775,0.321665
4,0.511667,0.723901,0.651273,0.335406


In [110]:
data.iloc[-1, :]

altura     0.511667
peso       0.723901
idade      0.651273
largura    0.335406
Name: 4, dtype: float64

## Math operations

In [141]:
data = np.random.random(size=(8, 4))

In [142]:
df = pd.DataFrame(data, columns=['Andre','Maria','Joao','Vamp'])

In [237]:
#df['Nova'] = ['oi']*8

In [143]:
df

Unnamed: 0,Andre,Maria,Joao,Vamp
0,0.292172,0.552395,0.714622,0.282644
1,0.283373,0.56928,0.042794,0.990881
2,0.846541,0.389849,0.177649,0.817783
3,0.873443,0.311711,0.84564,0.817717
4,0.982753,0.236437,0.66192,0.739851
5,0.537415,0.170579,0.381069,0.796863
6,0.97167,0.844352,0.035876,0.206618
7,0.297253,0.10107,0.172202,0.852029


In [144]:
data

array([[0.29217199, 0.55239546, 0.71462167, 0.28264447],
       [0.28337271, 0.56928015, 0.04279394, 0.99088118],
       [0.84654087, 0.38984875, 0.177649  , 0.81778316],
       [0.87344349, 0.31171149, 0.84564011, 0.81771652],
       [0.98275342, 0.23643716, 0.66192   , 0.73985064],
       [0.53741506, 0.17057919, 0.38106875, 0.79686316],
       [0.97166951, 0.84435176, 0.03587626, 0.2066184 ],
       [0.29725277, 0.10107015, 0.17220193, 0.85202894]])

In [148]:
data.mean(axis=0)

array([0.63557748, 0.39695926, 0.37897146, 0.68804831])

In [149]:
df.mean()

Andre    0.635577
Maria    0.396959
Joao     0.378971
Vamp     0.688048
dtype: float64

In [245]:
df.mean(axis=1)

0    0.609521
1    0.599466
2    0.411960
3    0.213729
4    0.549319
5    0.702126
6    0.464654
7    0.636557
dtype: float64

In [150]:
df.std(axis=1)

0    0.210547
1    0.407635
2    0.328474
3    0.267912
4    0.310840
5    0.263905
6    0.462485
7    0.340718
dtype: float64

In [151]:
df

Unnamed: 0,Andre,Maria,Joao,Vamp
0,0.292172,0.552395,0.714622,0.282644
1,0.283373,0.56928,0.042794,0.990881
2,0.846541,0.389849,0.177649,0.817783
3,0.873443,0.311711,0.84564,0.817717
4,0.982753,0.236437,0.66192,0.739851
5,0.537415,0.170579,0.381069,0.796863
6,0.97167,0.844352,0.035876,0.206618
7,0.297253,0.10107,0.172202,0.852029


In [154]:
df.describe()

Unnamed: 0,Andre,Maria,Joao,Vamp
count,8.0,8.0,8.0,8.0
mean,0.635577,0.396959,0.378971,0.688048
std,0.316375,0.246663,0.321718,0.283573
min,0.283373,0.10107,0.035876,0.206618
25%,0.295983,0.219973,0.13985,0.625549
50%,0.691978,0.35078,0.279359,0.80729
75%,0.898,0.556617,0.675095,0.826345
max,0.982753,0.844352,0.84564,0.990881


In [248]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7
Andre,0.614172,0.747136,0.561699,0.172714,0.347979,0.480251,0.201474,0.754445
Maria,0.84028,0.001302,0.91054,0.21756,0.813017,0.784102,0.693475,0.569147
Joao,0.249989,0.650419,0.090623,0.06888,0.316574,0.895065,0.14349,0.989829
Vamp,0.733642,0.999007,0.084976,0.395762,0.719704,0.649087,0.82018,0.232808


In [155]:
df['Valido'] = np.random.choice(['Não Valido','Valido'],size=8)

In [161]:
df

Unnamed: 0,Andre,Maria,Joao,Vamp,Valido
0,0.292172,0.552395,0.714622,0.282644,Valido
1,0.283373,0.56928,0.042794,0.990881,Valido
2,0.846541,0.389849,0.177649,0.817783,Valido
3,0.873443,0.311711,0.84564,0.817717,Valido
4,0.982753,0.236437,0.66192,0.739851,Não Valido
5,0.537415,0.170579,0.381069,0.796863,Valido
6,0.97167,0.844352,0.035876,0.206618,Não Valido
7,0.297253,0.10107,0.172202,0.852029,Valido


In [165]:
df['Nulos'] = np.nan

In [166]:
df

Unnamed: 0,Andre,Maria,Joao,Vamp,Valido,Nulos
0,0.292172,0.552395,0.714622,0.282644,Valido,
1,0.283373,0.56928,0.042794,0.990881,Valido,
2,0.846541,0.389849,0.177649,0.817783,Valido,
3,0.873443,0.311711,0.84564,0.817717,Valido,
4,0.982753,0.236437,0.66192,0.739851,Não Valido,
5,0.537415,0.170579,0.381069,0.796863,Valido,
6,0.97167,0.844352,0.035876,0.206618,Não Valido,
7,0.297253,0.10107,0.172202,0.852029,Valido,


In [167]:
df.describe()

Unnamed: 0,Andre,Maria,Joao,Vamp,Nulos
count,8.0,8.0,8.0,8.0,0.0
mean,0.635577,0.396959,0.378971,0.688048,
std,0.316375,0.246663,0.321718,0.283573,
min,0.283373,0.10107,0.035876,0.206618,
25%,0.295983,0.219973,0.13985,0.625549,
50%,0.691978,0.35078,0.279359,0.80729,
75%,0.898,0.556617,0.675095,0.826345,
max,0.982753,0.844352,0.84564,0.990881,


In [168]:
df['Total'] = df.sum(axis=1)
df

  df['Total'] = df.sum(axis=1)


Unnamed: 0,Andre,Maria,Joao,Vamp,Valido,Nulos,Total
0,0.292172,0.552395,0.714622,0.282644,Valido,,1.841834
1,0.283373,0.56928,0.042794,0.990881,Valido,,1.886328
2,0.846541,0.389849,0.177649,0.817783,Valido,,2.231822
3,0.873443,0.311711,0.84564,0.817717,Valido,,2.848512
4,0.982753,0.236437,0.66192,0.739851,Não Valido,,2.620961
5,0.537415,0.170579,0.381069,0.796863,Valido,,1.885926
6,0.97167,0.844352,0.035876,0.206618,Não Valido,,2.058516
7,0.297253,0.10107,0.172202,0.852029,Valido,,1.422554


In [128]:
df['Total'] = df['Andre'] +df['Maria']+df['Joao']+ df['Vamp']
df

Unnamed: 0,Andre,Maria,Joao,Vamp,Valido,Total
0,0.665483,0.293242,0.284131,0.383812,Valido,1.626668
1,0.625281,0.686599,0.251624,0.663646,Não Valido,2.227149
2,0.252446,0.157559,0.644287,0.801472,Não Valido,1.855765
3,0.655291,0.316513,0.253294,0.660937,Valido,1.886035
4,0.751603,0.01913,0.102802,0.792507,Valido,1.666042
5,0.513695,0.681081,0.582158,0.708621,Valido,2.485555
6,0.279895,0.519153,0.802368,0.843675,Valido,2.445091
7,0.182951,0.867378,0.134075,0.520207,Não Valido,1.704611


In [169]:
df.sum(axis=0)

Andre                                               5.08462
Maria                                              3.175674
Joao                                               3.031772
Vamp                                               5.504386
Valido    ValidoValidoValidoValidoNão ValidoValidoNão Va...
Nulos                                                   0.0
Total                                             16.796452
dtype: object

In [170]:
df.append(df.sum(axis=0),ignore_index=True)

Unnamed: 0,Andre,Maria,Joao,Vamp,Valido,Nulos,Total
0,0.292172,0.552395,0.714622,0.282644,Valido,,1.841834
1,0.283373,0.56928,0.042794,0.990881,Valido,,1.886328
2,0.846541,0.389849,0.177649,0.817783,Valido,,2.231822
3,0.873443,0.311711,0.84564,0.817717,Valido,,2.848512
4,0.982753,0.236437,0.66192,0.739851,Não Valido,,2.620961
5,0.537415,0.170579,0.381069,0.796863,Valido,,1.885926
6,0.97167,0.844352,0.035876,0.206618,Não Valido,,2.058516
7,0.297253,0.10107,0.172202,0.852029,Valido,,1.422554
8,5.08462,3.175674,3.031772,5.504386,ValidoValidoValidoValidoNão ValidoValidoNão Va...,0.0,16.796452


In [171]:
df = df.append(df.sum(axis=0),ignore_index=True)

In [172]:
df

Unnamed: 0,Andre,Maria,Joao,Vamp,Valido,Nulos,Total
0,0.292172,0.552395,0.714622,0.282644,Valido,,1.841834
1,0.283373,0.56928,0.042794,0.990881,Valido,,1.886328
2,0.846541,0.389849,0.177649,0.817783,Valido,,2.231822
3,0.873443,0.311711,0.84564,0.817717,Valido,,2.848512
4,0.982753,0.236437,0.66192,0.739851,Não Valido,,2.620961
5,0.537415,0.170579,0.381069,0.796863,Valido,,1.885926
6,0.97167,0.844352,0.035876,0.206618,Não Valido,,2.058516
7,0.297253,0.10107,0.172202,0.852029,Valido,,1.422554
8,5.08462,3.175674,3.031772,5.504386,ValidoValidoValidoValidoNão ValidoValidoNão Va...,0.0,16.796452


In [174]:
df.index=[0,1,2,3,4,5,6,7,'Total']

In [173]:
df.rename({8:'Total'})

Unnamed: 0,Andre,Maria,Joao,Vamp,Valido,Nulos,Total
0,0.292172,0.552395,0.714622,0.282644,Valido,,1.841834
1,0.283373,0.56928,0.042794,0.990881,Valido,,1.886328
2,0.846541,0.389849,0.177649,0.817783,Valido,,2.231822
3,0.873443,0.311711,0.84564,0.817717,Valido,,2.848512
4,0.982753,0.236437,0.66192,0.739851,Não Valido,,2.620961
5,0.537415,0.170579,0.381069,0.796863,Valido,,1.885926
6,0.97167,0.844352,0.035876,0.206618,Não Valido,,2.058516
7,0.297253,0.10107,0.172202,0.852029,Valido,,1.422554
Total,5.08462,3.175674,3.031772,5.504386,ValidoValidoValidoValidoNão ValidoValidoNão Va...,0.0,16.796452


In [175]:
df

Unnamed: 0,Andre,Maria,Joao,Vamp,Valido,Nulos,Total
0,0.292172,0.552395,0.714622,0.282644,Valido,,1.841834
1,0.283373,0.56928,0.042794,0.990881,Valido,,1.886328
2,0.846541,0.389849,0.177649,0.817783,Valido,,2.231822
3,0.873443,0.311711,0.84564,0.817717,Valido,,2.848512
4,0.982753,0.236437,0.66192,0.739851,Não Valido,,2.620961
5,0.537415,0.170579,0.381069,0.796863,Valido,,1.885926
6,0.97167,0.844352,0.035876,0.206618,Não Valido,,2.058516
7,0.297253,0.10107,0.172202,0.852029,Valido,,1.422554
Total,5.08462,3.175674,3.031772,5.504386,ValidoValidoValidoValidoNão ValidoValidoNão Va...,0.0,16.796452


# Pandas Index

In [176]:
df.index

Index([0, 1, 2, 3, 4, 5, 6, 7, 'Total'], dtype='object')

In [177]:
pd.Index([1,2,3])

Int64Index([1, 2, 3], dtype='int64')

In [178]:
df_2 = pd.DataFrame(data)

In [179]:
df_2

Unnamed: 0,0,1,2,3
0,0.292172,0.552395,0.714622,0.282644
1,0.283373,0.56928,0.042794,0.990881
2,0.846541,0.389849,0.177649,0.817783
3,0.873443,0.311711,0.84564,0.817717
4,0.982753,0.236437,0.66192,0.739851
5,0.537415,0.170579,0.381069,0.796863
6,0.97167,0.844352,0.035876,0.206618
7,0.297253,0.10107,0.172202,0.852029


In [180]:
df_2.index

RangeIndex(start=0, stop=8, step=1)

In [181]:
df_2.index.values

array([0, 1, 2, 3, 4, 5, 6, 7], dtype=int64)

In [182]:
 df.columns

Index(['Andre', 'Maria', 'Joao', 'Vamp', 'Valido', 'Nulos', 'Total'], dtype='object')

In [183]:
df_2.columns

RangeIndex(start=0, stop=4, step=1)

# Rename columns

In [185]:
df_2.columns

RangeIndex(start=0, stop=4, step=1)

In [190]:
df_2.columns = ['Oi','Tchau','Opa Tudo bem?','Otimo']

In [187]:
df_2

Unnamed: 0,Oi,Tchau,Opa Tudo bem?,Otimo
0,0.292172,0.552395,0.714622,0.282644
1,0.283373,0.56928,0.042794,0.990881
2,0.846541,0.389849,0.177649,0.817783
3,0.873443,0.311711,0.84564,0.817717
4,0.982753,0.236437,0.66192,0.739851
5,0.537415,0.170579,0.381069,0.796863
6,0.97167,0.844352,0.035876,0.206618
7,0.297253,0.10107,0.172202,0.852029


In [None]:
# Make all columns names lower case, without spaces or special characters  

In [None]:
df_2.columns = [x.lower() for x in df_2.columns]

In [198]:
[x.lower() if not ' ' in x else '_'.join(x.split()).lower().replace('?','') for x in df_2.columns ]

['oi', 'tchau', 'opa_tudo_bem', 'otimo']

In [None]:
[x.lower() if not ' ' in x else '_'.join(x.split()).lower().replace('?','') for x in df_2.columns ]

In [202]:
[col.replace('?','').replace(' ','_').lower() for col in df_2.columns]

['oi', 'tchau', 'opa_tudo_bem', 'otimo']

In [207]:
import re
df_2.columns = [re.sub(' ','_',re.sub('[^\w ]','', col.lower())) for col in df_2.columns]

In [208]:
df_2 = df_2.rename(columns = {'oi':'coluna 1'})

Unnamed: 0,coluna 1,tchau,opa_tudo_bem,otimo
0,0.292172,0.552395,0.714622,0.282644
1,0.283373,0.56928,0.042794,0.990881
2,0.846541,0.389849,0.177649,0.817783
3,0.873443,0.311711,0.84564,0.817717
4,0.982753,0.236437,0.66192,0.739851
5,0.537415,0.170579,0.381069,0.796863
6,0.97167,0.844352,0.035876,0.206618
7,0.297253,0.10107,0.172202,0.852029


In [211]:
df_2.rename({'oi':'coluna 1'},axis=1)

Unnamed: 0,coluna 1,tchau,opa_tudo_bem,otimo
0,0.292172,0.552395,0.714622,0.282644
1,0.283373,0.56928,0.042794,0.990881
2,0.846541,0.389849,0.177649,0.817783
3,0.873443,0.311711,0.84564,0.817717
4,0.982753,0.236437,0.66192,0.739851
5,0.537415,0.170579,0.381069,0.796863
6,0.97167,0.844352,0.035876,0.206618
7,0.297253,0.10107,0.172202,0.852029


# Resumo
* import pandas as pd
* pd.Series(lista_valores,index=lista_index) - uma coluna
* pd.Series(dicionario) 
* serie['index'] - pelo nome do indice
* serie[index] - pela posição
* pd.DataFrame(dicionario_de_series) - uma tabela
* pd.DataFrame(dicionario_de_listas)
* pd.DataFrame(listas_de_lista)
* pd.DataFrame(array_2d)
* dataframe['nome_da_coluna']['nome_linha']
* dataframe.nome_da_coluna.nome_linha
* dataframe.loc['nome_da_linha','nome_da_coluna'] - aceita slice
* dataframe.loc[posição_linha,posição_coluna] - aceita slice
* dataframe.operaçãomatemática() - assume axis=0(pela coluna)
* dataframe.operaçãomatemática(axis=1)
* dataframe.describe()
* dataframe.T.describe()
* dataframe.columns/datrame.index - retorna objeto tipo Index com nomes de colunas/linhas
