# Introdução a Ciência de Dados

## Pandas

conda install pandas
pip install pandas

### Series

Uma "Series" do pandas é uma estrutura de dados, construída por cima do array do numpy, que permite guardar uma lista de valores indexados.


In [1]:
import numpy as np
import pandas as pd

## meu primeiro comando

In [2]:
my_list = [1,2,3,4]
pd.Series(data=my_list)

0    1
1    2
2    3
3    4
dtype: int64

In [3]:
labels = ['1º','2º','3º', '4º']
pd.Series(data=my_list, index=labels)

1º    1
2º    2
3º    3
4º    4
dtype: int64

In [4]:
dictionary = {'a':10,'b':20,'c':30}
pd.Series(dictionary)

a    10
b    20
c    30
dtype: int64

In [5]:
my_list = [1,2,5,4]
cities = ['Goiânia', 'Brasília','Catalão', 'São Paulo']

serie = pd.Series(my_list, cities)     

In [6]:
serie

Goiânia      1
Brasília     2
Catalão      5
São Paulo    4
dtype: int64

In [7]:
serie['Brasília']

2

In [8]:
my_list = [2,3,1,5]
cities = ['Goiânia', 'Brasília','Maceió', 'São Paulo']

serie1 = pd.Series(my_list, cities)  


In [9]:
serie + serie1

Brasília     5.0
Catalão      NaN
Goiânia      3.0
Maceió       NaN
São Paulo    9.0
dtype: float64

### Dataframes

Parte principal dos Pandas, baseado em R (e muito parecido com excel!)
São como várias Series agrupadas

In [6]:
from numpy.random import randn
np.random.seed(101)

In [8]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [9]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


**Input e output de arquivos**


In [13]:
df.to_csv('exemplo.csv',index=False)

In [32]:
df_csv = pd.read_csv('exemplo.csv')
print(df_csv)

          W         X         Y         Z
0  2.706850  0.628133  0.907969  0.503826
1  0.651118 -0.319318 -0.848077  0.605965
2 -2.018168  0.740122  0.528813 -0.589001
3  0.188695 -0.758872 -0.933237  0.955057
4  0.190794  1.978757  2.605967  0.683509


In [15]:
df.to_excel('arquivo_excel.xlsx', sheet_name='Nome')

In [33]:
df_excel = pd.read_excel('arquivo_excel.xlsx', sheet_name='Nome')
print(df_excel)

   Unnamed: 0         W         X         Y         Z
0           0  2.706850  0.628133  0.907969  0.503826
1           1  0.651118 -0.319318 -0.848077  0.605965
2           2 -2.018168  0.740122  0.528813 -0.589001
3           3  0.188695 -0.758872 -0.933237  0.955057
4           4  0.190794  1.978757  2.605967  0.683509


In [29]:
df['W']

A    0.302665
B   -0.134841
C    0.807706
D   -0.497104
E   -0.116773
Name: W, dtype: float64

In [18]:
df[['W','Z']]


Unnamed: 0,W,Z
0,2.70685,0.503826
1,0.651118,0.605965
2,-2.018168,-0.589001
3,0.188695,0.955057
4,0.190794,0.683509


In [19]:
type(df['W'])

pandas.core.series.Series

**Adição de colunas**

In [20]:
df['new'] = df['W'] + df['Y']

In [21]:
df

Unnamed: 0,W,X,Y,Z,new
0,2.70685,0.628133,0.907969,0.503826,3.614819
1,0.651118,-0.319318,-0.848077,0.605965,-0.196959
2,-2.018168,0.740122,0.528813,-0.589001,-1.489355
3,0.188695,-0.758872,-0.933237,0.955057,-0.744542
4,0.190794,1.978757,2.605967,0.683509,2.796762


**Remoção de Colunas**

In [22]:
sem = df.drop('new',axis=1)


Unnamed: 0,W,X,Y,Z
0,2.70685,0.628133,0.907969,0.503826
1,0.651118,-0.319318,-0.848077,0.605965
2,-2.018168,0.740122,0.528813,-0.589001
3,0.188695,-0.758872,-0.933237,0.955057
4,0.190794,1.978757,2.605967,0.683509


In [23]:
df

Unnamed: 0,W,X,Y,Z,new
0,2.70685,0.628133,0.907969,0.503826,3.614819
1,0.651118,-0.319318,-0.848077,0.605965,-0.196959
2,-2.018168,0.740122,0.528813,-0.589001,-1.489355
3,0.188695,-0.758872,-0.933237,0.955057,-0.744542
4,0.190794,1.978757,2.605967,0.683509,2.796762


In [24]:
df.drop('new',axis=1,inplace=True)

In [25]:
df


Unnamed: 0,W,X,Y,Z
0,2.70685,0.628133,0.907969,0.503826
1,0.651118,-0.319318,-0.848077,0.605965
2,-2.018168,0.740122,0.528813,-0.589001
3,0.188695,-0.758872,-0.933237,0.955057
4,0.190794,1.978757,2.605967,0.683509


**Seleção de elementos**

In [31]:
df.loc['A']

W    0.302665
X    1.693723
Y   -1.706086
Z   -1.159119
Name: A, dtype: float64

In [None]:
df.loc['B','Y']

**Seleção condicional de elementos**

In [None]:
df


In [None]:
df>0

In [None]:
df[df>0]

In [None]:
df

In [None]:
df[df['W']>0]

In [None]:
df[(df['W']>0) & (df['Y'] > 1)]


**Como lidar com dados faltantes**

In [None]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df

In [None]:
df.dropna()

In [None]:
df.dropna(axis=1)

In [None]:
df.fillna(value='valor top')

In [None]:
df['A'].fillna(value=df['A'].mean())

**Agrupamento de dados**

In [34]:
data = {'Linguagem':['Python','Python','R','R','Java','Java'],
       'Funcionário':['Sara','Maria','Ana','Juliana','Giovana','André'],
       'Entregas':[1000,2000,340,124,500,350]}

In [35]:
df = pd.DataFrame(data)

In [36]:
df

Unnamed: 0,Linguagem,Funcionário,Entregas
0,Python,Sara,1000
1,Python,Maria,2000
2,R,Ana,340
3,R,Juliana,124
4,Java,Giovana,500
5,Java,André,350


In [37]:
por_linguagem = df.groupby('Linguagem')


In [38]:
por_linguagem

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000208EA6701D0>

In [39]:
por_linguagem.mean()

Unnamed: 0_level_0,Entregas
Linguagem,Unnamed: 1_level_1
Java,425
Python,1500
R,232


In [10]:
por_linguagem.std()

NameError: name 'por_linguagem' is not defined

In [None]:
por_linguagem.min()

In [None]:
por_linguagem.count()

In [None]:
por_linguagem.describe()

In [42]:
por_linguagem['Linguagem'].unique()

Linguagem
Java        [Java]
Python    [Python]
R              [R]
Name: Linguagem, dtype: object

In [46]:
por_linguagem['Linguagem'].nunique()

Linguagem
Java      1
Python    1
R         1
Name: Linguagem, dtype: int64

In [47]:
df['Linguagem'].value_counts()

R         2
Java      2
Python    2
Name: Linguagem, dtype: int64