# Manipulação de Dados com Pandas I

## Importando os dados

Neste exemplo, trabalharemos com dados do ShoeFly.com, uma loja de calçados online fictícia. Podemos carregar dados no Pandas de um arquivo csv (variável separada por vírgula). Esses dados representam compras no ShoeFly.com. Vamos examinar as primeiras 10 linhas do nosso conjunto de dados:

In [3]:
import pandas as pd
df = pd.read_csv('shoefly_orders.csv')

In [5]:
df.head(10)

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color
0,54791,Rebecca,Lindsay,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
1,53450,Emily,Joyce,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy
2,91987,Joyce,Waller,Joyce.Waller@gmail.com,sandals,fabric,black
3,14437,Justin,Erickson,Justin.Erickson@outlook.com,clogs,faux-leather,red
4,79357,Andrew,Banks,AB4318@gmail.com,boots,leather,brown
5,52386,Julie,Marsh,JulieMarsh59@gmail.com,sandals,fabric,black
6,20487,Thomas,Jensen,TJ5470@gmail.com,clogs,fabric,navy
7,76971,Janice,Hicks,Janice.Hicks@gmail.com,clogs,faux-leather,navy
8,21586,Gabriel,Porter,GabrielPorter24@gmail.com,clogs,leather,brown
9,62083,Frances,Palmer,FrancesPalmer50@gmail.com,wedges,leather,white


Note que o dataframe tem uma sequência de identificadores, os índices, mesmo tendo um outro identificador, o `id`, nesse exemplo. 

Poderíamos setar o índice para ser justamente esse `id`, com o comando a seguir: 

In [6]:
# solução
df.set_index('id')

Unnamed: 0_level_0,first_name,last_name,email,shoe_type,shoe_material,shoe_color
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
54791,Rebecca,Lindsay,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
53450,Emily,Joyce,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy
91987,Joyce,Waller,Joyce.Waller@gmail.com,sandals,fabric,black
14437,Justin,Erickson,Justin.Erickson@outlook.com,clogs,faux-leather,red
79357,Andrew,Banks,AB4318@gmail.com,boots,leather,brown
52386,Julie,Marsh,JulieMarsh59@gmail.com,sandals,fabric,black
20487,Thomas,Jensen,TJ5470@gmail.com,clogs,fabric,navy
76971,Janice,Hicks,Janice.Hicks@gmail.com,clogs,faux-leather,navy
21586,Gabriel,Porter,GabrielPorter24@gmail.com,clogs,leather,brown
62083,Frances,Palmer,FrancesPalmer50@gmail.com,wedges,leather,white


Para retroceder e colocar o índice numérico como estava, basta:

In [7]:
df.reset_index()

Unnamed: 0,index,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color
0,0,54791,Rebecca,Lindsay,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
1,1,53450,Emily,Joyce,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy
2,2,91987,Joyce,Waller,Joyce.Waller@gmail.com,sandals,fabric,black
3,3,14437,Justin,Erickson,Justin.Erickson@outlook.com,clogs,faux-leather,red
4,4,79357,Andrew,Banks,AB4318@gmail.com,boots,leather,brown
5,5,52386,Julie,Marsh,JulieMarsh59@gmail.com,sandals,fabric,black
6,6,20487,Thomas,Jensen,TJ5470@gmail.com,clogs,fabric,navy
7,7,76971,Janice,Hicks,Janice.Hicks@gmail.com,clogs,faux-leather,navy
8,8,21586,Gabriel,Porter,GabrielPorter24@gmail.com,clogs,leather,brown
9,9,62083,Frances,Palmer,FrancesPalmer50@gmail.com,wedges,leather,white


Veja que surgiu uma coluna extra, caso não precise, você pode acrescentar a flag `drop=True` (use `inplace=True` se não quiser retornar para uma nova variável)

In [8]:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color
0,54791,Rebecca,Lindsay,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
1,53450,Emily,Joyce,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy
2,91987,Joyce,Waller,Joyce.Waller@gmail.com,sandals,fabric,black
3,14437,Justin,Erickson,Justin.Erickson@outlook.com,clogs,faux-leather,red
4,79357,Andrew,Banks,AB4318@gmail.com,boots,leather,brown


In [9]:
df.to_excel('shoefly_orders.xlsx', index=False)

Se você não sabe o nome das colunas para pesquisar, pode obter com o seguinte comando:

In [10]:
# solução
df.columns

Index(['id', 'first_name', 'last_name', 'email', 'shoe_type', 'shoe_material',
       'shoe_color'],
      dtype='object')

Para obter informações completas e detalhadas (para devs!), basta rodar o seguinte comando:

In [36]:
# solução
# df.info()
df.describe()
# df.shape

Unnamed: 0,id
count,20.0
mean,54687.4
std,27431.097628
min,14437.0
25%,32536.25
50%,52918.0
75%,77567.5
max,98602.0


## Extraindo as primeiras informações

Se você quiser apenas os e-mails de todos os clientes, basta selecionar pela coluna:

In [21]:
# solução
emails = df[['email']] # df.email
type(emails)

pandas.core.frame.DataFrame

Ou selecionar nome, sobrenome e email, o que faz mais sentido:

In [23]:
# solução
customers_and_emails = df[['first_name', 'last_name', 'email']]
customers_and_emails.head()

Unnamed: 0,first_name,last_name,email
0,Rebecca,Lindsay,RebeccaLindsay57@hotmail.com
1,Emily,Joyce,EmilyJoyce25@gmail.com
2,Joyce,Waller,Joyce.Waller@gmail.com
3,Justin,Erickson,Justin.Erickson@outlook.com
4,Andrew,Banks,AB4318@gmail.com


Podemos selecionar todos que pediram sandálias pretas.

In [None]:
df[(df.shoe_type == 'sandals') & (df.shoe_color == 'black')] # and & or |

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color
0,54791,Rebecca,Lindsay,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
2,91987,Joyce,Waller,Joyce.Waller@gmail.com,sandals,fabric,black
5,52386,Julie,Marsh,JulieMarsh59@gmail.com,sandals,fabric,black
13,33862,Diane,Ochoa,DO2680@gmail.com,sandals,fabric,red
16,39888,Vincent,Stephenson,VS4753@outlook.com,boots,leather,black


Ou quem pediu algum calçado branco ou preto?

In [None]:
# df[(df.shoe_color == 'white') | (df.shoe_color == 'black')]
df[df.shoe_color.isin(['white', 'black'])]

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color
0,54791,Rebecca,Lindsay,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
2,91987,Joyce,Waller,Joyce.Waller@gmail.com,sandals,fabric,black
5,52386,Julie,Marsh,JulieMarsh59@gmail.com,sandals,fabric,black
9,62083,Frances,Palmer,FrancesPalmer50@gmail.com,wedges,leather,white
12,45832,Susan,Dennis,SusanDennis58@gmail.com,ballet flats,fabric,white
14,73431,Rebecca,Charles,Rebecca.Charles@gmail.com,boots,faux-leather,white
16,39888,Vincent,Stephenson,VS4753@outlook.com,boots,leather,black
17,35961,Roy,Tillman,RoyTillman20@gmail.com,boots,leather,white


Vamos ver o que Susan Dennis pediu.

In [30]:
df[(df.first_name == 'Susan') & (df.last_name == 'Dennis')]

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color
12,45832,Susan,Dennis,SusanDennis58@gmail.com,ballet flats,fabric,white


Ainda é possível acessar diretamente usando `loc` ou `iloc`:

In [31]:
df.loc[(df.shoe_type == 'sandals') & (df.shoe_color == 'black'), ['id', 'shoe_material' ]]

Unnamed: 0,id,shoe_material
2,91987,fabric
5,52386,fabric


In [35]:
# df.iloc[1,4]
print(df.iloc[[0,2,5], 2])

0    Lindsay
2     Waller
5      Marsh
Name: last_name, dtype: object


Para criar novas colunas com valores baseados em condições usando função lambda, vamos ao seguinte exemplo: Muitos dos nossos clientes querem comprar sapatos veganos (sapatos feitos de materiais que não vêm de animais). Adicione uma nova coluna chamada `shoe_source`, que é `vegan` se os materiais não forem `leather` e `animal` caso contrário.

In [37]:
df['shoe_source'] = df.shoe_material.apply(lambda x: 'vegan' if x != 'leather' else 'animal')
df.head()

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,shoe_source
0,54791,Rebecca,Lindsay,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black,vegan
1,53450,Emily,Joyce,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy,vegan
2,91987,Joyce,Waller,Joyce.Waller@gmail.com,sandals,fabric,black,vegan
3,14437,Justin,Erickson,Justin.Erickson@outlook.com,clogs,faux-leather,red,vegan
4,79357,Andrew,Banks,AB4318@gmail.com,boots,leather,brown,animal


## Operando Strings

Similares as funções nativas de Python, temos como saber o tamanho, rebaixar para minúsculas ou aumentar para maiúsculas, fazer replace, verificar se contém, ou ainda fazer split, sempre a partir da propriedade `str`:

In [None]:
# solução
# df.first_name.str.len()
# df.shoe_type.str.lower()
# df.shoe_material.str.replace('fabric', 'cotton')
# df.email.str.contains('gmail')
df.email.str.strip('@')

0     RebeccaLindsay57@hotmail.com
1           EmilyJoyce25@gmail.com
2           Joyce.Waller@gmail.com
3      Justin.Erickson@outlook.com
4                 AB4318@gmail.com
5           JulieMarsh59@gmail.com
6                 TJ5470@gmail.com
7           Janice.Hicks@gmail.com
8        GabrielPorter24@gmail.com
9        FrancesPalmer50@gmail.com
10         JessicaHale25@gmail.com
11      LawrenceParker44@gmail.com
12         SusanDennis58@gmail.com
13                DO2680@gmail.com
14       Rebecca.Charles@gmail.com
15              JC2072@hotmail.com
16              VS4753@outlook.com
17          RoyTillman20@gmail.com
18       Thomas.Roberson@gmail.com
19         ANewton1977@outlook.com
Name: email, dtype: object

## Exercícios

In [56]:
# Exercício 1
# Abra o arquivo "Cancer_Data.csv", conte quantos linhas o dataframe possui com a coluna diagnosis = 'B'.
cd = pd.read_csv('Cancer_Data.csv')
# cd.info()
cd.head(10)
# total_de_linhas = len(cd['diagnosis'])
# print(total_de_linhas)
cd.index.size
cd_diagnosis_b = cd[cd['diagnosis'] == 'B']
cd_diagnosis_b.index.size


357

In [58]:
# Exercício 2
# Substitua a coluna diagnosis seguindo a regra:
# “B” para “benign cancer”
# “M” para “malignant cancer”
cd['diagnosis'] = cd['diagnosis'].replace({'B': 'benign cancer', 'M': 'malignant cancer'})
cd.head()


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,malignant cancer,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,malignant cancer,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,malignant cancer,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,malignant cancer,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,malignant cancer,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [61]:
# Exercício 3
# Filtre os IDs dos "malignant cancer" com raio maior que 25.
# cd[cd['diagnosis'] == 'B']
cd_m_25 = cd[(cd['diagnosis'] == 'malignant cancer') & (cd['radius_mean'] > 25)]
cd_m_25


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
82,8611555,malignant cancer,25.22,24.91,171.5,1878.0,0.1063,0.2665,0.3339,0.1845,...,33.62,211.7,2562.0,0.1573,0.6076,0.6476,0.2867,0.2355,0.1051,
180,873592,malignant cancer,27.22,21.87,182.1,2250.0,0.1094,0.1914,0.2871,0.1878,...,32.85,220.8,3216.0,0.1472,0.4034,0.534,0.2688,0.2856,0.08082,
212,8810703,malignant cancer,28.11,18.47,188.5,2499.0,0.1142,0.1516,0.3201,0.1595,...,18.47,188.5,2499.0,0.1142,0.1516,0.3201,0.1595,0.1648,0.05525,
352,899987,malignant cancer,25.73,17.46,174.2,2010.0,0.1149,0.2363,0.3368,0.1913,...,23.58,229.3,3234.0,0.153,0.5937,0.6451,0.2756,0.369,0.08815,
461,911296202,malignant cancer,27.42,26.27,186.9,2501.0,0.1084,0.1988,0.3635,0.1689,...,31.37,251.2,4254.0,0.1357,0.4256,0.6833,0.2625,0.2641,0.07427,
