In [1]:
import numpy as np
import pandas as pd

Enunciado exercício class: <br>
Peça para o usuário uma lista com as comidas favoritas dele, depois transforme isso em um objeto numpy. Agora transformado, elimine todas as comidas que começam com uma vogal.

In [3]:
comidas = np.array(['maçã', 'uva', 'morango', 'melancia', 'acelga'])
comidas

array(['maçã', 'uva', 'morango', 'melancia', 'acelga'], dtype='<U8')

In [None]:
[item[0] for item in comidas ]

In [15]:
comidas[np.char.endswith(comidas, 'a')]

array(['uva', 'melancia', 'acelga'], dtype='<U8')

In [14]:
[item for item in comidas if item[0] not in ['a','e','i','o','u']]

['maçã', 'morango', 'melancia']

In [16]:
prefixes = ["xyz", "abc"]
"abcde".startswith(tuple(prefixes))

True

In [18]:
prefixes = ["a", "e", "i"]

[item for item in comidas if not item.startswith(tuple(prefixes))]

['maçã', 'uva', 'morango', 'melancia']

# Aula 3 - Limpeza e transformação de dados

Na aula de hoje, vamos explorar os seguintes tópicos em Python:

- 1) Limpeza de Dados (dropna, fillna, isnull, notnull, replace, duplicated, drop_duplicated)
- 2) Transformação de Dados (cut, qcut, get_dummies)

________

### Objetivos

Apresentar como identificamos valores nulos, discutir melhores formas de tratá-los além de trabalhar com variáveis dummies.

________

### Habilidades a serem desenvolvidas nessa aula

Ao final da aula o aluno deve:

- Saber como identificar valores nulos e como tratá-los.
- Identificar quando utilizar variáveis dummies e como criá-las.
____
____
____

In [3]:
import pandas as pd 
import numpy as np

In [22]:
df = pd.read_csv("data/titanic.csv")

## Limpeza de Dados: Dados Faltantes

### `.isna() e isnull()`
retorna um df booleano indicando se existe um nulo naquela posição. `pd.isnull()` é um alias para `pd.isna()` dentro do python como podemos ver na <a href="https://github.com/pandas-dev/pandas/blob/0409521665bd436a10aea7e06336066bf07ff057/pandas/core/dtypes/missing.py#L109">documentação</a>. O pandas dataframe é baseado nos df do R, onde null e na tem sentidos distintos.

In [23]:
pd.isnull

<function pandas.core.dtypes.missing.isna(obj)>

Tanto o `.isna()` quanto o `.isnull()` retornam um df booleano:

In [26]:
df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


Para termos a quantidade de nans em cada coluna podemos somar os valores true:

In [27]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### `.notnull()`

retorna df com True ou False, com True para elementos não nulos

In [29]:
df.notnull().sum()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

### `.dropna()`

elimina nulos em todo o df ou em colunas especificadas quando utilizado o argumento `subset`. É um dos métodos que aceita o parâmetro `inplace`.


In [34]:
df.shape

(891, 12)

In [33]:
df.dropna(subset=['Age','Embarked']).shape

(712, 12)

In [35]:
df.dropna(subset=['Age','Embarked']).isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          529
Embarked         0
dtype: int64

### `.fillna()`

preeche os valores nulos com valores especificados.
Parâmetros úteis:

   * method {‘bfill’,‘ffill’, None}, default None <br>
       Preenche os nas propagando o último valor válido para frente (ffill) ou utiliza próxima observação válida para preencher os nulos (bfill).
   * axis {0 para ‘index’, 1 para ‘colunas’} <br>
       Eixo no qual se quer preencher os nulos
   * inplace {booleano}, default False


In [36]:
fill = pd.DataFrame({'original': [np.nan,1,2,np.nan,4,5,np.nan,np.nan, np.nan, 9, 10, np.nan]})
fill

Unnamed: 0,original
0,
1,1.0
2,2.0
3,
4,4.0
5,5.0
6,
7,
8,
9,9.0


In [38]:
fill.original.mean()

5.166666666666667

In [42]:
fill['cte'] = fill.original.fillna(999)
fill['media'] = fill.original.fillna(fill.original.mean())
fill['ffill'] = fill.original.fillna(method='ffill')
fill['ffill_limit'] = fill.original.fillna(method='ffill', limit=2)
fill['bfill'] = fill.original.fillna(method='bfill')

fill

Unnamed: 0,original,cte,media,ffill,ffill_limit,bfill
0,,999.0,5.166667,,,1.0
1,1.0,1.0,1.0,1.0,1.0,1.0
2,2.0,2.0,2.0,2.0,2.0,2.0
3,,999.0,5.166667,2.0,2.0,4.0
4,4.0,4.0,4.0,4.0,4.0,4.0
5,5.0,5.0,5.0,5.0,5.0,5.0
6,,999.0,5.166667,5.0,5.0,9.0
7,,999.0,5.166667,5.0,5.0,9.0
8,,999.0,5.166667,5.0,,9.0
9,9.0,9.0,9.0,9.0,9.0,9.0


### Voltando ao dataset do titanic...
Quais soluções poderíamos utilizar para preencher os nulos em Age? <br>
E as cabines nulas? <br>
E os portões de embarque nulos? <br>
Qual os possíveis problemas que podemos encontrar utilizando essas abordagens? <br>


_________
_________
**Exercício:** Ordene as colunas Pclass e Fare, nessa ordem e ascendente, e preencha os nulos utilizando o método foward fill.

In [50]:
df['new_cabin'] = df.sort_values(["Pclass", "Fare"])['Cabin'].fillna(method="ffill")

263      B94
633      B94
806      A36
815     B102
822     B102
       ...  
201    F E69
324    F E69
792    F E69
846    F E69
863    F E69
Name: Cabin, Length: 891, dtype: object

_________
_________

### `.interpolate()`
Faz a interpolação dos missing de acordo com o método escolhido.
[Documentação](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.interpolate.html)

In [51]:
s = pd.DataFrame([0, 1, np.nan, 3, 4, np.nan, 400], columns=['original'])
s['linear'] = s.original.interpolate()
s['segundo_grau'] = s.original.interpolate(method="polynomial", order=2)
s

Unnamed: 0,original,linear,segundo_grau
0,0.0,0.0,0.0
1,1.0,1.0,1.0
2,,2.0,5.367521
3,3.0,3.0,3.0
4,4.0,4.0,4.0
5,,202.0,129.598291
6,400.0,400.0,400.0


### `.replace()`

substitui elementos dentro do df. É um dos métodos que aceita o parâmetro `inplace`.


In [61]:
# para apenas uma troca
df.replace({'h':'male'}, inplace=True)

In [62]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [64]:
df.Sex.replace({'male':'H', 'female':'M'})

0      H
1      M
2      M
3      M
4      H
      ..
886    H
887    M
888    M
889    H
890    H
Name: Sex, Length: 891, dtype: object

In [65]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [66]:
# para mais de uma troca
df.replace({'Sex':{'male':'H', 'female':'M'}, 'Name':{'Braund, Mr. Owen Harris':'fera'}})

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,fera,H,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",M,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",M,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",M,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",H,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",H,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",M,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",M,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",H,26.0,0,0,111369,30.0000,C148,C


___________________
___________________

**Exercício**: Troque os valores abreviados do portão de embarque para seus respectivos nomes 
(C = Cherbourg; Q = Queenstown; S = Southampton) e preencha os dados faltantes com o portão mais frequente

In [None]:
# Resposta
20:14

In [4]:
df.Embarked = df.replace({'Embarked': 
            {'C': 'Cherbourg', 'Q':'Queenstown', 'S':'Southampton'}
           })['Embarked'].fillna(df.Embarked.mode())

NameError: name 'df' is not defined

In [76]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,Southampton
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,Cherbourg
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,Southampton
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,Southampton
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,Southampton


In [77]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

___________________
___________________


## Limpeza de Dados: Dados duplicados
### `.duplicated()` e `.drop_duplicated()`

O `.duplicated()` retorna uma series indicando se determinada linha possui duplicados ou não. Já o `.drop_duplicated()`, elimina as linhas duplicadas. <br>
Parâmetros do `.drop_duplicated()`:

   * `subset` -  seleciona colunas para serem utilizadas na comparação de linhas duplicadas
   * `keep` - {‘first’, ‘last’, False} (default ‘first’) Determina qual duplicado manter
   * `inplace` - (default False) Se False retorna uma cópia do df com as alterações, se True faz as modificações no próprio df



In [82]:
# vamos criar um df que possua linhas duplicadas
df_dup = df.copy()

df_dup.loc[891, :] = df_dup.loc[0, :]
df_dup.loc[892, :] = df_dup.loc[1, :]
df_dup.loc[893, :] = df_dup.loc[1, :]

print(df.shape, df_dup.shape)

(891, 12) (894, 12)


In [84]:
df_dup

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1.0,0.0,3.0,"Braund, Mr. Owen Harris",male,22.0,1.0,0.0,A/5 21171,7.2500,,Southampton
1,2.0,1.0,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1.0,0.0,PC 17599,71.2833,C85,Cherbourg
2,3.0,1.0,3.0,"Heikkinen, Miss. Laina",female,26.0,0.0,0.0,STON/O2. 3101282,7.9250,,Southampton
3,4.0,1.0,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.0,0.0,113803,53.1000,C123,Southampton
4,5.0,0.0,3.0,"Allen, Mr. William Henry",male,35.0,0.0,0.0,373450,8.0500,,Southampton
...,...,...,...,...,...,...,...,...,...,...,...,...
889,890.0,1.0,1.0,"Behr, Mr. Karl Howell",male,26.0,0.0,0.0,111369,30.0000,C148,Cherbourg
890,891.0,0.0,3.0,"Dooley, Mr. Patrick",male,32.0,0.0,0.0,370376,7.7500,,Queenstown
891,1.0,0.0,3.0,"Braund, Mr. Owen Harris",male,22.0,1.0,0.0,A/5 21171,7.2500,,Southampton
892,2.0,1.0,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1.0,0.0,PC 17599,71.2833,C85,Cherbourg


In [83]:
df_dup.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
889    False
890    False
891     True
892     True
893     True
Length: 894, dtype: bool

In [90]:
df_dup.drop_duplicates(subset=['Age', 'Embarked'])

(167, 12)

In [86]:
print(df_dup.shape)
print(df_dup.drop_duplicates().shape)

(894, 12)
(891, 12)


In [94]:
%%timeit
df_dup.drop_duplicates(subset=['PassengerId'])

2.25 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [95]:
%%timeit
df_dup.sort_values('PassengerId').drop_duplicates(subset=['PassengerId'])

4.33 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


_______________________
_______________________
**Exercícios:** Elimine os duplicados considerando as colunas 'brand' e 'style' mantendo o último valor.

In [91]:
df_exerc = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, np.nan, 3.5, 15, 5]
})

In [92]:
df_exerc

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Yum Yum,cup,
3,Indomie,cup,3.5
4,Indomie,pack,15.0
5,Indomie,pack,5.0


In [96]:
df_exerc.drop_duplicates(subset=['brand','style'], inplace=True, keep='last')

In [97]:
df_exerc

Unnamed: 0,brand,style,rating
2,Yum Yum,cup,
3,Indomie,cup,3.5
5,Indomie,pack,5.0


__________
__________


## Transformação de dados

### `.get_dummies()`

#### variáveis categóricas
Variáveis categóricas são aquelas que representam grupos ou classes dentro dos nossos dados. Elas podem ser de dois tipos:
* ordinais: possuem uma ordem que tem um sentido. Por exemplo, em rendimentos poderíamos ter: classe alta > classe média > classe baixa  
* nominais: não possuem uma ordem válida. Por exemplo: sexo e CEP.

<img src="variaveis_categoricas.jpeg" style="width: 500px">

Dummies são quaisquer variáveis cujos valores são 1 ou 0 para cada observação. O método `pd.get_dummies()` converte as variáveis categóricas em numéricas separando cada categoria em uma coluna única.
<br>
<br>
<a href="https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html">Parâmetros:</a> <br>
`pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)`

In [100]:
pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Queenstown,Embarked_Southampton,Embarked_mode
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,1,0,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,0,0,0,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,0,0,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,0,0,1,0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,1,0,1,0
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,0,0,1,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,0,0,1,0
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,1,0,0,0


## Material de Aprofundamento

### `.cut()`
O método `.cut()` ordena os dados, separa em bins e computa qual grupo cada linha do df pertence. O `.cut()` escolherá os bins para serem espaçados uniformemente de acordo com os próprios valores e não com a frequência desses valores.  <br>
Ele é muito utilizado para transformar variáveis contínuas em categóricas. Por exemplo, podemos converter o valor númerico da idade em grupos de criança, jovem, adulto e idoso.
<br><br>
<a href='https://pandas.pydata.org/docs/reference/api/pandas.cut.html'>Parâmetros:</a> <br>
`pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)`

Ao informar quantidade de grupos o pd.cut() escolhe os bins com o mesmo tamanho de janela :


In [None]:
df['cut_bins'] = pd.cut(df.Age, 4)
df.head()

In [None]:
df.cut_bins.value_counts()

Podemos passar o nome dos grupos e transformar a variável numérica diretamente em categórica


In [None]:
df['cut_classes'] = pd.cut(df.Age, 4, labels=["jovens", "adultos", "meia-idade", "idosos"])
df.head()

In [None]:
df.cut_bins.unique()

In [None]:
df.cut_classes.value_counts()

Também podemos passar uma lista com os valores de início e fim dos bins:

In [None]:
pd.cut(df.Age, [0,20,60,80]).unique()

In [None]:
df.Age.describe()

<a href='https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b'>Discretização utilizando decision trees</a>

### `.qcut()`
O `pd.qcut()` é utilizado quando queremos discretizar nossos dados em quantis. Ao informar quantidade de grupos o `pd.qcut()` escolhe os bins tal que tenhamos a mesma quantidade de valores em cada grupo.

#### `.cut()` x `.qcut()`
   * O comando `.cut()` cria **caixas equidistantes**, mas a **frequência** das amostras é **desigual** em cada caixa
   * O comando `.qcut()` cria **caixas de tamanhos desiguais**, mas a **frequência** das amostras é **igual** em cada caixa.

<br>

[Parâmetros](https://pandas.pydata.org/docs/reference/api/pandas.qcut.html)
<br>
`pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')`

In [None]:
pd.cut(df.Age, 4).value_counts()

In [None]:
pd.qcut(df.Age, 4).value_counts()

In [None]:
pd.qcut(df.Age, 4).value_counts()/df.Age.notnull().sum()

## Exercícios

1. Converta os países do dataframe abaixo para variáveis dummies eliminando a primeira coluna e utilizando 'Country' como prefixo no nome das colunas.

In [3]:
ids = [11, 22, 33, 44, 55, 66, 77]
countries = ['Brasil', 'Argentina', 'Peru', 'Bolívia', 'Uruguai']
sales = [42, 53, 37, 45, 20]

df = pd.DataFrame(list(zip(ids, countries, sales)),
                  columns=['Ids', 'Countries', 'Sales'])
df.head()

Unnamed: 0,Ids,Countries,Sales
0,11,Brasil,42
1,22,Argentina,53
2,33,Peru,37
3,44,Bolívia,45
4,55,Uruguai,20


In [6]:
pd.get_dummies(df, columns=['Countries'], drop_first=True)

Unnamed: 0,Ids,Sales,Countries_Bolívia,Countries_Brasil,Countries_Peru,Countries_Uruguai
0,11,42,0,1,0,0
1,22,53,0,0,0,0
2,33,37,0,0,1,0
3,44,45,1,0,0,0
4,55,20,0,0,0,1


2. Considere os dados do arquivo "german_credit.csv" que contem dados de empréstimos realizados por um banco.<br> <br>
a. Tem nulos nesse df? Se sim, preencha os nulos da forma que o grupo decidir.

   b. Converta as variáveis categóricas em numéricas.

In [3]:
df2 = pd.read_csv('./data/german_credit.csv', sep=',')
df2.head()

Unnamed: 0.1,Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,0,67,male,2,own,,little,1169,6,radio/TV
1,1,22,female,2,own,little,moderate,5951,48,radio/TV
2,2,49,male,1,own,little,,2096,12,education
3,3,45,male,2,free,little,little,7882,42,furniture/equipment
4,4,53,male,2,free,little,little,4870,24,car


In [4]:
df2.isna().sum()

Unnamed: 0            0
Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
dtype: int64

In [5]:
df2.dtypes

Unnamed: 0           int64
Age                  int64
Sex                 object
Job                  int64
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
dtype: object

In [8]:
df2.shape

(1000, 10)

In [10]:
df2['Saving accounts'].unique()

array([nan, 'little', 'quite rich', 'rich', 'moderate'], dtype=object)

In [12]:
df2['Checking account'].unique()

array(['little', 'moderate', nan, 'rich'], dtype=object)

In [25]:
df2['Checking account'].mode()

0    little
dtype: object

In [28]:
changes = {'Saving accounts': 'no savings', 'Checking account': df2['Checking account'].mode(dropna=True)[0]}
df2.fillna(value=changes).isna().sum()
# df2.fillna(['Saving accounts', 'Checking accounts'], [df2.'Savings_accounts'].mean(), df2.'Checking accounts'.mean())

Unnamed: 0          0
Age                 0
Sex                 0
Job                 0
Housing             0
Saving accounts     0
Checking account    0
Credit amount       0
Duration            0
Purpose             0
dtype: int64

In [None]:
# fillna withmode needs [0] or it doesn't work

3. Para esse exercício vamos considerar o dataset https://archive.ics.uci.edu/ml/datasets/automobile que traz a especificação de modelos de carros em 1985.
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

In [24]:
# Considere a seguinte lista como o nome das colunas do dataset
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

a. Esse dataset possui nans? Se sim, o grupo precisa entrar em um acordo de como tratá-los.

In [25]:
df3 = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', sep=',', names=headers)
df3.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [26]:
df3.isnull().sum()

symboling            0
normalized_losses    0
make                 0
fuel_type            0
aspiration           0
num_doors            0
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_cylinders        0
engine_size          0
fuel_system          0
bore                 0
stroke               0
compression_ratio    0
horsepower           0
peak_rpm             0
city_mpg             0
highway_mpg          0
price                0
dtype: int64

aparentemente não existem nans, mas o que são os valores "?" na coluna normalized losses?

In [27]:
df3.dtypes

symboling              int64
normalized_losses     object
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                  object
stroke                object
compression_ratio    float64
horsepower            object
peak_rpm              object
city_mpg               int64
highway_mpg            int64
price                 object
dtype: object

In [28]:
df3.replace('?', np.nan, inplace=True)

In [29]:
df3.isna().sum()

symboling             0
normalized_losses    41
make                  0
fuel_type             0
aspiration            0
num_doors             2
body_style            0
drive_wheels          0
engine_location       0
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_type           0
num_cylinders         0
engine_size           0
fuel_system           0
bore                  4
stroke                4
compression_ratio     0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 4
dtype: int64

In [30]:
df3.dropna(inplace=True)
df3.isna().sum()

symboling            0
normalized_losses    0
make                 0
fuel_type            0
aspiration           0
num_doors            0
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_cylinders        0
engine_size          0
fuel_system          0
bore                 0
stroke               0
compression_ratio    0
horsepower           0
peak_rpm             0
city_mpg             0
highway_mpg          0
price                0
dtype: int64

b. Esse dataset possui linhas duplicadas?

In [31]:
df3.duplicated().sum()

0

c. O consumo dos automóveis na cidade e rodovia, representados pelas colunas "city-mpg" e "highway-mpg" respectivamente, estão na unidade mpg (miles per gallon). Converta essas colunas para consumo em L/100km sabendo que a conversão é dada por:

L/100km = 235 / mpg 

In [32]:
df3_copy=df3.copy()
df3_copy.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
10,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430


In [33]:
df3_copy[['city_mpg','highway_mpg']]=df3_copy[['city_mpg','highway_mpg']].multiply({'city_mpg':235, 'highway_mpg':235},inp)

In [34]:
df3_copy.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,5640,7050,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,4230,5170,17450
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,4465,5875,17710
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,3995,4700,23875
10,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,5405,6815,16430


In [39]:
df3.rename(columns={'city_mpg':'city_L/100km', 'highway_mpg':'highway_L/100km'}, inplace=True)
df3.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_L/100km,highway_L/100km,price
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
10,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430


In [42]:
df3[['city_L/100km', 'highway_L/100km']]=df3[['city_L/100km', 'highway_L/100km']].multiply({'city_L/100km':235, 'highway_L/100km':235})

In [43]:
df3.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_L/100km,highway_L/100km,price
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,5640,7050,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,4230,5170,17450
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,4465,5875,17710
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,3995,4700,23875
10,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,5405,6815,16430


d. Converta as colunas "body_style" e "drive_wheels" para variáveis dummies.

In [44]:
pd.get_dummies(df3, columns=['body_style', 'drive_wheels'], drop_first=True)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,engine_location,wheel_base,length,width,...,peak_rpm,city_L/100km,highway_L/100km,price,body_style_hardtop,body_style_hatchback,body_style_sedan,body_style_wagon,drive_wheels_fwd,drive_wheels_rwd
3,2,164,audi,gas,std,four,front,99.8,176.6,66.2,...,5500,5640,7050,13950,0,0,1,0,1,0
4,2,164,audi,gas,std,four,front,99.4,176.6,66.4,...,5500,4230,5170,17450,0,0,1,0,0,0
6,1,158,audi,gas,std,four,front,105.8,192.7,71.4,...,5500,4465,5875,17710,0,0,1,0,1,0
8,1,158,audi,gas,turbo,four,front,105.8,192.7,71.4,...,5500,3995,4700,23875,0,0,1,0,1,0
10,2,192,bmw,gas,std,two,front,101.2,176.8,64.8,...,5800,5405,6815,16430,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,front,109.1,188.8,68.9,...,5400,5405,6580,16845,0,0,1,0,0,1
201,-1,95,volvo,gas,turbo,four,front,109.1,188.8,68.8,...,5300,4465,5875,19045,0,0,1,0,0,1
202,-1,95,volvo,gas,std,four,front,109.1,188.8,68.9,...,5500,4230,5405,21485,0,0,1,0,0,1
203,-1,95,volvo,diesel,turbo,four,front,109.1,188.8,68.9,...,4800,6110,6345,22470,0,0,1,0,0,1


pd.get_dummies(

## Referências:
