**IA & Big Data**

Prof. Miguel Bozer da Silva - miguel.bozer@senaisp.edu.br

---

In [2]:
# Importando as bibliotecas para os modelos
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# One Hot Encoding

## Tarefa #1: Recebendo os dados

In [3]:
carros = pd.read_csv('/content/Used_fiat_500_in_Italy_dataset.csv', sep=',')

In [4]:
carros.head()

Unnamed: 0,model,engine_power,transmission,age_in_days,km,previous_owners,lat,lon,price
0,pop,69,manual,4474,56779,2,45.071079,7.46403,4490
1,lounge,69,manual,2708,160000,1,45.069679,7.70492,4500
2,lounge,69,automatic,3470,170000,2,45.514599,9.28434,4500
3,sport,69,manual,3288,132000,2,41.903221,12.49565,4700
4,sport,69,manual,3712,124490,2,45.532661,9.03892,4790


## Tarefa #2: Corrigindo os dados

In [5]:
carros.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   model            380 non-null    object 
 1   engine_power     380 non-null    int64  
 2   transmission     380 non-null    object 
 3   age_in_days      380 non-null    int64  
 4   km               380 non-null    int64  
 5   previous_owners  380 non-null    int64  
 6   lat              380 non-null    float64
 7   lon              380 non-null    float64
 8   price            380 non-null    int64  
dtypes: float64(2), int64(5), object(2)
memory usage: 26.8+ KB


In [6]:
carros.shape

(380, 9)

In [7]:
carros.isnull().sum() #se não aparecer zero é pq tem valor nulo

Unnamed: 0,0
model,0
engine_power,0
transmission,0
age_in_days,0
km,0
previous_owners,0
lat,0
lon,0
price,0


Vamos explorar as colunas que são do tipo `object` para aplicarmos o *One Hot Encoding* ou o *Label Encoding*:

In [8]:
carros['model'].unique()

array(['pop', 'lounge', 'sport', 'star'], dtype=object)

In [9]:
carros['transmission'].unique()

array(['manual', 'automatic'], dtype=object)

A coluna model e transmission possuem textos e precisamos corrigir isso

Vamos agora transformar a coluna de transmissão que é uma coluna que possui apenas dois valores possíveis. Para isso, vamos usando o comando o `replace`. Se o carro for manual o valor será substituído por 0 e se o carro for automático o valor será substituído por 1:

In [10]:
carros['transmission']=carros['transmission'].map({'manual':0, 'automatic':1})

In [11]:
carros.head()

Unnamed: 0,model,engine_power,transmission,age_in_days,km,previous_owners,lat,lon,price
0,pop,69,0,4474,56779,2,45.071079,7.46403,4490
1,lounge,69,0,2708,160000,1,45.069679,7.70492,4500
2,lounge,69,1,3470,170000,2,45.514599,9.28434,4500
3,sport,69,0,3288,132000,2,41.903221,12.49565,4700
4,sport,69,0,3712,124490,2,45.532661,9.03892,4790


Vamos aplicar o One Hot Enconding na coluna *model* para transformar os textos em colunas:

In [12]:
modelos = pd.get_dummies(carros['model'], prefix = 'modelo', dtype=int)

In [13]:
modelos.head()

Unnamed: 0,modelo_lounge,modelo_pop,modelo_sport,modelo_star
0,0,1,0,0
1,1,0,0,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0


Criamos dessa forma 4 colunas novas que são binárias indicando o modelo do veículo. Vamos agora criar um novo `DataFrame` unindo os `DataFrames` carros e transmissao:

In [14]:
carros_corrigidos = pd.concat([carros,modelos], axis = 1) #axis = 1 lado a lado // axis=0 abaixo

In [15]:
carros_corrigidos.head()

Unnamed: 0,model,engine_power,transmission,age_in_days,km,previous_owners,lat,lon,price,modelo_lounge,modelo_pop,modelo_sport,modelo_star
0,pop,69,0,4474,56779,2,45.071079,7.46403,4490,0,1,0,0
1,lounge,69,0,2708,160000,1,45.069679,7.70492,4500,1,0,0,0
2,lounge,69,1,3470,170000,2,45.514599,9.28434,4500,1,0,0,0
3,sport,69,0,3288,132000,2,41.903221,12.49565,4700,0,0,1,0
4,sport,69,0,3712,124490,2,45.532661,9.03892,4790,0,0,1,0


Pensando em um modelo de *Machine Learning*, a coluna *model* pode ser excluída, pois ela não seria usada para treinar o modelo.

In [16]:
carros_corrigidos.drop(columns=['model'], inplace=True)

In [17]:
carros_corrigidos.head()

Unnamed: 0,engine_power,transmission,age_in_days,km,previous_owners,lat,lon,price,modelo_lounge,modelo_pop,modelo_sport,modelo_star
0,69,0,4474,56779,2,45.071079,7.46403,4490,0,1,0,0
1,69,0,2708,160000,1,45.069679,7.70492,4500,1,0,0,0
2,69,1,3470,170000,2,45.514599,9.28434,4500,1,0,0,0
3,69,0,3288,132000,2,41.903221,12.49565,4700,0,0,1,0
4,69,0,3712,124490,2,45.532661,9.03892,4790,0,0,1,0


# Label Encoding

## Tarefa #1: Recebendo os dados

In [19]:
titanic=pd.read_csv('/content/titanic.csv', sep=';')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,Dead,Third Class,"Kelly, Mr. James",male,345.0,0,0,330911,78292.0,,Q
1,893,Alive,Third Class,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,Dead,Second Class,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,96875.0,,Q
3,895,Dead,Third Class,"Wirz, Mr. Albert",male,27.0,0,0,315154,86625.0,,S
4,896,Alive,Third Class,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,122875.0,,S


## Tarefa #2: Corrigindo os dados

In [20]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    object 
 2   Pclass       418 non-null    object 
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(3), object(7)
memory usage: 39.3+ KB


Vamos explorar as colunas que são do tipo `object` para aplicarmos *Label Encoding*:

In [22]:
titanic['Pclass'].unique()

array(['Third Class', 'Second Class', 'First Class'], dtype=object)

In [24]:
titanic['Survived'].unique()

array(['Dead', 'Alive'], dtype=object)

Vamos agora aplicar o Label Encoding na coluna Pclass:

In [25]:
titanic['Pclass']=titanic['Pclass'].map({'Third Class':3, 'Second Class':2, 'First Class':1})

In [26]:
titanic['Pclass'].unique()

array([3, 2, 1])

In [28]:
titanic['Survived']= titanic['Survived'].map({'Dead':0, 'Alive':1})

In [29]:
titanic['Survived'].unique()

array([0, 1])

# Exercícios

## Exercício 1)

Para o conjunto de dados do Titanic, substitua os textos das colunas Pclass e Sex usando o Label Encoding e para a coluna Embarqued use o One Hot Encoding

In [34]:
titanic['Pclass'] = titanic['Pclass'].map({'Third Class':3, 'Second Class':2, 'First Class':1})
titanic['Sex'] = titanic['Sex'].map({'Female':0, 'Male':1})

In [43]:
titanic['Embarked']=titanic['Embarked'].map({'Q':0, 'S':1})
embarq = pd.get_dummies(titanic['Embarked'], prefix = 'embarq', dtype=int)
titanic_embarqued = pd.concat([titanic,embarq], axis = 1)
titanic_embarqued.drop(columns=['Embarked'], inplace=True)

In [44]:
titanic_embarqued.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin
0,892,0,,"Kelly, Mr. James",,345.0,0,0,330911,78292.0,
1,893,1,,"Wilkes, Mrs. James (Ellen Needs)",,47.0,1,0,363272,7.0,
2,894,0,,"Myles, Mr. Thomas Francis",,62.0,0,0,240276,96875.0,
3,895,0,,"Wirz, Mr. Albert",,27.0,0,0,315154,86625.0,
4,896,1,,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",,22.0,1,1,3101298,122875.0,


## Exercício 2)

Para o conjunto de dados nursey.csv (Disponível no Google Classroom) aplique o Label Encoding ou o One Hot Enconding para as colunas:

Label Encoding:
* Child's Nursery (has_nurs): (1) proper, (2) less proper, (3) improper, (4) critical, (5) very critical (Berçário infantil (tem_enfermeiras): (1) adequado, (2) menos adequado, (3) impróprio, (4) crítico, (5) muito crítico;);

One Hot Encoding
* Social conditions (social): (1) non-problematic, (2) slightly problematic, (3) problematic (Condições sociais (sociais): (1) não problemáticas, (2) ligeiramente problemáticas, (3) problemáticas);


In [None]:
# Seu código aqui