<a href="https://colab.research.google.com/github/tvaditya/intro_ds_and_ml/blob/main/%5BML6%5DLidando_com_vari%C3%A1veis_categ%C3%B3ricas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lidando com variáveis categóricas

Em machine learning, muitos modelos não conseguirão lidar diretamente com variáveis categóricas. Dessa maneira, é importante conhecer os principais métodos e saber como aplicá-los.
<center><img src="https://resources.workable.com/wp-content/uploads/2016/01/category-manager-640x230.jpg"width="70%"></center>


Nesta aula veremos como usar o `LabelEncoder` e `OneHotEncoder`. Mais que isso, vou te mostrar algumas situações onde colunas numéricas são, na verdade, variáveis categóricas.

Para exemplificar o uso dessas técnicas, vou usar o dataset de câncer de mama, disponibilizado pela [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/breast+cancer).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

df = pd.read_csv("https://raw.githubusercontent.com/carlosfab/dsnp2/master/datasets/breast-cancer.data", header=None,
                 names=["class", "age", "menopause", "tumor_size",
                        "inv_nodes", "nodes-caps", "deg_malig", "breast",
                        "breast_quad", "irradiat"])

df.head()

Unnamed: 0,class,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


In [None]:
X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y)

## Label encoding

Para o Label Encoding, atribuímos a cada categoria um número. Por exemplo:

* Sem tumor = `0`
* Tumor benigno = `1`
* Tumor maligno = `2`
* Inconclusivo = `3`


In [None]:
# y_train antes do encoding
y_train

222       recurrence-events
213       recurrence-events
152    no-recurrence-events
109    no-recurrence-events
89     no-recurrence-events
               ...         
130    no-recurrence-events
187    no-recurrence-events
99     no-recurrence-events
53     no-recurrence-events
263       recurrence-events
Name: class, Length: 214, dtype: object

In [None]:
# y_test antes do encoding
y_test

118    no-recurrence-events
83     no-recurrence-events
217       recurrence-events
127    no-recurrence-events
147    no-recurrence-events
               ...         
174    no-recurrence-events
215       recurrence-events
234       recurrence-events
82     no-recurrence-events
167    no-recurrence-events
Name: class, Length: 72, dtype: object

In [None]:
# codificando a variável alvo
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [None]:
# y_train depois do encoding
y_train

array([1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1])

In [None]:
# y_test depois do encoding
y_test

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0])

In [None]:
# visualizando as classes (fase do fit)
le.classes_

array(['no-recurrence-events', 'recurrence-events'], dtype=object)

In [None]:
# recuperando e convertendo os labels
le.inverse_transform(y_train)[:5]

array(['recurrence-events', 'recurrence-events', 'no-recurrence-events',
       'no-recurrence-events', 'no-recurrence-events'], dtype=object)

## One-hot encoding

E quando a ordem não representa, necessariamente, uma escala real de importância?

<center><img alt="Colaboratory logo" width="45%" src="https://raw.githubusercontent.com/carlosfab/dsnp2/master/img/encoding.png"></center>


In [None]:
# X_train antes do OneHotEncoder
X_train

Unnamed: 0,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
222,60-69,ge40,25-29,0-2,no,3,left,right_low,yes
213,50-59,premeno,25-29,0-2,no,1,right,left_up,no
152,50-59,ge40,35-39,15-17,no,3,left,left_low,no
109,60-69,ge40,30-34,0-2,no,1,right,left_up,no
89,40-49,premeno,40-44,0-2,no,1,right,left_up,no
...,...,...,...,...,...,...,...,...,...
130,40-49,premeno,35-39,9-11,yes,2,right,right_up,yes
187,60-69,ge40,15-19,0-2,no,2,left,left_up,yes
99,30-39,premeno,25-29,0-2,no,2,left,left_low,no
53,70-79,ge40,20-24,0-2,no,3,left,left_up,no


In [None]:
from sklearn.preprocessing import OneHotEncoder

le = OneHotEncoder()
le.fit(X_train)
X_train_enc = le.transform(X_train)

In [None]:
X_train_enc

<214x42 sparse matrix of type '<class 'numpy.float64'>'
	with 1926 stored elements in Compressed Sparse Row format>

In [None]:
X_train_enc.toarray()

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.]])

## Dummies values

Uma *Dummy Variable* assume um valor 0 ou 1 para indicar a ausência ou presença de determinada variável. Diferente do Label Encoding, onde cada categoria assume um valor numérico, aqui criamos uma espécie de matriz esparça, onde cada categoria ganha uma coluna, com valores 0 indicando ausência, e 1 presença.

In [None]:
pd.get_dummies(df, columns=['menopause', 'breast'])

KeyError: ignored

In [None]:
pd.get_dummies(df)

Unnamed: 0,Age,Duration_2,Duration_3,Duration_5,Feeling_Bad,Feeling_Good,Feeling_Normal
0,22,0,1,0,0,1,0
1,25,0,0,1,0,0,1
2,23,1,0,0,1,0,0


In [None]:
df_enc = pd.get_dummies(df)

# Mapear na mão

In [None]:
import pandas as pd
import numpy as np
employee= {
    'Age':[22, 25, 23],
    'Duration':['3','5','2'],
    'Feeling':['Good', 'Normal', 'Bad']
          }
df = pd.DataFrame(employee)
print(df)

   Age Duration Feeling
0   22        3    Good
1   25        5  Normal
2   23        2     Bad


In [None]:
dict = {"Good":2, "Normal":1, "Bad":0}
df2=df.replace({"Feeling": dict})
print(df2)

   Age Duration  Feeling
0   22        3        2
1   25        5        1
2   23        2        0
