<a href="https://colab.research.google.com/github/stepsbtw/Machine-Learning/blob/main/feature_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding de Features Categóricas
- Label Encoding
- Dummy Encoding (One Hot)
- Ordinal Encoding
- Frequency Encoding
- Target Encoding

In [2]:
import pandas as pd

df_banking_train = pd.read_csv("https://raw.githubusercontent.com/AILAB-CEFET-RJ/cic1205/refs/heads/main/data/banking/train.csv", sep=';')
df_banking_test = pd.read_csv("https://raw.githubusercontent.com/AILAB-CEFET-RJ/cic1205/refs/heads/main/data/banking/test.csv", sep=';')
df_banking_train

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [3]:
print(df_banking_train["job"].unique())

['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown'
 'retired' 'admin.' 'services' 'self-employed' 'unemployed' 'housemaid'
 'student']


In [4]:
print(df_banking_test["marital"].unique())

['married' 'single' 'divorced']


In [5]:
print(df_banking_train.education.unique())

['tertiary' 'secondary' 'unknown' 'primary']


In [6]:
print(df_banking_train.contact.unique())

['unknown' 'cellular' 'telephone']


In [7]:
print(df_banking_train.month.unique())

['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep']


In [8]:
print(df_banking_train.poutcome.unique())

['unknown' 'failure' 'other' 'success']


## Label Encoding
Cada classe é atribuída um inteiro.

- Só deve ser usada para codificar as categorias da variável **TARGET**.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(df_banking_train)
X = df_banking_train.drop(columns=['y'])
y = df_banking_train['y']
print(y.head())

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

print(y)

# agora sim train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

0    no
1    no
2    no
3    no
4    no
Name: y, dtype: object
[0 0 0 ... 1 0 0]


## Dummy Encoding (one hot)

Converte variáveis categóricas em variáveis dummy (binárias), 0 ou 1 para a presença de uma categoria.

1. Indentificar variáveis categóricas que representem diferentes categorias sem uma ordem aparente.

2. **Variáveis Dummy**: Para cada categoria única, criar uma nova feature binária dizendo se aquele exemplo pertence ou não.

3. **Retirar uma Categoria de Referencia**: Em muitos casos, vai ser preciso eliminar uma das variáveis dummy para evitar a "trap dummy", quando variáveis são **muito correlacionadas**. Isso é feito eliminando uma das categorias, usando ela como referencia. Todas as outras são interpretadas em relação a esta.

In [10]:
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']}
df = pd.DataFrame(data)

# Binary encoding
binary_encoded = pd.get_dummies(df['Gender'], drop_first=True)

binary_encoded

Unnamed: 0,Male
0,True
1,False
2,True
3,False
4,True


In [11]:
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)

# One-hot encoding
one_hot_encoded = pd.get_dummies(df['Color'])

one_hot_encoded

Unnamed: 0,Blue,Green,Red
0,False,False,True
1,True,False,False
2,False,True,False
3,False,False,True
4,False,True,False


In [12]:
import pandas as pd

df_encoded = pd.get_dummies(df_banking_train['contact'], columns=['contact'], drop_first=True)
print(df_encoded)

       telephone  unknown
0          False     True
1          False     True
2          False     True
3          False     True
4          False     True
...          ...      ...
45206      False    False
45207      False    False
45208      False    False
45209       True    False
45210      False    False

[45211 rows x 2 columns]


## Ordinal Encoding

Usado em variáveis categóricas com uma ordem específica, são designados valores numéricos.

1. O OrdinalEncoder mapeia cada categoria única para um valor numérico baseado na ordem específica. Um mapping geralmente é definido pelo usuário ou inferido.

2. Encoding: Substitui as categorias com labels numéricos para cada categoria.

In [13]:
data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}
df = pd.DataFrame(data)

# Ordinal encoding
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}
df['Size_encoded'] = df['Size'].map(size_mapping)

df

Unnamed: 0,Size,Size_encoded
0,Small,1
1,Medium,2
2,Large,3
3,Medium,2
4,Small,1


In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

selected_columns = ['marital', 'month', 'education']

print(X_train[selected_columns].head())

# ordem de cada categoria
categories_order = [
    ["single", "married", "divorced"],  #'marital'
    ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"],  # 'month'
    ["unknown","primary", "secondary", "tertiary"]  # 'education'
]

ct = ColumnTransformer(
     [("enc", OrdinalEncoder(categories=categories_order), selected_columns)], remainder="passthrough"
     )
X_train_encoded = ct.fit_transform(X_train)

# new column names (encoded first)
encoded_feature_names = selected_columns
passthrough_feature_names = [col for col in X_train.columns if col not in selected_columns]
new_column_names = encoded_feature_names + passthrough_feature_names

X_train_encoded = pd.DataFrame(X_train_encoded, columns=new_column_names)

print()
print(X_train_encoded[selected_columns].head())

        marital month  education
10747    single   jun   tertiary
26054   married   nov  secondary
9125    married   jun  secondary
41659  divorced   oct   tertiary
4443    married   may  secondary

  marital month education
0     0.0   5.0       3.0
1     1.0  10.0       2.0
2     1.0   5.0       2.0
3     2.0   9.0       3.0
4     1.0   4.0       2.0


# Frequency Encoding

A partir das frequencias de cada categoria no dataset. Uma contagem absoluta ou relativa de ocorrencia, pode ser util quando há alguma relação ordinal com o target.

- Variaveis categorias de alta-cardinalidade
- Em tree-based pode ser informativo
- One hot cria um numero excessivo de features.

In [15]:
import pandas as pd

data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)

frequency_encoded = df['Color'].value_counts(normalize=True)

print(frequency_encoded)

Color
Red      0.4
Green    0.4
Blue     0.2
Name: proportion, dtype: float64


In [16]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
        self.frequency_maps = {}

    def fit(self, X, y=None):
        """Compute frequency of each category in specified columns."""
        if self.columns is None:
            self.columns = X.select_dtypes(include=['object', 'category']).columns

        for col in self.columns:
            self.frequency_maps[col] = X[col].value_counts().to_dict()
        return self

    def transform(self, X):
        """Apply frequency encoding to specified columns."""
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].map(self.frequency_maps[col]).fillna(0)  # Handle unseen categories
        return X

selected_columns = ['job', 'marital', 'month']

# Pipeline com encoder de frequencia + modelo.
pipeline = Pipeline([
    ('freq_encoder', FrequencyEncoder(columns=selected_columns)),  # Frequency encoding
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))  # Model
])

pipeline.fit(X_train[selected_columns], y_train)
y_pred = pipeline.predict(X_test[selected_columns])

print("Transformed Test Set:")
print(pipeline.named_steps['freq_encoder'].transform(X_test[selected_columns]))

Transformed Test Set:
        job  marital  month
3776   6863    19100   9558
9928   2907     8942   3799
33409   649     8942   2043
31885  6573    19100   2043
15738  6573    19100   4830
...     ...      ...    ...
9016   5300     8942   3799
380    6863    19100   9558
7713   3634    19100   9558
12188   649    19100   3799
28550  6573     3605    989

[13564 rows x 3 columns]


## Target Encoding

Informações da variavel objetivo para dar valores numericos as categorias.

1. Calcula média do objetivo **por categoria**: Para cada categoria, o `Target Encoder` calcula a média do target. (variavel classe 0,1,2)

2. Designar a media do target para a categoria: O valor numerico designado para cada um é a media para aquela categoria.

3. Encode: Substitua por essas calculadas

A ideia é capturar o relacionamento entre a variavel categorica e a variavel objetivo, boa para modelos preditivos. (CUIDADO PARA NAO CUSAR DATA LEAKAGE -> pode ser otimista)

In [17]:
# It can be useful for categorical variables with high cardinality.

data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
        'Salary': [80000, 75000, 70000, 85000, 72000]}
df = pd.DataFrame(data)

# Target encoding
city_means = df.groupby('City')['Salary'].mean()
df['City_encoded'] = df['City'].map(city_means)

print(df)

          City  Salary  City_encoded
0     New York   80000       82500.0
1  Los Angeles   75000       75000.0
2      Chicago   70000       71000.0
3     New York   85000       82500.0
4      Chicago   72000       71000.0


### Biblioteca [category_encoders](https://pypi.org/project/category-encoders/)
Tem implementação robusta do `Target Encoder`.

In [18]:
pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Downloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.8.1


In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from category_encoders import TargetEncoder

# split train
df = df_banking_train.copy()
X_train = df.drop(columns=['y'])
y_train = df['y']
y_train = y_train.replace({'yes': 0, 'no': 1})

# split test
df = df_banking_test.copy()
X_test = df.drop(columns=['y'])
y_test = df['y']
y_test = y_test.replace({'yes': 0, 'no': 1})

# inicializar classe
encoder = TargetEncoder(cols=['job', 'marital'])

# aplicar o encoder
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)

# Display part of the encoded data
print("Encoded Training Data:")
X_train_encoded

  y_train = y_train.replace({'yes': 0, 'no': 1})
  y_test = y_test.replace({'yes': 0, 'no': 1})


Encoded Training Data:


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58,0.862444,0.898765,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown
1,44,0.889430,0.850508,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown
2,33,0.917283,0.898765,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown
3,47,0.927250,0.898765,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown
4,33,0.881944,0.850508,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,0.889430,0.898765,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown
45207,71,0.772085,0.880545,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown
45208,72,0.772085,0.898765,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success
45209,57,0.927250,0.898765,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown


## Sumário

| Encoding       | Bom para                                    | Alta cardinalidade? | Exemplo |
|----------------------|-------------------------------------------|--------------------------|---------|
| **One-Hot Encoding**  | Variáveis nominais, poucas categorias     | ❌ No (creates too many columns) | `pd.get_dummies(data)` |
| **Ordinal Encoding**  | Ordenamento (e.g., "low" < "medium" < "high") | ✅ Yes | `OrdinalEncoder(categories=[["low", "medium", "high"]])` |
| **Frequency Encoding** | Alta cardinalidade!   | ✅ Yes | `df[col] = df[col].map(df[col].value_counts())` |
| **Target Encoding**   | Features com alta correlação ao target, e alta cardinalidade | ✅ Yes (risk of data leakage) | `df[col] = df.groupby(col)['target'].transform('mean')` |
