# Feature Engineering
Feature Engineering é um processo crucial na área de Ciência de Dados e Machine Learning, envolvendo a seleção, criação e transformação de variáveis (features) a partir dos dados brutos. Seu objetivo é maximizar o desempenho dos modelos de Machine Learning ao criar representações mais informativas e relevantes dos dados. Uma abordagem bem-feita de Feature Engineering pode levar a modelos mais precisos, melhor generalização e insights mais profundos.


In [71]:
import pandas as pd

In [72]:
baseLimpa = pd.read_excel("ChavesClientes.xlsx")
baseLimpa.head()

Unnamed: 0,ID,ChaveSituacao,ClassRisco,CatCliente,Pagamento
0,1,32FC,Ccinza,Basic-Alpha,1
1,2,25MV,AAmarelo,Black,1
2,3,27MV,B-Amarelo,Basic-Beta,1
3,4,26FD,BAmarelo,Black,0
4,5,26FD,C-Amarelo,Black,0


In [73]:
baseLimpa = pd.read_excel("ChavesClientesLimpo.xlsx")
baseLimpa.head()

Unnamed: 0,ChaveSituacao,ClassRisco,CatCliente,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco
0,32FC,Ccinza,Basic-Alpha,1,32,F,C,Basic,Alpha,C
1,25MV,AAmarelo,Black,1,25,M,V,Black,Comum,A
2,27MV,B-Amarelo,Basic-Beta,1,27,M,V,Basic,Beta,B-
3,26FD,BPreto,Black,0,26,F,D,Black,Comum,B
4,26FD,C-Amarelo,Black,0,26,F,D,Black,Comum,C-


**Podemos excluir as colunas que não vamos usar**

In [74]:
baseLimpa = baseLimpa.drop(['ChaveSituacao','ClassRisco','CatCliente'],axis=1)

**E verificar as informações da base**

In [75]:
baseLimpa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Pagamento    20 non-null     int64 
 1   Idade        20 non-null     int64 
 2   Genero       20 non-null     object
 3   EstadoCivil  20 non-null     object
 4   Categoria    20 non-null     object
 5   CatVIP       20 non-null     object
 6   Risco        20 non-null     object
dtypes: int64(2), object(5)
memory usage: 1.2+ KB


### Ao tentar colocar esses dados em um modelo como o de Regressão Linear vamos ter o seguinte erro

In [76]:
# Selecionando os valores de X e y
X = baseLimpa[['Idade','Genero','EstadoCivil','Categoria','CatVIP','Risco']]
y = baseLimpa.Pagamento

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)

reg.score(X,y)

ValueError: could not convert string to float: 'F'

### Por isso precisamos conseguir tratar os dados antes de inserir no modelo

In [78]:
baseLimpa.head(10)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco
0,1,32,F,C,Basic,Alpha,C
1,1,25,M,V,Black,Comum,A
2,1,27,M,V,Basic,Beta,B-
3,0,26,F,D,Black,Comum,B
4,0,26,F,D,Black,Comum,C-
5,0,28,F,C,Platinum,Alpha,C-
6,1,27,M,D,Platinum,Beta,A-
7,0,31,M,D,Basic,Comum,C-
8,1,28,F,S,Black,Comum,A-
9,1,31,M,V,Platinum,Comum,C+


**Com o One Hot Encoding podemos tratar valores que não tem relação de ordem entre eles**
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [79]:
# Importando e utilizando o OneHotEncoder para as colunas 'Genero' e 'EstadoCivil'
from sklearn.preprocessing import OneHotEncoder


In [80]:
ohe = OneHotEncoder()
ohe_transform = ohe.fit_transform(baseLimpa[['Genero','EstadoCivil']])

In [81]:
# Nome das features
#ohe.get_feature_names()
ohe.get_feature_names_out()

array(['Genero_F', 'Genero_M', 'EstadoCivil_C', 'EstadoCivil_D',
       'EstadoCivil_S', 'EstadoCivil_V'], dtype=object)

In [82]:
# Array de valores
ohe_transform.toarray()

array([[1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1.],
       [1., 0., 1., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 1.]])

In [83]:
# Transformando esses dados em um DataFrame
df_ohe = pd.DataFrame(ohe_transform.toarray())
df_ohe.columns = ohe.get_feature_names_out()
df_ohe.head()

Unnamed: 0,Genero_F,Genero_M,EstadoCivil_C,EstadoCivil_D,EstadoCivil_S,EstadoCivil_V
0,1.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0


In [84]:
# Para finalizar, podemos concatenar as duas bases
baseLimpa = pd.concat([baseLimpa,df_ohe],axis=1)

In [85]:
baseLimpa.head(20)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco,Genero_F,Genero_M,EstadoCivil_C,EstadoCivil_D,EstadoCivil_S,EstadoCivil_V
0,1,32,F,C,Basic,Alpha,C,1.0,0.0,1.0,0.0,0.0,0.0
1,1,25,M,V,Black,Comum,A,0.0,1.0,0.0,0.0,0.0,1.0
2,1,27,M,V,Basic,Beta,B-,0.0,1.0,0.0,0.0,0.0,1.0
3,0,26,F,D,Black,Comum,B,1.0,0.0,0.0,1.0,0.0,0.0
4,0,26,F,D,Black,Comum,C-,1.0,0.0,0.0,1.0,0.0,0.0
5,0,28,F,C,Platinum,Alpha,C-,1.0,0.0,1.0,0.0,0.0,0.0
6,1,27,M,D,Platinum,Beta,A-,0.0,1.0,0.0,1.0,0.0,0.0
7,0,31,M,D,Basic,Comum,C-,0.0,1.0,0.0,1.0,0.0,0.0
8,1,28,F,S,Black,Comum,A-,1.0,0.0,0.0,0.0,1.0,0.0
9,1,31,M,V,Platinum,Comum,C+,0.0,1.0,0.0,0.0,0.0,1.0


**Já se os valores tiverem uma relação de ordem, podemos usar o Ordinal Encoding**
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder

In [86]:
# Entendendo a relação entre a coluna "Categoria"
baseLimpa.Categoria.value_counts()

Categoria
Black       7
Platinum    7
Basic       6
Name: count, dtype: int64

In [87]:
baseLimpa.Categoria.reshape(-1, 1)

AttributeError: 'Series' object has no attribute 'reshape'

In [88]:
baseLimpa.Categoria.values.reshape(-1, 1)

array([['Basic'],
       ['Black'],
       ['Basic'],
       ['Black'],
       ['Black'],
       ['Platinum'],
       ['Platinum'],
       ['Basic'],
       ['Black'],
       ['Platinum'],
       ['Basic'],
       ['Basic'],
       ['Basic'],
       ['Platinum'],
       ['Black'],
       ['Platinum'],
       ['Black'],
       ['Black'],
       ['Platinum'],
       ['Platinum']], dtype=object)

In [89]:
# Importando e utilizando o OrdinalEncoder para a coluna 'Categoria'
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe_transform = oe.fit_transform(baseLimpa.Categoria.values.reshape(-1, 1))

In [90]:
oe_transform

array([[0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [2.],
       [2.],
       [0.],
       [1.],
       [2.],
       [0.],
       [0.],
       [0.],
       [2.],
       [1.],
       [2.],
       [1.],
       [1.],
       [2.],
       [2.]])

In [91]:
# E podemos adicionar essa coluna
baseLimpa['NrCategoria'] = oe_transform

In [92]:
# Visualizando a base
baseLimpa.head(20)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco,Genero_F,Genero_M,EstadoCivil_C,EstadoCivil_D,EstadoCivil_S,EstadoCivil_V,NrCategoria
0,1,32,F,C,Basic,Alpha,C,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1,25,M,V,Black,Comum,A,0.0,1.0,0.0,0.0,0.0,1.0,1.0
2,1,27,M,V,Basic,Beta,B-,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0,26,F,D,Black,Comum,B,1.0,0.0,0.0,1.0,0.0,0.0,1.0
4,0,26,F,D,Black,Comum,C-,1.0,0.0,0.0,1.0,0.0,0.0,1.0
5,0,28,F,C,Platinum,Alpha,C-,1.0,0.0,1.0,0.0,0.0,0.0,2.0
6,1,27,M,D,Platinum,Beta,A-,0.0,1.0,0.0,1.0,0.0,0.0,2.0
7,0,31,M,D,Basic,Comum,C-,0.0,1.0,0.0,1.0,0.0,0.0,0.0
8,1,28,F,S,Black,Comum,A-,1.0,0.0,0.0,0.0,1.0,0.0,1.0
9,1,31,M,V,Platinum,Comum,C+,0.0,1.0,0.0,0.0,0.0,1.0,2.0


In [93]:
# Fazendo o mesmo para a coluna risco
oe = OrdinalEncoder(categories=[['C-','C','C+','B-','B','B+','A-','A','A+']])
oe_transform_risco = oe.fit_transform(baseLimpa.Risco.values.reshape(-1, 1))
oe_transform_risco

array([[1.],
       [7.],
       [3.],
       [4.],
       [0.],
       [0.],
       [6.],
       [0.],
       [6.],
       [2.],
       [7.],
       [0.],
       [4.],
       [6.],
       [1.],
       [3.],
       [6.],
       [0.],
       [1.],
       [7.]])

In [94]:
baseLimpa['NrRisco'] = oe_transform_risco

In [95]:
baseLimpa.head(20)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco,Genero_F,Genero_M,EstadoCivil_C,EstadoCivil_D,EstadoCivil_S,EstadoCivil_V,NrCategoria,NrRisco
0,1,32,F,C,Basic,Alpha,C,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,1,25,M,V,Black,Comum,A,0.0,1.0,0.0,0.0,0.0,1.0,1.0,7.0
2,1,27,M,V,Basic,Beta,B-,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0
3,0,26,F,D,Black,Comum,B,1.0,0.0,0.0,1.0,0.0,0.0,1.0,4.0
4,0,26,F,D,Black,Comum,C-,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
5,0,28,F,C,Platinum,Alpha,C-,1.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0
6,1,27,M,D,Platinum,Beta,A-,0.0,1.0,0.0,1.0,0.0,0.0,2.0,6.0
7,0,31,M,D,Basic,Comum,C-,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
8,1,28,F,S,Black,Comum,A-,1.0,0.0,0.0,0.0,1.0,0.0,1.0,6.0
9,1,31,M,V,Platinum,Comum,C+,0.0,1.0,0.0,0.0,0.0,1.0,2.0,2.0


**Por fim, podemos criar funções para transformar colunas como transformar a CatVIP para verificar apenas se o cliente é VIP ou não**

In [96]:
# Criando uma função para verificar se o cliente é VIP
def define_VIP(valor):
    if valor == 'Alpha' or valor == 'Beta':
        return 1
    else:
        return 0

In [97]:
# Aplicando essa função na coluna 'CatVIP'
baseLimpa['NrVIP'] = baseLimpa.CatVIP.apply(define_VIP)

In [98]:
baseLimpa.head(20)

Unnamed: 0,Pagamento,Idade,Genero,EstadoCivil,Categoria,CatVIP,Risco,Genero_F,Genero_M,EstadoCivil_C,EstadoCivil_D,EstadoCivil_S,EstadoCivil_V,NrCategoria,NrRisco,NrVIP
0,1,32,F,C,Basic,Alpha,C,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1
1,1,25,M,V,Black,Comum,A,0.0,1.0,0.0,0.0,0.0,1.0,1.0,7.0,0
2,1,27,M,V,Basic,Beta,B-,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0,1
3,0,26,F,D,Black,Comum,B,1.0,0.0,0.0,1.0,0.0,0.0,1.0,4.0,0
4,0,26,F,D,Black,Comum,C-,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
5,0,28,F,C,Platinum,Alpha,C-,1.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,1
6,1,27,M,D,Platinum,Beta,A-,0.0,1.0,0.0,1.0,0.0,0.0,2.0,6.0,1
7,0,31,M,D,Basic,Comum,C-,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0
8,1,28,F,S,Black,Comum,A-,1.0,0.0,0.0,0.0,1.0,0.0,1.0,6.0,0
9,1,31,M,V,Platinum,Comum,C+,0.0,1.0,0.0,0.0,0.0,1.0,2.0,2.0,0


**Limpando novamente as colunas desnecessárias**

In [103]:
# Retirando novamente as colunas desnecessárias (aspecto de uma base preparada para a máquina trabalhar)

baseLimpa = baseLimpa.drop(['Genero','EstadoCivil','Categoria','CatVIP','Risco'],axis=1)
baseLimpa.head(20)

Unnamed: 0,Pagamento,Idade,Genero_F,Genero_M,EstadoCivil_C,EstadoCivil_D,EstadoCivil_S,EstadoCivil_V,NrCategoria,NrRisco,NrVIP
0,1,32,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1
1,1,25,0.0,1.0,0.0,0.0,0.0,1.0,1.0,7.0,0
2,1,27,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0,1
3,0,26,1.0,0.0,0.0,1.0,0.0,0.0,1.0,4.0,0
4,0,26,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
5,0,28,1.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,1
6,1,27,0.0,1.0,0.0,1.0,0.0,0.0,2.0,6.0,1
7,0,31,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0
8,1,28,1.0,0.0,0.0,0.0,1.0,0.0,1.0,6.0,0
9,1,31,0.0,1.0,0.0,0.0,0.0,1.0,2.0,2.0,0


### Usando novamente em um modelo de Regressão Linear

In [104]:
# Selecionando os valores de X e y
X = baseLimpa.drop('Pagamento',axis=1)
y = baseLimpa.Pagamento

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)

reg.score(X,y)

0.6197521275369209

In [48]:
# Selecionando os valores de X e y
X = baseLimpa.Idade.values.reshape(-1,1)
y = baseLimpa.Pagamento

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)

reg.score(X,y)

0.03832882093751644