# Seleção de modelos

Antes de começar, vamos fazer toda a preparação dos dados vista nas aulas 2 e 3.

In [12]:
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split

# splitting data
data = pd.read_csv('./marketing_investimento.csv')
target = 'aderencia_investimento'
x = data.drop(target, axis=1)
y = data[target] 

# transformer
categorical_cols = ['estado_civil', 'escolaridade', 'inadimplencia', 'fez_emprestimo']
one_hot = make_column_transformer(
    (
        OneHotEncoder(drop='if_binary'),
        categorical_cols
    ),
    remainder='passthrough',
    sparse_threshold=0
)

# encoding x
x = one_hot.fit_transform(x)

# encoding y
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# splitting train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, random_state=5)

## Normalizando os dados

Alguns algoritmos podem atribuir um peso maior aos valores das variáveis devido a escala dos valores e não pela importância da classificação da variável alvo. Por exemplo, em uma base de dados com a colunas **idade** e **salário**, o algoritmo pode dar um peso de decisão maior para os valores do salário simplesmente por estar em uma escala maior do que os valores de idade, e não porque a variável salário é mais importante do que a variável idade.

Nesses casos, precisamos fazer uma transformação nos dados para que fiquem em uma mesma escala, fazendo com que o algoritmo não seja influenciado incorretamente pelos valores numéricos divergentes entre as variáveis. Uma das tranformações possíveis é a **normalização** e uma das maneiras de se fazer isso é usando a estratégia min-max, representada na fórmula abaixo:

$X_{sc} = \frac{X - X_{min}}{X_{max} - X_{min}} $

In [13]:
from sklearn.preprocessing import MinMaxScaler

normalize = MinMaxScaler()
# fit (find min, max for the given parameter) and transform (apply min-max scaling using the parameter)
x_train_norm = normalize.fit_transform(x_train)

In [14]:
# x_train_norm is a numpy array. So if we want to visualize it as a dataframe, we need to call pd.Dataframe
pd.DataFrame(x_train_norm)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.205882,0.065564,0.123734,0.032258
1,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.147059,0.045792,0.396527,0.032258
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.338235,0.076036,0.335022,0.000000
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.573529,0.062866,0.315123,0.000000
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.338235,0.148380,0.065847,0.129032
...,...,...,...,...,...,...,...,...,...,...,...,...
946,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.191176,0.044265,0.246382,0.129032
947,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.205882,0.028043,0.275687,0.032258
948,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.323529,0.042952,0.024964,0.129032
949,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.176471,0.042810,0.023878,0.000000


Vamos também normalizar os dados de teste. Como já fizemos o `fit` anteriormente, agora podemos chamar diretamente o método `transform`.

(Estou em dúvidas com relação a esse passo. O fit foi feito com os dados de treino. Como garante que o range nos dados de teste está contido no range do de treino? O que acontece se houver um valor maior do que o máximo ou menor do que o mínimo?)

In [15]:
x_test_norm = normalize.transform(x_test)

pd.DataFrame(x_test_norm)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.338235,0.036917,0.028944,0.096774
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.411765,0.124667,0.084660,0.096774
2,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.191176,0.066877,0.569465,0.000000
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.220588,0.101629,0.202967,0.032258
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.529412,0.077456,0.123010,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
312,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.294118,0.067339,0.185239,0.000000
313,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.205882,0.086827,0.180897,0.032258
314,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.455882,0.078698,0.121201,0.032258
315,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.338235,0.111391,0.067656,1.000000


## KNN

O algoritmo KNN se baseia no cálculo de distância entre os registros da base de dados e busca elementos que estejam próximos uns dos outros (vizinhos) para tomar a decisão da classificação.

Por conta de usar cálculos de distância, esse algoritmo é influenciado pela escala das variáveis e por conta disso é necessário uma transformação nos dados antes de utilizar esse método. Fizemos essa transformação (a normalização) nos passos anteriores, então agora podemos utilizar os dados normalizados no modelo.

In [16]:
from sklearn.neighbors import KNeighborsClassifier

# initializing the classifier with default values
knn = KNeighborsClassifier()

In [17]:
# training and evaluating
knn.fit(x_train_norm, y_train)
knn.score(x_test_norm, y_test)

0.6876971608832808

In [18]:
# comparing with non-normalized data
knn_non_normalized = KNeighborsClassifier()
knn_non_normalized.fit(x_train, y_train)
knn_non_normalized.score(x_test, y_test)

0.6750788643533123

## Escolhendo e salvando o melhor modelo

Ao final de um projeto de machine learning, devemos comparar os resultados dos modelos e escolher aquele que tenha o melhor desempenho.

Podemos armazenar o modelo em um arquivo serializado do tipo pickle para que seja utilizado em produção, ou seja, em dados do mundo real para atender as necessidades do problema que precisa ser resolvido.

Retomando os modelos anteriores:

In [19]:
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier

# dummy
dummy = DummyClassifier()
dummy.fit(x_train, y_train)

# decision tree
tree = DecisionTreeClassifier(random_state=5, max_depth=3)
tree.fit(x_train, y_train)

In [20]:
print(f'Acurácia Dummy: {dummy.score(x_test, y_test)}')
print(f'Acurácia Árvore de Decisão: {tree.score(x_test, y_test)}')
print(f'Acurácia KNN: {knn.score(x_test_norm, y_test)}')

Acurácia Dummy: 0.6025236593059937
Acurácia Árvore de Decisão: 0.7160883280757098
Acurácia KNN: 0.6876971608832808


Podemos agora salvar o melhor classificador (o de árvore de decisão) para um arquivo .pkl (Pickle) e assim poder utilizar este classificador em outros projetos. Vamos também salvar o one-hot encoding que criamos, para poder transformar os novos dados para o formato utilizado no classificador.

O Pickle é um módulo em Python que faz a serialização/deserialização de objetos Python, de modo que você pode transformar seu código Python em um stream de bytes que podem ser salvos em um arquivo (e posteriormente pode abrir este arquivo e convertê-los novamente em um objeto Python). É semelhante ao que fazemos em JS com `JSON.parse` e `JSON.stringify`.

Da teoria da aula no curso da Alura: "o formato binário gerado pelo pickle é independente da plataforma, o que significa que é possível criar um arquivo em um sistema operacional e carregá-lo em outro sem problema de compatibilidade. Vale destacar que em versões diferentes do Python isto pode ser um problema. Objetos serializados em uma versão específica podem não ser carregados corretamente em outra versão. Portanto, é muito importante saber qual a versão da linguagem e das bibliotecas utilizadas no projeto para que sejam replicadas dentro do sistema em que vai ser utilizado."

In [21]:
import pickle

# create pkl file and open it in write (w) binary (b) mode
with open('one_hot_encoder.pkl', 'wb') as myFile:
    # serializing one_hot and saving it to the pkl file
    pickle.dump(one_hot, myFile)

In [22]:
with open('decision_tree_model.pkl', 'wb') as myFile:
    pickle.dump(tree, myFile)