# **Hands-on Clsutering**

---

Clustering com sklearn 

Vamos começar a analisar algoritmos de clusterização de acordo com as métricas vistas nas aula anteriores. 

Para isso, usaremos 2 algoritmos de clustering diferentes. Todas as implementações são provenivientes do sklearn. São eles: 


*   <a href = https://scikit-learn.org/stable/modules/clustering.html#k-means> K-means </a>
*   <a href = https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation> Affinity Propagation </a>

Esses métodos são amplamente utilizados no dia-a-dia quando se trata de uma tarefa de clusterização. Mais informações sobre esses modelos podem ser encontradas nos links para a documentação do sklearn. 


Vamos utilizar a base de dados **cms.csv** </a> que pode ser encontrada no OpenML em: https://www.openml.org/d/23. Essa base descreve o problema de predizer o método contracepitivo escolhido por um conjunto específico de mulheres do 1987 National Indonesia Contraceptive Prevalence Survey. Essa base de dados é tradicionalmente aplicada a problemas de classificação. 

Aqui, como estamos tratando de um problema de aprendizado não supervisionado, especificamente com a tarefa de clustering, vamos omitir as informações de labels e usa-las posteriormente para critérios avaliativos. 


### *Importando* bibliotecas


In [None]:
import pandas as pd #biblioteca para manipulação de dados
import numpy as np #biblioteca para utilizacao de vetores e matrizes
import matplotlib.pyplot as plt #bibloteca para plotar graficos

In [None]:
#liberando acesso do colab aos arquivos no drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Carregamento dos dados

In [2]:
#lendo o csv que contem a base de dados e armazanando em um df
df = pd.read_csv('/content/gdrive/My Drive/IGTI/Aulas praticas/cmc.csv')

NameError: ignored

In [None]:
#imprimindo as 5 primeiras linhas do df para confirmação
df.head(5)

Unnamed: 0,Wifes_age,Wifes_education,Husbands_education,Number_of_children_ever_born,Wifes_religion,Wifes_now_working%3F,Husbands_occupation,Standard-of-living_index,Media_exposure,Contraceptive_method_used
0,24,2,3,3,1,1,2,3,0,1
1,45,1,3,10,1,1,3,4,0,1
2,43,2,3,7,1,1,3,4,0,1
3,42,3,2,9,1,1,3,3,0,1
4,36,3,3,8,1,1,3,2,0,1


*   Wifes_age: numerical 
*   Wifes_education: categorical 1=low, 2, 3, 4=high
*   Husbands_education: categorical 1=low, 2, 3, 4=high
*   Number_of_children_ever_born: numerical
*   Wifes_religion: binary 0=Non-Islam, 1=Islam
*   Wifes_now_working: binary 0=Yes, 1=No
*   Husbands_occupation: categorical 1, 2, 3, 4
*   Standard-of-living_index: categorical 1=low, 2, 3, 4=high
*   Media_exposure: binary 0=Good, 1=Not good
*   Contraceptive_method_used: (class attribute) 1=No-use 2=Long-term 3=Short-term

In [None]:
# Verificando o numero de amostras (linhas) e features (colunas) do dataset. 
print('Amostras e Features:', df.shape)

Amostras e Features: (1473, 10)


In [None]:
# Verificando quais são os tipos das features
df.columns

Index(['Wifes_age', 'Wifes_education', 'Husbands_education',
       'Number_of_children_ever_born', 'Wifes_religion',
       'Wifes_now_working%3F', 'Husbands_occupation',
       'Standard-of-living_index', 'Media_exposure',
       'Contraceptive_method_used'],
      dtype='object')

### Pré processamento

Vamos fazer um mapeamento das classes originais para 0, 1 e 2. 

In [None]:
# CATEGORIZAÇÃO DA CLASSE 'Contraceptive_method_used'

#criando um dicionario de dados para o mapeamento
name_to_class = {
    1: 0,
    2: 1,
    3: 2

}

#substituindo os valores categóricos pelo mapeamento
df['Contraceptive_method_used'] = df['Contraceptive_method_used'].map(name_to_class)

#check
df.head(5)

Unnamed: 0,Wifes_age,Wifes_education,Husbands_education,Number_of_children_ever_born,Wifes_religion,Wifes_now_working%3F,Husbands_occupation,Standard-of-living_index,Media_exposure,Contraceptive_method_used
0,24,2,3,3,1,1,2,3,0,0
1,45,1,3,10,1,1,3,4,0,0
2,43,2,3,7,1,1,3,4,0,0
3,42,3,2,9,1,1,3,3,0,0
4,36,3,3,8,1,1,3,2,0,0


**O dataset possui 4 classes categóricas e 3 classes binárias, então faz o get_dummies para cada uma delas.**

Vamos usar o processo que se chama binarização, ou ainda One-Hot Encoding.

O pandas possui a função <a href = https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html > get_dummies()</a> que faz essa transformação diretamente. 

Vamos aplicar essa função as colunas categóricas da base de dados. 


In [None]:
# binarizando a features 
#df2 = pd.get_dummies(df)                                # se passar dessa forma, o método não funciona
df1 = pd.get_dummies(df, columns = ['Wifes_education', 'Husbands_education',
       'Wifes_religion', 'Wifes_now_working%3F', 'Husbands_occupation',
       'Standard-of-living_index', 'Media_exposure'] )   # passa somente as colunas categóricas e binárias.

# visualizando o resultado
df1.head(5)
#df2.head(5)

Unnamed: 0,Wifes_age,Number_of_children_ever_born,Contraceptive_method_used,Wifes_education_1,Wifes_education_2,Wifes_education_3,Wifes_education_4,Husbands_education_1,Husbands_education_2,Husbands_education_3,Husbands_education_4,Wifes_religion_0,Wifes_religion_1,Wifes_now_working%3F_0,Wifes_now_working%3F_1,Husbands_occupation_1,Husbands_occupation_2,Husbands_occupation_3,Husbands_occupation_4,Standard-of-living_index_1,Standard-of-living_index_2,Standard-of-living_index_3,Standard-of-living_index_4,Media_exposure_0,Media_exposure_1
0,24,3,0,0,1,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0
1,45,10,0,1,0,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,1,1,0
2,43,7,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,1,1,0
3,42,9,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0
4,36,8,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0


Uma outra etapa importante do pré-processamento consiste na avaliação de dados faltantes. 

Vamos fazer isso para o df original e o df1 apos a binarização


In [None]:
# Analisando o resumo da base
df.describe()

# não tem nenhum dado faltante nesse dataset

Unnamed: 0,Wifes_age,Wifes_education,Husbands_education,Number_of_children_ever_born,Wifes_religion,Wifes_now_working%3F,Husbands_occupation,Standard-of-living_index,Media_exposure,Contraceptive_method_used
count,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0
mean,32.538357,2.958588,3.429735,3.261371,0.850645,0.749491,2.137814,3.133741,0.073999,0.919891
std,8.227245,1.014994,0.816349,2.358549,0.356559,0.433453,0.864857,0.976161,0.261858,0.876376
min,16.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,26.0,2.0,3.0,1.0,1.0,0.0,1.0,3.0,0.0,0.0
50%,32.0,3.0,4.0,3.0,1.0,1.0,2.0,3.0,0.0,1.0
75%,39.0,4.0,4.0,4.0,1.0,1.0,3.0,4.0,0.0,2.0
max,49.0,4.0,4.0,16.0,1.0,1.0,4.0,4.0,1.0,2.0


In [None]:
# Analisando o resumo da base df1
df1.describe()

Unnamed: 0,Wifes_age,Number_of_children_ever_born,Contraceptive_method_used,Wifes_education_1,Wifes_education_2,Wifes_education_3,Wifes_education_4,Husbands_education_1,Husbands_education_2,Husbands_education_3,Husbands_education_4,Wifes_religion_0,Wifes_religion_1,Wifes_now_working%3F_0,Wifes_now_working%3F_1,Husbands_occupation_1,Husbands_occupation_2,Husbands_occupation_3,Husbands_occupation_4,Standard-of-living_index_1,Standard-of-living_index_2,Standard-of-living_index_3,Standard-of-living_index_4,Media_exposure_0,Media_exposure_1
count,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0
mean,32.538357,3.261371,0.919891,0.103191,0.226748,0.278344,0.391718,0.029871,0.120842,0.238968,0.610319,0.149355,0.850645,0.250509,0.749491,0.295995,0.288527,0.397149,0.01833,0.087576,0.155465,0.2926,0.464358,0.926001,0.073999
std,8.227245,2.358549,0.876376,0.304311,0.418871,0.448336,0.4883,0.170289,0.326054,0.426598,0.487843,0.356559,0.356559,0.433453,0.433453,0.456644,0.453231,0.489473,0.134187,0.282774,0.36247,0.455111,0.498897,0.261858,0.261858
min,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,26.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,32.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,39.0,4.0,2.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
max,49.0,16.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Os modelos implementados no sklearn recebem como entrada para a modelagam um ou mais arrays. Dessa forma, precisamos modificar o df original para que seja possível a modelagem correta. 

Para isso, vamos separar o label das amostras, armazenar o nome das features já que os arrays não fazem isso e depois retirar a coluna de labels do df original. Em seguida, vamos converter o df para array usando o numpy!

Novamente, vamos fazer isso para os dois dfs criados!

In [None]:
# armazenando os labels em um array
labels = np.array(df['Contraceptive_method_used'])   # essa é a classe que a gente quer calcular a saída

# salvando a ordem das features
feature_list = list(df.columns)

In [None]:
# removendo a coluna de labels do df original
df = df.drop('Contraceptive_method_used', axis = 1)

# check
df.columns

Index(['Wifes_age', 'Wifes_education', 'Husbands_education',
       'Number_of_children_ever_born', 'Wifes_religion',
       'Wifes_now_working%3F', 'Husbands_occupation',
       'Standard-of-living_index', 'Media_exposure'],
      dtype='object')

In [None]:
# convertendo df para array
data = np.array(df)

In [None]:
#repetindo o processo para o df1
labels1 = np.array(df1['Contraceptive_method_used'])
feature_list1 = list(df1.columns)

df1 = df1.drop('Contraceptive_method_used', axis = 1)
df1.columns

Index(['Wifes_age', 'Number_of_children_ever_born', 'Wifes_education_1',
       'Wifes_education_2', 'Wifes_education_3', 'Wifes_education_4',
       'Husbands_education_1', 'Husbands_education_2', 'Husbands_education_3',
       'Husbands_education_4', 'Wifes_religion_0', 'Wifes_religion_1',
       'Wifes_now_working%3F_0', 'Wifes_now_working%3F_1',
       'Husbands_occupation_1', 'Husbands_occupation_2',
       'Husbands_occupation_3', 'Husbands_occupation_4',
       'Standard-of-living_index_1', 'Standard-of-living_index_2',
       'Standard-of-living_index_3', 'Standard-of-living_index_4',
       'Media_exposure_0', 'Media_exposure_1'],
      dtype='object')

In [None]:
data1 = np.array(df1)

Agora estamos quase prontos para a modelagem em si!

Precisamos apenas separar uma parte dos nossos dados para que seja possível avaliar os modelos que vamos treinar. O sklearn tem uma função para isso: <a href = http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html> train_test_split<a>.

In [None]:
# importar train_test_split do scikitlearn 
from sklearn.model_selection import train_test_split

# aplicando a funcao train_test_split para separar os conjuntos de treino e 
# teste segundo uma porcentagem de separação definida. 
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size = 0.25, random_state = 42)

#repetindo o processo para o data1
train_data1, test_data1, train_labels1, test_labels1 = train_test_split(data1, labels1, test_size = 0.25, random_state = 42)



### Baseline: Comparando com um modelo aleatorio!

Como falamos durante as aulas teóricas, podemos criar uma base de comparação para os modelos que queremos avaliar. 

No caso da classificação, o baseline pode ser construído a partir de um modelo aleatório. Nesse caso, são atribuídos 0s e 1s de forma aleatória para todas as amostras de teste. Podemos pensar que o modelo baseline é apenas um chute aleatório sobre o resultado do teste positivo ou negativo. 

In [None]:
# criando baseline
baseline_preds = np.random.choice([0,1,2], size = len(test_labels))

print(baseline_preds)

[0 0 1 1 2 2 2 0 2 1 1 0 0 2 1 0 1 2 0 0 0 1 2 1 1 0 1 2 1 1 0 2 2 2 0 2 2
 0 0 1 2 2 1 2 2 1 2 1 0 0 0 0 0 2 0 1 2 2 2 2 1 1 0 2 2 2 2 2 1 1 1 1 1 1
 1 1 2 2 0 2 1 0 2 2 2 1 0 1 1 1 0 0 2 2 0 1 1 1 1 1 0 2 1 0 0 1 0 1 0 1 1
 0 2 0 2 1 2 2 0 2 1 2 2 1 2 1 2 0 0 2 0 0 0 2 2 2 2 1 2 2 0 1 1 2 1 1 2 0
 1 2 0 1 2 2 0 1 2 1 1 2 2 1 0 0 0 2 1 0 2 1 0 2 0 0 0 0 0 1 0 2 1 0 1 0 2
 0 0 0 0 2 0 0 2 1 0 0 0 0 0 1 2 0 0 1 0 2 1 1 2 1 1 1 1 2 0 2 1 0 2 2 0 2
 1 1 1 2 0 0 2 0 2 0 2 0 2 2 1 2 1 2 2 2 0 2 0 1 1 0 1 1 1 0 0 0 0 0 2 1 0
 2 2 1 2 0 1 1 0 0 1 0 2 0 0 2 0 2 1 2 2 2 2 0 1 2 1 0 0 0 2 0 2 2 1 2 2 2
 2 2 1 0 1 2 0 1 0 0 2 1 0 2 1 1 2 1 0 2 2 2 1 2 1 1 2 0 2 1 2 2 2 0 2 0 2
 0 0 1 1 2 2 1 2 1 0 0 0 0 0 1 0 1 2 0 0 0 1 1 2 1 2 1 2 0 1 1 2 0 1 2 0]


O sklearn tem várias métricas implementadas! :D

Vamos testar essas metricas? 

In [None]:
# importar biblioteca para calculo de métricas
from sklearn import metrics  
from sklearn.metrics import cluster

# Avaliando o baseline!
# essas medidas são calculadas a partir da comparação com o valor real do nosso conjunto de teste
print('Coeficiente de Silhueta\n', metrics.silhouette_score(test_data, baseline_preds)) 
print('\nDavies-Bouldin Score\n', metrics.davies_bouldin_score(test_data, baseline_preds)) 

print('\nMatriz de Contingência\n', metrics.cluster.contingency_matrix(test_labels, baseline_preds)) 
print('\nMutual information\n', metrics.mutual_info_score(test_labels, baseline_preds)) 

Coeficiente de Silhueta
 -0.03966727172363017

Davies-Bouldin Score
 28.344652318501037

Matriz de Contingência
 [[57 51 51]
 [32 24 31]
 [34 40 49]]

Mutual information
 0.004769656429835761


Agora que avaliamos nosso baseline e identificamos o erro desse modelo, podemos criar outros modelos de clusterização e comparar os resultados encontrados!

Dica: Se não conseguirmos um erro menor do que o baseline, talvez precisemos repensar nossa abordagem.

# K-means

Vamos aplicar o primeiro modelo de clusterização: o kmeans!

In [None]:
# importar o modelo de KMeans
from sklearn.cluster import KMeans

clustering = KMeans(n_clusters = 3, random_state = 42)
 
# treinando o modelo no conjunto de dados de treino
clustering.fit(train_data);   # neste caso não precisa passar os labels de teste

In [None]:
# aplicando o modelo treinado para a previsão da temperatura 
#em todo o conjunto de teste
predictions1_labels = clustering.predict(test_data)

# Exibindo dataframe com valores 10 reais e suas respectivas previsões
p = pd.DataFrame({'Real': test_labels, 'Previsto': predictions1_labels})  
p.head(10)

Unnamed: 0,Real,Previsto
0,2,1
1,0,0
2,1,1
3,0,0
4,0,1
5,1,1
6,1,1
7,0,1
8,0,0
9,0,1


Agora que criamos o modelo do kmeans e aplicamos o modelo criado ao conjunto de teste, podemos então avaliar o modelo gerado. 

In [None]:
#avaliando o modelo

print('Coeficiente de Silhueta\n', metrics.silhouette_score(test_data, predictions1_labels)) 
print('\nDavies-Bouldin Score\n', metrics.davies_bouldin_score(test_data, predictions1_labels)) 

print('\nMatriz de Contingência\n', metrics.cluster.contingency_matrix(test_labels, predictions1_labels)) 
print('\nMutual information\n', metrics.mutual_info_score(test_labels, predictions1_labels)) 

Coeficiente de Silhueta
 0.44398567322815014

Davies-Bouldin Score
 0.7497426781124604

Matriz de Contingência
 [[52 67 40]
 [41 28 18]
 [49 65  9]]

Mutual information
 0.030770654164656755


Vamos agora testar o comportamento do Kmeans com o df1

In [None]:
#definindo hiperparametros
clustering1 = KMeans(n_clusters = 3, random_state = 42)
 
# treinando o modelo no conjunto de dados de treino
clustering1.fit(train_data1);

#clustering1 = KMeans(n_clusters = 3, random_state = 42).fit(train_data1)  ## funciona da mesma forma, mas unifica os comandos

# aplicando o modelo treinado para a previsão da temperatura 
#em todo o conjunto de teste
predictions11_labels = clustering1.predict(test_data1)


# Exibindo dataframe com valores 10 reais e suas respectivas previsões
p = pd.DataFrame({'Real': test_labels1, 'Previsto': predictions11_labels})  
p.head(10)

Unnamed: 0,Real,Previsto
0,2,0
1,0,2
2,1,0
3,0,2
4,0,0
5,1,0
6,1,0
7,0,0
8,0,2
9,0,0


In [None]:
# Avaliando o modelo com o df1
print('Coeficiente de Silhueta\n', metrics.silhouette_score(test_data1, predictions11_labels)) 
print('\nDavies-Bouldin Score\n', metrics.davies_bouldin_score(test_data1, predictions11_labels)) 

print('\nMatriz de Contingência\n', metrics.cluster.contingency_matrix(test_labels1, predictions11_labels)) 
print('\nMutual information\n', metrics.mutual_info_score(test_labels1, predictions11_labels)) 


Coeficiente de Silhueta
 0.4464717655880375

Davies-Bouldin Score
 0.742946874787857

Matriz de Contingência
 [[67 40 52]
 [28 18 41]
 [65  9 49]]

Mutual information
 0.030770654164656755


O resultado do Kmeans qnd comparado ao baseline apresenta diferenças significativas. 

Vamos dar uma olhada em um outro algoritmo de clusterização!

# Affinity Propagation

Vamos dar uma olhada agora no comportamento do AffinityPropagation

In [None]:
#importar o modelo 
from sklearn.cluster import AffinityPropagation

#instanciacao 
clustering = AffinityPropagation().fit(train_data)


In [None]:
# aplicando o modelo treinado para a previsão 
predictions2_labels = clustering.predict(test_data)

# Exibindo dataframe com valores 10 reais e suas respectivas previsões
p = pd.DataFrame({'Real': test_labels, 'Previsto': predictions2_labels})  
p.head(10)

Unnamed: 0,Real,Previsto
0,2,10
1,0,18
2,1,3
3,0,10
4,0,26
5,1,3
6,1,3
7,0,16
8,0,24
9,0,3


In [None]:
#avaliando o modelo 
print('Coeficiente de Silhueta\n', metrics.silhouette_score(test_data, predictions2_labels)) 
print('\nDavies-Bouldin Score\n', metrics.davies_bouldin_score(test_data, predictions2_labels)) 

print('\nMatriz de Contingência\n', metrics.cluster.contingency_matrix(test_labels, predictions2_labels)) 
print('\nMutual informtion\n', metrics.mutual_info_score(test_labels, predictions2_labels)) 


Coeficiente de Silhueta
 0.12826562350527615

Davies-Bouldin Score
 1.4118372108257398

Matriz de Contingência
 [[ 2  9  2 10  5  5  3  2  3  8  4  7  4 10  5  5 18  1  5  0  2  1 20  5
   6  3  4  6  4]
 [ 2  3  5 10  4  0  1  2  0  5  1  1  0  5  0  6  3  2  3  5  0  6  6  6
   2  1  0  2  6]
 [ 3  1 16 10  2  0  0  0  0 10  9  0  2  2  3  1  7  0  7  6  0  2  4  6
  11  0 10  0 11]]

Mutual informtion
 0.21377471672715767


Vamos agora dar um olhada no comportamento do AffinityPropagation para o df1, que é o dataset binarizado

In [None]:
# treinando o modelo no conjunto de dados de treino
clustering1 = AffinityPropagation().fit(train_data1)
# aplicando o modelo treinado para a previsão da temperatura 
#em todo o conjunto de teste
predictions21_labels = clustering1.predict(test_data1)


# Exibindo dataframe com valores 10 reais e suas respectivas previsões
p = pd.DataFrame({'Real': test_labels1, 'Previsto': predictions21_labels})  
p.head(10)

Unnamed: 0,Real,Previsto
0,2,28
1,0,22
2,1,25
3,0,2
4,0,21
5,1,7
6,1,25
7,0,6
8,0,11
9,0,7


In [None]:
#avaliando o modelo para o df1
print('Coeficiente de Silhueta\n', metrics.silhouette_score(test_data, predictions21_labels)) 
print('\nDavies-Bouldin Score\n', metrics.davies_bouldin_score(test_data, predictions21_labels)) 


print('\nMatriz de Contingência\n', metrics.cluster.contingency_matrix(test_labels, predictions21_labels)) 
print('\nMutual informtion\n', metrics.mutual_info_score(test_labels, predictions21_labels)) 


Coeficiente de Silhueta
 0.1052024314178308

Davies-Bouldin Score
 1.6920982902650616

Matriz de Contingência
 [[ 5  3  9  7  5  4  5 10  0 12  1  1  2  7  4  1  3  8  3  2  9 11  4  4
   7 12  4 13  3]
 [ 2  9  3  2  4  4  0  9  2  8  0  2  4  3  0  2  0  3  2  0  5  0  2 11
   0  5  0  5  0]
 [ 2  4 12 13  1  8  3  7  3 10  0  5  6  1  2  0  1  3  1  0  5  9  7  4
   0  7  0  0  9]]

Mutual informtion
 0.1830622794957576


# Na próxima aula vamos começar a falar sobre técnicas de validação de modelos!

#Até lá!