# *Cross validation* I: Treinamento, Validação e Teste

### Índice <a name="topo"></a>
- 1. [Introdução](#1)
- 2. [Carregando a base](#2)
- 3. [Base de Treinamento, Validação e Teste](#3)
- 4. [Cálculo dos CCP-alphas](#4)
- 5. [Obtendo a melhor árvore](#5)
- 6. [Avaliando a melhor árvore](#6)
- 7. [Gancho para a próxima aula](#7)


### 1. Introdução <a name="1"></a>
[Voltar para o índice](#topo)

O gancho da aula passada:

- Será que "demos sorte" de a base de testes ter esse desempenho?  
- Com outra base teriamos o mesmo desempenho?  
- Como podemos obter uma métrica mais "confiável" do desempenho desse algoritmo?

Na aula passada, como a base de testes foi utilizada para 'tunar' o modelo, é razoável imaginarmos que ao aplicar o modelo a uma base mais ampla, não vamos obter exatamente esta acurácia.

Vamos fazer uma primeira tentativa de resolver este problema separando uma base de testes *holdout*, que não será utilizada nem no desenvolvimento do modelo, nem na escolha dos hiperparâmetros, e no final, vamos avaliar a qualidade do modelo nesta base.

### 2. Carregando a base<a name="2"></a>
[Voltar para o índice](#topo)

Nesta aula vamos carregar a base já tratada na aula passada, com os valores faltantes da variável ```sex``` preenchidos.

In [5]:
import pandas            as pd 
import numpy             as np 
import seaborn           as sns
import matplotlib.pyplot as plt
from sklearn.tree            import DecisionTreeClassifier
from sklearn.metrics         import accuracy_score
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [2]:
with open('features.txt', 'r') as f:
    features = pd.Series([line.strip() for line in f])

feature_names = features.str.extract(r'\d+\s+(.*)')[0].tolist()

with open('test/subject_test.txt', 'r') as f:
    subject_test = pd.Series([int(line.strip()) for line in f]) 

with open('test/y_test.txt', 'r') as f:
    y_test = pd.Series([int(line.strip()) for line in f]) 

with open('train/y_train.txt', 'r') as f:
    y_train = pd.Series([int(line.strip()) for line in f]) 

with open('train/subject_train.txt', 'r') as f:
    subject_train = pd.Series([int(line.strip()) for line in f]) 


selected_columns = ['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z']


X_train = pd.read_csv('train/X_train.txt', delim_whitespace=True, header=None)
X_train.columns = feature_names
X_train.insert(0, 'subject', subject_train)
X_train_selected = X_train[selected_columns]


X_test = pd.read_csv('test/X_test.txt', delim_whitespace=True, header=None)
X_test.columns = feature_names
X_test.insert(0,'subject', subject_test)
X_test_selected = X_test[selected_columns]


X_test_selected.index = pd.MultiIndex.from_arrays(
    [X_test.index, X_test['subject']],
    names=['order', 'subject']
)

X_train_selected.index = pd.MultiIndex.from_arrays(
    [X_train.index, X_train['subject']],
    names=['order', 'subject']
)


caminho = DecisionTreeClassifier(random_state=2360873).cost_complexity_pruning_path(X_train_selected, y_train)


In [3]:
ccp_alphas, impurities = caminho.ccp_alphas, caminho.impurities
ccp_alphas = np.unique(ccp_alphas[ccp_alphas>=0])

In [None]:
classifier = DecisionTreeClassifier(random_state=2360873)
grid_params = {'ccp_alpha':ccp_alphas}
grid_params


In [7]:
grid = GridSearchCV(estimator=classifier, param_grid = grid_params, cv=15,verbose=100)
grid.fit(X_train_selected, y_train)

Fitting 15 folds for each of 820 candidates, totalling 12300 fits
[CV 1/15; 1/820] START ccp_alpha=0.0............................................
[CV 1/15; 1/820] END .............ccp_alpha=0.0;, score=0.350 total time=   0.0s
[CV 2/15; 1/820] START ccp_alpha=0.0............................................
[CV 2/15; 1/820] END .............ccp_alpha=0.0;, score=0.312 total time=   0.0s
[CV 3/15; 1/820] START ccp_alpha=0.0............................................
[CV 3/15; 1/820] END .............ccp_alpha=0.0;, score=0.408 total time=   0.0s
[CV 4/15; 1/820] START ccp_alpha=0.0............................................
[CV 4/15; 1/820] END .............ccp_alpha=0.0;, score=0.369 total time=   0.0s
[CV 5/15; 1/820] START ccp_alpha=0.0............................................
[CV 5/15; 1/820] END .............ccp_alpha=0.0;, score=0.390 total time=   0.0s
[CV 6/15; 1/820] START ccp_alpha=0.0............................................
[CV 6/15; 1/820] END .............ccp_alpha

In [8]:
grid

In [9]:
resultados = pd.DataFrame(grid.cv_results_)

In [10]:
resultados.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ccp_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split8_test_score,split9_test_score,split10_test_score,split11_test_score,split12_test_score,split13_test_score,split14_test_score,mean_test_score,std_test_score,rank_test_score
0,0.027491,0.000903,0.001284,0.000417,0.0,{'ccp_alpha': 0.0},0.350305,0.311609,0.408163,0.369388,...,0.446939,0.42449,0.4,0.442857,0.402041,0.381633,0.406122,0.389162,0.035621,811
1,0.028705,0.002216,0.001271,0.000449,9.1e-05,{'ccp_alpha': 9.067827348567283e-05},0.350305,0.311609,0.408163,0.369388,...,0.446939,0.42449,0.4,0.442857,0.402041,0.381633,0.406122,0.389162,0.035621,811
2,0.027882,0.000978,0.0014,0.000611,0.0001,{'ccp_alpha': 9.974610083424014e-05},0.352342,0.309572,0.406122,0.369388,...,0.45102,0.42449,0.4,0.442857,0.404082,0.381633,0.404082,0.389162,0.036353,809
3,0.027499,0.00073,0.001267,0.000442,0.000102,{'ccp_alpha': 0.00010201305767138194},0.352342,0.309572,0.406122,0.369388,...,0.45102,0.42449,0.4,0.442857,0.404082,0.381633,0.404082,0.389162,0.036353,809
4,0.028058,0.001544,0.001376,0.000516,0.000107,{'ccp_alpha': 0.00010687082232240014},0.352342,0.309572,0.406122,0.369388,...,0.45102,0.42449,0.4,0.442857,0.404082,0.381633,0.404082,0.389298,0.036168,807
5,0.028138,0.001098,0.0011,0.000272,0.000109,{'ccp_alpha': 0.00010881392818280738},0.352342,0.309572,0.406122,0.369388,...,0.45102,0.42449,0.4,0.442857,0.404082,0.381633,0.404082,0.389298,0.036168,807
6,0.028681,0.001179,0.001237,0.000404,0.000111,{'ccp_alpha': 0.00011108088501994928},0.352342,0.309572,0.408163,0.37551,...,0.457143,0.426531,0.4,0.442857,0.408163,0.383673,0.404082,0.390794,0.037004,802
7,0.028087,0.000966,0.001503,0.000484,0.000113,{'ccp_alpha': 0.00011334784185709103},0.352342,0.309572,0.408163,0.37551,...,0.457143,0.426531,0.4,0.442857,0.408163,0.383673,0.404082,0.390794,0.037004,802
8,0.027926,0.000772,0.0012,0.0004,0.000113,{'ccp_alpha': 0.00011334784185709105},0.352342,0.309572,0.408163,0.37551,...,0.457143,0.426531,0.4,0.442857,0.408163,0.383673,0.404082,0.390794,0.037004,802
9,0.028236,0.000973,0.0012,0.0004,0.000115,{'ccp_alpha': 0.00011485914641518556},0.352342,0.309572,0.408163,0.37551,...,0.457143,0.426531,0.4,0.442857,0.408163,0.383673,0.404082,0.390794,0.037004,802


In [11]:
grid.best_index_

752

In [15]:
melhor_ccp = resultados.iloc[grid.best_index_,4]
clf = DecisionTreeClassifier(random_state=2360873, ccp_alpha=melhor_ccp).fit(X_train_selected, y_train)
clf.score(X_test_selected, y_test)

0.44146589752290466