# *Cross validation* I: Treinamento, Validação e Teste

### Índice <a name="topo"></a>
- 1. [Introdução](#1)
- 2. [Carregando a base](#2)
- 3. [Base de Treinamento, Validação e Teste](#3)
- 4. [Cálculo dos CCP-alphas](#4)
- 5. [Obtendo a melhor árvore](#5)
- 6. [Avaliando a melhor árvore](#6)
- 7. [Gancho para a próxima aula](#7)


### 1. Introdução <a name="1"></a>
[Voltar para o índice](#topo)

O gancho da aula passada:

- Será que "demos sorte" de a base de testes ter esse desempenho?  
- Com outra base teriamos o mesmo desempenho?  
- Como podemos obter uma métrica mais "confiável" do desempenho desse algoritmo?

Na aula passada, como a base de testes foi utilizada para 'tunar' o modelo, é razoável imaginarmos que ao aplicar o modelo a uma base mais ampla, não vamos obter exatamente esta acurácia.

Vamos fazer uma primeira tentativa de resolver este problema separando uma base de testes *holdout*, que não será utilizada nem no desenvolvimento do modelo, nem na escolha dos hiperparâmetros, e no final, vamos avaliar a qualidade do modelo nesta base.

### 2. Carregando a base<a name="2"></a>
[Voltar para o índice](#topo)

Nesta aula vamos carregar a base já tratada na aula passada, com os valores faltantes da variável ```sex``` preenchidos.

In [1]:
import pandas            as pd 
import numpy             as np 
import seaborn           as sns
import matplotlib.pyplot as plt
from sklearn.tree            import DecisionTreeClassifier
from sklearn.metrics         import accuracy_score
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [2]:
with open('features.txt', 'r') as f:
    features = pd.Series([line.strip() for line in f])

feature_names = features.str.extract(r'\d+\s+(.*)')[0].tolist()

with open('test/subject_test.txt', 'r') as f:
    subject_test = pd.Series([int(line.strip()) for line in f]) 

with open('test/y_test.txt', 'r') as f:
    y_test = pd.Series([int(line.strip()) for line in f]) 

with open('train/y_train.txt', 'r') as f:
    y_train = pd.Series([int(line.strip()) for line in f]) 

with open('train/subject_train.txt', 'r') as f:
    subject_train = pd.Series([int(line.strip()) for line in f]) 


selected_columns = ['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z']


X_train = pd.read_csv('train/X_train.txt', delim_whitespace=True, header=None)
X_train.columns = feature_names
X_train.insert(0, 'subject', subject_train)
X_train_selected = X_train[selected_columns]


X_test = pd.read_csv('test/X_test.txt', delim_whitespace=True, header=None)
X_test.columns = feature_names
X_test.insert(0,'subject', subject_test)
X_test_selected = X_test[selected_columns]


X_test_selected.index = pd.MultiIndex.from_arrays(
    [X_test.index, X_test['subject']],
    names=['order', 'subject']
)

X_train_selected.index = pd.MultiIndex.from_arrays(
    [X_train.index, X_train['subject']],
    names=['order', 'subject']
)


caminho = DecisionTreeClassifier(random_state=2360873,min_samples_leaf=20,max_depth=4).cost_complexity_pruning_path(X_train_selected, y_train)


In [3]:
ccp_alphas, impurities = caminho.ccp_alphas, caminho.impurities
ccp_alphas = np.unique(ccp_alphas[ccp_alphas>=0])

In [4]:
classifier = DecisionTreeClassifier(random_state=2360873)
grid_params = {'ccp_alpha':ccp_alphas}
grid_params


{'ccp_alpha': array([0.        , 0.00099256, 0.00176395, 0.00218766, 0.00240709,
        0.00287647, 0.00290352, 0.00440039, 0.00473006, 0.0052179 ,
        0.00739929, 0.01053951, 0.03459538])}

In [5]:
grid = GridSearchCV(estimator=classifier, param_grid = grid_params, cv=15,verbose=100)
grid.fit(X_train_selected, y_train)

Fitting 15 folds for each of 13 candidates, totalling 195 fits
[CV 1/15; 1/13] START ccp_alpha=0.0.............................................
[CV 1/15; 1/13] END ..............ccp_alpha=0.0;, score=0.350 total time=   0.0s
[CV 2/15; 1/13] START ccp_alpha=0.0.............................................
[CV 2/15; 1/13] END ..............ccp_alpha=0.0;, score=0.312 total time=   0.0s
[CV 3/15; 1/13] START ccp_alpha=0.0.............................................
[CV 3/15; 1/13] END ..............ccp_alpha=0.0;, score=0.408 total time=   0.0s
[CV 4/15; 1/13] START ccp_alpha=0.0.............................................
[CV 4/15; 1/13] END ..............ccp_alpha=0.0;, score=0.369 total time=   0.0s
[CV 5/15; 1/13] START ccp_alpha=0.0.............................................
[CV 5/15; 1/13] END ..............ccp_alpha=0.0;, score=0.390 total time=   0.0s
[CV 6/15; 1/13] START ccp_alpha=0.0.............................................
[CV 6/15; 1/13] END ..............ccp_alpha=0.

In [6]:
grid

In [7]:
resultados = pd.DataFrame(grid.cv_results_)

In [8]:
resultados.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ccp_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split8_test_score,split9_test_score,split10_test_score,split11_test_score,split12_test_score,split13_test_score,split14_test_score,mean_test_score,std_test_score,rank_test_score
0,0.029744,0.004463,0.001469,0.000537,0.0,{'ccp_alpha': 0.0},0.350305,0.311609,0.408163,0.369388,...,0.446939,0.42449,0.4,0.442857,0.402041,0.381633,0.406122,0.389162,0.035621,7
1,0.033974,0.000794,0.001296,0.000333,0.000993,{'ccp_alpha': 0.0009925578776407706},0.448065,0.362525,0.479592,0.44898,...,0.479592,0.459184,0.385714,0.495918,0.506122,0.463265,0.491837,0.443971,0.043867,1
2,0.034788,0.00335,0.001305,0.000423,0.001764,{'ccp_alpha': 0.0017639483015495644},0.433809,0.356415,0.469388,0.408163,...,0.418367,0.42449,0.346939,0.457143,0.487755,0.444898,0.465306,0.421117,0.040007,2
3,0.034124,0.001615,0.001563,0.00043,0.002188,{'ccp_alpha': 0.00218765586899472},0.411405,0.348269,0.467347,0.406122,...,0.414286,0.436735,0.342857,0.455102,0.471429,0.436735,0.436735,0.414318,0.04029,3
4,0.03553,0.005116,0.00143,0.00036,0.002407,{'ccp_alpha': 0.0024070945610311922},0.401222,0.354379,0.467347,0.395918,...,0.414286,0.432653,0.342857,0.455102,0.467347,0.434694,0.432653,0.412414,0.039235,4
5,0.038341,0.004511,0.001469,0.000373,0.002876,{'ccp_alpha': 0.0028764717667856283},0.413442,0.386965,0.463265,0.387755,...,0.414286,0.416327,0.35102,0.438776,0.469388,0.430612,0.428571,0.412,0.03431,6
6,0.034055,0.001249,0.001547,0.000345,0.002904,{'ccp_alpha': 0.0029035238969861467},0.413442,0.393075,0.463265,0.387755,...,0.414286,0.416327,0.344898,0.438776,0.469388,0.430612,0.428571,0.412407,0.035058,5
7,0.035078,0.00217,0.001335,0.000495,0.0044,{'ccp_alpha': 0.004400394494202681},0.415479,0.350305,0.406122,0.379592,...,0.383673,0.383673,0.330612,0.395918,0.442857,0.387755,0.389796,0.380848,0.034624,8
8,0.034662,0.002555,0.001355,0.000299,0.00473,{'ccp_alpha': 0.004730061257202697},0.409369,0.350305,0.406122,0.365306,...,0.393878,0.355102,0.314286,0.395918,0.434694,0.387755,0.397959,0.376631,0.035728,9
9,0.035419,0.002778,0.001258,0.000366,0.005218,{'ccp_alpha': 0.005217899399032111},0.407332,0.327902,0.4,0.361224,...,0.379592,0.357143,0.314286,0.406122,0.416327,0.402041,0.385714,0.371873,0.035554,10


In [9]:
grid.best_index_

1

In [10]:
melhor_ccp = resultados.iloc[grid.best_index_,4]
clf = DecisionTreeClassifier(random_state=2360873, ccp_alpha=melhor_ccp).fit(X_train_selected, y_train)
clf.score(X_test_selected, y_test)

0.43807261621988464