420-A52-SF - Algorithmes d'apprentissage supervisé - Hiver 2020 - Spécialisation technique en Intelligence Artificielle<br/>
MIT License - Copyright (c) 2020 Mikaël Swawola
<br/>
![Travaux Pratiques - Ensembles](static/19-tp-banner.png)
<br/>
**Objectif:** cette séance de travaux pratiques a pour objectif la mise en oeuvre différentes techniques d'ensembles. Le jeu de données utilisée sera **Titanic**

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Exercice 1 - Chargement et préparation des données

In [2]:
import pandas as pd

In [3]:
titanic = pd.read_csv('../../data/titanic_train.csv', index_col='PassengerId')

In [4]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic['Age'].isna().sum()

177

In [6]:
titanic['imp_age'] = titanic['Age'].isna()
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].mean())

In [7]:
# Embarked
titanic = pd.get_dummies(titanic, columns=['Embarked'], prefix = ['emb'], drop_first=True, dummy_na=True)

In [8]:
# Sex
titanic['Sex'] = (titanic['Sex'] == 'female').astype(int)

In [9]:
titanic.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'imp_age', 'emb_Q', 'emb_S', 'emb_nan'],
      dtype='object')

In [10]:
X_train = titanic[['Age', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Fare', 'emb_Q', 'emb_S', 'imp_age']]
y_train = titanic['Survived']

#### Vérification de la proportion des classes positives (Survided) et négatives (Died) 

In [11]:
y_train.sum()/len(y_train)

0.3838383838383838

#### Importation de quelques librairies

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.utils.fixes import loguniform
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.utils import resample

## Exercice 2 - Régression logistique

In [13]:
from sklearn.linear_model import LogisticRegression

[class sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [14]:
# Grille
parameters = {'C':[1, 2, 3],
              'l1_ratio':[0, 0.5, 1]}

# Régression logistique
clf_logreg = LogisticRegression(penalty='elasticnet',
                                max_iter=10000,
                                solver='saga',
                                random_state=2020,
                                n_jobs=-1)

# GridSearch avec Validation croisée
clf_logreg_grid = GridSearchCV(clf_logreg, parameters, cv=5, scoring="roc_auc", verbose=1, n_jobs=-1)

# Ajustement sur échantillonnage du jeu d'entraînement
ratio = 0.5
Xs, ys = resample(X_train, y_train, n_samples = int(ratio*len(X_train)), stratify=y_train)

clf_logreg_grid.fit(Xs, ys)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    9.2s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=10000, multi_class='auto',
                                          n_jobs=-1, penalty='elasticnet',
                                          random_state=2020, solver='saga',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [1, 2, 3], 'l1_ratio': [0, 0.5, 1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=1)

In [15]:
clf_logreg_grid.best_score_

0.8484941855530093

In [16]:
history = {}
history['LogReg'] = {'CV': clf_logreg_grid.best_score_}
history['LogReg']['CV']

0.8484941855530093

## Exercice 3 - K plus proches voisins

In [17]:
from sklearn.neighbors import KNeighborsClassifier

[class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [18]:
# Grille
parameters = {'n_neighbors':[1, 2, 4, 8, 16, 32, 64, 128],
              'p':[1, 2],
              'weights': ['uniform','distance']}

# K plus proches voisins
clf_knn = KNeighborsClassifier(n_jobs=-1)

# GridSearch avec Validation croisée
clf_knn_grid = GridSearchCV(clf_knn, parameters, cv=5, scoring="roc_auc", verbose=1, n_jobs=-1)

# Ajustement sur échantillonnage du jeu d'entraînement
ratio = 0.5
Xs, ys = resample(X_train, y_train, n_samples = int(ratio*len(X_train)), stratify=y_train)
clf_knn_grid.fit(Xs, ys)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:    0.9s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=-1,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=-1,
             param_grid={'n_neighbors': [1, 2, 4, 8, 16, 32, 64, 128],
                         'p': [1, 2], 'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=1)

In [19]:
clf_knn_grid.best_score_

0.8760812607871431

In [20]:
history['KNN'] = {'CV': clf_knn_grid.best_score_}
history['KNN']['CV']

0.8760812607871431

## Exercice 4 - Arbres de décision

In [21]:
from sklearn.tree import DecisionTreeClassifier

[class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [22]:
# Distributions des paramètres
distributions = dict(
    criterion=['gini', 'entropy'],
    ccp_alpha=loguniform(1e-3, 1e3),
    max_depth=randint(2, 128))

# Estimateur
clf_tree = DecisionTreeClassifier(random_state=2020)
   
# Recherche aléatoire avec avec validation croisée
clf_tree_rnd = RandomizedSearchCV(clf_tree, distributions, n_iter=10000, cv=5, scoring="roc_auc", verbose=1, n_jobs=-1, random_state=2020)

# Ajustement sur échantillonnage du jeu d'entraînement
ratio = 0.5
Xs, ys = resample(X_train, y_train, n_samples = int(ratio*len(X_train)), stratify=y_train)
clf_tree_rnd.fit(Xs, ys)

Fitting 5 folds for each of 10000 candidates, totalling 50000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 2160 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done 6160 tasks      | elapsed:   10.7s
[Parallel(n_jobs=-1)]: Done 11760 tasks      | elapsed:   20.5s
[Parallel(n_jobs=-1)]: Done 18960 tasks      | elapsed:   32.8s
[Parallel(n_jobs=-1)]: Done 27760 tasks      | elapsed:   47.8s
[Parallel(n_jobs=-1)]: Done 38160 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 49860 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 50000 out of 50000 | elapsed:  1.4min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features=None,
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    presort='deprecated',
                                                    random_state=2020,
          

In [23]:
clf_tree_rnd.best_score_

0.8906269982740571

In [24]:
history['Tree'] = {'CV': clf_tree_rnd.best_score_}
history['Tree']['CV']

0.8906269982740571

## Exercice 5 - SVM

In [25]:
from sklearn.svm import SVC

[class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [26]:
# Grille
parameters = {'kernel':['linear', 'poly', 'rbf'],
              'degree':[1, 2, 3]}

# K plus proches voisins
clf_svc = SVC(probability=True,
              random_state=2020)

# GridSearch avec Validation croisée
clf_svc_grid = GridSearchCV(clf_svc, parameters, cv=5, scoring="roc_auc", verbose=1, n_jobs=-1)

# Ajustement sur échantillonnage du jeu d'entraînement
ratio = 0.5
Xs, ys = resample(X_train, y_train, n_samples = int(ratio*len(X_train)), stratify=y_train)
clf_svc_grid.fit(Xs, ys)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  45 | elapsed:  1.4min remaining:   41.0s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  2.8min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=True, random_state=2020, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'degree': [1, 2, 3],
                         'kernel': ['linear', 'poly', 'rbf']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=1)

In [27]:
clf_svc_grid.best_score_

0.8422986164162634

In [28]:
history['SVM'] = {'CV': clf_svc_grid.best_score_}
history['SVM']['CV']

0.8422986164162634

In [29]:
history

{'LogReg': {'CV': 0.8484941855530093},
 'KNN': {'CV': 0.8760812607871431},
 'Tree': {'CV': 0.8906269982740571},
 'SVM': {'CV': 0.8422986164162634}}

## Exercice 6 - VotingClassifier

In [30]:
from sklearn.ensemble import VotingClassifier

[class sklearn.ensemble.VotingClassifier(estimators, voting='hard', weights=None, n_jobs=None, flatten_transform=True)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier)

In [31]:
estimators=[
    ('lr', clf_logreg_grid.best_estimator_),
    ('knn', clf_knn_grid.best_estimator_),
    ('tree', clf_tree_rnd.best_estimator_),
    ('svc', clf_svc_grid.best_estimator_)]

clf_vote = VotingClassifier(estimators=estimators, voting='soft', n_jobs=-1)

clf_vote.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=2, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=0, max_iter=10000,
                                                 multi_class='auto', n_jobs=-1,
                                                 penalty='elasticnet',
                                                 random_state=2020,
                                                 solver='saga', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('knn',
                              KNeighborsClassifier(algorithm='auto',
                                                   leaf_size=30,
                                                   metric='minkowski'...
                                            

In [32]:
cv_score = cross_val_score(clf_vote, X_train, y_train, cv=5, scoring="roc_auc", verbose=1, n_jobs=-1)
cv_score.mean()

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   13.2s remaining:   19.8s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.3min finished


0.8581965221088617

## Exercice 7 - Stacking

In [34]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#### Séparation des données

In [35]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.7, random_state=2020)

In [36]:
X_stack = np.c_[
    clf_logreg_grid.best_estimator_.predict_proba(X_val)[:,1],
    clf_knn_grid.best_estimator_.predict_proba(X_val)[:,1],
    clf_tree_rnd.best_estimator_.predict_proba(X_val)[:,1],
    clf_svc_grid.best_estimator_.predict_proba(X_val)[:,1]
]

In [37]:
X_stack.shape

(268, 4)

In [39]:
# Grille
parameters = {'C':[1, 2, 3],
              'l1_ratio':[0, 0.5, 1]}

# Régression logistique
clf_meta = LogisticRegression(penalty='elasticnet',
                                max_iter=10000,
                                solver='saga',
                                random_state=2020,
                                n_jobs=-1)

# GridSearch avec Validation croisée
clf_meta_grid = GridSearchCV(clf_meta, parameters, cv=5, scoring="roc_auc", verbose=1, n_jobs=-1)

clf_meta_grid.fit(X_stack, y_val)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    3.0s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=10000, multi_class='auto',
                                          n_jobs=-1, penalty='elasticnet',
                                          random_state=2020, solver='saga',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [1, 2, 3], 'l1_ratio': [0, 0.5, 1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=1)

In [40]:
clf_meta_grid.best_score_

0.9172767420751292