# Machine learning sur le titanic

On va construire un modèle de machine learning capable de prédire la survie au Titanic

Importer les données

In [1]:
import pandas as pd
import numpy as np

In [2]:
titanic = pd.read_csv("./data/titanic_train.csv")

In [3]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Préparation des données

In [4]:
# on affiche les colonnes
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

On extrait les colonnes explicatives (x) et à expliquer (y)

In [5]:
y = titanic["Survived"]
x = titanic[['Pclass','Sex', 'Age','Fare']]

On va recoder les données textuelles et imputer les données manquantes

In [6]:
# transformer les données
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
# imputer les données manquantes
from sklearn.impute import SimpleImputer

In [7]:
# on crée un objet
encode_sex = LabelEncoder()
# on applique la transormation
x["Sex"]=encode_sex.fit_transform(x["Sex"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [8]:
# on crée un objet
impute_age = SimpleImputer()
# on applique la transormation
x["Age"]=impute_age.fit_transform(np.array(x["Age"]).reshape(-1, 1))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [10]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
Pclass    891 non-null int64
Sex       891 non-null int32
Age       891 non-null float64
Fare      891 non-null float64
dtypes: float64(2), int32(1), int64(1)
memory usage: 24.4 KB


On a bien que des données numériques

## Séparation apprentissage / test

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
x_train, x_test,y_train,y_test = train_test_split(x,y,stratify = y)

In [13]:
y.value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

In [14]:
y_train.value_counts(normalize=True)

0    0.616766
1    0.383234
Name: Survived, dtype: float64

## Les modèles de Machine learning

On va charger et ajuster les modèles de machine learing

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [17]:
# créer des objets
modele_logit = LogisticRegression()
modele_rf = RandomForestClassifier(n_estimators=100)

In [18]:
# ajuster avec la méthode .fit(...)
modele_logit.fit(x_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [19]:
# on affiche les coefficients de la régression logistique
pd.DataFrame(modele_logit.coef_.T,index=x.columns)

Unnamed: 0,0
Pclass,-0.925113
Sex,-2.384653
Age,-0.020877
Fare,0.003222


In [20]:
modele_rf.fit(x_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [21]:
# on affiche les importances des colonnes dans les arbres générés par la forêt
pd.DataFrame(modele_rf.feature_importances_,index=x.columns)

Unnamed: 0,0
Pclass,0.111328
Sex,0.29399
Age,0.280225
Fare,0.314456


## Validation des modèles

On utilise des indicateurs pour vérifier la qualité du modèle

In [23]:
# tester avec .predict(...)
from sklearn.metrics import accuracy_score
print("% de bien classés pour le modèle logit :", accuracy_score(y_test,modele_logit.predict(x_test)))
print("% de bien classés pour le modèle RF :", accuracy_score( y_test,modele_rf.predict(x_test)))


% de bien classés pour le modèle logit : 0.7713004484304933
% de bien classés pour le modèle RF : 0.7892376681614349


In [25]:
from sklearn.metrics import confusion_matrix
print("mat de confusion pour le modèle logit :", confusion_matrix(y_test,modele_logit.predict(x_test)), sep="\n")
print("mat de confusion pour le modèle RF :", confusion_matrix(y_test,modele_rf.predict(x_test)), sep="\n")


mat de confusion pour le modèle logit :
[[118  19]
 [ 32  54]]
mat de confusion pour le modèle RF :
[[122  15]
 [ 32  54]]


## Ajustement des hyper-paramètres

On utilise Grid Search pour ajuster les hyper-paramètres du modèle

In [26]:
from sklearn.model_selection import GridSearchCV

On va donc devoir définir les hyperparamètres que l’on souhaite tester. Pour cela,
on utilisera un dictionnaire d’hyperparamètres, par exemple :

In [27]:
dico_param= {"max_depth":[3,5,7,10, None], "n_estimators":[10,20,50,100,1000]}

On va encore utiliser l’accuracy pour valider notre modèle. Finalement, nous allons
utiliser une validation croisée à cinq groupes pour valider les résultats.
Le nouvel objet est le suivant :

In [28]:
recherche_hyper = GridSearchCV(RandomForestClassifier(), 
                               dico_param, 
                               scoring="accuracy",cv=5)

In [29]:
recherche_hyper.fit(x_train,y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': [3, 5, 7, 10, None], 'n_estimators': [10, 20, 50, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [30]:
# la meilleure combinaison est :
recherche_hyper.best_params_

{'max_depth': 7, 'n_estimators': 50}

In [31]:
# L'ensemble des résultat apparaît ici :
pd.DataFrame(recherche_hyper.cv_results_)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.019042,0.006438,0.004613,0.001874,3.0,10,"{'max_depth': 3, 'n_estimators': 10}",0.859259,0.813433,0.781955,...,0.827844,0.028834,9,0.842402,0.835206,0.861682,0.816822,0.829907,0.837204,0.014825
1,0.028568,0.005115,0.00386,0.002089,3.0,20,"{'max_depth': 3, 'n_estimators': 20}",0.844444,0.80597,0.721805,...,0.811377,0.047857,22,0.831144,0.829588,0.835514,0.826168,0.814953,0.827474,0.006943
2,0.075827,0.006061,0.008642,0.003002,3.0,50,"{'max_depth': 3, 'n_estimators': 50}",0.874074,0.80597,0.75188,...,0.823353,0.041685,12,0.849906,0.840824,0.852336,0.842991,0.848598,0.846931,0.004328
3,0.14695,0.004673,0.01352,0.001495,3.0,100,"{'max_depth': 3, 'n_estimators': 100}",0.844444,0.791045,0.774436,...,0.814371,0.027712,21,0.842402,0.831461,0.856075,0.826168,0.842991,0.839819,0.010365
4,1.478071,0.011556,0.130022,0.002256,3.0,1000,"{'max_depth': 3, 'n_estimators': 1000}",0.851852,0.80597,0.759398,...,0.817365,0.032803,16,0.855535,0.837079,0.854206,0.835514,0.852336,0.846934,0.008759
5,0.018882,0.001049,0.003638,0.000805,5.0,10,"{'max_depth': 5, 'n_estimators': 10}",0.837037,0.783582,0.759398,...,0.80988,0.03359,24,0.870544,0.885768,0.88785,0.852336,0.846729,0.868646,0.016804
6,0.036247,0.002062,0.005496,0.001856,5.0,20,"{'max_depth': 5, 'n_estimators': 20}",0.866667,0.813433,0.75188,...,0.830838,0.043898,6,0.874296,0.876404,0.882243,0.857944,0.856075,0.869393,0.010457
7,0.080839,0.003593,0.008436,0.001881,5.0,50,"{'max_depth': 5, 'n_estimators': 50}",0.851852,0.813433,0.759398,...,0.829341,0.039108,7,0.874296,0.88015,0.880374,0.865421,0.878505,0.875749,0.005606
8,0.162997,0.009353,0.01472,0.003363,5.0,100,"{'max_depth': 5, 'n_estimators': 100}",0.859259,0.828358,0.759398,...,0.832335,0.038424,5,0.870544,0.870787,0.885981,0.869159,0.86729,0.872752,0.006731
9,1.631042,0.053028,0.141542,0.010206,5.0,1000,"{'max_depth': 5, 'n_estimators': 1000}",0.866667,0.828358,0.759398,...,0.833832,0.039286,2,0.878049,0.872659,0.88785,0.869159,0.874766,0.876497,0.00637
