# Analyse de données Titanic

## 1. Visualisation de données en utilisant PowerBI
![dashboard](https://dochub.com/heiko11/bDa8NX3RdX5jmAER2zA6Ey/bi-png)

## 2. Analyse de données en utilisant les librairies de machine learning en python.

### Demarche:
*   Definir un objectif mesurable : 
> Objectif : prédire si un passager aurait survécu ou pas.\
> Métrique : F1 -> 50% et Recall -> 70%. \
> Précision : permet de réduire au maximum le nombre de faux positifs.\
> Recall (sensibilité) : permet de réduire au maximum le nombre de faux négatifs.\
> Score F1.

*   AED (Analyse Exploratoire des Données)
*   Préparation des données.
*   Modélisation

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sb
import matplotlib.pyplot as plt
import scipy.stats as st
import statistics
from sklearn.preprocessing import scale
import warnings

%matplotlib inline


pd.options.display.max_columns = None
warnings.filterwarnings('ignore')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importion des données
df = pd.read_csv('/kaggle/input/titanicdataset-traincsv/train.csv')
df = df.drop(columns = ['PassengerId'])

In [None]:
df.columns

### Analyse de la forme des données.

In [None]:
data = df.copy()

In [None]:
data.isna().sum()

Dans ce jeu des données, il nous manque 177 valeurs dans la colonne 'Age', 687 valeurs de la variable 'Cabin' et 2 valeurs de 'Embarked'.

In [None]:
data.dtypes.value_counts()

In [None]:
sb.heatmap(data.isna(), cbar=False)

In [None]:
(data.isna().sum()/data.shape[0]).sort_values()

Pour la variable **Cabin**, plus de 77% des valeurs sont manquantes et 20% valeurs de la colonne **Age**

In [None]:
Age = data['Age'] # on remplace les valeurs manquantes par la median
Age[Age.isna() == True] = Age.median()

On remplace les valeurs manquantes de la variable 'Age' par la mediane

In [None]:
data = data.drop(columns=['Cabin','Name', 'Ticket']) # on supprime les variables inutiles

In [None]:
data.isna().sum()

In [None]:
data.dtypes

### Analyse de fond

### Histogrammes des variables quantitatives

In [None]:
for col in data.select_dtypes('float64'):
    plt.figure()
    sb.distplot(data[col])

In [None]:
for col in data.select_dtypes('object'):
    print(col, data[col].unique())

In [None]:
for col in data.select_dtypes('object'):
    plt.figure()
    data[col].value_counts().plot.pie()

### Relation Target/variable

#### Création des sous ensemble : Survi et non survi

In [None]:
survecu = data[data['Survived'] == 1]
nonsurvecu = data[data['Survived'] ==  0]

In [None]:
sb.countplot(x ='Age', hue = 'Survived', data = data)

In [None]:
sb.countplot(x ='Fare', hue = 'Survived', data = data)

In [None]:
pd.crosstab(data['Survived'], data['Embarked'])/data.shape[0]

Parmi les voyageurs qui ont ambarqué à Southampton, 48% n'ont pas survecu, alors que seuls 24% ont survecu.

In [None]:
pd.crosstab(data['Survived'], data['Sex'])/data.shape[0]

D'après cette table de fréquence, on voit que la moitié des voyageurs n'ayant pas survecu étaient des hommes. On remaque aussi une faible mortalité pour le genre féminine.

In [None]:
pd.crosstab(data['Survived'], data['Parch'])/data.shape[0]

In [None]:
pd.crosstab(data['Survived'], data['Pclass'])/data.shape[0]

Selon la table de probabilité, on voit que les voyageurs de classe 3 n'ont, en majorité pas survecu.

In [None]:
pd.crosstab(data['Survived'], data['SibSp'])/data.shape[0]

In [None]:
for col in ['Sex','Embarked','Pclass','Parch','SibSp']:
    plt.figure()
    sb.heatmap(pd.crosstab(data['Survived'], data[col]), annot=True, fmt = 'd')

## Préparation des données
- Objectif: Mettre les données dans un format propice au ML
  -  Train/Test
  - Encodage
  - Nettoyage des NaN

### Train-Test-Encodage-Nettoyage

In [None]:
df = pd.read_csv('/kaggle/input/titanicdataset-traincsv/train.csv')
df = df.drop(columns = ['PassengerId'])

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
Age = df['Age']
Age[Age.isna() == True] = Age.median()
df['Age'] = Age

In [None]:
Embarked = df['Embarked']
Embarked[Embarked.isna() == True] = Embarked.mode()
df['Embarked'] = Embarked

On met de côté une partie des données pour tester notre modèle (***test_set***) et une autre partie pour l'entaîner (***training_set***). 

*   Le ***training set***, qui va nous permettre d’entraîner notre modèle et sera utilisé par l’algorithme d’apprentissage.
*  Le **testing_set**, qui permet de mesurer l’erreur du modèle final sur des données qu’il n’a jamais vues. 

les données sont séparées avec les proportions suivantes : *80 % pour le training set et 20 % pour le testing set*.


In [None]:
trainset, testset = train_test_split(df, test_size = 0.2, random_state = 0)

In [None]:
trainset['Survived'].value_counts()

In [None]:
testset.shape

In [None]:
df = df.drop(columns=['Name','Cabin','Ticket'])

### Encodage

In [None]:
for col in df.select_dtypes('object').columns:
    print(col)

On définit une fonction encodage, qui nous permet de remplacer les valeurs des variables Embarked et Sex en des valeurs numériques et une fonction preprocessing pour la préparation des données.

In [None]:
def encodage(df):
    
    code  = {'S':1, 
             'C':2, 
             'Q':3, 
             'female':1,
             'male':2}
    for col in df.select_dtypes('object').columns:
        df[col] = df[col].map(code)  
        
    return df

In [None]:
def preprocessing(df):
    
    df = encodage(df)
    
    X = df.drop(columns=['Survived','Cabin','Name','Ticket'], axis = 1)
    y = df['Survived']
    print(y.value_counts())
    return X,y

In [None]:
X_train, y_train = preprocessing(trainset)

In [None]:
X_test, y_test = preprocessing(testset)

In [None]:
X_train.isna().sum()

In [None]:
Age = X_train['Age'] # on remplace les valeurs manquantes par la mediane
Age[Age.isna() == True] = Age.median()
X_train['Age'] = Age

In [None]:
Embarked = X_train['Embarked']
Embarked[Embarked.isna() == True] = 1
X_train['Embarked'] = Embarked

### Modélisation
On met en place des modèles de machine learning candidat afin d'en choisir un pour modéliser notre problème.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

In [None]:
preprocessor = make_pipeline(PolynomialFeatures(2, include_bias=False) ,SelectKBest(f_classif, k = 4))

In [None]:
RandomForest = make_pipeline(preprocessor, RandomForestClassifier(random_state=0))
AdaBoost = make_pipeline(preprocessor, AdaBoostClassifier(random_state=0))
SVM = make_pipeline(preprocessor, StandardScaler(), SVC(random_state=0))
KNN = make_pipeline(preprocessor, StandardScaler(), KNeighborsClassifier())
LogReg = make_pipeline(preprocessor, StandardScaler(), LogisticRegression())

In [None]:
list_model = {'RandomForest': RandomForest, 'AdaBoost': AdaBoost, 
              'SVM': SVM, 'KNN': KNN, 'LogisticReg' : LogReg}

In [None]:
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.model_selection import learning_curve

In [None]:
def evaluation(model):
    model.fit(X_train,y_train)
    ypred = model.predict(X_test)
    print(confusion_matrix(y_test,ypred))
    print(classification_report(y_test,ypred))
    
    N, train_score, val_score = learning_curve(model, X_train, y_train,
                                               cv = 4, scoring = 'f1',
                                               train_sizes = np.linspace(0.1,1,10))
    plt.figure(figsize=(12,8))
    plt.plot(N, train_score.mean(axis = 1))
    plt.plot(N, val_score.mean(axis = 1))

### Procédure d'évaluation 

In [None]:
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.model_selection import learning_curve

In [None]:
def evaluation(model):
    model.fit(X_train,y_train)
    ypred = model.predict(X_test)
    print(confusion_matrix(y_test,ypred))
    print(classification_report(y_test,ypred))
    
    N, train_score, val_score = learning_curve(model, X_train, y_train,
                                               cv = 4, scoring = 'f1',
                                               train_sizes = np.linspace(0.1,1,10))
    plt.figure(figsize=(12,8))
    plt.plot(N, train_score.mean(axis = 1))
    plt.plot(N, val_score.mean(axis = 1))

In [None]:
for name, model in list_model.items():
    print(name)
    evaluation(model)

### Optimisation

In [None]:
SVM

In [None]:
hyper_param = {'svc__gamma': [1e-3, 1e-4], 
               'svc__C': [1,10,100,1000],
              'pipeline__polynomialfeatures__degree': [2,3,4]}

In [None]:
grid = RandomizedSearchCV(SVM, hyper_param, scoring='recall', cv=4, n_iter=10)

In [None]:
grid.fit(X_train,y_train)

In [None]:
print(grid.best_params_)
ypred = grid.predict(X_test)
print(classification_report(y_test,ypred))

In [None]:
evaluation(grid.best_estimator_)

### Précision Recall Curve

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
precision, recall, threshold = precision_recall_curve(y_test,grid.best_estimator_.decision_function(X_test))

In [None]:
precision, recall, threshold = precision_recall_curve(y_test,grid.best_estimator_.decision_function(X_test))
plt.plot(threshold,recall[:-1], label = 'precision')
plt.legend()

In [None]:
LogReg

In [None]:
model_final = LogisticRegression()

In [None]:
model.fit(X_train,y_train)

In [None]:
param = {'tol' :np.linspace(0.00001,1,5), 'C': [1.0,2,3,4,5]}

In [None]:
Grid = RandomizedSearchCV(model_final, param, cv = 4, n_iter=10)

In [None]:
Grid.fit(X_train,y_train)

In [None]:
print(Grid.best_params_)
ypredL = Grid.predict(X_test)
print(classification_report(y_test,ypredL))

In [None]:
confusion_matrix(y_test,ypredL)

In [None]:
confusion_matrix(y_test,ypred)

In [None]:
ypredk = KNN.predict(X_test)

In [None]:
confusion_matrix(y_test,ypredk)

In [None]:
evaluation(model_final)

In [None]:
print(model.predict(X_test))