# Projet 7 : Implémentez un modèle de scoring

# Notebook de la partie modélisation (du prétraitement à la prédiction)

# SALMA CHAFAI

**Mission:**

* Vous êtes Data Scientist au sein d'une société financière, nommée "Prêt à dépenser", qui propose des crédits à la consommation pour des personnes ayant peu ou pas du tout d'historique de prêt.
    
* L’entreprise souhaite mettre en œuvre un outil de “scoring crédit” pour calculer la probabilité qu’un client rembourse son crédit, puis classifie la demande en crédit accordé ou refusé. Elle souhaite donc développer un algorithme de classification en s’appuyant sur des sources de données variées (données comportementales, données provenant d'autres institutions financières, etc.).

* De plus, les chargés de relation client ont fait remonter le fait que les clients sont de plus en plus demandeurs de transparence vis-à-vis des décisions d’octroi de crédit. Cette demande de transparence des clients va tout à fait dans le sens des valeurs que l’entreprise veut incarner.

* "Prêt à dépenser" décide donc de développer un dashboard interactif pour que les chargés de relation client puissent à la fois expliquer de façon la plus transparente possible les décisions d’octroi de crédit, mais également permettre à leurs clients de disposer de leurs informations personnelles et de les explorer facilement. 

**Résumé de notre mission:**

 * 1- Construire un modèle de scoring qui donnera une prédiction sur la probabilité de faillite d'un client de façon automatique.

 * 2- Construire un dashboard interactif à destination des gestionnaires de la relation client permettant d'interpréter les prédictions faites par le modèle, et d’améliorer la connaissance client des chargés de relation client.

## A) Importation des bibliothèques nécessaires

### 1- Les bibliothèques usuelles et les bibliothèques de visualisation

In [1]:
 # ça nous permet d'importer numpy avec son nom np et matplotlib.pyplot as plt
%pylab inline 

# data
import pandas as pd
import numpy as np
import scipy 

# visualisation
import seaborn as sns
import missingno as msn
import matplotlib.pyplot as plt

# Librairie plotly pour les graphiques intéractives
import plotly.graph_objects as go
import plotly as plo
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode,iplot
from plotly.subplots import make_subplots

Populating the interactive namespace from numpy and matplotlib


In [2]:
import scipy.stats as stats

### 2- Bibliothèques du ML

In [40]:
# Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer


# outliers
from sklearn.ensemble import IsolationForest

# Preprocessing
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, KBinsDiscretizer, QuantileTransformer
from sklearn.compose import make_column_transformer, ColumnTransformer, make_column_selector
from category_encoders import TargetEncoder

# Feature engineering
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_regression, f_classif
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

import lime #LIME package
import lime.lime_tabular #the type of LIIME analysis we’ll do
import shap #SHAP package
import time #some of the routines take a while so we monitor the time
import os #needed to use Environment Variables in Domino

# Equilibrer les classes
from imblearn.over_sampling import SMOTE

# Modèles
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import kernel_ridge
from xgboost.sklearn import XGBClassifier

# Evaluation 
from sklearn.metrics import f1_score, mean_squared_error, mean_absolute_error, confusion_matrix, classification_report, fbeta_score
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, precision_score, recall_score, plot_confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay, make_scorer
from sklearn.model_selection import learning_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold, cross_val_score,StratifiedKFold
from sklearn.model_selection import train_test_split

# pickle est un package qu'on utilise pour mettre dedans ou pour générer notre modèle pour le déployer dans une application
import pickle
import dill

In [4]:
import gc
import time
from contextlib import contextmanager
from lightgbm import LGBMClassifier
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## B) Jeu de données

In [5]:
# Jeu du données
df = pd.read_csv('C:/Users/salma/OneDrive/Bureau/Projet7/Data/data_clean_vf.csv')
df.drop(columns = 'Unnamed: 0', inplace=True)
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,INSTAL_AMT_INSTALMENT_MEAN,INSTAL_AMT_INSTALMENT_SUM,INSTAL_AMT_PAYMENT_MIN,INSTAL_AMT_PAYMENT_MAX,INSTAL_AMT_PAYMENT_MEAN,INSTAL_AMT_PAYMENT_SUM,INSTAL_DAYS_ENTRY_PAYMENT_MAX,INSTAL_DAYS_ENTRY_PAYMENT_MEAN,INSTAL_DAYS_ENTRY_PAYMENT_SUM,INSTAL_COUNT
0,100002,1,0,0,0,0,202500.0,406597.5,24700.5,351000.0,...,11559.247,219625.69,9251.775,53093.746,11559.247,219625.69,-49.0,-315.5,-5993.0,19.0
1,100003,0,1,0,1,0,270000.0,1293502.5,35698.5,1129500.0,...,64754.586,1618864.6,6662.97,560835.4,64754.586,1618864.6,-544.0,-1385.0,-34633.0,25.0
2,100004,0,0,1,0,0,67500.0,135000.0,6750.0,135000.0,...,7096.155,21288.465,5357.25,10573.965,7096.155,21288.465,-727.0,-761.5,-2285.0,3.0
3,100006,0,1,0,0,0,135000.0,312682.5,29686.5,297000.0,...,62947.09,1007153.44,2482.92,691786.9,62947.09,1007153.44,-12.0,-271.5,-4346.0,16.0
4,100007,0,0,0,0,0,121500.0,513000.0,21865.5,513000.0,...,12666.444,835985.3,0.18,22678.785,12214.061,806128.0,-14.0,-1032.0,-68128.0,66.0


In [6]:
#Taille du jeu de données
df.shape

(307507, 624)

In [7]:
((df.isna().sum()/df.shape[0])*100).sort_values(ascending=False)

SK_ID_CURR                             0.0
PREV_CODE_REJECT_REASON_VERIF_MEAN     0.0
PREV_NAME_PAYMENT_TYPE_nan_MEAN        0.0
PREV_CODE_REJECT_REASON_CLIENT_MEAN    0.0
PREV_CODE_REJECT_REASON_HC_MEAN        0.0
                                      ... 
ORGANIZATION_TYPE_Trade: type 7        0.0
ORGANIZATION_TYPE_Transport: type 1    0.0
ORGANIZATION_TYPE_Transport: type 2    0.0
ORGANIZATION_TYPE_Transport: type 3    0.0
INSTAL_COUNT                           0.0
Length: 624, dtype: float64

In [8]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [9]:
df_reduce = reduce_mem_usage(df)

Memory usage of dataframe is 1463.96 MB
Memory usage after optimization is: 412.91 MB
Decreased by 71.8%


### 1- On prend un échantillon du jeu de données

In [10]:
# On prend un échantillon de 200 clients pour la partie test du dashboard
df_sub = df_reduce.sample(200)
#df_sub.drop(columns='TARGET', inplace=True)
df_sub.head()

Unnamed: 0,SK_ID_CURR,TARGET,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,INSTAL_AMT_INSTALMENT_MEAN,INSTAL_AMT_INSTALMENT_SUM,INSTAL_AMT_PAYMENT_MIN,INSTAL_AMT_PAYMENT_MAX,INSTAL_AMT_PAYMENT_MEAN,INSTAL_AMT_PAYMENT_SUM,INSTAL_DAYS_ENTRY_PAYMENT_MAX,INSTAL_DAYS_ENTRY_PAYMENT_MEAN,INSTAL_DAYS_ENTRY_PAYMENT_SUM,INSTAL_COUNT
70537,181833,0,1,0,0,0,360000.0,1928304.0,79578.0,1800000.0,...,49032.132812,3530314.0,27.0,902486.8125,45371.347656,3266737.0,-190.0,-1114.0,-80190.0,72.0
203889,336380,0,1,0,0,0,135000.0,354276.0,19912.5,292500.0,...,6731.884277,888608.7,2.07,58712.941406,6728.795898,888201.1,-16.0,-1490.0,-196680.0,132.0
3347,103912,0,0,0,0,1,315000.0,545040.0,35617.5,450000.0,...,22335.443359,469044.3,538.380005,22365.404297,14895.289062,312801.1,-976.0,-1164.0,-24448.0,21.0
128439,248989,0,0,1,0,3,306000.0,1305000.0,38155.5,1305000.0,...,20891.949219,1483328.0,37.395,414450.0,18686.765625,1326760.0,-10.0,-933.5,-66271.0,71.0
16326,119040,0,0,0,0,2,148500.0,312768.0,22374.0,270000.0,...,5263.145996,131578.7,164.294998,5591.790039,5064.461914,126611.5,-1687.0,-2078.0,-51942.0,25.0


In [11]:
df_sub.to_csv('echantillon.csv', index=False)

In [12]:
unique(df_sub["TARGET"].values.tolist())

array([0, 1])

## C) Modélisation 

### 1- Séparation du data set en train et test set

In [13]:
deleted_cols = ['TARGET','SK_ID_CURR']

In [14]:
X = df_reduce.drop(columns=deleted_cols)
y = df_reduce['TARGET']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size=0.2,
                                                   random_state=42)

In [16]:
y_train.value_counts()

0    226201
1     19804
Name: TARGET, dtype: int64

In [17]:
X.shape

(307507, 622)

In [18]:
y_test.value_counts()

0    56481
1     5021
Name: TARGET, dtype: int64

### 2- Sélection des features

In [19]:
# Le jeu de données "HomeCredit_columns_description"
col_description = pd.read_csv("C:/Users/salma/OneDrive/Bureau/Projet7/Data/Data_kaggle/HomeCredit_columns_description_encode.csv", sep=',', encoding="utf8")

In [20]:
# Create and fit selector
selector = SelectKBest(f_classif, k=20)
selector.fit(X, y)
# Get columns to keep and create new dataframe with those only
cols = selector.get_support(indices=True)
X_new = X.iloc[:,cols]

Features [254 259 275 284 363 371 374 400 405 410 420 428 433 448 462 468 472 481
 493 499 517 594] are constant.
invalid value encountered in true_divide


In [21]:
X_new.head()

Unnamed: 0,DAYS_BIRTH,DAYS_EMPLOYED,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,DAYS_EMPLOYED_PERC,NAME_INCOME_TYPE_Working,NAME_EDUCATION_TYPE_Higher education,BURO_DAYS_CREDIT_MIN,BURO_DAYS_CREDIT_MEAN,BURO_DAYS_CREDIT_UPDATE_MEAN,BURO_CREDIT_ACTIVE_Active_MEAN,BURO_CREDIT_ACTIVE_Closed_MEAN,PREV_NAME_CONTRACT_STATUS_Approved_MEAN,PREV_NAME_CONTRACT_STATUS_Refused_MEAN,PREV_CODE_REJECT_REASON_SCOFR_MEAN,PREV_CODE_REJECT_REASON_XAP_MEAN,PREV_NAME_PRODUCT_TYPE_walk-in_MEAN
0,-9461,-637.0,2,2,0.083008,0.262939,0.139404,0.067322,1.0,0.0,-1437.0,-874.0,-500.0,0.25,0.75,1.0,0.0,0.0,1.0,0.0
1,-16765,-1188.0,1,1,0.311279,0.62207,0.535156,0.070862,0.0,1.0,-2586.0,-1401.0,-816.0,0.25,0.75,1.0,0.0,0.0,1.0,0.0
2,-19046,-225.0,2,2,0.505859,0.556152,0.729492,0.01181,1.0,0.0,-1326.0,-867.0,-532.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
3,-19005,-3040.0,2,2,0.505859,0.650391,0.535156,0.159912,1.0,0.0,-1827.0,-1051.0,-481.75,0.375,0.625,0.555664,0.111084,0.0,0.888672,0.0
4,-19932,-3038.0,2,2,0.505859,0.322754,0.535156,0.152466,1.0,0.0,-1149.0,-1149.0,-783.0,0.0,1.0,1.0,0.0,0.0,1.0,0.166626


In [22]:
features_train = X_new.columns.tolist()
len(features_train)

20

In [23]:
l_description = []
cols_desc = col_description['Row'].tolist()
for feature in features_train : 
    if feature in cols_desc :
        ligne = col_description[col_description['Row'] == feature]['Description'].iloc[0]
        l_description.append(ligne)
    else : 
        l_description.append('Pas de description')

In [24]:
# Enregistrer les colonnes et leurs descriptions
pickle.dump(features_train, open('features_selected.pkl', 'wb'))
pickle.dump(l_description, open('features_description.pkl', 'wb'))

### 3- Modélisation

#### (a) Division du dataset en train et test set

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X_new,y, 
                                                    test_size=0.2,
                                                   random_state=42,
                                                   stratify=y)

#### (b) Création de la fonction coût métier

In [26]:
def fct_cout(y_test, y_pred, cout_fp=1, coup_fn=10):
    
    TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()
    resultat = (cout_fp*FP + coup_fn*FN)/(TN + coup_fn*FN + TP + cout_fp*FP)
    return resultat

On transforme la fct_cout en un score en utilisant la fonction make_scorer.

In [45]:
score_cout = make_scorer(fct_cout, greater_is_better=False)

#### (c) Fonctions d'entraînement des modèles

In [28]:
def entrainement_model(model, params,X_train,X_test,y_train, y_test):
    
    """Cette fonction renvoie le meilleur modèle et les meilleurs hyperparamètres
    obtenues par cross validation avec GridSearchCv"""
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    
    grid_model = GridSearchCV(model, params, cv=5, scoring=score_cout)
    
    t0 = time.time()
    grid_model.fit(X_train,y_train)
    tf = time.time()
    print("Temps d'exécution avec gridSearchCv : {:.4f} seconds".format(tf - t0))
    
        
    

    
    best_model = grid_model.best_estimator_
    best_params = grid_model.best_params_
    
    for mean, std, params in zip(grid_model.cv_results_['mean_test_score'],
                                grid_model.cv_results_['std_test_score'],
                                grid_model.cv_results_['params']):
        
        print("\tscore coût = %0.3f (+/-%0.3f) for %s" % (mean,std*2,params))
    

    
    return best_model, best_params

#### (d) Fonctions d'évaluation des modèles

In [29]:
def evaluation_model(model, X_train,X_test,y_train, y_test, smote_method = False):
    
    """Cette fonction entraîne le modèle et affiche les scores du modèle"""
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    if smote_method:
        oversampler=SMOTE(random_state=0)
        X_train,y_train=oversampler.fit_resample(X_train,y_train)
    
    t0 = time.time()
    
    model.fit(X_train, y_train)
    tf = time.time()
    tps_ex = tf - t0
    print("Temps d'exécution du meilleur modèle : {:.4f} seconds".format(tps_ex))
    
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    acc_train = accuracy_score(y_train, y_pred_train)
    acc_test = accuracy_score(y_test, y_pred_test)
    
    roc_train = roc_auc_score(y_train, y_pred_train)
    roc_test = roc_auc_score(y_test, y_pred_test)
    
    beta_score_train = fbeta_score(y_train, y_pred_train, beta=10)
    beta_score_test = fbeta_score(y_test, y_pred_test, beta=10)
    
    
    
    
    print("Accuracy train : ", acc_train)
    print("Accuracy test : ", acc_test)
    print("Roc Accuracy train : ", roc_train)
    print("Roc Accuracy test : ", roc_test)
    print("Beta score train : ", beta_score_train)
    print("Beta score test : ", beta_score_test)
    print(confusion_matrix(y_test, y_pred_test))
    print(classification_report(y_test, y_pred_test))
    
    return acc_train, acc_test, roc_train, roc_test, beta_score_train, beta_score_test, tps_ex

In [30]:
# Les modèles
model_lr = LogisticRegression(random_state=0, class_weight="balanced")
model_knn = KNeighborsClassifier()
model_rf = RandomForestClassifier(random_state=0, class_weight="balanced")
model_lgb = LGBMClassifier(random_state=0, class_weight="balanced")
model_xgb = XGBClassifier(random_state=0, class_weight="balanced")

In [31]:
params_lr = {'C': [0.1, 1, 10, 100]}

#params_knn = {'n_neighbors' : [3,5,7]}
 

params_rf = { 
    'n_estimators': [100, 200],#[int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
    'max_depth' : [2, 5, 10], #[int(x) for x in np.linspace(10, 110, num = 11)],
    'min_samples_split' : [2, 5, 10]
}



params_lgb = {'learning_rate': [0.1, 0.01, 0.001],
              'max_depth': [3, 4, 5]}

params_xgb = {'learning_rate': [0.1, 0.01, 0.001],
              'max_depth': [3, 4, 5]}

#### (a) Pour la régression logistique

In [46]:
best_lr, best_param_lr = entrainement_model(model_lr, params_lr,X_train,X_test,y_train, y_test)
final_model_lr = LogisticRegression(**best_param_lr, random_state=0, class_weight="balanced")
acc_train_lr, acc_test_lr, roc_train_lr, roc_test_lr, beta_score_train_lr, beta_score_test_lr, tps_ex_lr = \
evaluation_model(final_model_lr,X_train,X_test,y_train, y_test)

Temps d'exécution avec gridSearchCv : 46.4549 seconds
	score coût = -0.453 (+/-0.005) for {'C': 0.1}
	score coût = -0.453 (+/-0.005) for {'C': 1}
	score coût = -0.453 (+/-0.005) for {'C': 10}
	score coût = -0.453 (+/-0.005) for {'C': 100}
Temps d'exécution du meilleur modèle : 2.5627 seconds
Accuracy train :  0.6828316497632162
Accuracy test :  0.6849370752170661
Roc Accuracy train :  0.6715778312834204
Roc Accuracy test :  0.6769715564356691
Beta score train :  0.6376691959211847
Beta score test :  0.646740348572091
[[38811 17726]
 [ 1651  3314]]
              precision    recall  f1-score   support

           0       0.96      0.69      0.80     56537
           1       0.16      0.67      0.25      4965

    accuracy                           0.68     61502
   macro avg       0.56      0.68      0.53     61502
weighted avg       0.89      0.68      0.76     61502



#### (b) Pour la forêt aléatoire 

In [33]:
best_rf, best_param_rf = entrainement_model(model_rf, params_rf,X_train,X_test,y_train, y_test)
final_model_rf = RandomForestClassifier(**best_param_rf, random_state=0, class_weight="balanced")
acc_train_rf, acc_test_rf, roc_train_rf, roc_test_rf, beta_score_train_rf, beta_score_test_rf, tps_ex_rf = \
evaluation_model(final_model_rf, X_train,X_test,y_train, y_test)

Temps d'exécution avec gridSearchCv : 4121.8280 seconds
	score coût = -0.480 (+/-0.009) for {'max_depth': 2, 'min_samples_split': 2, 'n_estimators': 100}
	score coût = -0.477 (+/-0.006) for {'max_depth': 2, 'min_samples_split': 2, 'n_estimators': 200}
	score coût = -0.480 (+/-0.009) for {'max_depth': 2, 'min_samples_split': 5, 'n_estimators': 100}
	score coût = -0.477 (+/-0.006) for {'max_depth': 2, 'min_samples_split': 5, 'n_estimators': 200}
	score coût = -0.480 (+/-0.009) for {'max_depth': 2, 'min_samples_split': 10, 'n_estimators': 100}
	score coût = -0.477 (+/-0.006) for {'max_depth': 2, 'min_samples_split': 10, 'n_estimators': 200}
	score coût = -0.457 (+/-0.005) for {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100}
	score coût = -0.457 (+/-0.005) for {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 200}
	score coût = -0.457 (+/-0.005) for {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 100}
	score coût = -0.457 (+/-0.005) for {'max_depth': 5, 'min_sa

#### (c) Le modèle lightgbm

In [34]:
best_lgb, best_param_lgb = entrainement_model(model_lgb, params_lgb,X_train,X_test,y_train, y_test)
final_model_lgb = LGBMClassifier(**best_param_lgb, random_state=0, class_weight="balanced")
acc_train_lgb, acc_test_lgb, roc_train_lgb, roc_test_lgb, beta_score_train_lgb, beta_score_test_lgb, tps_ex_lgb = \
evaluation_model(final_model_lgb, X_train,X_test,y_train, y_test)

Temps d'exécution avec gridSearchCv : 138.4898 seconds
	score coût = -0.450 (+/-0.004) for {'learning_rate': 0.1, 'max_depth': 3}
	score coût = -0.447 (+/-0.003) for {'learning_rate': 0.1, 'max_depth': 4}
	score coût = -0.446 (+/-0.005) for {'learning_rate': 0.1, 'max_depth': 5}
	score coût = -0.471 (+/-0.006) for {'learning_rate': 0.01, 'max_depth': 3}
	score coût = -0.467 (+/-0.006) for {'learning_rate': 0.01, 'max_depth': 4}
	score coût = -0.463 (+/-0.007) for {'learning_rate': 0.01, 'max_depth': 5}
	score coût = -0.480 (+/-0.023) for {'learning_rate': 0.001, 'max_depth': 3}
	score coût = -0.477 (+/-0.018) for {'learning_rate': 0.001, 'max_depth': 4}
	score coût = -0.467 (+/-0.013) for {'learning_rate': 0.001, 'max_depth': 5}
Temps d'exécution du meilleur modèle : 3.6142 seconds
Accuracy train :  0.6920875591959513
Accuracy test :  0.6888719066046632
Roc Accuracy train :  0.6968446084051136
Roc Accuracy test :  0.6823268887127023
Beta score train :  0.6808181691686896
Beta score tes

#### (d) Le modèle XgBoost

In [42]:
best_xgb, best_param_xgb = entrainement_model(model_xgb, params_xgb,X_train,X_test,y_train, y_test)
final_model_xgb = XGBClassifier(**best_param_xgb, random_state=0, class_weight="balanced")
acc_train_xgb, acc_test_xgb, roc_train_xgb, roc_test_xgb, beta_score_train_xgb, beta_score_test_xgb, tps_ex_xgb = \
evaluation_model(final_model_xgb, X_train,X_test,y_train, y_test, smote_method=True)



Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Temps d'exécution avec gridSearchCv : 633.4388 seconds
	score coût = -0.466 (+/-0.001) for {'learning_rate': 0.1, 'max_depth': 3}
	score coût = -0.465 (+/-0.000) for {'learning_rate': 0.1, 'max_depth': 4}
	score coût = -0.465 (+/-0.000) for {'learning_rate': 0.1, 'max_depth': 5}
	score coût = -0.468 (+/-0.000) for {'learning_rate': 0.01, 'max_depth': 3}
	score coût = -0.468 (+/-0.000) for {'learning_rate': 0.01, 'max_depth': 4}
	score coût = -0.467 (+/-0.000) for {'learning_rate': 0.01, 'max_depth': 5}
	score coût = -0.468 (+/-0.000) for {'learning_rate': 0.001, 'max_depth': 3}
	score coût = -0.467 (+/-0.000) for {'learning_rate': 0.001, 'max_depth': 4}
	score coût = -0.46



Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Temps d'exécution du meilleur modèle : 67.9952 seconds
Accuracy train :  0.8239602909637622
Accuracy test :  0.7945757861532958
Roc Accuracy train :  0.8239602909637622
Roc Accuracy test :  0.6299542541284401
Beta score train :  0.8215438633730442
Beta score test :  0.4276572102856582
[[46715  9822]
 [ 2812  2153]]
              precision    recall  f1-score   support

           0       0.94      0.83      0.88     56537
           1       0.18      0.43      0.25      4965

    accuracy                           0.79     61502
   macro avg       0.56      0.63      0.57     61502
weighted avg       0.88      0.79      0.83     61502



### 4- Tableau comparatif des modèles

In [37]:
df_comparaison = pd.DataFrame(columns = ['Modèle', 'best_paramètres', 'Accuracy train', 'Accuracy test', 
                                         'ROC train','ROC test', 'Beta_score train',
                                         'Beta_score test',"temps_exec"])

name_models = ["Régression logistique","Forêt aléatoire","LightGBM","XGBoost"]
params_models = [best_param_lr, best_param_rf, best_param_lgb,  best_param_xgb]
l_acc_train = [acc_train_lr, acc_train_rf,acc_train_lgb,acc_train_xgb]
l_acc_test = [acc_test_lr, acc_test_rf,acc_test_lgb, acc_test_xgb]
l_roc_train = [roc_train_lr, roc_train_rf,roc_train_lgb, roc_train_xgb]
l_roc_test = [roc_test_lr, roc_test_rf,roc_test_lgb,roc_test_xgb]
l_betasc_train = [beta_score_train_lr, beta_score_train_rf,beta_score_train_lgb,beta_score_train_xgb]
l_betasc_test = [beta_score_test_lr, beta_score_test_rf,beta_score_test_lgb,beta_score_test_xgb]
l_tps_ex = [tps_ex_lr, tps_ex_rf, tps_ex_lgb, tps_ex_xgb]

# Résultast
df_comparaison['Modèle'] = name_models
df_comparaison['best_paramètres'] = params_models
df_comparaison['Accuracy train'] = l_acc_train
df_comparaison['Accuracy test'] = l_acc_test
df_comparaison['ROC train'] = l_roc_train
df_comparaison['ROC test'] = l_roc_test
df_comparaison['Beta_score train'] = l_betasc_train
df_comparaison['Beta_score test'] = l_betasc_test
df_comparaison['temps_exec'] = l_tps_ex

In [38]:
df_comparaison

Unnamed: 0,Modèle,best_paramètres,Accuracy train,Accuracy test,ROC train,ROC test,Beta_score train,Beta_score test,temps_exec
0,Régression logistique,{'C': 100},0.682832,0.684937,0.671578,0.676972,0.637669,0.64674,1.833568
1,Forêt aléatoire,"{'max_depth': 10, 'min_samples_split': 2, 'n_e...",0.73577,0.725667,0.717182,0.672761,0.677161,0.594267,96.880283
2,LightGBM,"{'learning_rate': 0.1, 'max_depth': 5}",0.692088,0.688872,0.696845,0.682327,0.680818,0.653788,3.614241
3,XGBoost,"{'learning_rate': 0.1, 'max_depth': 5}",0.920201,0.919547,0.508958,0.506581,0.01871,0.014236,22.633902


### 5- Modèle final

On choisit comme modèle final le model LightGBM

#### (a) Définition du seuil de proba pour la fonction coût personnalisée

In [76]:
def pred_seuil(model, X_test, y_test) :
    
    y_proba = model.predict_proba(X_test)[:,1]
    best_seuil = 0
    best_cost = float('inf')
    
    list_seuils = np.linspace(start=0, stop=1, num=100)

    for seuil in list_seuils:
        y_pred = (y_proba >= seuil).astype(int)
        #print("y pred : ", y_pred)
        total_cost = f1_score(y_test, y_pred)
        if total_cost < best_cost:
            best_seuil = seuil 
            best_cost = total_cost
            
        
            
    y_pred_f = (y_proba >= best_seuil).astype(int)
    print("prediction : ", y_pred_f)
            
    return best_seuil


In [66]:
def evaluation_model_final(model, X_train, y_train, X_test, y_test):
    
    """Cette fonction donne les scores pour le model final choisi"""
    
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    acc_train = accuracy_score(y_train, y_pred_train)
    acc_test = accuracy_score(y_test, y_pred_test)
    
    roc_train = roc_auc_score(y_train, y_pred_train)
    roc_test = roc_auc_score(y_test, y_pred_test)
    
    
    beta_score_train = fbeta_score(y_train, y_pred_train, beta=10)
    beta_score_test = fbeta_score(y_test, y_pred_test, beta=10)
    
    #sc_cout = score_cout(model, y_test, y_pred_test)
    

    
    
    
    
    print("Accuracy train : ", acc_train)
    print("Accuracy test : ", acc_test)
    print("Roc Accuracy train : ", roc_train)
    print("Roc Accuracy test : ", roc_test)
    print("Beta score train : ", beta_score_train)
    print("Beta score test : ", beta_score_test)
    #print("Score de la fonction personnalisée :", sc_cout)
    print(confusion_matrix(y_test, y_pred_test))
    print(classification_report(y_test, y_pred_test))    
    



#### (b) On met le modèle final dans un Pipeline

In [59]:
model_f = best_lgb

In [60]:
smt = SMOTE(random_state=42)
scaler = StandardScaler()

In [61]:
pipeline = Pipeline([
                     ('scaler', scaler),
                    ('model', model_f)
                          ])

In [62]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('model',
                 LGBMClassifier(class_weight='balanced', max_depth=5,
                                random_state=0))])

In [77]:
res = pred_seuil(pipeline, X_test, y_test)
res 

prediction :  [0 0 0 ... 0 0 0]


0.9595959595959597

In [64]:
pipeline.predict_proba(X_test)[:,1]

array([0.44376605, 0.38042285, 0.58473601, ..., 0.21261666, 0.52236856,
       0.5989343 ])

In [None]:
evaluation_model_final(pipeline,X_train, y_train, X_test,y_test)

In [None]:
# Enregistrer le model logistic regression
#pickle.dump(pipeline_lgb, open('model_credit_rl.pkl', 'wb'))
pickle.dump(pipeline, open('model_credit.pkl', 'wb'))

### 5- Interprétabilité en utilisant shap

#### (a) Interprétabilité globale

In [None]:
pipeline = pickle.load(open("model_credit.pkl","rb"))

In [None]:
# On définit la liste des colonnes
all_features = list(X_new.columns)

In [None]:
# Créer un explicateur SHAP
explainer = shap.TreeExplainer(pipeline.named_steps['model'], X_train)


In [None]:
dill.dump(explainer, open('shap_explainer.dill', 'wb'))

In [None]:
explainer = dill.load(open("shap_explainer.dill","rb"))

In [None]:
# Calculer les valeurs SHAP pour les prédictions de test
shap_values = explainer.shap_values(X_test.values)

In [None]:
shap.summary_plot(shap_values, X_test, feature_names=all_features)

#### (b) Interprétabilité locale Pour un client 

In [None]:
X_test_sample = df[df["SK_ID_CURR"] == 137021][all_features]
X_test_sample

In [None]:
len(X_test_sample[all_features].columns.tolist())

In [None]:
pred_proba = pipeline.predict_proba(X_test_sample)
pred_proba[0][1]

In [None]:
pipeline.predict(X_test_sample)

In [None]:
shap_values = explainer.shap_values(X_test_sample[all_features])
shap_values[0]

In [None]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value,shap_values[0],feature_names = X_test_sample.columns) 

## D) MlFlow

In [None]:
import mlflow

In [None]:
from mlflow.models.signature import infer_signature

In [None]:
signature = infer_signature(X_train, y_train)

In [None]:
mlflow.sklearn.save_model(pipeline, 'mlflow_model', signature=signature)