# Projet 7 : Implémentez un modèle de scoring

# Notebook de la partie modélisation (du prétraitement à la prédiction)

# SALMA CHAFAI

**Mission:**

* Vous êtes Data Scientist au sein d'une société financière, nommée "Prêt à dépenser", qui propose des crédits à la consommation pour des personnes ayant peu ou pas du tout d'historique de prêt.
    
* L’entreprise souhaite mettre en œuvre un outil de “scoring crédit” pour calculer la probabilité qu’un client rembourse son crédit, puis classifie la demande en crédit accordé ou refusé. Elle souhaite donc développer un algorithme de classification en s’appuyant sur des sources de données variées (données comportementales, données provenant d'autres institutions financières, etc.).

* De plus, les chargés de relation client ont fait remonter le fait que les clients sont de plus en plus demandeurs de transparence vis-à-vis des décisions d’octroi de crédit. Cette demande de transparence des clients va tout à fait dans le sens des valeurs que l’entreprise veut incarner.

* "Prêt à dépenser" décide donc de développer un dashboard interactif pour que les chargés de relation client puissent à la fois expliquer de façon la plus transparente possible les décisions d’octroi de crédit, mais également permettre à leurs clients de disposer de leurs informations personnelles et de les explorer facilement. 

**Résumé de notre mission:**

 * 1- Construire un modèle de scoring qui donnera une prédiction sur la probabilité de faillite d'un client de façon automatique.

 * 2- Construire un dashboard interactif à destination des gestionnaires de la relation client permettant d'interpréter les prédictions faites par le modèle, et d’améliorer la connaissance client des chargés de relation client.

## A) Importation des bibliothèques nécessaires

### 1- Les bibliothèques usuelles et les bibliothèques de visualisation

In [1]:
 # ça nous permet d'importer numpy avec son nom np et matplotlib.pyplot as plt
%pylab inline 

# data
import pandas as pd
import numpy as np
import scipy 

# visualisation
import seaborn as sns
import missingno as msn
import matplotlib.pyplot as plt

# Librairie plotly pour les graphiques intéractives
import plotly.graph_objects as go
import plotly as plo
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode,iplot
from plotly.subplots import make_subplots

Populating the interactive namespace from numpy and matplotlib


### 2- Bibliothèques du ML

In [40]:
# Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer




# outliers
from sklearn.ensemble import IsolationForest

# Preprocessing
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, KBinsDiscretizer, QuantileTransformer
from sklearn.compose import make_column_transformer, ColumnTransformer, make_column_selector
from category_encoders import TargetEncoder

# Feature engineering
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.feature_selection import SelectKBest, f_regression, f_classif
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

import lime #LIME package
import lime.lime_tabular #the type of LIIME analysis we’ll do
import shap #SHAP package
import time #some of the routines take a while so we monitor the time
import os #needed to use Environment Variables in Domino

# Equilibrer les classes
from imblearn.over_sampling import SMOTE

# Modèles

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn import kernel_ridge
import xgboost as xgb
from xgboost.sklearn import XGBRegressor

# Evaluation 
from sklearn.metrics import f1_score, r2_score, mean_squared_error, mean_absolute_error, confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, precision_score, recall_score
from sklearn.model_selection import learning_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold, cross_val_score,StratifiedKFold
from sklearn.model_selection import train_test_split

# pickle est un package qu'on utilise pour mettre dedans ou pour générer notre modèle pour le déployer dans une application
import pickle

In [3]:
import gc
import time
from contextlib import contextmanager
from lightgbm import LGBMClassifier
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## B) Jeu de données

In [4]:
# Jeu du données
df = pd.read_csv('data_clean.csv')
df.drop(columns = 'Unnamed: 0', inplace=True)
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,CC_NAME_CONTRACT_STATUS_Signed_MAX,CC_NAME_CONTRACT_STATUS_Signed_MEAN,CC_NAME_CONTRACT_STATUS_Signed_SUM,CC_NAME_CONTRACT_STATUS_Signed_VAR,CC_NAME_CONTRACT_STATUS_nan_MIN,CC_NAME_CONTRACT_STATUS_nan_MAX,CC_NAME_CONTRACT_STATUS_nan_MEAN,CC_NAME_CONTRACT_STATUS_nan_SUM,CC_NAME_CONTRACT_STATUS_nan_VAR,CC_COUNT
0,100002,1,0,0,0,0,202500.0,406597.5,24700.5,351000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0
1,100003,0,1,0,1,0,270000.0,1293502.5,35698.5,1129500.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0
2,100004,0,0,1,0,0,67500.0,135000.0,6750.0,135000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0
3,100006,0,1,0,0,0,135000.0,312682.5,29686.5,297000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
4,100007,0,0,0,0,0,121500.0,513000.0,21865.5,513000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0


In [5]:
#Taille du jeu de données
df.shape

(307507, 797)

In [6]:
((df.isna().sum()/df.shape[0])*100).sort_values(ascending=False)

SK_ID_CURR                                     0.0
PREV_NAME_SELLER_INDUSTRY_MLM partners_MEAN    0.0
PREV_NAME_SELLER_INDUSTRY_XNA_MEAN             0.0
PREV_NAME_SELLER_INDUSTRY_nan_MEAN             0.0
PREV_NAME_YIELD_GROUP_XNA_MEAN                 0.0
                                              ... 
BURO_AMT_ANNUITY_MEAN                          0.0
BURO_CNT_CREDIT_PROLONG_SUM                    0.0
BURO_MONTHS_BALANCE_MIN_MIN                    0.0
BURO_MONTHS_BALANCE_MAX_MAX                    0.0
CC_COUNT                                       0.0
Length: 797, dtype: float64

In [7]:
for col in list(df.columns):
    print('Col', col)
    print('Le max', df[col].max())
    print('Le min', df[col].min())
    print('-'*100)

Col SK_ID_CURR
Le max 456255
Le min 100002
----------------------------------------------------------------------------------------------------
Col TARGET
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col CODE_GENDER
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col FLAG_OWN_CAR
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col FLAG_OWN_REALTY
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col CNT_CHILDREN
Le max 19
Le min 0
----------------------------------------------------------------------------------------------------
Col AMT_INCOME_TOTAL
Le max 117000000.0
Le min 25650.0
----------------------------------------------------------------------------------------------------
Col AMT_CREDIT
Le max 4

Le min 0.0
----------------------------------------------------------------------------------------------------
Col LIVINGAREA_MEDI
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col NONLIVINGAPARTMENTS_MEDI
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col NONLIVINGAREA_MEDI
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col TOTALAREA_MODE
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col OBS_30_CNT_SOCIAL_CIRCLE
Le max 348.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col DEF_30_CNT_SOCIAL_CIRCLE
Le max 34.0
Le min 0.0
---------------------------------------------------------------------------------------------------

Le min 0
----------------------------------------------------------------------------------------------------
Col WEEKDAY_APPR_PROCESS_START_SATURDAY
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col WEEKDAY_APPR_PROCESS_START_SUNDAY
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col WEEKDAY_APPR_PROCESS_START_THURSDAY
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col WEEKDAY_APPR_PROCESS_START_TUESDAY
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col WEEKDAY_APPR_PROCESS_START_WEDNESDAY
Le max 1
Le min 0
----------------------------------------------------------------------------------------------------
Col ORGANIZATION_TYPE_Advertising
Le max 1
Le min 0
---------------------------------------------

Le min 0.0002238846153846
----------------------------------------------------------------------------------------------------
Col PAYMENT_RATE
Le max 0.1581142857142857
Le min 0.0167896919664934
----------------------------------------------------------------------------------------------------
Col BURO_DAYS_CREDIT_MIN
Le max 0.0
Le min -2922.0
----------------------------------------------------------------------------------------------------
Col BURO_DAYS_CREDIT_MAX
Le max 0.0
Le min -2922.0
----------------------------------------------------------------------------------------------------
Col BURO_DAYS_CREDIT_MEAN
Le max 0.0
Le min -2922.0
----------------------------------------------------------------------------------------------------
Col BURO_DAYS_CREDIT_VAR
Le max 4173160.5
Le min 0.0
----------------------------------------------------------------------------------------------------
Col BURO_DAYS_CREDIT_ENDDATE_MIN
Le max 31198.0
Le min -42060.0
----------------------------

Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col BURO_STATUS_1_MEAN_MEAN
Le max 0.8333333333333334
Le min 0.0
----------------------------------------------------------------------------------------------------
Col BURO_STATUS_2_MEAN_MEAN
Le max 0.2407407407407407
Le min 0.0
----------------------------------------------------------------------------------------------------
Col BURO_STATUS_3_MEAN_MEAN
Le max 0.4
Le min 0.0
----------------------------------------------------------------------------------------------------
Col BURO_STATUS_4_MEAN_MEAN
Le max 0.1666666666666666
Le min 0.0
----------------------------------------------------------------------------------------------------
Col BURO_STATUS_5_MEAN_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col BURO_STATUS_C_MEAN_MEAN
Le max 1.0
Le min 0.0
---------------------------

Le min -34750.35
----------------------------------------------------------------------------------------------------
Col CLOSED_AMT_ANNUITY_MAX
Le max 59586682.5
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CLOSED_AMT_ANNUITY_MEAN
Le max 54562657.5
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CLOSED_CNT_CREDIT_PROLONG_SUM
Le max 6.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CLOSED_MONTHS_BALANCE_MIN_MIN
Le max 0.0
Le min -96.0
----------------------------------------------------------------------------------------------------
Col CLOSED_MONTHS_BALANCE_MAX_MAX
Le max 0.0
Le min -94.0
----------------------------------------------------------------------------------------------------
Col CLOSED_MONTHS_BALANCE_SIZE_MEAN
Le max 97.0
Le min 1.0
---------------------------------

Le max 1.0
Le min 0.1111111111111111
----------------------------------------------------------------------------------------------------
Col PREV_FLAG_LAST_APPL_PER_CONTRACT_nan_MEAN
Le max 0.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_CASH_LOAN_PURPOSE_Building a house or an annex_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_CASH_LOAN_PURPOSE_Business development_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_CASH_LOAN_PURPOSE_Buying a garage_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_CASH_LOAN_PURPOSE_Buying a holiday home / land_MEAN
Le max 1.0
Le min 0.0
-----------------------------------------------------------

Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_TYPE_SUITE_Spouse, partner_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_TYPE_SUITE_Unaccompanied_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_TYPE_SUITE_nan_MEAN
Le max 0.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_CLIENT_TYPE_New_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_CLIENT_TYPE_Refreshed_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_CLIENT_TYPE_Repeater_MEAN
Le max 1.0
Le min 0.0


Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_YIELD_GROUP_XNA_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_YIELD_GROUP_high_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_YIELD_GROUP_low_action_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_YIELD_GROUP_low_normal_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_YIELD_GROUP_middle_MEAN
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col PREV_NAME_YIELD_GROUP_nan_MEAN
Le max 0.0
Le min 0.0
-----------------------

Le max -2.0
Le min -2922.0
----------------------------------------------------------------------------------------------------
Col REFUSED_CNT_PAYMENT_MEAN
Le max 84.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col REFUSED_CNT_PAYMENT_SUM
Le max 1728.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col POS_MONTHS_BALANCE_MAX
Le max -1.0
Le min -96.0
----------------------------------------------------------------------------------------------------
Col POS_MONTHS_BALANCE_MEAN
Le max -1.0
Le min -96.0
----------------------------------------------------------------------------------------------------
Col POS_MONTHS_BALANCE_SIZE
Le max 295.0
Le min 1.0
----------------------------------------------------------------------------------------------------
Col POS_SK_DPD_MAX
Le max 4231.0
Le min 0.0
-------------------------------------------------------------

Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_AMT_PAYMENT_CURRENT_MAX
Le max 4289207.445
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_AMT_PAYMENT_CURRENT_MEAN
Le max 1125000.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_AMT_PAYMENT_CURRENT_SUM
Le max 19347594.795
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_AMT_PAYMENT_CURRENT_VAR
Le max 969732777814.8174
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_AMT_PAYMENT_TOTAL_CURRENT_MIN
Le max 1125000.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_AMT_PAYMENT_TOTAL_CURRENT_MAX
Le max 4278315.69
Le min 0.0
--

Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_NAME_CONTRACT_STATUS_Refused_VAR
Le max 0.0138888888888888
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_NAME_CONTRACT_STATUS_Sent proposal_MIN
Le max 0.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_NAME_CONTRACT_STATUS_Sent proposal_MAX
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_NAME_CONTRACT_STATUS_Sent proposal_MEAN
Le max 0.024390243902439
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_NAME_CONTRACT_STATUS_Sent proposal_SUM
Le max 1.0
Le min 0.0
----------------------------------------------------------------------------------------------------
Col CC_NAME_CONTRAC

In [8]:
list_col_with_inf = ['PREV_APP_CREDIT_PERC_MAX', 'INSTAL_PAYMENT_PERC_MAX',
                     'INSTAL_PAYMENT_PERC_MEAN', 'INSTAL_PAYMENT_PERC_SUM',
                     'REFUSED_APP_CREDIT_PERC_MAX',
                     ]


df.drop(columns=list_col_with_inf, inplace=True)

In [9]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [10]:
df_reduce = reduce_mem_usage(df)

Memory usage of dataframe is 1858.11 MB
Memory usage after optimization is: 511.15 MB
Decreased by 72.5%


## C) Modélisation 

### 1- Séparation du data set en train et test 

In [15]:
X = df_reduce.drop(columns='TARGET')
y = df_reduce['TARGET']

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size=0.2,
                                                   random_state=42)

In [24]:
y_train.value_counts()

0    226201
1     19804
Name: TARGET, dtype: int64

In [23]:
y_test.value_counts()

0    56481
1     5021
Name: TARGET, dtype: int64

In [25]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### 2- Modèles

### (a) Les modèles sans équilibrer les classes

In [45]:
model1 = make_pipeline(SelectKBest(f_classif, k=20),
                      RandomForestClassifier(class_weight='balanced',random_state=0))

In [46]:
model2 = make_pipeline(SelectKBest(f_classif, k=20),
                      LogisticRegression(class_weight='balanced'))

In [38]:
def evaluation_model(model, X_train,X_test,y_train, y_test):
    
    oversampler=SMOTE(random_state=0)
    X_train_sm,y_train_sm=oversampler.fit_resample(X_train_scaled,y_train)
    model.fit(X_train,y_train)
    
    y_pred_train = model.predict(X_train_sm)
    y_pred_test = model.predict(X_test)
    
    acc_train = accuracy_score(y_train_sm, y_pred_train)
    acc_test = accuracy_score(y_test, y_pred_test)
    
    print("Accuracy train : ", acc_train)
    print("Accuracy test : ", acc_test)
    print(confusion_matrix(y_test, y_pred_test))
    print(classification_report(y_test, y_pred_test))
    
    #N, train_score, val_score = learning_curve(model, X_train, y_train,
    #                                          cv=4, scoring='f1',
    #                                          train_sizes=np.linspace(0.1, 1, 10))
    
    
    #plt.figure(figsize=(12, 8))
    #plt.plot(N, train_score.mean(axis=1), label='train score')
    #plt.plot(N, val_score.mean(axis=1), label='validation score')
    #plt.legend()

In [47]:
evaluation_model(model1, X_train,X_test,y_train, y_test)

Features [277 282 298 307 395 403 406 432 437 442 452 460 465 480 494 500 504 513
 525 531 549 625 740 745 755 765 770 775 785 786 787 788 789] are constant.
invalid value encountered in true_divide
X does not have valid feature names, but SelectKBest was fitted with feature names


Accuracy train :  0.582837830071485
Accuracy test :  0.9183278592566095
[[56407    74]
 [ 4949    72]]
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     56481
           1       0.49      0.01      0.03      5021

    accuracy                           0.92     61502
   macro avg       0.71      0.51      0.49     61502
weighted avg       0.88      0.92      0.88     61502



In [48]:
evaluation_model(model2, X_train_scaled,X_test_scaled,y_train, y_test)

Features [277 282 298 307 395 403 406 432 437 442 452 460 465 480 494 500 504 513
 525 531 549 625 740 745 755 765 770 775 785 786 787 788 789] are constant.
invalid value encountered in true_divide


Accuracy train :  0.6788144172660598
Accuracy test :  0.6860264706838802
[[38918 17563]
 [ 1747  3274]]
              precision    recall  f1-score   support

           0       0.96      0.69      0.80     56481
           1       0.16      0.65      0.25      5021

    accuracy                           0.69     61502
   macro avg       0.56      0.67      0.53     61502
weighted avg       0.89      0.69      0.76     61502



In [44]:
list(df_reduce.columns)[277]

'BURO_CREDIT_ACTIVE_Sold_MEAN'

**Autre alternative : on choisit les 15 meilleurs variables puis on revoit les modèles**

In [57]:
# Create and fit selector
selector = SelectKBest(f_classif, k=15)
selector.fit(X, y)
# Get columns to keep and create new dataframe with those only
cols = selector.get_support(indices=True)
X_new = X.iloc[:,cols]

Features [277 282 298 307 395 403 406 432 437 442 452 460 465 480 494 500 504 513
 525 531 549 625 740 745 755 765 770 775 785 786 787 788 789] are constant.
invalid value encountered in true_divide


In [66]:
X_train, X_test, y_train, y_test = train_test_split(X_new,y, 
                                                    test_size=0.2,
                                                   random_state=42)

In [67]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [68]:
evaluation_model(LogisticRegression(class_weight='balanced'), X_train_scaled,X_test_scaled,y_train, y_test)

Accuracy train :  0.6710934080751191
Accuracy test :  0.6832785925660955
[[38755 17726]
 [ 1753  3268]]
              precision    recall  f1-score   support

           0       0.96      0.69      0.80     56481
           1       0.16      0.65      0.25      5021

    accuracy                           0.68     61502
   macro avg       0.56      0.67      0.53     61502
weighted avg       0.89      0.68      0.75     61502



In [69]:
evaluation_model(RandomForestClassifier(class_weight='balanced',random_state=0), X_train_scaled,X_test_scaled,y_train, y_test)


Accuracy train :  0.6168009867330383
Accuracy test :  0.9182790803551104
[[56434    47]
 [ 4979    42]]
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     56481
           1       0.47      0.01      0.02      5021

    accuracy                           0.92     61502
   macro avg       0.70      0.50      0.49     61502
weighted avg       0.88      0.92      0.88     61502



In [None]:
#pip install imblearn

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_resample(X_train_scaled,y_train)

In [None]:
clf=RandomForestClassifier(random_state=0)
clf.fit(os_features,os_labels)

In [None]:
predictions=clf.predict(X_test_scaled)

In [None]:
predictions_tr=clf.predict(X_train_scaled)

In [None]:
accuracy_score(y_test, predictions)

In [None]:
accuracy_score(y_train, predictions_tr)

#### (a) Régression logistique

In [None]:
clf1 = LogisticRegression(class_weight='balanced')
clf1.fit(X_train_scaled, y_train)

In [None]:
pred_y_lr = clf1.predict(X_test_scaled)
pred_y_lr_train = clf1.predict(X_train_scaled)

#### (b) Forêt aléatoire

In [None]:
clf2 = RandomForestClassifier(class_weight='balanced')
clf2.fit(X_train, y_train)

In [None]:
pred_y_foret = clf2.predict(X_test)
pred_y_foret_train = clf2.predict(X_train)

### 3- Evaluation

In [None]:
def evaluation_model(model, X_train,X_test,y_train, y_test):
    model.fit(X_train,y_train)
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    acc_train = accuracy_score(y_train, y_pred_train)
    acc_test = accuracy_score(y_test, y_pred_test)
    
    print("Accuracy train : ", acc_train)
    print("Accuracy test : ", acc_test)
    print(confusion_matrix(y_test, y_pred_test))
    print(classification_report(y_test, y_pred_test))
    
    N, train_score, val_score = learning_curve(model, X_train, y_train,
                                              cv=4, scoring='f1',
                                               train_sizes=np.linspace(0.1, 1, 10))
    
    
    plt.figure(figsize=(12, 8))
    plt.plot(N, train_score.mean(axis=1), label='train score')
    plt.plot(N, val_score.mean(axis=1), label='validation score')
    plt.legend()

In [None]:
#model1 = RandomForestClassifier(class_weight='balanced')
evaluation_model(clf1,  X_train_scaled,X_test_scaled,y_train, y_test)

In [None]:
features = df_reduce.columns
importances = clf2.feature_importances_
indices = np.argsort(importances)

# customized number 
num_features = 10 

plt.figure(figsize=(10,100))
plt.title('Feature Importances')

# only plot the customized number of features
plt.barh(range(num_features), importances[indices[-num_features:]], color='b', align='center')
plt.yticks(range(num_features), [features[i] for i in indices[-num_features:]])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
pd.DataFrame(clf2.feature_importances_, index=X_train.columns).plot.bar(figsize=(12, 8))

In [None]:
explainer = shap.TreeExplainer(clf2)
shap_values = explainer.shap_values(X_test)

In [37]:
pip install joblib

Note: you may need to restart the kernel to use updated packages.


#### (a) Accuracy

In [None]:
acc_lr_tr = accuracy_score(y_train, pred_y_lr_train)
acc_lr_tr

In [None]:
acc_lr = accuracy_score(y_train, pred_y_lr_train)
acc_lr

In [None]:
acc_lr_foret_tr = accuracy_score(y_train, pred_y_foret_train)
acc_lr_foret_tr

In [None]:
acc_foret = accuracy_score(y_test, pred_y_foret)
acc_foret

#### (b) Recall

In [None]:
rec_lr = recall_score(y_test, pred_y_lr)
rec_lr

In [None]:
rec_foret = recall_score(y_test, pred_y_foret)
rec_foret

#### (c) Précision

In [None]:
prec_lr = precision_score(y_test, pred_y_lr)
prec_lr

In [None]:
prec_foret = precision_score(y_test, pred_y_foret)
prec_foret

#### (d) F1 score

In [None]:
f1_lr = f1_score(y_test, pred_y_lr)
f1_lr

In [None]:
f1_foret = f1_score(y_test, pred_y_foret)
f1_foret

## D) Modèle final

In [None]:
pca = PCA()
# Define a Standard Scaler to normalize inputs
scaler = StandardScaler()

# set the tolerance to a large value to make the example faster

pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", model_f)])

In [None]:
# Choix du modèle
model_f = LogisticRegression(max_iter=10000, tol=0.1)

# Parameters of pipelines can be set using '__' separated parameter names:
param_grid = {
    "pca__n_components": [5, 15, 30, 45, 60],
    "logistic__C": np.logspace(-4, 4, 4),
}
gsv = GridSearchCV(pipe, param_grid, n_jobs=2, cv=5)
gsv.fit(X_digits, y_digits)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

In [None]:
# Enregistrer le model logistic regression
pickle.dump(clf1, open('prevision_credit.pkl', 'wb'))

In [None]:
import streamlit as st