# I. Importing Libraries

## Problem statement :

Bank XYZ has been observing a lot of customers closing their accounts or switching to competitor banks over the past couple of quarters. As such, this has caused a huge dent in the quarterly revenues and might drastically affect annual revenues for the ongoing financial year, causing stocks to plunge and market cap to reduce by X %. A team of business, product, engineering and data science folks have been put together to arrest this slide. 

__Objective__ : Can we build a model to predict, with a reasonable accuracy, the customers who are going to churn in the near future? Being able to accurately estimate when they are going to churn will be an added bonus

__Definition of churn__ : A customer having closed all their active accounts with the bank is said to have churned. Churn can be defined in other ways as well, based on the context of the problem. A customer not transacting for 6 months or 1 year can also be defined as to have churned, based on the business requirements.

__Product Manager's perspective :__  

(1) Business goal : Arrest decrease in revenues or loss of active customers of bank

(2) Identify data source : There are different source of data. Some of these could be Transactional systems, event-based logs, Data warehouse (MySQL DBs, Redshift/AWS), Data Lakes, NoSQL DBs.

(3) Audit for data quality : De-duplication of events/transactions, Complete or partial absence of data for chunks of time in between, Obscuring PII (personal identifiable information) data 

(4) Business and Data-related metrics : Tracking these metrics over time, probably through some intuitive visualizations
    
    (i) Business metrics : Churn rate (month-on-month, weekly/quarterly), Trend of avg. number of products per customer, 
        %age of dormant customers, Other such descriptive metrics
    
    (ii) Data-related metrics : F1-score, Recall, Precision
         Recall = TP/(TP + FN) 
         Precision = TP/(TP + FP)
         F1-score = Harmonic mean of Recall and Precision
         where, TP = True Positive, FP = False Positive and FN = False Negative

(5) Prediction model output format : These models doesn't require deployment. Instead, we can run these models periodically (monthly/quarterly) and the list of customers along with their propensity to churn can be shared with the business (Sales/Marketing) or Product team.

* Business metrics : If we take Recall target as __70%__ which means correctly identifying 70% of customers who's going to churn in the near future, we can expect that due to business intervention (offers, getting in touch with customers etc.), 50% of the customers can be saved from being churned, which means atleast a __35%__ improvement in Churn Rate

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
## Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

In [4]:
## Display all rows and columns of a dataframe instead of a truncated version
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# II. Importing Data and Descriptive Statistical Analysis

In [5]:
df = pd.read_csv('../input/BankingCustomerData.csv')

In [6]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [7]:
df.shape

(10000, 14)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [9]:
continuous_variables = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary', 'Surname_enc', 
                        'Balance_per_product', 'Balance_by_est_salary', 'Tenure_age_ratio', 'AgeSurname_mean_churn']
categorical_variables = ['Gender', 'HasCrCard', 'IsActiveMember', 'Country_France', 'Country_Germany', 'Country_Spain']

In [10]:
continuous_variables        
categorical_variables

['CreditScore',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'EstimatedSalary',
 'Surname_enc',
 'Balance_per_product',
 'Balance_by_est_salary',
 'Tenure_age_ratio',
 'AgeSurname_mean_churn']

['Gender',
 'HasCrCard',
 'IsActiveMember',
 'Country_France',
 'Country_Germany',
 'Country_Spain']

### Separating out train-test-valid sets

Since this is the only data available to us, we keep aside a holdout/test set to evaluate our model at the very end in order to estimate our chosen model's performance on unseen data / new data.

A validation set is also created which we'll use in our baseline models to evaluate and tune our models

In [11]:
## Separating out different columns into various categories as defined above
target_variable = ['Exited']
cols_to_remove = ['RowNumber', 'CustomerId']

# Tenure and NumOfProducts are ordinal variables. 
continuous_features = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

# HasCrCard and IsActiveMember are actually binary categorical variables.
categorical_features = ['Surname', 'Geography', 'Gender', 'HasCrCard', 'IsActiveMember']

In [12]:
## Separating out target variable and removing the non-essential columns
y = df[target_variable].values
df.drop(cols_to_remove, axis=1, inplace=True)

In [13]:
from sklearn.model_selection import train_test_split

## Keeping aside a test/holdout set
df_train_val, df_test, y_train_val, y_test = train_test_split(df, y.ravel(), test_size = 0.2, random_state = 42)

## Splitting into train and validation set
df_train, df_val, y_train, y_val = train_test_split(df_train_val, y_train_val, test_size = 0.12, random_state = 42)

In [14]:
df_train.shape, df_val.shape, df_test.shape, y_train.shape, y_val.shape, y_test.shape
np.mean(y_train), np.mean(y_val), np.mean(y_test)

((7040, 12), (960, 12), (2000, 12), (7040,), (960,), (2000,))

(0.20738636363636365, 0.19166666666666668, 0.1965)

### Spot-checking various ML algorithms

__Steps__ :

- Automate data preparation and model run through Pipelines

- Model Zoo : List of all models to compare/spot-check

- Evaluate using k-fold Cross validation framework

#### Automating data preparation and model run through Pipelines

__Base class for all estimators in scikit-learn.__

All estimators should specify all the parameters that can be set at the class level in their "__ init __ " as explicit keyword arguments (no *args or **kwargs).

In [15]:
from sklearn.base import BaseEstimator, TransformerMixin

In [16]:
class CategoricalEncoder(BaseEstimator, TransformerMixin):
    """ 
    Encodes categorical columns using LabelEncoding, OneHotEncoding and TargetEncoding.
    LabelEncoding is used for binary categorical columns
    OneHotEncoding is used for columns with <= 10 distinct values
    TargetEncoding is used for columns with higher cardinality (>10 distinct values)
    
    """

    def __init__(self, cols = None, label_encoder_cols = None, onehot_encoder_cols = None, target_encoding_cols = None, 
                 reduce_df = False):
        """
        
        Parameters
        ----------
        cols : list of str
            Columns to encode.  Default is to one-hot/target/label encode all categorical columns in the DataFrame.
        reduce_df : bool
            Whether to use reduced degrees of freedom for encoding
            (that is, add N-1 one-hot columns for a column with N 
            categories). E.g. for a column with categories A, B, 
            and C: When reduce_df is True, A=[1, 0], B=[0, 1],
            and C=[0, 0].  When reduce_df is False, A=[1, 0, 0], 
            B=[0, 1, 0], and C=[0, 0, 1]
            Default = False
        
        """
        
        if isinstance(cols, str):
            self.cols = [cols]
        else :
            self.cols = cols
        
        if isinstance(label_encoder_cols, str):
            self.label_encoder_cols = [label_encoder_cols]
        else :
            self.label_encoder_cols = label_encoder_cols
        
        if isinstance(onehot_encoder_cols, str):
            self.onehot_encoder_cols = [onehot_encoder_cols]
        else :
            self.onehot_encoder_cols = onehot_encoder_cols
        
        if isinstance(target_encoding_cols, str):
            self.target_encoding_cols = [target_encoding_cols]
        else :
            self.target_encoding_cols = target_encoding_cols
        
        self.reduce_df = reduce_df
    
    
    def fit(self, X, y):
        """Fit label/one-hot/target encoder to X and y
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
        y : pandas Series, shape = [n_samples]
            Target values.
            
        Returns
        -------
        self : encoder
            Returns self.
        """
        
        # Encode all categorical cols by default
        if self.cols is None:
            self.cols = [c for c in X if str(X[c].dtype)=='object']

        # Check columns are in X
        for col in self.cols:
            if col not in X:
                raise ValueError('Column \''+col+'\' not in X')
        
        # Separating out lcols, ohecols and tcols
        if self.label_encoder_cols is None:
            self.label_encoder_cols = [c for c in self.cols if X[c].nunique() <= 2]
        
        if self.onehot_encoder_cols is None:
            self.onehot_encoder_cols = [c for c in self.cols if ((X[c].nunique() > 2) & (X[c].nunique() <= 10))]
        
        if self.target_encoding_cols is None:
            self.target_encoding_cols = [c for c in self.cols if X[c].nunique() > 10]
        
        
        ## Create Label Encoding mapping
        self.label_encoder_maps = dict()
        for col in self.label_encoder_cols:
            self.label_encoder_maps[col] = dict(zip(X[col].values, X[col].astype('category').cat.codes.values))
        
        
        ## Create OneHot Encoding mapping
        self.onehot_encoder_maps = dict() #dict to store map for each column
        for col in self.onehot_encoder_cols:
            self.onehot_encoder_maps[col] = []
            uniques = X[col].unique()
            for unique in uniques:
                self.onehot_encoder_maps[col].append(unique)
            if self.reduce_df:
                del self.onehot_encoder_maps[col][-1]
        
        
        ## Create Target Encoding mapping
        self.global_target_mean = y.mean().round(2)
        self.sum_count = dict()
        for col in self.target_encoding_cols:
            self.sum_count[col] = dict()
            uniques = X[col].unique()
            for unique in uniques:
                ix = X[col]==unique
                self.sum_count[col][unique] = (y[ix].sum(),ix.sum())
        
        
        ## Return the fit object
        return self
    
    
    def transform(self, X, y=None):
        """Perform label/one-hot/target encoding transformation.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to label encode
            
        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        
        Xo = X.copy()
        ## Perform label encoding transformation
        for col, lmap in self.label_encoder_maps.items():
            # Map the column
            Xo[col] = Xo[col].map(lmap)
            Xo[col].fillna(-1, inplace=True) ## Filling new values with -1
        
        
        ## Perform one-hot encoding transformation
        for col, vals in self.onehot_encoder_maps.items():
            for val in vals:
                new_col = col+'_'+str(val)
                Xo[new_col] = (Xo[col]==val).astype('uint8')
            del Xo[col]
        
        
        ## Perform LOO target encoding transformation
        # Use normal target encoding if this is test data
        if y is None:
            for col in self.sum_count:
                vals = np.full(X.shape[0], np.nan)
                for cat, sum_count in self.sum_count[col].items():
                    vals[X[col]==cat] = (sum_count[0]/sum_count[1]).round(2)
                Xo[col] = vals
                Xo[col].fillna(self.global_target_mean, inplace=True) # Filling new values by global target mean

        # LOO target encode each column
        else:
            for col in self.sum_count:
                vals = np.full(X.shape[0], np.nan)
                for cat, sum_count in self.sum_count[col].items():
                    ix = X[col]==cat
                    if sum_count[1] > 1:
                        vals[ix] = ((sum_count[0]-y[ix].reshape(-1,))/(sum_count[1]-1)).round(2)
                    else :
                        vals[ix] = ((y.sum() - y[ix])/(X.shape[0] - 1)).round(2) # Catering to the case where a particular 
                                                                                 # category level occurs only once in the dataset
                
                Xo[col] = vals
                Xo[col].fillna(self.global_target_mean, inplace=True) # Filling new values by global target mean
        
        
        ## Return encoded DataFrame
        return Xo
    
    
    def fit_transform(self, X, y=None):
        """Fit and transform the data via label/one-hot/target encoding.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
        y : pandas Series, shape = [n_samples]
            Target values (required!).

        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        
        return self.fit(X, y).transform(X, y)
    


In [17]:
class AddFeatures(BaseEstimator):
    """
    Add new, engineered features using original categorical and numerical features of the DataFrame
    """
    
    def __init__(self, eps = 1e-6):
        """
        Parameters
        ----------
        eps : A small value to avoid divide by zero error. Default value is 0.000001
        """
        
        self.eps = eps
    
    
    def fit(self, X, y=None):
        return self
    
    
    def transform(self, X):
        """
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing base columns using which new interaction-based features can be engineered
        """
        Xo = X.copy()
    
        # Add 4 new columns - bal_per_product, bal_by_est_salary, tenure_age_ratio, age_surname_mean_churn
        Xo['Balance_per_product'] = Xo.Balance/(Xo.NumOfProducts + self.eps)
        Xo['Balance_by_est_salary'] = Xo.Balance/(Xo.EstimatedSalary + self.eps)
        Xo['Tenure_age_ratio'] = Xo.Tenure/(Xo.Age + self.eps)
        Xo['AgeSurname_mean_churn'] = np.sqrt(Xo.Age) * Xo.Surname
        
        ## Returning the updated dataframe
        return Xo
    
    
    def fit_transform(self, X, y=None):
        """
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing base columns using which new interaction-based features can be engineered
        """
        return self.fit(X,y).transform(X)
    
    

In [18]:
class CustomScaler(BaseEstimator, TransformerMixin):
    """
    A custom standard scaler class with the ability to apply scaling on selected columns
    """
    
    def __init__(self, scaling_cols = None):
        """
        Parameters
        ----------
        scaling_cols : list of str
            Columns on which to perform scaling and normalization. Default is to scale all numerical columns
        
        """
        self.scaling_cols = scaling_cols
    
    
    def fit(self, X, y=None):
        """
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to scale
        """
        
        # Scaling all non-categorical columns if user doesn't provide the list of columns to scale
        if self.scaling_cols is None:
            self.scaling_cols = [c for c in X if ((str(X[c].dtype).find('float') != -1) or (str(X[c].dtype).find('int') != -1))]
        
     
        ## Create mapping corresponding to scaling and normalization
        self.scaling_maps = dict()
        for col in self.scaling_cols:
            self.scaling_maps[col] = dict()
            self.scaling_maps[col]['mean'] = np.mean(X[col].values).round(2)
            self.scaling_maps[col]['std_dev'] = np.std(X[col].values).round(2)
        
        # Return fit object
        return self
    
    
    def transform(self, X):
        """
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to scale
        """
        Xo = X.copy()
        
        ## Map transformation to respective columns
        for col in self.scaling_cols:
            Xo[col] = (Xo[col] - self.scaling_maps[col]['mean']) / self.scaling_maps[col]['std_dev']
        
        
        # Return scaled and normalized DataFrame
        return Xo
    
    
    def fit_transform(self, X, y=None):
        """
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to scale
        """
        # Fit and return transformed dataframe
        return self.fit(X).transform(X)
    
    

### Pipeline for Decision Tree Classifier

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

## Importing relevant metrics
from sklearn.metrics import roc_auc_score, f1_score, recall_score, confusion_matrix, classification_report

In [20]:
X = df_train.drop(columns = ['Exited'], axis = 1)
X_val = df_val.drop(columns = ['Exited'], axis = 1)

## Scaling only continuous columns
columns_to_scale = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary', 'Surname', 
                    'Balance_per_product', 'Balance_by_est_salary', 'Tenure_age_ratio', 'AgeSurname_mean_churn']

## Assigning weigts scale
weights_dict = {0 : 1, 1 : 4}

decision_tree_classifier = DecisionTreeClassifier(criterion = 'entropy', class_weight = weights_dict, max_depth = 4, max_features = None
                            , min_samples_split = 25, min_samples_leaf = 15)

In [21]:
dt_model = Pipeline(steps = [ ('Feature_Encoding', CategoricalEncoder()),
                              ('Feature_Extraction', AddFeatures()),
                              ('Feature_Scaling', CustomScaler(columns_to_scale)),
                              ('classifiers', decision_tree_classifier)
                            ]
                   )

# Fit pipeline with training data
dt_model.fit(X,y_train)

In [22]:
y_dt_train_predicted = dt_model.predict(X)
y_dt_validation_predicted = dt_model.predict(X_val)

# summarize the fit of the model
print('Classification report for the training data \n',classification_report(y_train, y_dt_train_predicted))
print('Classification report for the validation data \n',classification_report(y_val, y_dt_validation_predicted))
print('Confusion matrix for the train data- \n',confusion_matrix(y_train, y_dt_train_predicted))
print('\nConfusion matrix for the validation data- \n',confusion_matrix(y_val, y_dt_validation_predicted))

Classification report for the training data 
               precision    recall  f1-score   support

           0       0.92      0.76      0.83      5580
           1       0.44      0.73      0.55      1460

    accuracy                           0.75      7040
   macro avg       0.68      0.75      0.69      7040
weighted avg       0.82      0.75      0.77      7040

Classification report for the validation data 
               precision    recall  f1-score   support

           0       0.93      0.79      0.85       776
           1       0.45      0.74      0.56       184

    accuracy                           0.78       960
   macro avg       0.69      0.77      0.71       960
weighted avg       0.84      0.78      0.80       960

Confusion matrix for the train data- 
 [[4242 1338]
 [ 392 1068]]

Confusion matrix for the validation data- 
 [[610 166]
 [ 47 137]]


### Pipeline for RandomForest, LGBM, XGB, Naive Bayes (Gaussian/Multinomial), kNN

In [23]:
## Preparing data and a few common model parameters
X = df_train.drop(columns = ['Exited'], axis = 1)
y = y_train.ravel()

weights_dict = {0 : 1, 1 : 4}

weight = 4

In [24]:
## Importing the models to be tried out
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB, BernoulliNB
from sklearn.model_selection import cross_val_score, KFold

In [25]:
## Preparing a list of models to try out in the spot-checking process
def classification_models(models_dict = dict()):
    # Tree models
    for n_trees in [21, 1001]:
        models_dict['RF_' + str(n_trees)] = RandomForestClassifier(n_estimators = n_trees, n_jobs = -1, criterion = 'entropy'
                                                              , class_weight = weights_dict, max_depth = 6, max_features = 0.6
                                                              , min_samples_split = 30, min_samples_leaf = 20)
        
        models_dict['LGBM_' + str(n_trees)] = LGBMClassifier(boosting_type='dart', num_leaves=31, max_depth= 6, 
                                                             learning_rate=0.1, n_estimators=n_trees, class_weight=weights_dict, 
                                                             min_child_samples=20, colsample_bytree=0.6, reg_alpha=0.3, 
                                                             reg_lambda=1.0, n_jobs=- 1, importance_type = 'gain')
        
        models_dict['XGB_' + str(n_trees)] = XGBClassifier(objective='binary:logistic', n_estimators = n_trees, max_depth = 6
                                                      , learning_rate = 0.03, n_jobs = -1, colsample_bytree = 0.6
                                                      , reg_alpha = 0.3, reg_lambda = 0.1, scale_pos_weight = weight)
        
        models_dict['ETC_' + str(n_trees)] = ExtraTreesClassifier(n_estimators=n_trees, criterion = 'entropy', max_depth = 6
                                                            , max_features = 0.6, n_jobs = -1, class_weight = weights_dict
                                                            , min_samples_split = 30, min_samples_leaf = 20)
    
    # kNN models
    for n in [3,5,11]:
        models_dict['KNN_' + str(n)] = KNeighborsClassifier(n_neighbors=n)
    
    # Naive-Bayes models
    models_dict['gauss_nb'] = GaussianNB()
    models_dict['multi_nb'] = MultinomialNB()
    models_dict['compl_nb'] = ComplementNB()
    models_dict['bern_nb'] = BernoulliNB()
    
    return models_dict

In [26]:
## Automation of data preparation and model run through pipelines
def make_pipeline(model):
    '''
    Creates pipeline for the model passed as the argument. Uses standard scaling only in case of kNN models. 
    Ignores scaling step for tree/Naive Bayes models
    '''
    
    if (str(model).find('KNeighborsClassifier') != -1):
        pipe =  Pipeline(steps = [('Feature_Encoding', CategoricalEncoder()),
                                  ('Feature_Extraction', AddFeatures()),
                                  ('Feature_Scaling', CustomScaler(columns_to_scale)),
                                  ('classifiers', model)
                             ])
    else :
        pipe =  Pipeline(steps = [('Feature_Encoding', CategoricalEncoder()),
                                  ('Feature_Extraction', AddFeatures()),
                                  ('classifiers', model)
                                ])
    
    
    return pipe


In [27]:
## Run/Evaluate all 15 models using KFold cross-validation (5 folds)
def evaluate_classification_models(X, y, models, folds = 5, metric = 'recall'):
    results = dict()
    for name, model in models.items():
        # Evaluate model through automated pipelines
        pipeline = make_pipeline(model)
        scores = cross_val_score(pipeline, X, y, cv = folds, scoring = metric, n_jobs = -1)
        
        # Store results of the evaluated model
        results[name] = scores
        mu, sigma = np.mean(scores), np.std(scores)
        # Printing individual model results
        print('Model {}: mean = {}, std_dev = {}'.format(name, mu, sigma))
    
    return results

In [28]:
# Spot-checking in action
models = classification_models()
print('Recall metric')
results = evaluate_classification_models(X, y , models, metric = 'recall')
print('F1-score metric')
results = evaluate_classification_models(X, y , models, metric = 'f1')

Recall metric
Model RF_21: mean = 0.7431506849315068, std_dev = 0.03256127212056086
Model LGBM_21: mean = 0.7602739726027397, std_dev = 0.02748274335306698
Model XGB_21: mean = 0.7410958904109589, std_dev = 0.020387501460461945
Model ETC_21: mean = 0.7493150684931507, std_dev = 0.033160872429890985
Model RF_1001: mean = 0.7410958904109589, std_dev = 0.0287018175887082
Model LGBM_1001: mean = 0.6020547945205479, std_dev = 0.02013279240643707
Model XGB_1001: mean = 0.6068493150684932, std_dev = 0.018556461896087777
Model ETC_1001: mean = 0.7561643835616438, std_dev = 0.028848547518702302
Model KNN_3: mean = 0.4232876712328767, std_dev = 0.011583242825539568
Model KNN_5: mean = 0.4013698630136986, std_dev = 0.013595502220054258
Model KNN_11: mean = 0.347945205479452, std_dev = 0.017408582228957303
Model gauss_nb: mean = 0.04452054794520548, std_dev = 0.060375070879558936
Model multi_nb: mean = 0.5410958904109588, std_dev = 0.022613115094820772
Model compl_nb: mean = 0.5410958904109588, st

### Let us try LGBM Model with HyperParameter Tuning

In [29]:
lgbm_model_01 = LGBMClassifier(boosting_type='dart', num_leaves=45, max_depth= 6,
                               learning_rate=0.1, n_estimators=90, class_weight={0 : 1, 1 : 3}, 
                               min_child_samples=20, colsample_bytree=0.6, reg_alpha=0.3, 
                               reg_lambda=1.0, n_jobs=- 1, importance_type = 'gain', force_col_wise=True)

In [30]:
lgbm_model = Pipeline(steps = [ ('Feature_Encoding', CategoricalEncoder()),
                              ('Feature_Extraction', AddFeatures()),
                              # ('Feature_Scaling', CustomScaler(columns_to_scale)),
                              ('classifiers', lgbm_model_01)
                            ]
                   )

# Fit pipeline with training data
lgbm_model.fit(X,y_train)

[LightGBM] [Info] Number of positive: 1460, number of negative: 5580
[LightGBM] [Info] Total Bins 1921
[LightGBM] [Info] Number of data points in the train set: 7040, number of used features: 17
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.439759 -> initscore=-0.242140
[LightGBM] [Info] Start training from score -0.242140


In [31]:
y_dt_train_predicted = lgbm_model.predict(X)
y_dt_validation_predicted = lgbm_model.predict(X_val)

# summarize the fit of the model
print('Classification report for the training data \n',classification_report(y_train, y_dt_train_predicted))
print('Classification report for the validation data \n',classification_report(y_val, y_dt_validation_predicted))
print('Confusion matrix for the train data- \n',confusion_matrix(y_train, y_dt_train_predicted))
print('\nConfusion matrix for the validation data- \n',confusion_matrix(y_val, y_dt_validation_predicted))

Classification report for the training data 
               precision    recall  f1-score   support

           0       0.92      0.88      0.90      5580
           1       0.61      0.72      0.66      1460

    accuracy                           0.84      7040
   macro avg       0.76      0.80      0.78      7040
weighted avg       0.86      0.84      0.85      7040

Classification report for the validation data 
               precision    recall  f1-score   support

           0       0.93      0.87      0.90       776
           1       0.56      0.71      0.63       184

    accuracy                           0.84       960
   macro avg       0.74      0.79      0.76       960
weighted avg       0.86      0.84      0.85       960

Confusion matrix for the train data- 
 [[4902  678]
 [ 414 1046]]

Confusion matrix for the validation data- 
 [[674 102]
 [ 53 131]]


### Random Forest Ensemble Technique

In [33]:
RF_model=RandomForestClassifier(max_depth=7,max_features=5,min_samples_leaf=10,min_samples_split=25,n_estimators=40,
                                class_weight='balanced',random_state=1)

rf_pipeline_model = Pipeline(steps = [ ('Feature_Encoding', CategoricalEncoder()),
                                       ('Feature_Extraction', AddFeatures()),
                                       # ('Feature_Scaling', CustomScaler(columns_to_scale)),
                                       ('classifiers', RF_model)
                                     ]
                            )

# Fit pipeline with training data
rf_pipeline_model.fit(X,y_train)


y_rf_train_predicted = rf_pipeline_model.predict(X)
y_rf_validation_predicted = rf_pipeline_model.predict(X_val)

# summarize the fit of the model
print('\n\nClassification report for the training data \n',classification_report(y_train, y_rf_train_predicted))
print('Classification report for the validation data \n',classification_report(y_val, y_rf_validation_predicted))
print('Confusion matrix for the train data- \n',confusion_matrix(y_train, y_rf_train_predicted))
print('\nConfusion matrix for the validation data- \n',confusion_matrix(y_val, y_rf_validation_predicted))



Classification report for the training data 
               precision    recall  f1-score   support

           0       0.93      0.85      0.89      5580
           1       0.56      0.75      0.64      1460

    accuracy                           0.83      7040
   macro avg       0.75      0.80      0.76      7040
weighted avg       0.85      0.83      0.84      7040

Classification report for the validation data 
               precision    recall  f1-score   support

           0       0.94      0.84      0.89       776
           1       0.54      0.76      0.63       184

    accuracy                           0.83       960
   macro avg       0.74      0.80      0.76       960
weighted avg       0.86      0.83      0.84       960

Confusion matrix for the train data- 
 [[4729  851]
 [ 366 1094]]

Confusion matrix for the validation data- 
 [[655 121]
 [ 44 140]]


### Decision Tree Classification Tuned with Best Params from Basic Models

In [36]:
dt_model = DecisionTreeClassifier(class_weight ='balanced', criterion = 'entropy', max_depth= 10, 
                                  min_samples_leaf= 24, min_samples_split = 9)

dt_pipeline_model = Pipeline(steps = [ ('Feature_Encoding', CategoricalEncoder()),
                                       ('Feature_Extraction', AddFeatures()),
                                       # ('Feature_Scaling', CustomScaler(columns_to_scale)),
                                       ('classifiers', dt_model)
                                     ]
                            )

# Fit pipeline with training data
dt_pipeline_model.fit(X,y_train)

y_dt_train_predicted = dt_pipeline_model.predict(X)
y_dt_validation_predicted = dt_pipeline_model.predict(X_val)

# summarize the fit of the model
print('Classification report for the training data \n',classification_report(y_train, y_dt_train_predicted))
print('Classification report for the validation data \n',classification_report(y_val, y_dt_validation_predicted))
print('Confusion matrix for the train data- \n',confusion_matrix(y_train, y_dt_train_predicted))
print('\nConfusion matrix for the validation data- \n',confusion_matrix(y_val, y_dt_validation_predicted))

Classification report for the training data 
               precision    recall  f1-score   support

           0       0.95      0.79      0.86      5580
           1       0.51      0.83      0.64      1460

    accuracy                           0.80      7040
   macro avg       0.73      0.81      0.75      7040
weighted avg       0.86      0.80      0.82      7040

Classification report for the validation data 
               precision    recall  f1-score   support

           0       0.93      0.77      0.84       776
           1       0.43      0.74      0.55       184

    accuracy                           0.76       960
   macro avg       0.68      0.76      0.69       960
weighted avg       0.83      0.76      0.78       960

Confusion matrix for the train data- 
 [[4431 1149]
 [ 243 1217]]

Confusion matrix for the validation data- 
 [[597 179]
 [ 47 137]]


### Considering all the parameters - Random Forest Seems to be the Best Parameter w.r.t Recall Values

### Saving the RandomForestClassifier Models

In [37]:
import pickle
import os

filename = '../output/BankingCustomerChurnPrediction.sav'
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, 'wb') as f:
    pickle.dump(rf_pipeline_model, f)

### Loading the Saved Model and Predicting with Test Data

In [38]:
import joblib
model = joblib.load('../output/BankingCustomerChurnPrediction.sav')

In [39]:
X_test = df_test.drop(columns = ['Exited'], axis = 1)
X_test.shape
y_test.shape

(2000, 11)

(2000,)

In [40]:
y_test_predicted = model.predict(X_test)
print('Classification report for the validation data \n',classification_report(y_test, y_test_predicted))
print('Confusion matrix for the train data- \n',confusion_matrix(y_test, y_test_predicted))

Classification report for the validation data 
               precision    recall  f1-score   support

           0       0.93      0.84      0.88      1607
           1       0.52      0.74      0.61       393

    accuracy                           0.82      2000
   macro avg       0.73      0.79      0.75      2000
weighted avg       0.85      0.82      0.83      2000

Confusion matrix for the train data- 
 [[1342  265]
 [ 103  290]]


In [45]:
# Adding predictions and their probabilities in the original test dataframe
test_probs = model.predict_proba(X_test)[:,1]
df_test['Predictions'] = y_test_predicted
df_test['Prediction_Probabilities'] = test_probs

In [46]:
df_test.sample(20)

Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Predictions,Prediction_Probabilities
7420,Kuo,753,Germany,Female,38,9,151766.71,1,1,1,180829.99,0,0,0.496695
6196,Bailey,698,France,Male,29,5,95167.55,1,1,1,152723.23,0,0,0.182923
9783,Ajuluchukwu,601,Germany,Female,49,4,96252.98,2,1,0,104263.82,0,1,0.790875
2199,Piazza,762,France,Male,29,6,141389.06,1,1,0,54122.89,0,0,0.271795
9547,McFarland,626,France,Female,34,3,0.0,2,1,1,37870.29,0,0,0.106517
8134,Shah,577,France,Male,41,6,0.0,1,1,1,167621.18,0,0,0.463627
9275,Carslaw,427,Germany,Male,42,1,75681.52,1,1,1,57098.0,0,0,0.436842
3396,Knowles,581,France,Male,71,4,0.0,2,1,1,197562.08,0,0,0.169676
7487,McGuffog,651,France,Female,56,4,0.0,1,0,0,84383.22,1,1,0.918367
603,Burke,566,France,Male,30,5,0.0,1,1,0,54926.51,1,0,0.374872
