# Telecom Churn Analysis and Prediction using SMOTE data

* [1.  Import and Clean data](#import-and-clean-data)
    * [1.1.  Delete `customerid` column](#delete-%60customerid%60-column)
    * [1.2.  Data Munging](#data-munging)
        * [1.2.1.  Checking for null values in the dataset](#checking-for-null-values-in-the-dataset)
        * [1.2.2.  Making labels concise](#making-labels-concise)
        * [1.2.3.  Column Type Casting and Imputation](#column-type-casting-and-imputation)
            * [1.2.3.1.  Cast `TotalCharges` column to `float`](#cast-%60totalcharges%60-column-to-%60float%60)
            * [1.2.3.2.  Search for categorical columns and cast them to `pd.Categorical`](#search-for-categorical-columns-and-cast-them-to-%60pd.categorical%60)
        * [1.2.4.  Reordering Columns](#reordering-columns)
* [2.  Correlations in the data](#correlations-in-the-data)
    * [2.1.  Correlation between Quantitative variables](#correlation-between-quantitative-variables)
    * [2.2.  Correlation between Qualitative/ Categorical variables](#correlation-between-qualitative/-categorical-variables)
* [3.  Data Preprocessing](#data-preprocessing)
    * [3.1.  Train-Test split](#train-test-split)
    * [3.2. Oversample Training Data (SMOTE-NC)](#oversample-training-data-%28smote-nc%29)
    * [3.3.  One-hot Encoding and Standardization](#one-hot-encoding-and-standardization)
* [4.  Data Modeling](#data-modeling)
    * [4.1.  Utility Functions](#utility-functions)
    * [4.2. Naive Bayes](#naive-bayes)
    * [4.3.  Logistic Regression](#logistic-regression)
    * [4.4.  K-Nearest Neighbors](#k-nearest-neighbors)
    * [4.5.  Decision Tree](#decision-tree)
    * [4.6.  Decision Trees with Bagging](#decision-trees-with-bagging)
    * [4.7.  Random Forests](#random-forests)
    * [4.8.  Decision Trees with AdaBoost](#decision-trees-with-adaboost)
    * [4.9.  Linear SVC](#linear-svc)
    * [4.10.  SVM with RBF kernel](#svm-with-rbf-kernel)
    * [4.11.  XGBoost](#xgboost)
    * [4.12.  CatBoost](#catboost)
* [5.  Model Comparison](#model-comparison)
    * [5.1.  Evaluation Metrics](#evaluation-metrics)
    * [5.2. 2 ROC and PR Curves](#2-roc-and-pr-curves)
* [6.  Further Analysis](#further-analysis)

In [None]:
import os, sys

import numpy as np
from scipy.stats import chi2_contingency
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


# Included changes to make the kernel run as a jupyter notebook on windows without the need to make any changes
kaggle_data_folder = os.path.join('kaggle', 'input') if sys.platform == 'win32' else os.path.join(os.path.sep, 'kaggle', 'input')
file_ext = ".csv"
files = []
for dirname, _, filenames in os.walk(kaggle_data_folder):
    for filename in filenames:
        if filename.endswith(file_ext):
            files.append(os.path.join(dirname, filename))

print(files)
# Any results you write to the current directory are saved as output.

<a id="import-and-clean-data"></a>
# 1.  Import and Clean data

In [None]:
df = pd.read_csv(files[0])
df.head(2)

<a id="delete-%60customerid%60-column"></a>
## 1.1.  Delete `customerid` column
Since 'customerid' column does not provide any relevant information in predicting the customer churn, we can delete the column.

In [None]:
df.drop(labels=['customerID'], axis=1, inplace=True)
df.head(2)

<a id="data-munging"></a>
## 1.2.  Data Munging

<a id="checking-for-null-values-in-the-dataset"></a>
### 1.2.1.  Checking for null values in the dataset

In [None]:
df.info()

As of now we don't see any null values. However, we will find a few in the `TotalCharges` column after casting it to `float64`

<a id="making-labels-concise"></a>
### 1.2.2.  Making labels concise
Let's make the categorical labels more concise. For instance, we will convert the categorical label `'Bank transfer (automatic)'` to `'Bank transfer'` to make it easier to access (and display) during visualization.

In [None]:
## Shorten the Labels
value_mapper = {'Female': 'F', 'Male': 'M', 'Yes': 'Y', 'No': 'N',
                'No phone service': 'No phone', 'Fiber optic': 'Fiber',
                'No internet service': 'No internet', 'Month-to-month': 'Monthly',
                'Bank transfer (automatic)': 'Bank transfer',
                'Credit card (automatic)': 'Credit card',
                'One year': '1 yr', 'Two year': '2 yr'}
df.replace(to_replace=value_mapper, inplace=True)
# Another method
# df = df.applymap(lambda v: value_mapper[v] if v in value_mapper.keys() else v)

Let's also change column labels from `TitleCase` to `lowercase` to ease access.

In [None]:
df.columns = [label.lower() for label in df.columns]
df.head(10).T

<a id="column-type-casting-and-imputation"></a>
### 1.2.3.  Column Type Casting and Imputation
Pandas couldn't properly cast the data type of several columns. For instance, the `TotalCharges` column is recognized as `object` instead of `float`. Similarly, all the categorical columns were casted as `object` type instead of `pd.Categorical`.

<a id="cast-%60totalcharges%60-column-to-%60float%60"></a>
#### 1.2.3.1.  Cast `TotalCharges` column to `float`

In [None]:
df['totalcharges'] = pd.to_numeric(df['totalcharges'], errors='coerce')
df['totalcharges'].head()

In [None]:
df.info()

Here we see that the `totalcharges` has 11 missing values. Let's see the complete data corresponding to these customers.

In [None]:
df[np.isnan(df['totalcharges'])]

It can also be noted that the `Tenure` column is 0 for these entries even though the `monthlycharges` column is not empty. Let's see if there are any other 0 values in the `tenure` column.

In [None]:
df[df['tenure'] == 0].index

There are no additional missing values in the `Tenure` column. Let's delete the rows with missing values in `monthlycharges` and `tenure` columns.

In [None]:
df.drop(labels=df[df['tenure'] == 0].index, axis=0, inplace=True)
df[df['tenure'] == 0].index

In [None]:
df.info()

<a id="search-for-categorical-columns-and-cast-them-to-%60pd.categorical%60"></a>
#### 1.2.3.2.  Search for categorical columns and cast them to `pd.Categorical`
We need to manually identify categorical columns in the data before casting them to `pd.Categorical`. Casting categorical columns from the detected *object* type to *categorical* will ease visualization.

In [None]:
def summarize_categoricals(df, show_levels=False):
    """
        Display uniqueness in each column
    """
    data = [[df[c].unique(), len(df[c].unique()), df[c].isnull().sum()] for c in df.columns]
    df_temp = pd.DataFrame(data, index=df.columns,
                           columns=['Levels', 'No. of Levels', 'No. of Missing Values'])
    return df_temp.iloc[:, 0 if show_levels else 1:]


def find_categorical(df, cutoff=10):
    """
        Function to find categorical columns in the dataframe.
    """
    cat_cols = []
    for col in df.columns:
        if len(df[col].unique()) <= cutoff:
            cat_cols.append(col)
    return cat_cols


def to_categorical(columns, df):
    """
        Converts the columns passed in `columns` to categorical datatype
    """
    for col in columns:
        df[col] = df[col].astype('category')
    return df

In [None]:
summarize_categoricals(df, show_levels=True)

In [None]:
df = to_categorical(find_categorical(df), df)
df.info()

<a id="reordering-columns"></a>
### 1.2.4.  Reordering Columns

In [None]:
new_order = list(df.columns)
new_order.insert(16, new_order.pop(4))
df = df[new_order]
df.head(2)

<a id="correlations-in-the-data"></a>
# 2.  Correlations in the data

In [None]:
df.describe().T

<a id="correlation-between-quantitative-variables"></a>
## 2.1.  Correlation between Quantitative variables

In [None]:
sns.heatmap(data=df[['tenure', 'monthlycharges', 'totalcharges']].corr(),
            annot=True, cmap='coolwarm');

***Inference:*** As evident from the correlation matrix and regplots, since ***'totalcharges'*** is the total monthly charges over the tenure of a customer, ***'totalcharges'*** is highly correlated with ***'monthlycharges'*** and ***'tenure'***.

In [None]:
sns.lmplot('monthlycharges', 'totalcharges', data=df, hue='churn',
           scatter_kws={'alpha': 0.1})
fig = sns.lmplot('tenure', 'totalcharges', data=df, hue='churn',
                 scatter_kws={'alpha': 0.1})
fig.set_xlabels('tenure (in months)');

<a id="correlation-between-qualitative/-categorical-variables"></a>
## 2.2.  Correlation between Qualitative/ Categorical variables
`Cramer's V` is more appropriate than Pearson correlation to find correlation between two nominal variables. Here, the `Cramer's V` metric is implemented.

In [None]:
def cramers_corrected_stat(contingency_table):
    """
        Computes corrected Cramer's V statistic for categorial-categorial association
    """
    chi2 = chi2_contingency(contingency_table)[0]
    n = contingency_table.sum().sum()
    phi2 = chi2/n
    
    r, k = contingency_table.shape
    r_corrected = r - (((r-1)**2)/(n-1))
    k_corrected = k - (((k-1)**2)/(n-1))
    phi2_corrected = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    
    return (phi2_corrected / min( (k_corrected-1), (r_corrected-1)))**0.5

In [None]:
def categorical_corr_matrix(df):
    """
        Computes corrected Cramer's V statistic between
        all the categorical variables in the dataframe
    """
    df = df.select_dtypes(include='category')
    cols = df.columns
    n = len(cols)
    corr_matrix = pd.DataFrame(np.zeros(shape=(n, n)), index=cols, columns=cols)
    
    for col1 in cols:
        for col2 in cols:
            if col1 == col2:
                corr_matrix.loc[col1, col2] = 1
                break
            df_crosstab = pd.crosstab(df[col1], df[col2], dropna=False)
            corr_matrix.loc[col1, col2] = cramers_corrected_stat(df_crosstab)
    
    # Flip and add to get full correlation matrix
    corr_matrix += np.tril(corr_matrix, k=-1).T
    return corr_matrix

In [None]:
fig, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(categorical_corr_matrix(df), annot=True, cmap='coolwarm', 
            cbar_kws={'aspect': 50}, square=True, ax=ax)
plt.xticks(rotation=60);

***Inference:*** There is some correlation between *'phone service'* and *'multiple lines'* since those who don't have a phone service cannot have multiple lines. So, knowing that a particular customer is not subscribed to phone service we can infer that the customer doesn't have multiple lines. Similarly, there is also a correlation between *'internet service'* and *'online security', 'online backup', 'device protection', 'streaming tv'* and *'streaming movies'*

<a id="data-preprocessing"></a>
# 3.  Data Preprocessing
Data needs to be one-hot-encoded before applying machine learning models.

In [None]:
x = df.iloc[:, :-1]
y = df['churn']

categorical_columns = list(x.select_dtypes(include='category').columns)
numeric_columns = list(x.select_dtypes(exclude='category').columns)

<a id="train-test-split"></a>
## 3.1.  Train-Test split
CatBoost classifier does not require any knd of preprocessing while Naive bayes requires a different kind of preprocesing. Therefore, we will use raw/ unmodified data (`x_train_cat, x_test_cat, y_train_cat, y_test_cat`) for CatBoost and preprocessed data (`x_train, x_test, y_train, y_test`) for all other classifiers. For Naive Bayes, we will use the raw data (`x_train_cat, x_test_cat, y_train_cat, y_test_cat`) and preprocess it as required in the Naive Bayes section.

In [None]:
from sklearn.model_selection import train_test_split

data_splits = train_test_split(x, y, test_size=0.25, random_state=0,
                               shuffle=True, stratify=y)
x_train, x_test, y_train, y_test = data_splits


# For CatBoost and Naive Bayes
data_splits = train_test_split(x, y, test_size=0.25, random_state=0,
                               shuffle=True, stratify=y)
x_train_cat, x_test_cat, y_train_cat, y_test_cat = data_splits


# Save the non-scaled version of monthlycharges and totalcharges to compare classifiers
x_test_charges = np.array(x_test[['monthlycharges', 'totalcharges']], copy=True)

list(map(lambda x: x.shape, [x, y, x_train, x_test, y_train, y_test]))

<a id="oversample-training-data-%28smote-nc%29"></a>
## 3.2. Oversample Training Data (SMOTE-NC)
SMOTE is an oversampling method that balances imbalanced datasets by sampling (with replacement) minority class. SMOTE-NC stands for Synthetic Minority Over-sampling TEchnique for data with Numerical-Categorical features. Note that only training data is oversampled. The testing data is untouched.

In [None]:
from imblearn.over_sampling import SMOTENC

smote = SMOTENC(categorical_features=(x_train.dtypes == "category").values,
                random_state=42)

x_train, y_train = smote.fit_resample(x_train, y_train)

x_train_cat, y_train_cat = smote.fit_resample(x_train_cat, y_train_cat)

In [None]:
pd.Series(y_train).value_counts()

In [None]:
sns.countplot(x=y_train);

<a id="one-hot-encoding-and-standardization"></a>
## 3.3.  One-hot Encoding and Standardization
We need to standardize the continuous or quantitative variables/ features before applying Machine Learning models. This is important because if we don't standardize the features, features with high variance that are orders of magnitude larger that others might dominate the model fitting process and causing the model unable to learn from other features (with lower variance) correctly as expected. <br/>
There is no need to standardize categorical variables.

***Also we need to standardize the data only after performing train-test split because if we standardize before splitting then there is a chance for some information leak from the test set into the train set. We always want the test set to be completely new to the ML models. [Read more](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)***

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder

categorical_columns = list(x.select_dtypes(include='category').columns)


## Column Transformer
transformers = [('one_hot_encoder',
                  OneHotEncoder(drop='first',dtype='int'),
                  categorical_columns),
                ('standard_scaler', StandardScaler(), numeric_columns)]
x_trans = ColumnTransformer(transformers, remainder='passthrough')

## Applying Column Transformer
x_train = x_trans.fit_transform(x_train)
x_test = x_trans.transform(x_test)

## Label encoding
y_trans = LabelEncoder()
y_train = y_trans.fit_transform(y_train)
y_test = y_trans.transform(y_test)


## Save feature names after one-hot encoding for feature importances plots
feature_names = list(x_trans.named_transformers_['one_hot_encoder'] \
                            .get_feature_names(input_features=categorical_columns))
feature_names = feature_names + numeric_columns

<a id="data-modeling"></a>
# 4.  Data Modeling
Since the dataset is imbalanced we will be using class-weighted/ cost-sensitive learning. In cost-sensitive learning, a weighted cost function is used. Therefore, misclassifying a sample from the minority class will cost the classifiers more than misclassifying a sample from the majority class. In most of the Sklearn classifiers, cost-sensitive learning can be enabled by setting `class_weight='balanced'`.

<a id="utility-functions"></a>
## 4.1.  Utility Functions

In [None]:
import timeit
import pickle
import sys
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, \
                            precision_recall_curve, roc_curve, accuracy_score
from sklearn.exceptions import NotFittedError

In [None]:
def confusion_plot(matrix, labels=None):
    """ Display binary confusion matrix as a Seaborn heatmap """
    
    labels = labels if labels else ['Negative (0)', 'Positive (1)']
    
    fig, ax = plt.subplots(nrows=1, ncols=1)
    sns.heatmap(data=matrix, cmap='Blues', annot=True, fmt='d',
                xticklabels=labels, yticklabels=labels, ax=ax)
    ax.set_xlabel('PREDICTED')
    ax.set_ylabel('ACTUAL')
    ax.set_title('Confusion Matrix')
    plt.close()
    
    return fig

In [None]:
def roc_plot(y_true, y_probs, label, compare=False, ax=None):
    """ Plot Receiver Operating Characteristic (ROC) curve 
        Set `compare=True` to use this function to compare classifiers. """
    
    fpr, tpr, thresh = roc_curve(y_true, y_probs)
    auc = round(roc_auc_score(y_true, y_probs), 2)
    
    fig, axis = (None, ax) if ax else plt.subplots(nrows=1, ncols=1)
    label = ' '.join([label, f'({auc})']) if compare else None
    sns.lineplot(x=fpr, y=tpr, ax=axis, label=label)
    
    if compare:
        axis.legend(title='Classifier (AUC)', loc='lower right')
    else:
        axis.text(0.72, 0.05, f'AUC = { auc }', fontsize=12,
                  bbox=dict(facecolor='green', alpha=0.4, pad=5))
            
        # Plot No-Info classifier
        axis.fill_between(fpr, fpr, tpr, alpha=0.3, edgecolor='g',
                          linestyle='--', linewidth=2)
        
    axis.set_xlim(0, 1)
    axis.set_ylim(0, 1)
    axis.set_title('ROC Curve')
    axis.set_xlabel('False Positive Rate [FPR]\n(1 - Specificity)')
    axis.set_ylabel('True Positive Rate [TPR]\n(Sensitivity or Recall)')
    
    plt.close()
    
    return axis if ax else fig

In [None]:
def precision_recall_plot(y_true, y_probs, label, compare=False, ax=None):
    """ Plot Precision-Recall curve.
        Set `compare=True` to use this function to compare classifiers. """
    
    p, r, thresh = precision_recall_curve(y_true, y_probs)
    p, r, thresh = list(p), list(r), list(thresh)
    p.pop()
    r.pop()
    
    fig, axis = (None, ax) if ax else plt.subplots(nrows=1, ncols=1)
    
    if compare:
        sns.lineplot(r, p, ax=axis, label=label)
        axis.set_xlabel('Recall')
        axis.set_ylabel('Precision')
        axis.legend(loc='lower left')
    else:
        sns.lineplot(thresh, p, label='Precision', ax=axis)
        axis.set_xlabel('Threshold')
        axis.set_ylabel('Precision')
        axis.legend(loc='lower left')

        axis_twin = axis.twinx()
        sns.lineplot(thresh, r, color='limegreen', label='Recall', ax=axis_twin)
        axis_twin.set_ylabel('Recall')
        axis_twin.set_ylim(0, 1)
        axis_twin.legend(bbox_to_anchor=(0.24, 0.18))
    
    axis.set_xlim(0, 1)
    axis.set_ylim(0, 1)
    axis.set_title('Precision Vs Recall')
    
    plt.close()
    
    return axis if ax else fig

In [None]:
def feature_importance_plot(importances, feature_labels, ax=None):
    fig, axis = (None, ax) if ax else plt.subplots(nrows=1, ncols=1, figsize=(5, 10))
    sns.barplot(x=importances, y=feature_labels, ax=axis)
    axis.set_title('Feature Importance Measures')
    
    plt.close()
    
    return axis if ax else fig

In [None]:
def train_clf(clf, x_train, y_train, sample_weight=None, refit=False):
    train_time = 0
    
    try:
        if refit:
            raise NotFittedError
        y_pred_train = clf.predict(x_train)
    except NotFittedError:
        start = timeit.default_timer()
        
        if sample_weight is not None:
            clf.fit(x_train, y_train, sample_weight=sample_weight)
        else:
            clf.fit(x_train, y_train)
        
        end = timeit.default_timer()
        train_time = end - start
        
        y_pred_train = clf.predict(x_train)
    
    train_acc = accuracy_score(y_train, y_pred_train)
    return clf, y_pred_train, train_acc, train_time

In [None]:
def model_memory_size(clf):
    return sys.getsizeof(pickle.dumps(clf))

In [None]:
def report(clf, x_train, y_train, x_test, y_test, sample_weight=None,
           refit=False, importance_plot=False, confusion_labels=None,
           feature_labels=None, verbose=True):
    """ Trains the passed classifier if not already trained and reports
        various metrics of the trained classifier """
    
    dump = dict()
    
    ## Train if not already trained
    clf, train_predictions, \
    train_acc, train_time = train_clf(clf, x_train, y_train,
                                                     sample_weight=sample_weight,
                                                     refit=refit)
    ## Testing
    start = timeit.default_timer()
    test_predictions = clf.predict(x_test)
    end = timeit.default_timer()
    test_time = end - start
    
    test_acc = accuracy_score(y_test, test_predictions)
    y_probs = clf.predict_proba(x_test)[:, 1]
    
    roc_auc = roc_auc_score(y_test, y_probs)
    
    
    ## Model Memory
    model_mem = round(model_memory_size(clf) / 1024, 2)
    
    print(clf)
    print("\n=============================> TRAIN-TEST DETAILS <======================================")
    
    ## Metrics
    print(f"Train Size: {x_train.shape[0]} samples")
    print(f" Test Size: {x_test.shape[0]} samples")
    print("------------------------------------------")
    print(f"Training Time: {round(train_time, 3)} seconds")
    print(f" Testing Time: {round(test_time, 3)} seconds")
    print("------------------------------------------")
    print("Train Accuracy: ", train_acc)
    print(" Test Accuracy: ", test_acc)
    print("------------------------------------------")
    print(" Area Under ROC: ", roc_auc)
    print("------------------------------------------")
    print(f"Model Memory Size: {model_mem} kB")
    print("\n=============================> CLASSIFICATION REPORT <===================================")
    
    ## Classification Report
    clf_rep = classification_report(y_test, test_predictions, output_dict=True)
    
    print(classification_report(y_test, test_predictions,
                                target_names=confusion_labels))
    
    
    if verbose:
        print("\n================================> CONFUSION MATRIX <=====================================")
    
        ## Confusion Matrix HeatMap
        display(confusion_plot(confusion_matrix(y_test, test_predictions),
                               labels=confusion_labels))
        print("\n=======================================> PLOTS <=========================================")


        ## Variable importance plot
        fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14, 10))
        roc_axes = axes[0, 0]
        pr_axes = axes[0, 1]
        importances = None

        if importance_plot:
            if not feature_labels:
                raise RuntimeError("'feature_labels' argument not passed "
                                   "when 'importance_plot' is True")

            try:
                importances = pd.Series(clf.feature_importances_,
                                        index=feature_labels) \
                                .sort_values(ascending=False)
            except AttributeError:
                try:
                    importances = pd.Series(clf.coef_.ravel(),
                                            index=feature_labels) \
                                    .sort_values(ascending=False)
                except AttributeError:
                    pass

            if importances is not None:
                # Modifying grid
                grid_spec = axes[0, 0].get_gridspec()
                for ax in axes[:, 0]:
                    ax.remove()   # remove first column axes
                large_axs = fig.add_subplot(grid_spec[0:, 0])

                # Plot importance curve
                feature_importance_plot(importances=importances.values,
                                        feature_labels=importances.index,
                                        ax=large_axs)
                large_axs.axvline(x=0)

                # Axis for ROC and PR curve
                roc_axes = axes[0, 1]
                pr_axes = axes[1, 1]
            else:
                # remove second row axes
                for ax in axes[1, :]:
                    ax.remove()
        else:
            # remove second row axes
            for ax in axes[1, :]:
                ax.remove()


        ## ROC and Precision-Recall curves
        clf_name = clf.__class__.__name__
        roc_plot(y_test, y_probs, clf_name, ax=roc_axes)
        precision_recall_plot(y_test, y_probs, clf_name, ax=pr_axes)

        fig.subplots_adjust(wspace=5)
        fig.tight_layout()
        display(fig)
    
    ## Dump to report_dict
    dump = dict(clf=clf, train_acc=train_acc, train_time=train_time,
                train_predictions=train_predictions, test_acc=test_acc,
                test_time=test_time, test_predictions=test_predictions,
                test_probs=y_probs, report=clf_rep, roc_auc=roc_auc,
                model_memory=model_mem)
    
    return clf, dump

In [None]:
def compare_models(y_test=None, clf_reports=[], labels=[]):
    """ Compare evaluation metrics for the True Positive class [1] of 
        binary classifiers passed in the argument and plot ROC and PR curves.
        
        Arguments:
        ---------
        y_test: to plot ROC and Precision-Recall curves
        
        Returns:
        -------
        compare_table: pandas DataFrame containing evaluated metrics
                  fig: `matplotlib` figure object with ROC and PR curves """

    
    ## Classifier Labels
    default_names = [rep['clf'].__class__.__name__ for rep in clf_reports]
    clf_names =  labels if len(labels) == len(clf_reports) else default_names
    
    
    ## Compare Table
    table = dict()
    index = ['Train Accuracy', 'Test Accuracy', 'Overfitting', 'ROC Area',
             'Precision', 'Recall', 'F1-score', 'Support']
    for i in range(len(clf_reports)):
        train_acc = round(clf_reports[i]['train_acc'], 3)
        test_acc = round(clf_reports[i]['test_acc'], 3)
        clf_probs = clf_reports[i]['test_probs']
        roc_auc = clf_reports[i]['roc_auc']
        
        # Get metrics of True Positive class from sklearn classification_report
        true_positive_metrics = list(clf_reports[i]['report']["1"].values())
        
        table[clf_names[i]] = [train_acc, test_acc,
                               test_acc < train_acc, roc_auc] + true_positive_metrics
    
    table = pd.DataFrame(data=table, index=index)
    
    
    ## Compare Plots
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    
    # ROC and Precision-Recall
    for i in range(len(clf_reports)):
        clf_probs = clf_reports[i]['test_probs']
        roc_plot(y_test, clf_probs, label=clf_names[i],
                 compare=True, ax=axes[0])
        precision_recall_plot(y_test, clf_probs, label=clf_names[i],
                              compare=True, ax=axes[1])
    # Plot No-Info classifier
    axes[0].plot([0,1], [0,1], linestyle='--', color='green')
        
    fig.tight_layout()
    plt.close()
    
    return table.T, fig

<a id="naive-bayes"></a>
## 4.2. Naive Bayes
The fundamental assumption made by Naive Bayes regarding the data is ***class conditional independence of features***. Sklearn provides different variants of Naive Bayes depending on whether the features follow a categorical distribution (CategoricalNB), normal distribution (GaussianNB), bernoulli distribution (BernoulliNB), multinomial distribution (MultinomialNB)

In [None]:
from sklearn.naive_bayes import CategoricalNB, GaussianNB 
from sklearn.preprocessing import KBinsDiscretizer, OrdinalEncoder

confusion_lbs = ['No Churn', 'Churn']

## Discretize 'monthlycharges' and 'totalcharges' into 3bins
kbn = KBinsDiscretizer(n_bins=12, encode='ordinal')
ode = OrdinalEncoder(dtype=np.int64)
nb_trans = [('ordinal', ode, categorical_columns),
            ('kbn', kbn, numeric_columns[1:])]
nb_col_trans = ColumnTransformer(nb_trans, remainder='passthrough')

## Applying Column Transformer
x_train_nb = nb_col_trans.fit_transform(x_train_cat)
x_test_nb = nb_col_trans.transform(x_test_cat)

nb_clf = CategoricalNB()

nb_clf, nb_report = report(nb_clf, x_train_nb, y_train,
                           x_test_nb, y_test, refit=True,
                           confusion_labels=confusion_lbs)

<a id="logistic-regression"></a>
## 4.3.  Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegressionCV

logit_cv = LogisticRegressionCV(Cs=10, class_weight='balanced', cv=5, dual=False,
                                fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                                max_iter=500, multi_class='auto', n_jobs=None,
                                penalty='l1', random_state=0, refit=True,
                                scoring='f1', solver='liblinear', tol=0.0001,
                                verbose=0)

logit_cv, logit_report = report(logit_cv, x_train, y_train,
                                x_test, y_test, refit=True,
                                importance_plot=True,
                                feature_labels=feature_names,
                                confusion_labels=confusion_lbs)

<a id="k-nearest-neighbors"></a>
## 4.4.  K-Nearest Neighbors
KNN estimator in Scikit-learn does not provide a way to pass class-weights to enable cost-sensitive/ class-weighted learning.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=91, p=1,
                           weights='uniform', n_jobs=-1)

knn, knn_report = report(knn, x_train, y_train,
                         x_test, y_test,
                         importance_plot=True,
                         feature_labels=feature_names,
                         confusion_labels=confusion_lbs)

<a id="decision-tree"></a>
## 4.5.  Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(class_weight='balanced',
                                       criterion='entropy',
                                       max_depth=3,
                                       random_state=0)

decision_tree, decision_tree_report = report(decision_tree, x_train, y_train,
                                             x_test, y_test,
                                             importance_plot=True,
                                             feature_labels=feature_names,
                                             confusion_labels=confusion_lbs)

<a id="decision-trees-with-bagging"></a>
## 4.6.  Decision Trees with Bagging

In [None]:
from sklearn.ensemble import BaggingClassifier

bagging_dtree = DecisionTreeClassifier(max_depth=2, class_weight='balanced',
                                       criterion='entropy', random_state=0)

bagging_clf = BaggingClassifier(base_estimator=bagging_dtree,
                                max_samples=110, n_estimators=80,
                                max_features=15, n_jobs=-1,
                                random_state=0)

bagging_clf, bagging_clf_report = report(bagging_clf, x_train, y_train,
                                         x_test, y_test,
                                         feature_labels=feature_names,
                                         confusion_labels=confusion_lbs)

<a id="random-forests"></a>
## 4.7.  Random Forests

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(class_weight='balanced', criterion='entropy',
                                       max_depth=1, max_samples=2000, n_estimators=100,
                                       n_jobs=-1, random_state=0)

random_forest, random_forest_report = report(random_forest, x_train, y_train,
                                             x_test, y_test,
                                             importance_plot=True,
                                             feature_labels=feature_names,
                                             confusion_labels=confusion_lbs)

<a id="decision-trees-with-adaboost"></a>
## 4.8.  Decision Trees with AdaBoost
The default base estimator for `AdaBoostClassifier` is `DecisionTreeClassifier(max_depth=1)`

In [None]:
from sklearn.ensemble import AdaBoostClassifier

boosting_dtree = DecisionTreeClassifier(class_weight='balanced',
                                        criterion='entropy',
                                        max_depth=1, random_state=0)
adaboot = AdaBoostClassifier(base_estimator=boosting_dtree,
                             n_estimators=285, learning_rate=0.1,
                             random_state=0)

adaboot, adaboot_report = report(adaboot, x_train, y_train,
                                 x_test, y_test,
                                 importance_plot=True,
                                 feature_labels=feature_names,
                                 confusion_labels=confusion_lbs)

<a id="linear-svc"></a>
## 4.9.  Linear SVC

In [None]:
from sklearn.svm import SVC

linear_svc = SVC(kernel='linear', probability=True,
                 class_weight='balanced', random_state=0)

linear_svc, linear_svc_report = report(linear_svc, x_train, y_train,
                                       x_test, y_test,
                                       importance_plot=True,
                                       feature_labels=feature_names,
                                       confusion_labels=confusion_lbs)

<a id="svm-with-rbf-kernel"></a>
## 4.10.  SVM with RBF kernel

In [None]:
rbf_svc = SVC(C=0.3, kernel='rbf', probability=True,
              class_weight='balanced', random_state=0)

rbf_svc, rbf_svc_report = report(rbf_svc, x_train, y_train,
                                 x_test, y_test,
                                 importance_plot=True,
                                 feature_labels=feature_names,
                                 confusion_labels=confusion_lbs)

<a id="xgboost"></a>
## 4.11.  XGBoost

In [None]:
from xgboost import XGBClassifier
from sklearn.utils import class_weight

## Compute `class_weights` using sklearn
cls_weight = (y_train.shape[0] - np.sum(y_train)) / np.sum(y_train)

xgb_clf = XGBClassifier(learning_rate=0.01, random_state=0,
                        scale_pos_weight=cls_weight, n_jobs=-1)
xgb_clf.fit(x_train, y_train);

xgb_clf, xgb_report = report(xgb_clf, x_train, y_train,
                             x_test, y_test,
                             importance_plot=True,
                             feature_labels=feature_names,
                             confusion_labels=confusion_lbs)

<a id="catboost"></a>
## 4.12.  CatBoost
Cat boost performs better without One-hot encoding because it performs an internal categorical encoding that is similar to Leave One Out Encoding (LOOE). So, we can give the dataframe as input to the catboost classifier.

In [None]:
from catboost import CatBoostClassifier

# Basic working

catboost_clf = CatBoostClassifier(cat_features=categorical_columns,
                                  l2_leaf_reg=120, depth=6,
                                  auto_class_weights='Balanced',
                                  iterations=200, learning_rate=0.16,
                                  use_best_model=True,
                                  early_stopping_rounds=150,
                                  eval_metric='F1', random_state=0)

catboost_clf.fit(x_train_cat, y_train, 
                 eval_set=(x_train_cat, y_train),
                 verbose=False)


f_labels = categorical_columns+numeric_columns
catboost_clf, catboost_report = report(catboost_clf, x_train_cat, y_train,
                                       x_test_cat, y_test,
                                       importance_plot=True,
                                       feature_labels=f_labels,
                                       confusion_labels=confusion_lbs)

<a id="model-comparison"></a>
# 5.  Model Comparison
Since input data format for Naive Bayes and CatBoost are different, we will add them to the comparison manually.

In [None]:
report_list = [nb_report, logit_report, knn_report, decision_tree_report, 
               bagging_clf_report, random_forest_report, adaboot_report,
               xgb_report, linear_svc_report, rbf_svc_report, catboost_report]
clf_labels = [rep['clf'].__class__.__name__ for rep in report_list]
clf_labels[-3], clf_labels[-2] = 'Linear SVC', 'RBF SVC'

<a id="evaluation-metrics"></a>
## 5.1.  Evaluation Metrics

In [None]:
compare_table, compare_plot = compare_models(y_test, clf_reports=report_list, labels=clf_labels)

compare_table.sort_values(by=['Overfitting'])

***Inference:*** We can see that among the classifiers that do not overfit, Random Forests classifier has the highest recall while Logistic Regression has the highest F1-score. In terms of Revenue Retained, Random Forests are the best. However, Random Forests suffer from low precision.

<a id="2-roc-and-pr-curves"></a>
## 5.2. 2 ROC and PR Curves

In [None]:
compare_plot

<a id="further-analysis"></a>
# 6.  Further Analysis
We defined a additional evaluation metric called Percentage Monthly Revenue Retained which is defined as,

\begin{align}
\text {Revenue Retained (Monthly)} \%=\frac{\sum_{i=1}^{n_{\text {test}}} y_{\text {test}}^{(i)} \times y_{\text {pred}}^{(i)} \times \text {monthlycharges}^{(i)}}{\sum_{i=1}^{n_{\text {test}}} y_{\text {test}}^{(i)} \times \text {monthlycharges}(i)} \times 100
\end{align}

It is a “revenue” weighted recall score and can be viewed as a business equivalent of Recall. It represents the revenue retained (or saved) by a model as a result of its correct churn predictions i.e., True Positives.

In [None]:
df_charges = list()
new_cols = ['Revenue Retained (Monthly) %', 'Revenue Retained (Total) %']

for rep in report_list:
    true_positives = (y_test * rep['test_predictions']).reshape(y_test.shape[0], 1)
    tp_revenue = (x_test_charges * true_positives).sum(axis=0)
    df_charges.append(tp_revenue)

revenue_saved = pd.DataFrame(df_charges, index=clf_labels, 
                             columns=new_cols)

compare_table_rev = pd.concat([compare_table, revenue_saved], axis=1)

## True Positive Revenue/ Total Churn Revenue
total_churn_revenue = (x_test_charges * y_test.reshape(y_test.shape[0], 1)).sum(axis=0)
temp_cols = (compare_table_rev.iloc[:, 8:10] / total_churn_revenue) * 100
compare_table_rev.iloc[:, 8:10] = temp_cols

In [None]:
compare_table_rev

In [None]:
select_cols = ['Overfitting', 'F1-score'] + new_cols
compare_table_rev[select_cols].sort_values(by=['Overfitting', 'Revenue Retained (Monthly) %'],
                                           ascending=[True, False])

In [None]:
compare_table_rev[select_cols].sort_values(by=['Overfitting', 'F1-score'],
                                           ascending=[True, False])