# Telecom Churn Analysis

To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.
In this analysis, we will analyse the customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

## Step 1: Reading and Understanding the Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
plt.style.use('seaborn-deep')
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.serif'] = 'Ubuntu'
plt.rcParams['font.monospace'] = 'Ubuntu Mono'
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['xtick.labelsize'] = 8
plt.rcParams['ytick.labelsize'] = 8
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['figure.titlesize'] = 14
plt.rcParams['figure.figsize'] = (12, 8)

pd.options.mode.chained_assignment = None
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 300)
pd.set_option('display.width', 400)
import warnings
warnings.filterwarnings('ignore')
import sklearn.metrics as skm
import sklearn.model_selection as skms
import sklearn.preprocessing as skp
import sklearn.feature_selection as skfs
import sklearn.linear_model as sklm
import sklearn.decomposition as skd
import sklearn.ensemble as ske
import sklearn.tree as skt
import sklearn.feature_selection as skfs
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import svm
import xgboost as xgb
import random
seed = 12
np.random.seed(seed)

from datetime import date

In [None]:
# important funtions
def datasetShape(df):
    rows, cols = df.shape
    print("The dataframe has",rows,"rows and",cols,"columns.")
    
# select numerical and categorical features
def divideFeatures(df):
    numerical_features = df.select_dtypes(include=[np.number])
    categorical_features = df.select_dtypes(include=[np.object])
    return numerical_features, categorical_features

def calc_missing(df):
    missing = df.isna().sum().sort_values(ascending=False)
    missing = missing[missing != 0]
    missing_perc = missing/df.shape[0]*100
    return missing, missing_perc

def plotCorrelation(cols, df, figsize=(20,10)):
    plt.figure(figsize=figsize)
    sns.heatmap(df[cols].corr(), cmap=sns.diverging_palette(20, 220, n=200), annot=True, center = 0)
    plt.show()
    
def plotAUC(y_true, y_pred_proba):
    fpr, tpr, threshold = skm.roc_curve(y_true, y_pred_proba[:,1])
    roc_auc = skm.auc(fpr, tpr)
    plt.figure(figsize=(6,6))
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.4f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

def display_scores(y_true, y_pred, y_pred_proba, plot=False):
    cfm = skm.confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cfm.ravel()
    print(f"Accuracy Score: {round(skm.accuracy_score(y_true, y_pred)*100,4)}%")
    print(f"Sensitivity/Recall/TPR: {round(tp/(tp+fn)*100,4)}%")
    print(f"FPR: {round(fp/(tn+fp)*100,4)}%")
    print(f"Specificity: {round(tn/(tn+fp)*100,4)}%")
    print(f"Precision: {round(tp/(tp+fp)*100,4)}%")
    print(f"F1 Score: {round(skm.f1_score(y_true, y_pred)*100,4)}%")
    print(f"Prediction AUC Score: {round(skm.roc_auc_score(y_true, y_pred)*100,4)}%")
    print(f"Mean Square Error: {round(skm.mean_squared_error(y_true, y_pred),10)}")
    if plot:
        plotAUC(y_true, y_pred_proba)

# function for prediction metrics
def displayPredictionMetrics(model, X_train, y_train, showROC=False):
    y_pred_proba = model.predict_proba(X_train)
    y_pred = model.predict(X_train)
    display_scores(y_train, y_pred, y_pred_proba, showROC)
    
def plot_avgMonthlyCalls(data,calltype,colList):
    # create a color palette
    palette = plt.get_cmap('Set1')
    
    fig, ax = plt.subplots(figsize=(8,4))
    ax.plot(data[colList].mean())
    ax.set_xticklabels(['Jun','Jul','Aug'])

    # Add titles
    plt.title("Avg. "+calltype+" MOU  V/S Month", loc='left', fontsize=12, fontweight=0, color='orange')
    plt.xlabel("Month")
    plt.ylabel("Avg. "+calltype+" MOU")
    plt.show()
    
def plot_byChurnMou(data,colList, calltype):
    fig, ax = plt.subplots(figsize=(7,4))
    df=data.groupby(['churn'])[colList].mean().T
    plt.plot(df)
    ax.set_xticklabels(['Jun','Jul','Aug','Sep'])
    ## Add legend
    plt.legend(['Non-Churn', 'Churn'])
    # Add titles
    plt.title("Avg. "+calltype+" MOU  V/S Month", loc='left', fontsize=12, fontweight=0, color='orange')
    plt.xlabel("Month")
    plt.ylabel("Avg. "+calltype+" MOU")
    
def plot_byChurn(data, col):
    # per month churn vs Non-Churn
    fig, ax = plt.subplots(figsize=(7,4))
    colList = list(data.filter(regex=(col)).columns)
    colList = colList[:3]
    plt.plot(data.groupby('churn')[colList].mean().T)
    ax.set_xticklabels(['Jun','Jul','Aug','Sep'])
    ## Add legend
    plt.legend(['Non-Churn', 'Churn'])
    # Add titles
    plt.title( str(col) +" V/S Month", loc='left', fontsize=12, fontweight=0, color='orange')
    plt.xlabel("Month")
    plt.ylabel(col)
    plt.show()
    # Numeric stats for per month churn vs Non-Churn
    return data.groupby('churn')[colList].mean()

In [None]:
data_file = "/kaggle/input/telecom-churn-dataset/telecom_churn_data.csv"
telecom = pd.read_csv(data_file)
telecom.head()

In [None]:
# check dataset shape
datasetShape(telecom)

In [None]:
# check for duplicates
if(len(telecom) == len(telecom.mobile_number.unique())):
    print("No duplicates found!!")
else:
    print("Duplicates occuring")

## Step 2: Data Cleaning

In [None]:
# drop the mobile number
telecom.drop('mobile_number', axis=1, inplace=True)

# drop the duplicate rows
telecom.drop_duplicates(inplace=True)
datasetShape(telecom)

In [None]:
# remove all columns having no values
telecom.dropna(axis=1, how="all", inplace=True)
telecom.dropna(axis=0, how="all", inplace=True)
datasetShape(telecom)

In [None]:
telecom.head()

In [None]:
print("Verifying percentage of NaN values in remaining columns")
print(telecom.isnull().mean().round(4) * 100)

In [None]:
# remove columns having null values more than 50%
telecom.dropna(thresh=telecom.shape[0]*0.5,how='all',axis=1, inplace=True)
datasetShape(telecom)

### Drop Irrelevant Features

In [None]:
# divide features
numerical_features, categorical_features = divideFeatures(telecom)
telecom.head()

In [None]:
categorical_features.info()

**All object type features are only dates. We will drop all date columns that are object types.**

In [None]:
featuresToDrop = categorical_features.columns.to_list()
telecom.drop(featuresToDrop, axis=1, inplace=True)
datasetShape(telecom)

### Missing Value Imputation

In [None]:
# plot missing values

missing, missing_perc = calc_missing(telecom)
missing.plot(kind='bar',figsize=(30,12))
plt.title('Missing Values')
plt.show()

In [None]:
# missing values with percentage
pd.concat([missing, missing_perc], axis=1, keys=['Total','Percent']).T

In [None]:
telecom[missing.index].head()

In [None]:
telecom[missing.index].describe()

**As all the columns are numerical, have many outliers and missing values are less than 8% in these columns, we will impute all the missing value columns with median values.**

In [None]:
colsToImputeWithMedian = missing.index.to_list()
for col in colsToImputeWithMedian:
    telecom.loc[telecom[col].isna(), col] = telecom[col].median()

In [None]:
missing, missing_perc = calc_missing(telecom)
print("Any Missing Values?",missing.values)

All Missing values are treated.

### Drop single valued features

In [None]:
singleValuedFeatures = []
for x in telecom.columns.to_list():
    if len(telecom[x].unique()) == 1:
        singleValuedFeatures.append(x)

In [None]:
telecom.drop(singleValuedFeatures, axis=1, inplace=True)
datasetShape(telecom)

### Derive New Features

**New column for average amount of total recharge amount for months 6th & 7th**

In [None]:
telecom['avg_amt_m6m7'] = (telecom['total_rech_amt_6']+telecom['total_rech_amt_7'])/2

**Finding churn customers with total data and calls usage**

In [None]:
# mapping churn as 1 and not churn as 0
telecom['churn'] = ((telecom['total_ic_mou_9']==0) & (telecom['total_og_mou_9']==0) & (telecom['vol_2g_mb_9']==0) & (telecom['vol_3g_mb_9']==0)).map({True:1,False:0})
telecom['churn'].value_counts()/telecom.shape[0]*100

### Filter High Value Customers

In [None]:
# filtering high value customers with value > 70% quantile in avg_amt_6_7 column
telecom = telecom[telecom["avg_amt_m6m7"] >= telecom['avg_amt_m6m7'].quantile(0.7)]
telecom["avg_amt_m6m7"].describe()

### Drop columns for churn phase i.e. 9

In [None]:
remainingCols = [x for x in telecom.columns.to_list() if '_9' not in x]
remainingCols.remove('sep_vbc_3g')
telecom = telecom[remainingCols]
datasetShape(telecom)

## Step 3: Data Visualization - EDA

In [None]:
numerical_features, categorical_features = divideFeatures(telecom)
telecom.head()

### Univariate Analysis

In [None]:
# check for amount of defaulters in the data using countplot
plt.figure(figsize=(16,3))
sns.countplot(y="churn", data=telecom)
plt.show()
telecom["churn"].value_counts()/telecom.shape[0]*100

In [None]:
# plot aon with histplot
aon_bins = [0, 365, 730, 1095, 1460, 1825, 2190, 2555, 2920, 3285, 3650, 5015]
bucket_l = ["Year "+str(x) for x in range(1,12)]
aon_bins_data = pd.cut(telecom.aon, aon_bins, labels=bucket_l)

In [None]:
aon_bins_data.hist()
plt.show()

**We can see most of the customers belong to last 4 years.**

In [None]:
# Plotting Avg. total monthly incoming MOU vs AON
ic_col = telecom.filter(regex ='total_ic_mou').columns
plot_avgMonthlyCalls(telecom, calltype='incoming', colList=ic_col)

In [None]:
# Plotting Avg. total monthly outgoing MOU vs AON
og_col = telecom.filter(regex ='total_og_mou').columns
plot_avgMonthlyCalls(telecom, calltype='outgoing', colList=og_col)

In [None]:
# boxplots of numerical features for outlier detection

fig = plt.figure(figsize=(16,100))
for i in range(len(numerical_features.columns)):
    fig.add_subplot(34, 4, i+1)
    sns.boxplot(y=numerical_features.iloc[:,i])
plt.tight_layout()
plt.show()

There are outliers in many features. These outliers will be treated in Data Preparation step.

In [None]:
# distplots for categorical data

cat_features = ['monthly_2g_6','monthly_2g_7','monthly_2g_8','monthly_3g_6','monthly_3g_7','monthly_3g_8', 'churn']
fig = plt.figure(figsize=(16,10))
for i in range(len(cat_features)):
    fig.add_subplot(3, 3, i+1)
    telecom[cat_features].iloc[:,i].hist()
    plt.xlabel(telecom.columns[i])
plt.tight_layout()
plt.show()

Some patterns can be seen in above plotted categorical data. It will help in identifying useful features.

### Bivariate Analysis

In [None]:
# graph for incoming and outgoing in month 6,7,8 by churn
ic_col = ['total_ic_mou_6','total_ic_mou_7', 'total_ic_mou_8']
og_col = ['total_og_mou_6','total_og_mou_7', 'total_og_mou_8']
plot_byChurnMou(telecom, ic_col, 'Incoming')
plot_byChurnMou(telecom, og_col, 'Outgoing')

In [None]:
# graph for total recharge amount in month 6,7,8 by churn
plot_byChurn(telecom,'total_rech_amt')

In [None]:
# graph for arpu in month 6,7,8 by churn
plot_byChurn(telecom,'arpu')

In [None]:
colsToPlot = ['arpu_6','arpu_7','arpu_8','onnet_mou_6','onnet_mou_7','onnet_mou_8','offnet_mou_6','offnet_mou_7','offnet_mou_8','loc_og_mou_6','loc_og_mou_7','loc_og_mou_8','total_og_mou_6','total_og_mou_7','total_og_mou_8','total_ic_mou_6','total_ic_mou_7','total_ic_mou_8','avg_amt_m6m7','aon']
# plot scatter plots for all major numerical features

sns.pairplot(telecom[colsToPlot], size=3)
plt.show()

There are some correlated features present in dataset. We will use these features in model building.

### MultiVariate Analysis

In [None]:
# correlation within outgoing minutes of usage for month 6
og_mou_6 = telecom.columns[telecom.columns.str.contains('.*_og_.*mou_6',regex=True)]
plotCorrelation(og_mou_6, telecom)

In [None]:
# correlation within outgoing minutes of usage for month 7
og_mou_7 = telecom.columns[telecom.columns.str.contains('.*_og_.*mou_7',regex=True)]
plotCorrelation(og_mou_7, telecom)

In [None]:
# correlation within incoming minutes of usage for month 6
ic_mou_6 = telecom.columns[telecom.columns.str.contains('.*_ic_.*mou_6',regex=True)]
plotCorrelation(ic_mou_6, telecom)

In [None]:
# correlation within incoming minutes of usage for month 7
ic_mou_7 = telecom.columns[telecom.columns.str.contains('.*_ic_.*mou_7',regex=True)]
plotCorrelation(ic_mou_7, telecom)

Correlation will be used for feature selection.

## Step 4: Data Preparation

### Outlier Treatment

Treating with the SalePrice target feature and other numerical features, which are skewed. We will take log of the feature values using np.log1p()

In [None]:
# Checking outliers at 25%,50%,75%,90%,95% and 99%
telecom.describe(percentiles=[.25,.5,.75,.90,.95,.99])

**All the features are having outliers in the data. We will not remove any outlier data but we will treat the skewed data.**

In [None]:
# plot sample skewed feature
plt.figure(figsize=(16,4))
sns.distplot(telecom.loc_og_mou_7)
plt.show()

In [None]:
# extract all skewed features
temp_numerical_features, temp_categorical_features = divideFeatures(telecom)
# remove categorical features stored as int
temp_numerical_features.drop(cat_features, axis=1, inplace=True)
temp_numerical_features.drop(['arpu_6','arpu_7','arpu_8'], axis=1, inplace=True)
skewed_features = temp_numerical_features.apply(lambda x: x.skew()).sort_values(ascending=False)

In [None]:
# transform skewed features
for feat in skewed_features.index:
    # features which are more than 50% skewed are transformed
    if skewed_features.loc[feat] > 0.5:
        telecom[feat] = np.log1p(telecom[feat])

In [None]:
# plot sample treated feature
plt.figure(figsize=(16,4))
sns.distplot(telecom.loc_og_mou_7)
plt.show()

In [None]:
# outlier treatment for categorical features
cat_features.remove('churn')
def getCategoricalSkewed(categories, threshold):
    tempSkewedFeatures = []
    for feat in categories:
        for featValuePerc in list(telecom[feat].value_counts()/telecom.shape[0]):
            if featValuePerc > threshold:
                tempSkewedFeatures.append(feat)
    return list(set(tempSkewedFeatures))

# display all categorical skewed features which have value_counts > 90%
categoricalSkewed = getCategoricalSkewed(cat_features, .90)
if len(categoricalSkewed) > 0:
    for feat in categoricalSkewed:
        print("Ratio of non-churn vs churn:")
        print(telecom[feat].value_counts()/len(telecom)*100)
else:
    print("No Categorical Skewed variables")

### Derive New Features

We will make new features by taking average for every feature for month 6 and 7.

In [None]:
newFeaturesAdded = []
newFeatures = telecom.filter(regex='_6|_7').columns.str[:-2]
for idx, col in enumerate(newFeatures.unique()):
    newF = "avg_"+col+"_avg67"
    newFeaturesAdded.append(newF)
    telecom[newF] = (telecom[col+"_6"]  + telecom[col+"_7"])/ 2
datasetShape(telecom)
telecom[newFeaturesAdded].head()

### Split Train-Test Data

In [None]:
# shuffle samples
df_shuffle = telecom.sample(frac=1, random_state=seed).reset_index(drop=True)

In [None]:
df_y = df_shuffle.pop('churn')
df_X = df_shuffle.copy()

# split into train dev and test
X_train, X_test, y_train, y_test = skms.train_test_split(df_X, df_y, train_size=0.75, random_state=seed)
print(f"Train set has {X_train.shape[0]} records out of {len(df_shuffle)} which is {round(X_train.shape[0]/len(df_shuffle)*100)}%")
print(f"Test set has {X_test.shape[0]} records out of {len(df_shuffle)} which is {round(X_test.shape[0]/len(df_shuffle)*100)}%")

### Feature Scaling

In [None]:
scaler = skp.StandardScaler()
numerical_features, categorical_features = divideFeatures(X_train)

# apply scaling to all numerical variables except dummy variables as they are already between 0 and 1
X_train[numerical_features.columns] = scaler.fit_transform(X_train[numerical_features.columns])

# scale test data with transform()
X_test[numerical_features.columns] = scaler.transform(X_test[numerical_features.columns])

# view sample data
X_train.describe()

## Model A - High Performance Model Using PCA

### Find Principal Components 

In [None]:
pca = skd.PCA(random_state=seed)
pca.fit(X_train)

#### Components from the PCA

In [None]:
pca.components_

Looking at the explained variance ratio for each component

In [None]:
pca.explained_variance_ratio_

Making a scree plot for the explained variance

In [None]:
var_cumu = np.cumsum(pca.explained_variance_ratio_)

In [None]:
fig = plt.figure(figsize=[12,8])
plt.vlines(x=65, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=150, xmin=0, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.show()

**We have got 65 components to describe 95% of variance in the data**

In [None]:
pca_final = skd.IncrementalPCA(n_components=65)
df_train_pca = pca_final.fit_transform(X_train)

Plotting the heatmap of the corr matrix

In [None]:
corrmat = np.corrcoef(df_train_pca.transpose())
plt.figure(figsize=[40,20])
sns.heatmap(corrmat, annot=True, center = 0)
plt.show()

**We have removed multicollinearity from our dataset, and now our models will be much more stable**

Applying the transformation on the test set

In [None]:
df_test_pca = pca_final.transform(X_test)
datasetShape(df_test_pca)

## Step 5: Data Modelling

**Using class_weight='balanced' in Learning Algorithms to handle class imbalance problem.**

**Using scoring='recall' for every GridSearch to tune the hyperparameters in order to improve the sensitivity.**

### Model 1 - Random Forest Classifier Model

**Initial Model Building**

In [None]:
# fit model
rfc = ske.RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=seed)
rfc.fit(df_train_pca,y_train)

In [None]:
# find train prediction metrics
displayPredictionMetrics(rfc, df_train_pca, y_train)

In [None]:
# find test prediction metrics
displayPredictionMetrics(rfc, df_test_pca, y_test, True)

**HyperParameter Tuning**

In [None]:
def tune_rfc_hyperparameter(model, parameters,x_train,y_train,n_folds = 5):
    
    m = skms.GridSearchCV(model, parameters, cv=n_folds, n_jobs=-1, scoring="recall", return_train_score=True, verbose=3)
    m.fit(x_train, y_train)
    scores = m.cv_results_

    # find the value of the hyperparameter
    for key in parameters.keys():
        hyperparameter = key
        break

    print('We can get the sensitivity of',m.best_score_,'using',m.best_params_)
    
    # plotting accuracies for parameters
    plt.figure(figsize=(16,5))
    plt.plot(scores["param_"+hyperparameter], scores["mean_train_score"], label="training accuracy")
    plt.plot(scores["param_"+hyperparameter], scores["mean_test_score"], label="test accuracy")
    plt.xlabel(hyperparameter)
    plt.ylabel("Accuracy")
    plt.legend()
    plt.show()

In [None]:
rf_hyper_init = ske.RandomForestClassifier(class_weight='balanced', random_state=seed)

In [None]:
# tuning max_depth
parameters = {'max_depth': range(2, 30, 5)}
tune_rfc_hyperparameter(rf_hyper_init, parameters,df_train_pca,y_train)

In [None]:
# tuning n_estimators
parameters = {'n_estimators': [100, 200, 300]}
tune_rfc_hyperparameter(rf_hyper_init, parameters,df_train_pca,y_train)

In [None]:
# tuning max_features
parameters = {'max_features': [40, 50, 60]}
tune_rfc_hyperparameter(rf_hyper_init, parameters,df_train_pca,y_train)

In [None]:
# tuning min_samples_leaf
parameters = {'min_samples_leaf': range(10, 80, 15)}
tune_rfc_hyperparameter(rf_hyper_init, parameters,df_train_pca,y_train)

In [None]:
# tuning min_samples_split
parameters = {'min_samples_split': range(20, 110, 20)}
tune_rfc_hyperparameter(rf_hyper_init, parameters,df_train_pca,y_train)

In [None]:
# tuning all final hyperparameters for rfc

parameters = {
    'max_depth': [2, 7, 12],
    'n_estimators': [100],
    'max_features': [55],
    'min_samples_leaf': [50,70],
    'min_samples_split': [100]}

rfc_hyper = ske.RandomForestClassifier(class_weight='balanced')

# cross validation
model_cv_rfc_hyper = skms.GridSearchCV(estimator = rfc_hyper, n_jobs=-1, param_grid = parameters, 
                             scoring= 'recall', cv = 3, return_train_score=True, verbose = 3)            
model_cv_rfc_hyper.fit(df_train_pca, y_train)

In [None]:
# display all final tuned hyper parameters for rfc
print('We can get the sensitivity of',model_cv_rfc_hyper.best_score_,'using',model_cv_rfc_hyper.best_params_)

`We will evaluate this classifier in Model Evaluation step.`

### Model 2 - Support Vector Machine Model

**Initial Model Building**

In [None]:
y_train_svc = y_train.map({1:1, 0:-1})
y_test_svc = y_test.map({1:1, 0:-1})

In [None]:
# fit model
svc = svm.SVC(class_weight='balanced', probability=True, kernel='rbf')
svc.fit(df_train_pca,y_train_svc)

In [None]:
# find train prediction metrics
displayPredictionMetrics(svc, df_train_pca, y_train_svc)

In [None]:
# find test prediction metrics
displayPredictionMetrics(svc, df_test_pca, y_test_svc, True)

**HyperParameter Tuning**

In [None]:
def plot_svc_hyperparameters(scores, param):
    gamma = scores[scores['param_gamma']==param]
    plt.plot(gamma["param_C"], gamma["mean_test_score"])
    plt.plot(gamma["param_C"], gamma["mean_train_score"])
    plt.xlabel('C')
    plt.ylabel('Accuracy')
    plt.title("Gamma="+str(param))
    plt.ylim([0, 1])
    plt.legend(['test_score', 'training_score'])
    plt.xscale('log')

In [None]:
# tuning all final hyperparameters for svc
gamma = [1e-1,1e-2, 1e-3, 1e-4]
C = [1, 10, 100, 1000]
parameters = {'gamma': gamma, 'C': C}
svc_hyper = svm.SVC(class_weight='balanced', probability=True, kernel='rbf')

# cross validation
model_cv_svc_hyper = skms.GridSearchCV(estimator = svc_hyper, n_jobs=-1, param_grid = parameters, 
                             scoring= 'recall', cv = 3, return_train_score=True, verbose = 3)            
model_cv_svc_hyper.fit(df_train_pca, y_train_svc)

In [None]:
# display cv scores and plot gamma vs C
svc_scores = pd.DataFrame(model_cv_svc_hyper.cv_results_)
svc_scores['param_C'] = svc_scores['param_C']

plt.figure(figsize=(16,5))

for x, g in enumerate(gamma):
    plt.subplot(1, 4, x+1)
    plot_svc_hyperparameters(svc_scores, g)
plt.show()

In [None]:
# display all final tuned hyper parameters for svc
print('We can get the sensitivity of',model_cv_svc_hyper.best_score_,'using',model_cv_svc_hyper.best_params_)

`We will evaluate this classifier in Model Evaluation step.`

### Model 3 - Gradient Boosting Model

**Initial Model Building**

In [None]:
# fit model
xgbc = xgb.XGBClassifier(scale_pos_weight=(y_train.value_counts()[0]/y_train.value_counts()[1]), n_jobs=-1)
xgbc.fit(df_train_pca,y_train)

In [None]:
# find train prediction metrics
displayPredictionMetrics(xgbc, df_train_pca, y_train)

In [None]:
# find test prediction metrics
displayPredictionMetrics(xgbc, df_test_pca, y_test, True)

**HyperParameter Tuning**

In [None]:
def plot_xgb_hyperparameters(scores, param):
    lr = scores[scores['param_subsample']==param]
    plt.plot(lr["param_learning_rate"], lr["mean_test_score"])
    plt.plot(lr["param_learning_rate"], lr["mean_train_score"])
    plt.xlabel('learning_rate')
    plt.ylabel('Accuracy')
    plt.title("Subsample="+str(param))
    plt.ylim([0.4, 1])
    plt.legend(['test_score', 'training_score'])
    plt.xscale('log')

In [None]:
# tuning all final hyperparameters for xgb
learning_rate = [0.1,0.2,0.3]
subsample = [0.3,0.4,0.5]
parameters = {'learning_rate': learning_rate, 'subsample': subsample}
xgb_hyper = xgb.XGBClassifier(scale_pos_weight=(y_train.value_counts()[0]/y_train.value_counts()[1]))

# cross validation
model_cv_xgb_hyper = skms.GridSearchCV(estimator = xgb_hyper, n_jobs=-1, param_grid = parameters, 
                             scoring= 'recall', cv = 3, return_train_score=True, verbose = 3)
model_cv_xgb_hyper.fit(df_train_pca, y_train)

In [None]:
# display cv scores and plot gamma vs C
xgb_scores = pd.DataFrame(model_cv_xgb_hyper.cv_results_)
xgb_scores['param_learning_rate'] = xgb_scores['param_learning_rate']

plt.figure(figsize=(16,5))

for x, s in enumerate(subsample):
    plt.subplot(1, 3, x+1)
    plot_xgb_hyperparameters(xgb_scores, s)
plt.show()

In [None]:
# display all final tuned hyper parameters for xgb
print('We can get the sensitivity of',model_cv_xgb_hyper.best_score_,'using',model_cv_xgb_hyper.best_params_)

`We will evaluate this classifier in Model Evaluation step.`

## Step 6: Model Evaluation

### Model 1 - Random Forest Classifier Model Evaluation

In [None]:
print("Evaluating RFC Model with best parameters on Train Dataset",model_cv_rfc_hyper.best_params_,"\n")
displayPredictionMetrics(model_cv_rfc_hyper, df_train_pca, y_train)

In [None]:
print("Evaluating RFC Model with best parameters on Test Dataset",model_cv_rfc_hyper.best_params_,"\n")
displayPredictionMetrics(model_cv_rfc_hyper, df_test_pca, y_test, True)

### Model 2 - Support Vector Machine Model Evaluation

In [None]:
print("Evaluating SVC Model with best parameters on Train Dataset",model_cv_svc_hyper.best_params_,"\n")
displayPredictionMetrics(model_cv_svc_hyper, df_train_pca, y_train_svc)

In [None]:
print("Evaluating SVC Model with best parameters on Test Dataset",model_cv_svc_hyper.best_params_,"\n")
displayPredictionMetrics(model_cv_svc_hyper, df_test_pca, y_test_svc, True)

### Model 3 - XGBoost Model Evaluation

In [None]:
print("Evaluating XGB Model with best parameters on Train Dataset",model_cv_xgb_hyper.best_params_,"\n")
displayPredictionMetrics(model_cv_xgb_hyper, df_train_pca, y_train)

In [None]:
print("Evaluating XGB Model with best parameters on Test Dataset",model_cv_xgb_hyper.best_params_,"\n")
displayPredictionMetrics(model_cv_xgb_hyper, df_test_pca, y_test, True)

## High Performance Model Observations

`The sensitivity score of RandomForestClassifier model for Train is 74.8341% and Test is 77.918%`

`The sensitivity score of SupportVectorClassifier model for Train is 84.9923% and Test is 84.2271%`

`The sensitivity score of XGBClassifier model for Train is 96.0184% and Test is 71.1356%`

| Classifier              | Train Sensitivity | Test Sensitivity |
|-------------------------|-------------------|------------------|
| RandomForestClassifier  | 74.8341%          | 77.918%          |
| SupportVectorClassifier | 84.9923%          | **84.2271%**     |
| XGBClassifier           | 96.0184%          | 71.1356%         |

**The SupportVectorClassifier model is stable and performing good with around 84.2271% of test accuracy. It can be used in production environment.**

## Model B - Feature Importance Model Without PCA

## Step 5: Data Modelling

### Logistic Regression Model with all Features

In [None]:
# find stats for model using all features
lr = sklm.LogisticRegression(random_state=seed, class_weight='balanced', n_jobs=-1)
lr.fit(X_train,y_train)
displayPredictionMetrics(lr, X_train, y_train)

### RFE Feature Selection

In [None]:
lr_rfe = sklm.LogisticRegression(random_state=seed, class_weight='balanced', n_jobs=-1, max_iter=500, verbose=1)
rfe = skfs.RFE(lr_rfe, 30)
rfe.fit(X_train, y_train)

In [None]:
rfeCols = X_train.columns[rfe.support_]
X_train_rfe = X_train[rfeCols]
X_test_rfe = X_test[rfeCols]
print("Selected features by RFE are",list(rfeCols))

In [None]:
# check metrics by building model with RFE selected features
lr = sklm.LogisticRegression(random_state=seed, class_weight='balanced', n_jobs=-1)
lr.fit(X_train_rfe,y_train)
displayPredictionMetrics(lr, X_train_rfe, y_train)

#### Manual Feature Selection

Now we will proceed with **manual feature** selection by building the model using **statsmodel** library with **p-value & VIF**

In [None]:
lr_glm = sm.GLM(y_train, X_train_rfe, family = sm.families.Binomial())
lr_glm_model = lr_glm.fit()

In [None]:
# function to drop the mentioned feature variable, build the model and print the summary and VIF
def displaySummary(dropFeature=None):
    # update variable from global scope
    global X_train_rfe, y_train
    
    # dropping variable with very high VIF
    if dropFeature is not None:
        X_train_rfe.drop(dropFeature, axis=1, inplace=True)
        X_test_rfe.drop(dropFeature, axis=1, inplace=True)
        print(f"Removed feature: {dropFeature} \n")
        
    # print model metrics for current available features
    lr = sklm.LogisticRegression(random_state=seed, class_weight='balanced', n_jobs=-1)
    lr.fit(X_train_rfe, y_train)
    displayPredictionMetrics(lr, X_train_rfe, y_train)

    # Running the linear model
    lr_glm = sm.GLM(y_train, X_train_rfe, family = sm.families.Binomial())
    lr_glm_model = lr_glm.fit()
    
    # check the summary of our linear model
    print(lr_glm_model.summary())
    
    # Calculate the VIFs for the new model after removing constant
    if 'const' in X_train_rfe.columns:
        X_train_rfe = X_train_rfe.drop(['const'], axis=1)
    vif = pd.DataFrame()
    X = X_train_rfe
    vif['Features'] = X.columns
    vif['VIF'] = [round(variance_inflation_factor(X.values, i), 2) for i in range(X.shape[1])]
    vif = vif.sort_values(by = "VIF", ascending = False).set_index('Features')
    print(vif)

In [None]:
displaySummary()

In [None]:
displaySummary('arpu_7')

In [None]:
displaySummary('monthly_3g_8')

In [None]:
displaySummary('avg_vol_3g_mb_avg67')

In [None]:
displaySummary('arpu_8')

In [None]:
displaySummary('loc_ic_mou_8')

In [None]:
displaySummary('loc_og_mou_8')

In [None]:
displaySummary('avg_loc_ic_t2m_mou_avg67')

In [None]:
displaySummary('total_rech_amt_8')

In [None]:
displaySummary('std_og_t2m_mou_6')

In [None]:
displaySummary('onnet_mou_7')

In [None]:
displaySummary('total_og_mou_8')

In [None]:
displaySummary('onnet_mou_8')

In [None]:
displaySummary('roam_og_mou_7')

In [None]:
displaySummary('avg_max_rech_amt_avg67')

In [None]:
displaySummary('loc_og_mou_7')

In [None]:
displaySummary('vol_3g_mb_8')

***All the features' p-values and VIF are optimal now. <br>These set of features are fine to proceed with the Model Selection.***

### Correlation Heatmap

In [None]:
# plot correlation among final selected featuers
plt.figure(figsize = (20, 10))
sns.heatmap(X_train_rfe.corr(), annot = True, linewidths=.2, cmap="YlGnBu")
plt.show()

### Random Forest Classifier Model Building

In [None]:
# fit model
rfc_nopca = ske.RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=seed)
rfc_nopca.fit(X_train_rfe,y_train)

In [None]:
# find train prediction metrics
displayPredictionMetrics(rfc_nopca, X_train_rfe, y_train)

In [None]:
# find test prediction metrics
displayPredictionMetrics(rfc_nopca, X_test_rfe, y_test, True)

**HyperParameter Tuning**

In [None]:
rf_hyper_init_nopca = ske.RandomForestClassifier(class_weight='balanced', random_state=seed)

In [None]:
# tuning max_depth
parameters = {'max_depth': range(2, 30, 5)}
tune_rfc_hyperparameter(rf_hyper_init_nopca, parameters, X_train_rfe, y_train)

In [None]:
# tuning n_estimators
parameters = {'n_estimators': range(100, 800, 200)}
tune_rfc_hyperparameter(rf_hyper_init_nopca, parameters, X_train_rfe, y_train)

In [None]:
# tuning max_features
parameters = {'max_features': [8, 10, 12, 14]}
tune_rfc_hyperparameter(rf_hyper_init_nopca, parameters, X_train_rfe, y_train)

In [None]:
# tuning min_samples_leaf
parameters = {'min_samples_leaf': range(1, 50, 10)}
tune_rfc_hyperparameter(rf_hyper_init_nopca, parameters, X_train_rfe, y_train)

In [None]:
# tuning min_samples_split
parameters = {'min_samples_split': range(10, 100, 10)}
tune_rfc_hyperparameter(rf_hyper_init_nopca, parameters, X_train_rfe, y_train)

In [None]:
# tuning all final hyperparameters for rfc

parameters = {
    'max_depth': [7],
    'n_estimators': [100],
    'max_features': [12],
    'min_samples_leaf': [41],
    'min_samples_split': [70,90]}

rfc_hyper_nopca = ske.RandomForestClassifier(class_weight='balanced', random_state=seed)

# cross validation
model_cv_rfc_hyper_nopca = skms.GridSearchCV(estimator = rfc_hyper_nopca, n_jobs=-1, param_grid = parameters, 
                             scoring= 'recall', cv = 3, return_train_score=True, verbose = 3)            
model_cv_rfc_hyper_nopca.fit(X_train_rfe, y_train)

In [None]:
# display all final tuned hyper parameters for rfc
print('We can get the sensitivity of',model_cv_rfc_hyper_nopca.best_score_,'using',model_cv_rfc_hyper_nopca.best_params_)

## Step 6: Model Evaluation

In [None]:
rfc_interpretable = ske.RandomForestClassifier(class_weight='balanced', random_state=seed, max_depth=7, max_features=12, 
                          min_samples_leaf=41, min_samples_split=90, n_estimators=100)
rfc_interpretable.fit(X_train_rfe, y_train)

In [None]:
print("Evaluating RFC Model with best parameters on Train Dataset",model_cv_rfc_hyper_nopca.best_params_,"\n")
displayPredictionMetrics(rfc_interpretable, X_train_rfe, y_train)

In [None]:
print("Evaluating RFC Model with best parameters on Test Dataset",model_cv_rfc_hyper_nopca.best_params_,"\n")
displayPredictionMetrics(rfc_interpretable, X_test_rfe, y_test, True)

### Find Important Features

In [None]:
# rfc_interpretable model.feature_importances_
feature_importances = pd.DataFrame(rfc_interpretable.feature_importances_,
                                   index = X_train_rfe.columns, columns=['importance']).sort_values('importance', ascending=False)

In [None]:
print("Top 10 Model parameters (excluding constant) are:")
feature_importances[:10]

## Interpretable Model Recommendations

**A less number of high value customers have churned as around 8-10%.**

**Outgoing Calls on romaing for 8th month is strong indicators of churn behaviour. So it should be focused to identify if there is any issue with calling on roaming.**

**Total Incoming Calls for 8th month is most strong indicator of churn behaviour.**

**Customers that are joined in last 4 years are more likely to churn and should be focussed for retaining by providing special schemes.**

**Last day recharge feature is also important for predicting churn behaviour.**

**Using RFC with selected set of features, we can have 79.9% of sensitivity in predicting churn behaviour.**

***The top 10 important features are:*** <br>
<blockquote><b>total_ic_mou_8, roam_og_mou_8, last_day_rch_amt_8, loc_og_t2m_mou_8, loc_ic_t2m_mou_8, vol_2g_mb_8, aon, max_rech_amt_8, std_og_mou_8, avg_amt_m6m7</b></blockquote>

# The END