# Telecom Churn Case Study

### 1. Data Understanding and Cleaning

Let's first have a look at the dataset and understand the size, attribute names etc.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from pprint import pprint
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn.decomposition import PCA
from sklearn.decomposition import IncrementalPCA
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix, roc_auc_score
import os

# hide warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing Housing.csv
telecom = pd.read_csv("../input/telecom-customer/Telecom_customer churn.csv")

In [None]:
# summary of the dataset: 99999 rows, 226 columns
telecom.info()

In [None]:
telecom.shape

In [None]:
telecom.head()

In [None]:
# Checking columns which have missing values

telecom.isnull().mean().sort_values(ascending=False)

In [None]:
# columns with more than 50% missing values

column_missing_data = telecom.loc[:,telecom.isnull().mean() >= 0.5 ]
print("Number of columns with missing data {}".format(len(column_missing_data.columns)))
column_missing_data.columns

In [None]:
# Droping columns with more than 50% missing values
telecom = telecom.loc[:, telecom.isnull().mean() <= .5]

In [None]:
telecom.shape

In [None]:
# Droping date columns since there will be no time series analysis

date_cols = telecom.columns[telecom.columns.str.contains(pat = 'date')]
telecom = telecom.drop(date_cols, axis = 1) 

In [None]:
# Dropping Mobile Number

telecom = telecom.drop('mobile_number', axis = 1) 

In [None]:
telecom.shape

In [None]:
# Checking percentage of missing values in dataset

telecom.isnull().mean().sort_values(ascending=False)

In [None]:
# All the column names with missing values
telecom.loc[:,telecom.isnull().mean() > 0].columns

In [None]:
# Plotting missing values
plt.figure(figsize=(20, 5))
sns.heatmap(telecom.isnull())

In [None]:
# Remove Columns which have only 1 unique Value

col_list = telecom.loc[:,telecom.apply(pd.Series.nunique) == 1]
telecom = telecom.drop(col_list, axis = 1)
telecom.shape

In [None]:
telecom.describe()

In [None]:
# Storing column names before imputing

col_name = telecom.columns
col_name

#### Since we have outliers in most of the columns we will do imputation of missing values using median

In [None]:
# Imputing median values using SimpleImputer
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer( strategy='median') 
imp_mean.fit(telecom)
telecom = imp_mean.transform(telecom)

In [None]:
telecom= pd.DataFrame(telecom)
telecom.columns = col_name
telecom.head()

In [None]:
plt.figure(figsize=(20, 5))
sns.heatmap(telecom.isnull())

All missing values imputed

In [None]:
# Renaming columns

telecom.rename(columns={'jun_vbc_3g': 'vbc_3g_6', 
                        'jul_vbc_3g': 'vbc_3g_7', 
                        'aug_vbc_3g': 'vbc_3g_8', 
                        'sep_vbc_3g': 'vbc_3g_9'}, inplace=True)

### Deriving total recharge amount for month 6 and 7

In [None]:
total_rech_amt_6_7 = telecom[['total_rech_amt_6','total_rech_amt_7']].sum(axis=1)

In [None]:
# Selecting top 30 percent subscribers for churn prediction

p70 = np.percentile(total_rech_amt_6_7, 70.0)

tele_top30 = telecom[total_rech_amt_6_7 > p70]
tele_top30.shape

### Deriving churn flag using month 9 data

In [None]:
tele_top30['total_usage_9'] = tele_top30[['total_ic_mou_9','total_og_mou_9','vol_2g_mb_9','vol_3g_mb_9']].sum(axis=1)

In [None]:
tele_top30['churn'] = tele_top30['total_usage_9'].apply(lambda x: 1 if x==0 else 0 )

In [None]:
tele_top30.head()

### Dropping month 9 columns 

In [None]:
mon_9_cols = tele_top30.columns[tele_top30.columns.str.contains(pat = '_9')]
mon_9_cols

In [None]:
tele_top30.drop(mon_9_cols, axis=1, inplace = True)
tele_top30.shape

## Deriving New columns

In [None]:
# Converting age on network to years from days

tele_top30['aon_yr'] = round(tele_top30['aon']/365,2)
tele_top30.drop('aon',axis=1,inplace=True)

### Derving new total columns by combing month 6 and 7

In [None]:
col_list = tele_top30.columns[tele_top30.columns.str.contains('_6|_7')]
len(col_list)

In [None]:
unique_col_list = col_list.str[:-2].unique()
len(unique_col_list)

In [None]:
unique_col_list

In [None]:
for col in unique_col_list:
    col_new_name = col+"_6_7"
    col_6_name = col+"_6"
    col_7_name = col+"_7"
    tele_top30[col_new_name] = tele_top30[[col_6_name,col_7_name]].sum(axis=1)

In [None]:
tele_top30.shape

In [None]:
tele_top30.drop(col_list, axis=1, inplace=True)
tele_top30.shape

In [None]:
tele_top30.head()

In [None]:
tele_top30.describe()

We see presence of outliers, hence performing capping operation

In [None]:
tele_top30['churn'].describe()

Churn column is also imbalanced, so we will not cap it, we will balance the dataset later

In [None]:
# Storing churn data in new dataframe
Churn = pd.DataFrame(tele_top30['churn'])

# Dropping churn column from tele_top30 before capping operation
tele_top30 = tele_top30.drop(['churn'], axis=1)

In [None]:
# Derving 25th and 75th percentile

Q1=tele_top30.quantile(0.25)
Q3=tele_top30.quantile(0.75)

# Deriving Inter Quartile Range
IQR=Q3-Q1

# Derving the Upper limit and Lower limit
LL = Q1 - 3*IQR
UL = Q3 + 3*IQR 

In [None]:
# Capping the data using Upper Limit and Lower Limit

q = [LL,UL]
tele_top30 = tele_top30.clip(LL,UL,axis=1)
print(tele_top30.shape)

In [None]:
tele_top30.describe()

After capping many columns have only 1 unique value. So dropping these columns

In [None]:
# Removing columns which have only one value after capping operation

col_list = tele_top30.loc[:,tele_top30.apply(pd.Series.nunique) == 1]
tele_top30 = tele_top30.drop(col_list, axis = 1)
tele_top30.shape

In [None]:
# Adding churn column to tele_top30

tele_top30 = pd.concat([tele_top30,Churn], axis=1)
tele_top30.shape

In [None]:
# Plotting the correlation matrix using seaborn heatmap

corr_mat = tele_top30.corr()
plt.figure(figsize=(20, 10))
sns.heatmap(corr_mat)

We can see high correlation between month 6,7 and 8 features

In [None]:
# Finding the pairs of most correlated features

abs(corr_mat).unstack().sort_values(ascending = False).drop_duplicates().head(10)

In [None]:
# Plotting the jointplot to check correlation

sns.jointplot(x = 'total_rech_amt_6_7', y = 'arpu_6_7', data=tele_top30, kind='reg')

In [None]:
# Plotting the jointplot to check correlation

sns.jointplot(x = 'total_rech_amt_8', y = 'arpu_8', data=tele_top30, kind='reg', color = [255/255,152/255,150/255])

In [None]:
#Finding highest correlated features with churn

corr_tgt = abs(corr_mat["churn"]).sort_values(ascending = False)
top_features = corr_tgt.loc[((corr_tgt > 0.2) & (corr_tgt != 1))]
top_features

In [None]:
# Plotting absolute correlation value of churn with all other varibales

plt.figure(figsize=(20,5))
corr_tgt.sort_values(ascending = False).plot(kind='bar')

In [None]:
# Checking the imbalance in churn feature

tele_top30['churn'].value_counts()*100.0 /len(tele_top30)

In [None]:
plt.figure(figsize=(3, 4))
sns.countplot('churn', data=tele_top30)
plt.title('Churn distribution')
plt.show()

The dataset is higly imbalnced 

# Model Building

In [None]:
Interpretable_Model_df = tele_top30

In [None]:
y = Interpretable_Model_df.pop('churn')
X = Interpretable_Model_df
X.shape

In [None]:
X_cols = X.columns

### Scaling

In [None]:
# Scaling the data using standard scaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
# Creating the test train split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

### Balancing

In [None]:
# Balancing the dataSet using SMOTE method

from imblearn.over_sampling import SMOTE

sm = SMOTE(sampling_strategy='auto', random_state=100)
X_train_bal, y_train_bal = sm.fit_sample(X_train, y_train)

In [None]:
print(X_train_bal.shape)
print(y_train_bal.shape)

In [None]:
plt.figure(figsize=(3, 4))
sns.countplot(y_train_bal)
plt.title('Churn distribution')
plt.show()

Data is now balanced

# 1. Interpretable Models - Without PCA

### Feature Selection using Lasso Logistic Regression

In [None]:
from sklearn.feature_selection import SelectFromModel

C = [100, 10, 1, 0.5, 0.1, 0.01, 0.001]

for c in C:
    lassoclf = LogisticRegression(penalty='l1', solver='liblinear', C=c).fit(X_train_bal, y_train_bal)
    model = SelectFromModel(lassoclf, prefit=True)
    X_lasso = model.transform(X_train_bal)
    print('C Value - ',c, ' selects',X_lasso.shape[1],' no. of Features')
    

#### Selecting c = 0.001 to have 18 important features

In [None]:
lassoclf = LogisticRegression(penalty='l1', solver='liblinear', C=.001).fit(X_train_bal, y_train_bal)
model = SelectFromModel(lassoclf, prefit=True)
X_train_lasso = model.transform(X_train_bal)
pos = model.get_support(indices=True)
selected_features = list(Interpretable_Model_df.columns[pos])
print(selected_features)

In [None]:
X_train_lasso = pd.DataFrame(X_train_lasso)
X_train_lasso.columns = selected_features
X_train_lasso

## 1.1 Interpretable Model 1 - Logistic Regression

In [None]:
# Defining common code

def print_all_scores(y_test, test_prediction, y_train, train_prediction):
    print('Precision on test set:\t'+str(round(precision_score(y_test,test_prediction) *100,2))+"%")
    print('Recall on test set:\t'+str(round(recall_score(y_test,test_prediction) *100,2))+"%")
    print("Training Accuracy: "+str(round(accuracy_score(y_train,train_prediction) *100,2))+"%")
    print("Test Accuracy: "+str(round(accuracy_score(y_test,test_prediction) *100,2))+"%")

### Logistic Regression Base Model - Default parameters

In [None]:
# Creating a base logistic regression model
lr = LogisticRegression(random_state=100)

# Lookin at the parameters used by our base model
print('Parameters currently in use:\n')
pprint(lr.get_params())

In [None]:
# fit the model
lr.fit(X_train_lasso, y_train_bal)

# Predicting values
X_test_lasso = pd.DataFrame(data=X_test).iloc[:, pos]
X_test_lasso.columns = selected_features
predictions = lr.predict(X_test_lasso)
train_pred = lr.predict(X_train_lasso)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions, y_train_bal, train_pred)

### Interpreatation of scores

1. Precision on test set:	31.29%
2. Recall on test set:	78.9%
3. Training Accuracy: 83.51%
4. Test Accuracy: 82.39%


- Precision score is low and recall score is good, which means the model has a tendency for classifying most of the customers to churn, even though the customers are not likely to churn ( actual churn / total predicted churn ). In this case, As most of the customers are classified that they would churn hence, the model would capture all the customers who would likely churn.
- Observing test and training score, we can be sure that model is not overfitted

<b> Verdict: Many customers who would not likely to churn would recive offers, which would lead to loss of revenue for the organization </b>

### Logistic Regression - Hyperparameter Tuning

#### Creating a hypertuned logistic regression model

In [None]:
# Initialising logistic Regression
log_reg = LogisticRegression(random_state = 100)

# Creating hyper parameter grid
parameter_grid = {'solver': ['newton-cg', 'lbfgs','liblinear','sag'],
                  'penalty': ['l1', 'l2', 'elasticnet', 'none'],
                  'C': [100, 10, 1.0, 0.1, 0.01]}

gs = GridSearchCV(estimator=log_reg, param_grid=parameter_grid, n_jobs=-1, cv=3, scoring='accuracy', error_score=0)

In [None]:
# Fitting the model
grid_result = gs.fit(X_train_lasso, y_train_bal)

# Finding the best model
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

**Fitting the final model with the best parameters obtained from grid search.**

In [None]:
# Initialising hyper tuned logistic Regression
log_reg_ht = LogisticRegression(C= 1.0, penalty= 'l2', solver= 'liblinear', random_state = 100)

# Fitting the model
log_reg_ht.fit(X_train_lasso, y_train_bal)

In [None]:
# Predicting the labels
train_pred = log_reg_ht.predict(X_train_lasso)
test_pred = log_reg_ht.predict(X_test_lasso)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,test_pred,y_train_bal,train_pred)

### Interpreatation of scores

Even with the hyperparametres tunning there is some improvement with our model

1. Precision on test set:	36.69%
2. Recall on test set:	72.39%
3. Training Accuracy: 83.51%
4. Test Accuracy: 86.18%


- Precision score is still low and recall score is good, which means the model has a tendency for classifying most of the customers to churn, even though the customers are not likely to churn ( actual churn / total predicted churn ). In this case, As most of the customers are classified that they would churn hence, the model would capture all the customers who would likely churn.
- Observing test and training score, we can be sure that model is not overfitted

<b> Verdict: Many customers who would not likely to churn would recive offers, which would lead to loss of revenue for the organization </b>

In [None]:
# To get the weights of all the variables
weights = pd.Series(log_reg_ht.coef_[0],
                 index=selected_features)
weights.sort_values(ascending = False).plot(kind = 'bar')

## 1.2 Interpretable Model 2 - Random Forrest

### Random Forrest Base Model - Default Parameters

In [None]:
# Running the random forest with default parameters.
rfc = RandomForestClassifier(random_state = 100)

# Lookin at the parameters used by our base model
print('Parameters currently in use:\n')
pprint(rfc.get_params())

In [None]:
# fit the model
rfc.fit(X_train_lasso,y_train_bal)

# Making predictions
predictions = rfc. predict(X_test_lasso)
train_pred = rfc. predict(X_train_lasso)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

### Interpreatation of scores

Even with the hyperparametres tunning there is significant improvement with our random forest model

1. Precision on test set:	43.19%
2. Recall on test set:	72.02%
3. Training Accuracy: 87.78%
4. Test Accuracy: 88.88%


- Precision score is better than logistic regression and recall score is good too, This model would identify most of the customers who are about churn and also the model would relativly less likely to miss classify non churn customers leading to significant loss in revenue
- Observing test and training score, we can be sure that model is not overfitted which would likely to expect in the tree based models

<b> Verdict: some customers who would not likely to churn would recive offers, which would lead to loss of revenue for the organization to some extent, If we can improve our precision it would be more benificial and this model would take more computational resorce and time consuming hence it can't make prediction quickly</b>

### Random Forrest - Hyperparameter Tuning

#### Creating a hyperparameter grid

In [None]:
max_depth = [int(x) for x in np.linspace(10, 50, num = 5)]
max_depth.append(None)

# Create the random parameter grid
parameter_grid = {'n_estimators': [int(x) for x in np.linspace(start = 200, stop = 1000, num = 5)],
                  'max_features': ['auto', 'sqrt'],
                  'max_depth': max_depth,
                  'min_samples_split': [100, 500, 1000],
                  'min_samples_leaf': [50, 250, 500],
                  'bootstrap': [True, False]}

pprint(parameter_grid)

# Searching across different combinations for best model parameters
rf_random = RandomizedSearchCV(estimator = rfc, param_distributions = parameter_grid, n_iter = 100, 
                               cv = 3, verbose=2, random_state=100, n_jobs = -1)

In [None]:
# Fit the random search model
rf_random.fit(X_train_lasso, y_train_bal)

# Finding the best parameters
rf_random.best_params_

#### Building the model around the random parameter obtained

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [15, 20, 25],
    'min_samples_leaf': range(40, 50, 60),
    'min_samples_split': range(80, 100, 120),
    'n_estimators': [800, 1000, 1200], 
    'max_features': ['sqrt'],
    'bootstrap': [False]
}

# Create a based model
rf = RandomForestClassifier(random_state = 100)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1,verbose = 1)

In [None]:
# Fitting the model
grid_result = grid_search.fit(X_train_lasso, y_train_bal)

# Finding the best model
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

**Fitting the final model with the best parameters obtained from grid search.**

In [None]:
# Model with the best hyperparameters

rfc = RandomForestClassifier(bootstrap=False,
                             max_depth=20,
                             min_samples_leaf=40, 
                             min_samples_split=80,
                             max_features='sqrt',
                             n_estimators=800)

# Fit
rfc.fit(X_train_lasso, y_train_bal)

In [None]:
# Predict
train_pred = rfc.predict(X_train_lasso)
test_pred = rfc.predict(X_test_lasso)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

### Interpreatation of scores

There is no significant improvement in the model

1. Precision on test set:	43.19%
2. Recall on test set:	72.02%
3. Training Accuracy: 92.83%
4. Test Accuracy: 88.88%


- Precision score is low and recall score is good, which means the model has a tendency for classifying most of the customers to churn, even though the customers are not likely to churn. In this case, As most of the customers are classified that they would churn hence, the model would capture all the customers who would likely churn.
- Observing test and training score, we can be sure that model is not overfitted

<b> Verdict: Many customers who would not likely to churn would recive offers, which would lead to loss of revenue for the organization upto some extent </b>

In [None]:
# To get the weights of all the variables
weights = pd.Series(rfc.feature_importances_,
                 index=selected_features)
weights.sort_values(ascending = False).plot(kind = 'bar')

## 1.3 Interpretable Model 3 - Using XgBoost

### XgBoost Base Model - Default Parameters

In [None]:
# fit model on training data with default hyperparameters
xgb = XGBClassifier()

# Lookin at the parameters used by our base model
print('Parameters currently in use:\n')
pprint(xgb.get_params())

In [None]:
# Fitting the model
xgb.fit(X_train_lasso,y_train_bal)

# Making predictions
predictions = xgb.predict(X_test_lasso)
train_pred = xgb.predict(X_train_lasso)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

In [None]:
# AUC Score
print("AUC Score on test set:\t" +str(round(roc_auc_score(y_test,predictions) *100,2)))

### Interpreatation of scores

1. Precision on test set:	62.38%
2. Recall on test set:	54.11%
3. Training Accuracy: 83.51%
4. Test Accuracy: 92.39%


- Precision score is better and recall score is not good, which means the model has a tendency for classifying most of the customers will not churn, even though the customers are likely to churn. This model is not good for the objective we have. As we dont need our model to predict customers who are about to churn as this would impact the business more.
- Observing test and training score, looks like model is not overfitted and performs better with test data

<b> Verdict: This model is not suitable as we would miss 50% of customers who would likely to churn </b>

### XgBoost - Hyperparameter Tuning

In [None]:
# hyperparameter tuning with XGBoost

# specify range of hyperparameters
param_grid = {'learning_rate': [0.2, 0.6], 
             'subsample': [0.3, 0.6, 0.9]}          


# specify model
xgb_ht = XGBClassifier(max_depth=2, n_estimators=200)

# set up GridSearchCV()
gs = GridSearchCV(estimator = xgb_ht, param_grid = param_grid, scoring= 'roc_auc', 
                        cv = 3, verbose = 1, return_train_score=True, n_jobs = -1)     

In [None]:
# Fitting the model
grid_result = gs.fit(X_train_lasso,y_train_bal) 

# Finding the best model
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

#### Fitting the final model with the best parameters obtained from grid search.

In [None]:
# Model with the best hyperparameters
xgb_ht = XGBClassifier(max_depth=2, n_estimators=200, learning_rate = 0.6, subsample = 0.9)

# Fit
xgb_ht.fit(X_train_lasso, y_train_bal)

In [None]:
# Predict
train_pred = xgb_ht.predict(X_train_lasso)
test_pred = xgb_ht.predict(X_test_lasso)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

### Interpreatation of scores

This model has no significant improvement apart from the the training accuracy

1. Precision on test set:	62.38%
2. Recall on test set:	54.11%
3. Training Accuracy: 95.92%
4. Test Accuracy: 92.88%


- Precision score is better and recall score is not good, which means the model has a tendency for classifying most of the customers will not churn, even though the customers are likely to churn. This model is not good for the objective we have. As we dont need our model to predict customers who are about to churn as this would impact the business more.
- Observing test and training score, looks like model is not overfitted and performs better with test data

<b> Verdict: This model is not suitable as we would miss 50% of customers who would likely to churn </b>

In [None]:
# To get the weights of all the variables
weights = pd.Series(xgb_ht.feature_importances_,
                 index=selected_features)
weights.sort_values(ascending = False).plot(kind = 'bar')

# 2. High Accuracy Models - Using PCA

In [None]:
def draw_roc( y_test_churn, y_pred_churn ):
    fpr, tpr, thresholds = metrics.roc_curve(  y_test_churn, y_pred_churn,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score(  y_test_churn, y_pred_churn )
    print("ROC score: {}".format(auc_score))
    plt.figure(figsize=(6, 6))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return fpr, tpr, thresholds

### Manually finding what would be ideal number of components*

In [None]:
# Running pca with default parameters.
pca = PCA(random_state=100)

# Fitting the model
pca.fit(X_train_bal)

#### Using Screeplot for identifying the component size

In [None]:
# cumulative variance
var_cumu = np.cumsum(pca.explained_variance_ratio_)

# code for Scree plot
fig = plt.figure(figsize=[12,8])
plt.vlines(x=30, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=30, xmin=0, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.show()

From the grapgh we can infer that 30 compenents would be ideal

#### Using incremental PCA

In [None]:
# Initializing the PCA model
pca_inc = IncrementalPCA(n_components=30)

# Fitting the model
df_train_pca_inc = pca_inc.fit_transform(X_train_bal)

# Looking at the shape
df_train_pca_inc.shape

In [None]:
df_train_pca_inc

## Verifying there is no correlation exist after PCA

In [None]:
# Plottong correlation

corrmat = np.corrcoef(df_train_pca_inc.transpose())
plt.figure(figsize=[15,5])
sns.heatmap(corrmat)

In [None]:
# Applying the transformation on test

df_test_pca_inc = pca_inc.transform(X_test)
df_test_pca_inc.shape

## 2.1 High Accuracy Model 1 - Using Logistic Regression

### Logistic Regression Base Model - Default parameters

In [None]:
# Creating a base logistic regression model
lr = LogisticRegression(random_state=100)

# Lookin at the parameters used by our base model
print('Parameters currently in use:\n')
pprint(lr.get_params())

In [None]:
# fit the model
lr.fit(df_train_pca_inc, y_train_bal)

# Predicting values
predictions = lr.predict(df_test_pca_inc)
test_pred = lr.predict(df_train_pca_inc)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

### Logistic Regression - Hyperparameter Tuning

#### Creating a hypertuned logistic regression model

In [None]:
# Initialising logistic Regression
log_reg = LogisticRegression(random_state = 100)

# Creating hyper parameter grid
parameter_grid = {'solver': ['newton-cg', 'lbfgs','liblinear','sag'],
                  'penalty': ['l1', 'l2', 'elasticnet', 'none'],
                  'C': [100, 10, 1.0, 0.1, 0.01]}

gs = GridSearchCV(estimator=log_reg, param_grid=parameter_grid, n_jobs=-1, cv=3, scoring='accuracy', error_score=0)

In [None]:
# Fitting the model
grid_result = gs.fit(df_train_pca_inc, y_train_bal)

# Finding the best model
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

**Fitting the final model with the best parameters obtained from grid search.**

In [None]:
# Initialising hyper tuned logistic Regression
log_reg_ht = LogisticRegression(C= 0.1, penalty= 'l2', solver= 'newton-cg', random_state = 100)

# Fitting the model
log_reg_ht.fit(df_train_pca_inc, y_train_bal)

In [None]:
# Predicting the labels
train_pred = log_reg_ht.predict(df_train_pca_inc)
test_pred = log_reg_ht.predict(df_test_pca_inc)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

### Interpreatation of scores for Logistic regression ( default and hyperparametre tuned )

1. Precision on test set:	32.18%
2. Recall on test set:	79.63%
3. Training Accuracy: 83.59%
4. Test Accuracy: 82.94% 


- Precision score is low and recall score is good, which means the model has a tendency for classifying most of the customers to churn, even though the customers are not likely to churn ( actual churn / total predicted churn ). In this case, As most of the customers are classified that they would churn hence, the model would capture all the customers who would likely churn.
- Observing test and training score, we can be sure that model is not overfitted

<b> Verdict: Many customers who would not likely to churn would recive offers, which would lead to loss of revenue for the organization </b>

## 2.2 High Accuracy Model 2 - Random Forrest

### Random Forrest Base Model - Default Parameters

In [None]:
# Running the random forest with default parameters.
rfc = RandomForestClassifier(random_state = 100)

# Lookin at the parameters used by our base model
print('Parameters currently in use:\n')
pprint(rfc.get_params())

In [None]:
# fit the model
rfc.fit(df_train_pca_inc,y_train_bal)

# Making predictions
predictions = rfc.predict(df_test_pca_inc)
train_pred = rfc.predict(df_train_pca_inc)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

### Random Forrest - Hyperparameter Tuning

## Note: Random Forest with hyper tuning would take more than 1 hour for execution

#### Creating a hyperparameter grid

In [None]:
max_depth = [int(x) for x in np.linspace(10, 30, num = 3)]

# Create the random parameter grid
parameter_grid = {'n_estimators': [int(x) for x in np.linspace(start = 600, stop = 1000, num = 5)],
                  'max_features': ['auto', 'sqrt'],
                  'max_depth': max_depth,
                  'min_samples_split': [500, 1000],
                  'min_samples_leaf': [250, 500],
                  'bootstrap': [True, False]}

pprint(parameter_grid)

# Searching across different combinations for best model parameters
rf_random = RandomizedSearchCV(estimator = rfc, param_distributions = parameter_grid, n_iter = 50, 
                               cv = 3, verbose=2, random_state=100, n_jobs = -1)

In [None]:
# Fit the random search model
rf_random.fit(df_train_pca_inc, y_train_bal)

# Finding the best parameters
rf_random.best_params_

#### Building the model around the random parameter obtained

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [15, 20, 25],
    'min_samples_leaf': range(200, 300, 50),
    'min_samples_split': range(400, 600, 100),
    'n_estimators': [800, 1000, 1200], 
    'max_features': ['auto'],
    'bootstrap': [False]
}

# Create a based model
rf = RandomForestClassifier(random_state = 100)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1,verbose = 1)

In [None]:
# Fitting the model
grid_result = grid_search.fit(df_train_pca_inc, y_train_bal)

# Finding the best model
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

In [None]:
# Model with the best hyperparameters

rfc = RandomForestClassifier(bootstrap=False,
                             max_depth=20,
                             min_samples_leaf=200, 
                             min_samples_split=400,
                             max_features='auto',
                             n_estimators=800)

# Fit
rfc.fit(df_train_pca_inc, y_train_bal)

In [None]:
# Predict
train_pred = rfc.predict(df_train_pca_inc)
test_pred = rfc.predict(df_test_pca_inc)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

### Interpreatation of scores for Random Forest ( deafault and hypertuned model )


1. Precision on test set:	43.19%
2. Recall on test set:	72.02%
3. Training Accuracy: 92.83%
4. Test Accuracy: 88.88%


- Precision score is low and recall score is good, which means the model has a tendency for classifying most of the customers to churn, even though the customers are not likely to churn. In this case, As most of the customers are classified that they would churn hence, the model would capture all the customers who would likely churn.
- Observing test and training score, we can be sure that model is not overfitted

<b> Verdict: Many customers who would not likely to churn would recive offers, though it would be better than losing the customer </b>

## 2.3 High Accuracy Model 3 - Using XgBoost

### XgBoost Base Model - Default Parameters

In [None]:
# fit model on training data with default hyperparameters
xgb = XGBClassifier()

# Lookin at the parameters used by our base model
print('Parameters currently in use:\n')
pprint(xgb.get_params())

In [None]:
# Fitting the model
xgb.fit(df_train_pca_inc,y_train_bal)

# Making predictions
predictions = xgb.predict(df_test_pca_inc)
train_pred = xgb.predict(X_train_lasso)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

In [None]:
# AUC Score
print("AUC Score on test set:\t" +str(round(roc_auc_score(y_test,predictions) *100,2)))

### XgBoost - Hyperparameter Tuning

In [None]:
# hyperparameter tuning with XGBoost

# specify range of hyperparameters
param_grid = {'learning_rate': [0.2, 0.6], 
             'subsample': [0.3, 0.6, 0.9]}          


# specify model
xgb_ht = XGBClassifier(max_depth=2, n_estimators=200)

# set up GridSearchCV()
gs = GridSearchCV(estimator = xgb_ht, param_grid = param_grid, scoring= 'roc_auc', 
                        cv = 3, verbose = 1, return_train_score=True, n_jobs = -1)     

In [None]:
# Fitting the model
grid_result = gs.fit(df_train_pca_inc,y_train_bal) 

# Finding the best model
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

In [None]:
# Model with the best hyperparameters
xgb_ht = XGBClassifier(max_depth=2, n_estimators=200, learning_rate = 0.6, subsample = 0.9)

# Fit
xgb_ht.fit(df_train_pca_inc, y_train_bal)

In [None]:
# Predict
train_pred = xgb_ht.predict(df_train_pca_inc)
test_pred = xgb_ht.predict(df_test_pca_inc)

In [None]:
# Accuracy, precision, recall/sensitivity of the model
print_all_scores(y_test,predictions,y_train_bal,train_pred)

### Interpreatation of scores

This model has no significant improvement apart from the the training accuracy

1. Precision on test set:	62.38%
2. Recall on test set:	54.11%
3. Training Accuracy: 95.92%
4. Test Accuracy: 92.88%


- Precision score is better and recall score is not good, which means the model has a tendency for classifying most of the customers will not churn, even though the customers are likely to churn. This model is not good for the objective we have. As we dont need our model to predict customers who are about to churn as this would impact the business more.
- Observing test and training score, looks like model is not overfitted and performs better with test data

<b> Verdict: This model is not suitable as we would miss 50% of customers who would likely to churn </b>

## Conclusion:

Based on the model's performance, Model built with <b><i> Random forest using PCA  </i> </b> helps us in identyfing upto 70% of the customers who are about to churn, though we have identified some customers would churn even though they are not would not cause much harm as providing the offers to them also helps in keeping the revenue up rather than losing the customers

PCA helps us in reducing the dimensions which would further reduce the model building

## Significant features that would help in identifying the churn

### As per the model the following features are important

1. arpu_8
2. loc_og_t2t_mou_8
3. loc_og_t2m_mou_8
4. loc_og_t2f_mou_8
5. spl_og_mou_8
6. total_og_mou_8
7. loc_ic_t2f_mou_8
8. std_ic_t2f_mou_8
9. total_ic_mou_8
10. ic_others_8
11. total_rech_num_8
12. max_rech_amt_8
13. last_day_rch_amt_8
14. vol_2g_mb_8
15. aon_yr
16. arpu_6_7
17. roam_og_mou_6_7
18. std_og_mou_6_7

### According to EDA

1. arpu_8          
2. total_rech_amt_8
3. total_ic_mou_8  
4. total_og_mou_8  

The above mentioned features are the top features which are corelated with Churn