### Business Objective of Telecom Churn:
Build Two machine learning models such that

**Model 1** : Used to predict whether a high-value customer will churn or not, in churn phase. So that company can take action steps such as providing special plans, discounts on recharge etc.

**Model 2** : Used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks.

# Data Understanding

In [None]:
# Importing important packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Importing ML Packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.decomposition import IncrementalPCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score,confusion_matrix,plot_roc_curve,precision_score,recall_score,precision_recall_curve
from sklearn.model_selection import GridSearchCV

In [None]:
# Reading Dataset as Churn
churn = pd.read_csv('../input/telecom-churn-prediction-classification/telecom_churn_data.csv')

In [None]:
# How you doin?
churn.info()

In [None]:
# Sneak Peek
churn.head(10)


*   Some columns seems to have only 0.
*   Some have fixed numbers like circle id.
*   Mobile number seems to have no particular information.

We will resolve all this under Sanctity checks.



According to the Problem Statement there are **Three phases of customer behaviour during churn**, and they were devided as below :
- Good phase :   months **6** and **7**
- Action phase : month **8**
- Churn phase :  month **9**

So we need to predict the churn value (1 0r 0) in the last month that is month 9, so we will create a target variable called as **"churned"** based on the month 9, where 1 indicates that the customer churned and 0 indicates the customer stayed.

According to the problem statement, if all these following variables are recorded as 0 then the person has churned otherwise not. 
- total_ic_mou_9
- total_og_mou_9
- vol_2g_mb_9
- vol_3g_mb_9



#### Finding High Value Customers

High Value Customers  : Customers whose total average recharge value of the Good phase (i.e months 6&7) is grater than or equals to 70th percentile

- Avg_recharge_amount_6and7 = Total_recharge_amount_6and7/2
- Total_recharge_amount_6and7 = Total_recharge_amount_6 + Total_recharge_amount_7
- Total_recharge_amount_6 = total_rech_amt_6 + (av_rech_amt_data_6*total_rech_data_6)

In [None]:
# Who's not there?
churn[["total_rech_amt_6","av_rech_amt_data_6","total_rech_data_6",
       "total_rech_amt_7","av_rech_amt_data_7","total_rech_data_7"]].isnull().sum()

**NaN values in above columns they are actually 0 values, so that we can impute NaN values with 0s**

In [None]:
# Fill the empty seats!
churn[["total_rech_amt_6","av_rech_amt_data_6","total_rech_data_6",
       "total_rech_amt_7","av_rech_amt_data_7","total_rech_data_7"]]=churn[["total_rech_amt_6",
                                                                            "av_rech_amt_data_6",
                                                                            "total_rech_data_6",
                                                                            "total_rech_amt_7",
                                                                            "av_rech_amt_data_7",
                                                                            "total_rech_data_7"]].fillna(value=0)

In [None]:
# Compute the Average total recharge amount of each customer in 6 and 7 month 
Total_recharge_amount_6 = churn["total_rech_amt_6"]+churn["av_rech_amt_data_6"]*churn["total_rech_data_6"]
Total_recharge_amount_7 = churn["total_rech_amt_7"]+churn["av_rech_amt_data_7"]*churn["total_rech_data_7"]
Total_recharge_amount_6and7 = Total_recharge_amount_6 + Total_recharge_amount_7
Avg_recharge_amount_6and7 = Total_recharge_amount_6and7/2

In [None]:
# Find the 70th percentile of the Average total recharge amount of each customer in 6 and 7 months
np.percentile(Avg_recharge_amount_6and7,70)

In [None]:
# VIP's lounge ONLY

churn = churn[Avg_recharge_amount_6and7>=478]
churn.head()

In [None]:
# How many left?
churn.shape

In [None]:
# Attendance
churn = churn.reset_index(drop=True)
churn.head()

In [None]:
# Anybody lost?
pd.DataFrame(churn.isnull().sum()).sort_values(by = 0, ascending = False)

After observing the dataset we can devide the columns into 2 types :
- Numeric Columns 
- Date Columns 

As per the data observed :  
**NaN values in Numeric columns can be imputed with 0s**

**NaN values in Date columns means the customer have not recharged in that month, so instead of imputing date columns we will find the number of days between their last recharge among 6,7 and 8 months and the last date of the 8th month, to create a new derived variable.** 



In [None]:
# What type are they?
obj_types = list(churn.columns[churn.dtypes == 'object'])


In [None]:
obj_types

In [None]:
# Correct format is the best format
for i in obj_types:
  churn[i] = pd.to_datetime(churn[i], format = '%m/%d/%Y')


# Derived Metrics

Define a variable called "recency", which defines how many days ago the last recharge has been done till the last date of month 8.

In [None]:
# Get last date of recharge 
last_rech_date = churn[['date_of_last_rech_6','date_of_last_rech_7',
                        'date_of_last_rech_8','date_of_last_rech_data_6',
                        'date_of_last_rech_data_7','date_of_last_rech_data_8']].max(1)

In [None]:
# fill the missing values of last date of month 8
churn['last_date_of_month_8'] = churn['last_date_of_month_8'].fillna(churn['last_date_of_month_8'][0])

In [None]:
# Create recency variable
churn["recency"]=churn["last_date_of_month_8"]-last_rech_date
churn["recency"]=churn["recency"].dt.days

In [None]:
#remove all date columns 
churn.drop(obj_types, inplace = True, axis = 1)

In [None]:
# Rename vbc variables 
churn=churn.rename(columns={'aug_vbc_3g':'vbc_3g_8','jul_vbc_3g':'vbc_3g_7','jun_vbc_3g':'vbc_3g_6'})

# Null Value Treatment

In [None]:
# Get total missing values in the dataframe
churn.isnull().sum().sum()

fb_user and night_pack_user columns have 0s, 1s and blanks. There might be some patterns between blank values of these columns and target variable, so we will impute the missing values in these columns with -1. 

 1: service availed
 
 0: service not availed
 
-1: Not available

In [None]:
# Replacing 'who knows?' with -1
for i in ["fb_user_6","fb_user_7","fb_user_8","night_pck_user_6", "night_pck_user_7","night_pck_user_8"]:
  churn[i].fillna(-1, inplace = True)

churn[["fb_user_6","fb_user_7","fb_user_8","night_pck_user_6","night_pck_user_7","night_pck_user_8"]].isnull().sum()

In [None]:
# Making the HouseFull
num_types = list(churn.columns[churn.dtypes != 'object'])
for i in num_types:
  churn[i].fillna(0, inplace = True)

In [None]:
# HouseFull!
churn.isnull().sum().sum()

#### Create Target variable from month 9 :



In [None]:
# All hail to the Target!
churn["churned"] = (churn[["total_ic_mou_9","total_og_mou_9","vol_2g_mb_9","vol_3g_mb_9"]].sum(axis=1)==0).astype(int)
churn.head()

Here **"churned"** is the **target variable** where

- **1** represents customer has **churned**
- **0** represents customer **Not churned** 

#### Finding churn rate :

In [None]:
# Find churn rate 
churn_rate = (sum(churn["churned"])/len(churn))*100
churn_rate

**churn_rate is quite low** 

In this case, we can say that dataset is Imbalanced. We will handle this during the model building phase.

Since target is based on the month 9, **we should drop all the columns that are related to month 9**

In [None]:
# Relatives of the Target ...
last_mnth_att = []
for i in churn.columns.tolist():
  if i.endswith('_9'):
    last_mnth_att.append(i)

In [None]:
# ... OUT!
churn.drop(last_mnth_att, inplace = True, axis = 1)
churn.drop('sep_vbc_3g', axis = 1, inplace = True)

# Sanctitiy Checks

In [None]:
# No variance? No place!
empty_cols = [ ]
for i in churn.columns:
  if pd.DataFrame(churn[i].value_counts()).shape[0] == 1:
    empty_cols.append(i)

In [None]:
for i in empty_cols:
  print(churn[i].value_counts())
  print('-------------------')

In [None]:
# C ya another time
churn.drop(empty_cols, axis = 1, inplace = True)
churn.drop('mobile_number', axis = 1, inplace = True)

In [None]:
churn.shape

# EDA

**Since most of the variables are numeric, so we can use box plots and histograms to analyze the distribution for churned and non churned customers**

In [None]:
# Analyse arpu (average revenue per user) with respect to all months 
churn.groupby("churned")[["arpu_6","arpu_7","arpu_8"]].median()

In [None]:
#plot the average revenue per user in each month 
fig = plt.figure(figsize=[18,6])
i = 1
for j in ['arpu_6', 'arpu_7', 'arpu_8']:
  plt.subplot(1,3,i)
  plt.ylim([0,5000])
  sns.boxplot(y=j,data = churn,x= "churned")
  plt.xlabel('Churn')
  plt.ylabel('ARPU')
  i = i+1



**From the above table and Plot we can say that for the customers who churned , the average revenue per user is dropped significantly in the 8th month**

In [None]:
# Analyze MOU (Minutes of usage voice calls )
fig = plt.figure(figsize=[18,6])
i = 1
for j in ['loc_og_mou_6','loc_og_mou_7','loc_og_mou_8']:
  plt.subplot(1,3,i)
  plt.ylim([0,5000])
  sns.boxplot(y=j,data = churn,x= "churned")
  plt.xlabel('Churn')
  plt.ylabel('Minutes of Usage')
  
  i = i+1




**Local Outgoing minutes of usage is lower in all the Three months**

In [None]:
# Analyze recency (Number of days befor ethe last recharge was done)
fig = plt.figure(figsize=[7,6])
sns.boxplot(y="recency",data = churn,x= "churned")
plt.xlabel('Churn')
plt.ylabel('Recency')
plt.title('Churn vs Recency')
plt.show()
  

**From the above plot we can conclude that the recency is higher (more number of days since last recharge) in the churned case**`

In [None]:
# VBC vs Churn
fig = plt.figure(figsize=[15,6])
i = 1
for j in ['vbc_3g_6', 'vbc_3g_7', 'vbc_3g_8']:
  plt.subplot(1,3,i)
  sns.barplot(y=j,data = churn,x= "churned")
  plt.xlabel('Churn')
  plt.ylabel('VBC')  
  i = i+1
  





**VBC (Volume based cost ) is lower in the past three months in churned customers**

In [None]:
# Analyze AON column (Age of network)
fig = plt.figure(figsize=[7,6])
sns.boxplot(y="aon",data = churn,x= "churned")
plt.xlabel('Churn')
plt.ylabel('Age on Network')
plt.title('AON vs Churn')
plt.show()  

**From the above plot we can say that the Age on Network (i.e Number of days since the customer joined in the Telecom network) is significantly lower in the churned customers that means , the newly joined customers are more likely to churn**

In [None]:
# Analyze VOL (Volume of 2g)

fig = plt.figure(figsize=[18,6])
i = 1
for j in ['vol_2g_mb_6','vol_2g_mb_7','vol_2g_mb_8']:
  plt.subplot(1,3,i)
  sns.boxplot(y=j,data = churn,x= "churned")
  plt.xlabel('Churn')
  plt.ylabel('2G Consumption in mb')  
  i = i+1


In [None]:
# Analyze VOL (Volume of 3g data used)

fig = plt.figure(figsize=[18,6])
i = 1
for j in ['vol_3g_mb_6','vol_3g_mb_7','vol_3g_mb_8']:
  plt.subplot(1,3,i)
  sns.boxplot(y=j,data = churn,x= "churned")
  plt.xlabel('Churn')
  plt.ylabel('3G Consumption in mb')  
  i = i+1


**So, from the above plots we can conclude that the Volume of MB used in both 2G and 3G are reduced significantly in the last two months in churned customers**

In [None]:
churn.groupby("churned")[["fb_user_6","fb_user_7","fb_user_8","night_pck_user_6","night_pck_user_7","night_pck_user_8"]].sum()

**In churned customers -1s(Blanks) are majority in fb_user services**

In [None]:
# Analyze fb_user (Volume of 2g and 3g data used)

fig = plt.figure(figsize=[18,6])
i = 1
for j in ["fb_user_6","fb_user_7","fb_user_8"]:
  plt.subplot(1,3,i)
  sns.countplot(x=j,data = churn,hue= "churned")
  plt.xlabel('FB_USER')
  plt.ylabel('Count')  
  i = i+1


**For churned users in fb_user_services 1s are reducing and blanks(-1s) are increasing**

1: service availed

0: serviece not availed

-1: Not available

In [None]:
# Analyze night_pack_user (Volume of 2g and 3g data used)
fig = plt.figure(figsize=[18,6])
i = 1
for j in ["night_pck_user_6","night_pck_user_7","night_pck_user_8"]:
  plt.subplot(1,3,i)
  sns.countplot(x=j,data = churn,hue= "churned")
  plt.xlabel('Night_Pack_User')
  plt.ylabel('Frequency')  
  i = i+1


**For churned users in night pack_user_services blanks(-1s) are increasing**

1: service availed

0: serviece not availed

-1: Not available


# Data Pre-processing 

In [None]:
# All bow to continuous data 
dummy_var_columns = churn[["fb_user_6","fb_user_7","fb_user_8","night_pck_user_6","night_pck_user_7","night_pck_user_8"]].astype(object)
dummy_df = pd.get_dummies(dummy_var_columns,drop_first=True)
churn=pd.concat([churn,dummy_df],axis=1)


In [None]:
# drop encoded variables 
churn=churn.drop(["fb_user_6","fb_user_7","fb_user_8","night_pck_user_6","night_pck_user_7","night_pck_user_8"],axis=1)

In [None]:
# Keeping the Guest Separate 
y = churn.pop('churned')
X = churn

# Outlier Treatment

In [None]:
# Out of Scale, Out of Data!
for i in X.columns.to_list():
  iqr = churn[i].quantile(0.75) - churn[i].quantile(0.25)
  l_bound = churn[i].quantile(0.25) - (1.5*iqr)
  u_bound = churn[i].quantile(0.75) + (1.5*iqr)
  churn[i] = churn[i].apply(lambda x : l_bound if x < l_bound else u_bound if x > u_bound else x)


In [None]:
# Suprise checks!
import random
sns.boxplot(data = churn[random.choices(churn.columns.to_list(), k = 8)], orient = 'h')
plt.ylabel('Features')
plt.title('After Outlier Treatment')
plt.show()

In [None]:
# Split train test data 
X_train,X_test,y_train,y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=100,stratify = y)

In [None]:
# How many where?

for i in [X_train,X_test,y_train,y_test]:
  print(i.shape)

In [None]:
# Scaling is the Best Policy!
scalar = StandardScaler()
X_train[X_train.columns] = scalar.fit_transform(X_train[X_train.columns])
X_test[X_test.columns] = scalar.transform(X_test[X_test.columns])

In [None]:
# All set?
X_train.describe()

# Model 1


#### PCA

In [None]:
# Welcome PCA
pca = PCA(random_state=42)

In [None]:
# Transfom the guests
pca.fit(X_train)

In [None]:
# Total number of components
X_train.shape, len(pca.components_)

**Since there are total 158 features in X_train, PCA will return a total of 158 Principal Components**

In [None]:
# Who explains how much? 
pca.explained_variance_ratio_

#### Scree Plot

In [None]:
# Lets find the total best
cum_var = np.cumsum(pca.explained_variance_ratio_)
fig = plt.figure(figsize=[12,8])
plt.vlines(x=60, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=150, xmin=0, colors="g", linestyles="--")
plt.plot(cum_var)
plt.ylabel("Cumulative variance explained")
plt.show()

From the above scree plot we can say that **60 Pricipal Components are able to explain 95% of the variance (information) of the dataset** 

In [None]:
# Seal the Deal!
pca_final = IncrementalPCA(60)

In [None]:
# Transform Again
X_train_pca = pca_final.fit_transform(X_train)
X_test_pca = pca_final.transform(X_test)

X_train_pca.shape, X_test_pca.shape

In [None]:
# Anybody related?
plt.figure(figsize=[10,10])
sns.heatmap(pd.DataFrame(X_train_pca).corr())
plt.show()

**From the above figure we can conclude that all the PCs are uncorrelated to each other as the black color indicates nearly 0 correlation**

#### Model Building

**As per the case study, we need to build 2 models where Model1 is used to predict the churned cases, and Model2 is to find the driving features of churned cases.**


We will use Random forests to build model that will correctly predict the churns.

In [None]:
# Report Card
results_df = pd.DataFrame(index= ['basic', 'best'], columns = ['recall_train', 'recall_test'])

In [None]:
# Aforestation
rf_basic = RandomForestClassifier(random_state=42,n_estimators=100,n_jobs=-1,
                                  class_weight="balanced",max_depth=10,
                                  min_samples_split=50,min_samples_leaf=30,
                                  oob_score=True)

In [None]:
# Fit the model 
rf_basic.fit(X_train_pca,y_train)

In [None]:
# Claculate OOB score
rf_basic.oob_score_

In [None]:
# Find predicted probabilities and predicted values based on basic model for train data
y_train_pred_proba = rf_basic.predict_proba(X_train_pca)
y_train_pred = rf_basic.predict(X_train_pca)

In [None]:
# Plot roc curve for train data
plot_roc_curve(rf_basic,X_train_pca,y_train)
plt.show()

In [None]:
# Find area under the roc curve for train data
area_under_roc_cureve = roc_auc_score(y_train,y_train_pred_proba[:,1])
area_under_roc_cureve

In [None]:
# Calculate the default confusion matrix
confusion_train_for_basic = confusion_matrix(y_train,y_train_pred)
confusion_train_for_basic

Since our data is unbalanced, accuracy metric is not good.

As per the business objective main goal is to predict the customers who will churn and stop them from doing so by giving interesting offers. 

However, it is okay if few customers are missclassified as churn and will be given offers, which might not harm the company as much as churned customers.

Hence, the trade off is to limit False negatives by compromising False positives. 
This means we can take **recall** as our evaluation metric. 


In [None]:
# Find recall and precision for train set
recall_basic_train = recall_score(y_train,y_train_pred)
precision_basic_train  = precision_score(y_train,y_train_pred)
print(recall_basic_train)
print(precision_basic_train)

In [None]:
# Find predicted probabilities and predicted values based on basic model
y_test_pred_proba = rf_basic.predict_proba(X_test_pca)
y_test_pred = rf_basic.predict(X_test_pca)

In [None]:
# Plot roc curve
plot_roc_curve(rf_basic,X_test_pca,y_test)
plt.show()

In [None]:
#Find area under the roc curve for test data
area_under_roc_cureve = roc_auc_score(y_test,y_test_pred_proba[:,1])
area_under_roc_cureve

In [None]:
# calculate the defaule confusion matrix for test data
confusion_test_for_basic = confusion_matrix(y_test,y_test_pred)
confusion_test_for_basic

In [None]:
# Find recall and precision for test set
recall_basic_test = recall_score(y_test,y_test_pred)
precision_basic_test  = precision_score(y_test,y_test_pred)
print(recall_basic_test)
print(precision_basic_test)

In [None]:
# Storing results of Unit Tests
results_df.loc['basic'] = [recall_basic_train, recall_basic_test]

In [None]:
results_df.loc['basic']

**We can observe that precision and recall are dropped significantly from train to test, let's try to improve the model.**


#### Hyper perameter tuning 

In [None]:
# Create a basic rf
rf1 = RandomForestClassifier(random_state=42, n_jobs=-1,class_weight="balanced")

**Round1 tuning :**

In [None]:
# Create params
params = {
    'max_depth': [10,20],
    'min_samples_leaf':[50,100],
    'min_samples_split':[100,150]
       
}

# Create grid search
grid_search = GridSearchCV(estimator=rf1, param_grid=params, 
                          cv=3,verbose=1,n_jobs=-1,scoring = 'recall')

grid_search.fit(X_train_pca,y_train)

In [None]:
# Best score for the gridsearch
grid_search.best_score_

In [None]:
# Best estimator
grid_search.best_estimator_

#### Round2 Tuning: 


In [None]:
#Create params for second round
params = {
    'max_depth': [5,10,15],
    'min_samples_leaf':[75,100,125],
    'min_samples_split':[75,100,125]
       
}

grid_search = GridSearchCV(estimator=rf1, param_grid=params, 
                          cv=3,verbose=1,n_jobs=-1, scoring = 'recall')

grid_search.fit(X_train_pca,y_train)

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_estimator_

#### Round3 Tuning: 

In [None]:
rf2 = RandomForestClassifier(random_state=42, n_jobs=-1,class_weight="balanced", 
                             max_depth=5,min_samples_leaf=75, min_samples_split=75)

In [None]:
params = {
    'n_estimators': [200,250,300],
    'max_features': ['auto','sqrt','log2']    
    
}

grid_search = GridSearchCV(estimator=rf2, param_grid=params, 
                          cv=3,verbose=1,n_jobs=-1, scoring = 'recall')

grid_search.fit(X_train_pca,y_train)

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_estimator_

In [None]:
# Create best Random Forest model with the best estimated parameters 
rf_best = RandomForestClassifier(random_state=42, n_jobs=-1,class_weight="balanced", max_depth=5,min_samples_leaf=75, min_samples_split=75)

In [None]:
# fit the model 
rf_best.fit(X_train_pca,y_train)

In [None]:
# Find predicted probabilities and predicted values based on basic model for train data
y_train_pred_proba = rf_best.predict_proba(X_train_pca)
y_train_pred = rf_best.predict(X_train_pca)

In [None]:
# Plot roc curve for train data
plot_roc_curve(rf_best,X_train_pca,y_train)
plt.show()

In [None]:
#Find area under the roc curve for train data
area_under_roc_cureve = roc_auc_score(y_train,y_train_pred_proba[:,1])
area_under_roc_cureve

In [None]:
# calculate the defaule confusion matrix
confusion_train_for_best = confusion_matrix(y_train,y_train_pred)
confusion_train_for_best

In [None]:
# Find recall and precision for train set
recall_best_train = recall_score(y_train,y_train_pred)
precision_best_train  = precision_score(y_train,y_train_pred)
print(recall_best_train)
print(precision_best_train)

In [None]:
# Find predicted probabilities and predicted values based on basic model for test data
y_test_pred_proba = rf_best.predict_proba(X_test_pca)
y_test_pred = rf_best.predict(X_test_pca)

In [None]:
# Plot roc curve for train data
plot_roc_curve(rf_best,X_test_pca,y_test)
plt.show()

In [None]:
#Find area under the roc curve for train data
area_under_roc_cureve = roc_auc_score(y_test,y_test_pred_proba[:,1])
area_under_roc_cureve

In [None]:
# calculate the default confusion matrix for test data
confusion_test_for_best = confusion_matrix(y_test,y_test_pred)
confusion_test_for_best

In [None]:
# Find recall and precision for test set
recall_best_test = recall_score(y_test,y_test_pred)
precision_best_test  = precision_score(y_test,y_test_pred)
print(recall_best_test)
print(precision_best_test)

In [None]:
results_df.loc['best'] = [recall_best_train, recall_best_test]

In [None]:
# Final Results
results_df

From the above dataframe, we can see that we have achieved significant improvement by tuning. 

Can conclude that the overfitting is taken care of.

#### Finding best probability cutoff value to get best recall value

In [None]:
# Create dataframe for final predicted train data
y_train_pred_final = pd.DataFrame()
y_train_pred_final["churn"] = y_train
y_train_pred_final["pred_churn_proba"] = y_train_pred_proba[:,1]

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.churn,y_train_pred_final.pred_churn_proba)

In [None]:
plt.xlabel('Probability thresholds')
plt.ylabel('recall')
plt.vlines(x=0.35, ymax=1, ymin=0, colors="b", linestyles="--")
plt.hlines(y=0.95, xmax=1, xmin=0, colors="g", linestyles="--")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

**From the above graph we can observe that with 0.35 probabilit cutoff, we can achieve 95% recall**

In [None]:
# Create dataframe for final predicted test data
y_test_pred_final = pd.DataFrame()
y_test_pred_final["churn"] = y_test
y_test_pred_final["pred_churn_proba"] = y_test_pred_proba[:,1]

In [None]:
# Assign class labels 1s and 0s based on probability cutoff 0.35
y_test_pred_final["pred_churn"] = y_test_pred_final["pred_churn_proba"].map(lambda x : 1 if x>=0.35 else 0)

In [None]:
#Find confusion matrix
confusion_matrix(y_test_pred_final.churn,y_test_pred_final.pred_churn)

In [None]:
recall_score(y_test_pred_final.churn,y_test_pred_final.pred_churn)

Hence, we can recommend rf_best model to predict the future customers who are going to churn.

We can also note that the recall score is 0.92 which is extremely good in capturing true positives.  

# Model 2

## Logistic Regression

To derive important features, PCA will not be helpful. we have to choose model which is easy to interpret. Hence, we will build another model to achieve this goal.

In [None]:
# Back to Pavilion
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(max_iter=1000, class_weight='balanced')
lg.fit(X_train, y_train)
y_pred = lg.predict(X_train)
recall_score(y_train, y_pred)

## Checking Assumptions

In [None]:
# Attention
sns.regplot(x = y_pred, y = y_train, line_kws = {'color' : 'black'})
plt.title('Checking for Linearity')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()

In [None]:
# Normalilty
res = y_train - y_pred
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.histplot(x = res, kde = True, color = 'black')
plt.title('Checking for Normality')
plt.xlabel('Residuals')
plt.subplot(1,2,2)
sns.regplot(y = y_pred, x = res, color = 'black')
plt.title('Checking for Constant Variance')
plt.ylabel('Residuals')
plt.xlabel('Predicted Values')
plt.show()

We have more than 100 features and making business decisions for these features, well, is not possible. 

Hence, we have to come up with ways to select features. 

Let's start with the infamous RFE.

## RFE

In [None]:
from sklearn.feature_selection import RFE
lg.fit(X_train, y_train)
rfe = RFE(lg, n_features_to_select=50).fit(X_train, y_train)           


In [None]:
# Top performing Features
X_train = X_train[X_train.columns[list(rfe.support_)]]
X_test = X_test[X_test.columns[list(rfe.support_)]]

Checking for Multicollinearity

In [None]:
# Personal Relationship Prohibited
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['features'] = X_train.columns
vif['vif'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif.sort_values(by = 'vif', ascending=False).head(20)


We see vif more than 5, which means high multicollineartiy among features. 

Let's build a function to solve this.

In [None]:
# Action Plan
def vifs(train):
  vif = pd.DataFrame()
  vif['features'] = train.columns
  vif['vif'] = [variance_inflation_factor(train.values, i) for i in range(train.shape[1])]
  if vif.vif.max() > 5:
    train.drop(vif[vif['vif'] == vif.vif.max()].iloc[0,0], axis = 1, inplace = True)
    vifs(train)
    
  
  return train.columns

In [None]:
# The Survivors
print(vifs(X_train))

In [None]:
X_test = X_test[X_train.columns.tolist()]

In [None]:
# Report Card v2
results_2 = pd.DataFrame(index = ['log_reg', 'ridge'], columns = ['train_recall', 'test_recall'])

In [None]:
# Model with the Survivors
lg2 = LogisticRegression(class_weight={1:0.95, 0:0.05})
lg2.fit(X_train, y_train)
recall_score(y_test, lg2.predict(X_test))

Let's tune it further

In [None]:
# Scaling that extra mile
from sklearn.model_selection import KFold
params = [{'C':list(range(1,20)), 
           'max_iter': list(np.arange(1000,10000,1000))}]
kf = KFold(n_splits = 10, shuffle = True, random_state = 100)
gsv = GridSearchCV(
    estimator = lg2,
    param_grid = params,
    scoring = 'recall',
    verbose = 1
)
gsv.fit(X_train, y_train)

In [None]:
# The toppers
gsv.best_params_,gsv.best_score_

In [None]:
# Using the Toppers
lg2 = LogisticRegression(C = 2, max_iter=1000, class_weight={1:0.95, 0:0.05})
lg2.fit(X_train, y_train)
lg_test_recall = recall_score(y_test, lg2.predict(X_test))
lg_train_recall = recall_score(y_train, lg2.predict(X_train))

In [None]:
# PTA
results_2.loc['log_reg'] = [lg_train_recall, lg_test_recall]
results_2.loc['log_reg']

After Tuning Logistic Regression, we achieved recall of 0.88 for test and 0.89 for train, which is good. 

Let's try Ridge classification as it is suitable for multicollinarity than Lasso. 

## Ridge Classification

In [None]:
# Regularise like Ridge
from sklearn.linear_model import RidgeClassifier,RidgeClassifierCV
ridge = RidgeClassifier(class_weight={1:0.95, 0:0.05})
ridge.fit(X_train, y_train)
y_pred2 = ridge.predict(X_train)
recall_score(y_train, y_pred2)

Let's achieve more heights.

In [None]:
# Always Room for Improvement
kf = KFold(n_splits = 10)
params = [{'alpha': list(np.arange(1,100,5)) }]
gsc = GridSearchCV(
    estimator = ridge,
    param_grid = params,
    scoring = 'recall',
    cv = kf,
    verbose = 1
)
gsc.fit(X_train, y_train)

In [None]:
# The Improvers...
gsc.best_params_,gsc.best_score_

In [None]:
# ... In Action
ridge2 = RidgeClassifier(alpha = gsc.best_params_['alpha'], class_weight={1:0.95, 0:0.05})
ridge2.fit(X_train,y_train)
y_pred_test = ridge2.predict(X_test)
ridge_test = recall_score(y_test, y_pred_test)
ridge_train = recall_score(y_train, ridge2.predict(X_train))

In [None]:
# PTA 2
results_2.loc['ridge'] = [ridge_train, ridge_test]
results_2

From the above results, we can see that ridge is performing considerably better than logistic regression. Hence, we will derive the top features from ridge model. 

In [None]:
# The Top Features
imp_featrs = pd.DataFrame(list(zip(X_train.columns, 
                                           [item for elem in list(ridge2.coef_) for item in elem])), 
                                           columns = ['features', 'importance'])

In [None]:
# Top 20 overall
top_20_overall = pd.DataFrame(list(zip(X_train.columns, [item for elem in list(abs(ridge2.coef_)) for item in elem])), 
                              columns = ['features', 'importance']).sort_values(by = 'importance', ascending= False)

Hence, the top influencing features are: 

In [None]:
# *Flying Confetti*
plt.figure(figsize = (10,10))
sns.barplot(y = top_20_overall.features, x = top_20_overall.importance)

The top 5 features are:


In [None]:
top_20_overall.features[:5]

# Results

From Model 1, we saw that Random forest with PCA was able to predict churn with recall score of 0.92.

From Model 2, we saw that **last day of recharge in month 8**, **the recency**  and **total outgoing minutes of usage in month 8** are the top 3 features. 

To prevent the customer from churning the companies can roll out the following changes:

*  Send reminders to customers to recharge every month
*  To those customers who have stopped using the service for longer duration, or not recent, offer interesting plans.





Just the top features might not say how it is affecting the churners. We might want to whether it is affecting positively or negatively. Let's look at them.

In [None]:
neg_influencers = imp_featrs[imp_featrs.importance > 0].sort_values(by = 'importance', ascending = False).features[:20]
pos_influencers = imp_featrs[imp_featrs.importance < 0].sort_values(by = 'importance').features[:20]

If the coefficient is negative, it means that they are inversely proportional to target. 

We also want that. The target needs to reduce, that is change from 1 to 0. 

Hence, positive coefficients are increasing churn and negative coefficients are decreasing churn. 

In [None]:
pos_influencers[:5]

The above features are decreasing churn, which means retaining the customers. The company needs to roll out offers that will make the users recharge more data in 3G, or reduce cost so that call from other operators cost less. 

In [None]:
neg_influencers[:5]

These features are increasing churn. 

In [None]:
import warnings
warnings.filterwarnings('ignore')