<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>Telecom Churn Prediction Indian and Southeast Asian market</center></h2>


# <font color='black'>
![telecom](https://www.pinclipart.com/picdir/big/171-1716993_since-b2c-companies-will-at-times-provide-a.png)

## *Business Problem Overview* ¶
In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.
So we need to analyse telecom industry data and predict high value customers who are at high risk of churn and identify main indicators of churn.
In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

## *Business objective*¶
The business objective is to predict the churn in the last (i.e. the ninth) month using the features/data from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful.
### *Understanding Customer Behaviour During Churn*¶
Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle :

The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual.

The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.)

The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase.

In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase.

## *About dataset:*
Dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively.

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>If you found this notebook helpful , some upvotes would be very much appreciated - That will keep me motivated :)</center></h2>


# Importing Libraries

In [None]:
# hide warnings
import warnings
warnings.filterwarnings("ignore")
# Import Libraries
import sys,joblib
import six
sys.modules['sklearn.externals.six'] = six
sys.modules['sklearn.externals.joblib'] = joblib
import numpy as np 
import pandas as pd
import re
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set_context("talk", font_scale = 0.65, rc={"grid.linewidth": 5})
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_rows', 400)
from sklearn.linear_model import LogisticRegression,LinearRegression,LassoCV,Lasso,Ridge,LogisticRegressionCV
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA,IncrementalPCA
from sklearn.model_selection import GridSearchCV,cross_val_score,KFold,StratifiedKFold,RandomizedSearchCV
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,precision_score,recall_score
from sklearn.metrics import precision_recall_curve,roc_auc_score,roc_curve
from imblearn.over_sampling import SMOTE,RandomOverSampler,ADASYN
from sklearn.preprocessing import StandardScaler,MinMaxScaler,QuantileTransformer
from scipy.stats import skew
from fancyimpute import IterativeImputer,KNN
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.feature_selection import RFE
import statsmodels.api as sm


In [None]:
telecom = pd.read_csv('../input/telecom-churn-dataset/telecom_churn_data.csv')
telecom.head()

In [None]:
#Checking the shape
print(telecom.shape)
print('\n')
# Checking Dataset Info 
print(telecom.info(verbose=True,null_counts=True))

In [None]:
#Summary and checking outliers
telecom.describe(percentiles=[0.25,0.5,0.75,0.99])

In [None]:
#Function to check percentage of null values present in dataset 
def calnullpercentage(df):
    missing_num= df[df.columns].isna().sum().sort_values(ascending=False)
    missing_perc= (df[df.columns].isna().sum()/len(df)*100).sort_values(ascending=False)
    missing= pd.concat([missing_num,missing_perc],keys=['Total','Percentage'],axis=1)
    missing= missing[missing['Percentage']>0]
    return missing

In [None]:
# check the %age of null values
calnullpercentage(telecom)


In [None]:
# Number of columns having null values
print(len(calnullpercentage(telecom)))

`Out of 226 columns 166 have null values`

In [None]:
#Checking categorical and numerical columns 
telecom.select_dtypes(include='object').head(3)

## Filter high-value customers
We need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

In [None]:
# Deriving new columns for total recharge amount data for 6 and 7th month
telecom['tot_rech_amt_data_6'] = telecom['total_rech_data_6'] * telecom['av_rech_amt_data_6']
telecom['tot_rech_amt_data_7'] = telecom['total_rech_data_7'] * telecom['av_rech_amt_data_7']
# Deriving new columns for total amount spent during  6 and 7th month
telecom['tot_amt_6'] = telecom[['total_rech_amt_6','tot_rech_amt_data_6']].sum(axis=1)
telecom['tot_amt_7'] = telecom[['total_rech_amt_7','tot_rech_amt_data_7']].sum(axis=1)
#first two months average
telecom['avg_amt_6_7'] = telecom[['tot_amt_6','tot_amt_7']].mean(axis=1)
# Filtering customers based on percentile having goodphase_avg more than or equal to cutoff of 70th percentile
telecom=telecom.loc[(telecom['avg_amt_6_7'] >= np.percentile(telecom['avg_amt_6_7'], 70))]
telecom.shape

` I have taken  recharge amount `more than or equal to X`, where X is the 70th percentile of the average recharge amount in the first two months and getting 30k rows. If I use `more than(>)sign`, will get 29.9k rows, but going with problem statement.`

In [None]:
# Deriving new columns for total recharge amount data for 8 and 9th month
telecom['tot_rech_amt_data_8'] = telecom['total_rech_data_8'] * telecom['av_rech_amt_data_8']
telecom['tot_rech_amt_data_9'] = telecom['total_rech_data_9'] * telecom['av_rech_amt_data_9']
# Deriving new columns for total amount spent during  8 and 9th month
telecom['tot_amt_8'] = telecom[['total_rech_amt_8','tot_rech_amt_data_8']].sum(axis=1)
telecom['tot_amt_9'] = telecom[['total_rech_amt_9','tot_rech_amt_data_9']].sum(axis=1)

In [None]:
#Finding categorical columns where dtype is float but those columns are having 0 or 1 values only
catg= []
for col in telecom.columns:
    if len(telecom[col].unique())== 2 | 3:
        catg.append(col)
# COnverting into categorical or object type
telecom[catg]=telecom[catg].apply(lambda x:x.astype('object'))
col_tmp=['total_rech_num_6','total_rech_num_7','total_rech_num_8','total_rech_num_9','total_rech_data_6',\
        'total_rech_data_7','total_rech_data_8','total_rech_data_9']
telecom[col_tmp]=telecom[col_tmp].apply(lambda x:x.astype('object'))

In [None]:
x=['tot_amt_8','total_rech_amt_8','tot_rech_amt_data_8','total_rech_data_8','av_rech_amt_data_8']
plt.figure(figsize=(8,5))
sns.heatmap(telecom[x].corr(),annot=True,cmap='viridis_r')

#### Dropping Redundant columns, since we have already created derived features from them and derived features reflects the same information.

In [None]:
telecom.drop(['tot_rech_amt_data_6','tot_rech_amt_data_7','tot_rech_amt_data_8',\
              'tot_rech_amt_data_9'],inplace=True,axis=1)


## Identifying CHURN CUSTOMERS 

Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are:

- total_ic_mou_9
- total_og_mou_9
- vol_2g_mb_9
- vol_3g_mb_9

After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names).

In [None]:
# Where summation of columns = 0 then churn =1  else 0
telecom['churn']= np.where(telecom[['total_ic_mou_9','total_og_mou_9','vol_2g_mb_9',\
                                    'vol_3g_mb_9']].sum(axis=1)== 0,1,0)

In [None]:
# Removing all features having ‘ _9’, etc. in their names
telecom.drop(telecom.filter(regex='_9|sep', axis = 1).columns, axis=1,inplace=True)

In [None]:
## Churn Percentage
pd.DataFrame(round(telecom['churn'].value_counts(normalize=True)*100,2))

` approximately 92% customers not churned and 8% customers got churned. Also, we can see class imbalance is there and we will deal with it later` 

<b>`For each feature, it counts the values of that feature. If the most recurrent value of the feature is repeated almost in all the instances (*zeros / len(X) * 100 > 95*). Then it drops these features because their values are almost the same for all instances and will not help in learning process and those features are not useful in our prediction.`

In [None]:
telecom.shape

In [None]:
def redundant_feature(df):
    redundant = []
    for i in df.columns:
        counts = df[i].value_counts()
        count_max = counts.iloc[0]
        if count_max / len(df) * 100 > 95:
            redundant.append(i)
    redundant = list(redundant)
    return redundant



In [None]:
print('Before dropping Redundant features: ',telecom.shape)
redundant_features = redundant_feature(telecom)
telecom = telecom.drop(redundant_features, axis=1)
print('After dropping Redundant features: ',telecom.shape)

`Function to impute NaN values where %age of missing values > 40%, Reason for taking cutoff 40% is beacuse for these columns we can replace NaN with 0(for example, fb_user_7, not used facebook(NaN),av_rech_amt_data_8, not done recharge(NaN) similarly for other columns.`

In [None]:
#Function to impute NaN with 0
def imputeNaN(df,col_name):
    for col in col_name:
        df[col].fillna(0,inplace=True)
col_40= calnullpercentage(telecom)[calnullpercentage(telecom)['Percentage']>40].index
#call function
imputeNaN(telecom,col_40)

In [None]:
# checking %age of null values
calnullpercentage(telecom)

` As we can see from above missing value dataframe and value count == 0, large percentage of values are zero in missing value columns. I can impute most missing value column having NaN value with 0 if I assume that they have not use local incoming service, special outgoing service that is why these columns have NaN values. But this assumption doesn't helping much beacuse most values in these columns have 0 and it infers the same thing. So Imputing missing values for columns mentioned above.`

In [None]:
pd.DataFrame((telecom[calnullpercentage(telecom).index]==0).sum()).head(10)

In [None]:
imput_col= list(set(calnullpercentage(telecom).index)-set(('date_of_last_rech_6',\
                                                      'date_of_last_rech_7','date_of_last_rech_8')))
knn_imp =KNNImputer()
telecom[imput_col] = knn_imp.fit_transform(telecom[imput_col])

In [None]:
# checking %age of null values
calnullpercentage(telecom)

`Lets fill these data missing values with 0 since date is not available for these columns. We will handle these rows later in an efficient way`

In [None]:
telecom.fillna(0,inplace=True)

In [None]:
# checking %age of null values
calnullpercentage(telecom)

In [None]:
telecom.shape

`After dropping missing values will result in dropping 1838 or 6% rows`

In [None]:
# Checking missing value percentage if any
calnullpercentage(telecom) #no missing value

In [None]:
telecom.head(3)

In [None]:
# no duplicate mobile number
len(telecom['mobile_number'].unique())

In [None]:
#Dropping mobile number since it doesn't help in modelling and prediction 
telecom.drop('mobile_number',inplace=True,axis=1)

# Data Visualisation and EDA

#### Function to show values on bar plot

In [None]:
def showvalues(ax,m=None):
    for p in ax.patches:
        ax.annotate("%.1f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),\
                    ha='center', va='center', fontsize=14, color='k', rotation=0, xytext=(0, 7),\
                    textcoords='offset points',fontweight='light',alpha=0.9) 

In [None]:
#### Correlation between numerical variables
plt.figure(figsize=(20,18))
ax=sns.heatmap(telecom.corr())
plt.setp(ax.get_xticklabels(), visible=False)
plt.setp(ax.get_yticklabels(), visible=False)
ax.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)

`As we can see from the graph, correlation is present between features. We will take care of correlated features later using techniques like PCA, t-SNE or any  other suitable technique for this problem`

In [None]:
# Function to plot distribution plot for months(6,7 and 8) for churn and non churn customers. 
# Also, plotting box plot and strip plot for the same.
def dist_box_plot(df,col1,col2,col3):    
    fig, axes = plt.subplots(nrows=1, ncols=4,figsize=(20, 4))
    ax = sns.distplot(df[df['churn']==1][col1], bins = 40, ax = axes[0], kde = False,\
                      color='#2980B9',hist_kws={"alpha": 1})
    ax.set_title('Churn',fontweight='bold',size=20)
    ax = sns.distplot(df[df['churn']==0][col1], bins = 40, ax = axes[1], kde = False,\
                     color='#E67E22',hist_kws={"alpha": 1})
    ax.set_title('Non-Churn',fontweight='bold',size=20)
    ax = sns.distplot(df[df['churn']==1][col2], bins = 40, ax = axes[2], kde = False,\
                     color='#2980B9',hist_kws={"alpha": 1})
    ax.set_title('Churn',fontweight='bold',size=20)
    ax = sns.distplot(df[df['churn']==0][col2], bins = 40, ax = axes[3], kde = False,\
                     color='#E67E22',hist_kws={"alpha": 1})
    ax.set_title('Non-Churn',fontweight='bold',size=20)
    fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(20, 4))
    ax=sns.boxplot(x='churn', y=col1, data=df,ax=axes[0])
    ax=sns.stripplot(x='churn', y=col1, data=df, jitter=True, edgecolor="gray",ax=axes[0])
    ax.yaxis.label.set_visible(False)
    ax.set_title(col1,fontweight='bold',size=20)
    ax=sns.boxplot(x='churn', y=col2, data=df,ax=axes[1])
    ax=sns.stripplot(x='churn', y=col2, data=df, jitter=True, edgecolor="gray",ax=axes[1])
    ax.yaxis.label.set_visible(False)
    ax.set_title(col2,fontweight='bold',size=20)
    fig, axes = plt.subplots(nrows=1, ncols=3,figsize=(20, 4))
    ax = sns.distplot(df[df['churn']==1][col3], bins = 40, ax = axes[0], kde = False,\
                     color='#2980B9',hist_kws={"alpha": 1})
    ax.set_title('Churn',fontweight='bold',size=20)
    ax = sns.distplot(df[df['churn']==0][col3],bins=40, ax = axes[1], kde = False,\
                     color='#E67E22',hist_kws={"alpha": 1})
    ax.set_title('Non-Churn',fontweight='bold',size=20)
    ax=sns.boxplot(x='churn', y=col3, data=df,ax=axes[2])
    ax=sns.stripplot(x='churn', y=col3, data=df, jitter=True, edgecolor="gray",ax=axes[2])
    ax.yaxis.label.set_visible(False)
    ax.set_title(col3,fontweight='bold',size=20)
    plt.show()

## arpu (Average Revenue per user)

In [None]:
dist_box_plot(telecom,'arpu_6','arpu_7','arpu_8')

` As we can see average revenue per user is decreasing for churn customers in 8th month.Also there are lots of outlier exists in revenue as some customers might using higher data and recharging frequently.`

### onnnet_mou (Minutes of usage for all kind of calls within the same operator network)

In [None]:
dist_box_plot(telecom,'onnet_mou_6','onnet_mou_7','onnet_mou_8')

`we can clearly see that Minutes of usage for all kind of calls within the same operator network is decreasing for churn customers.Also it looks like some of the customers are having high minutes of usage( outlier present)`

# offnet_mou (Minutes of usage for All kind of calls outside the operator T network)

In [None]:
dist_box_plot(telecom,'offnet_mou_6','offnet_mou_7','offnet_mou_8')

`Similalry Offnet minutes of usage is also decreasing for churn customers in 8th month.As compared to 6th and 7th month , in 8th month there is no high minutes of usage as the gitter graph is condensed.`

### roam_ic_mou (Minutes of usage on roaming incoming calls)

In [None]:
dist_box_plot(telecom,'roam_ic_mou_6','roam_ic_mou_7','roam_ic_mou_8')

`As compard to other parameters it looks like customers uses less services during roaming. Also gitter graph for 8th month (churn customers) shows slight decrease in mou.`

### loc_og_t2t_mou (Minutes of usage within same operator on local outgoing calls)

In [None]:
dist_box_plot(telecom,'loc_og_t2t_mou_6','loc_og_t2t_mou_7','loc_og_t2t_mou_8')

`local outgoing call with in same operator has decreased in 8th month for churn customers.`

### loc_og_t2m_mou  (Minutes of usage from operator T to fixed line T on local outgoing calls)

In [None]:
dist_box_plot(telecom,'loc_og_t2f_mou_6','loc_og_t2f_mou_7','loc_og_t2f_mou_8')

`As we can see clearly very less usage of Minutes of usage from operator T to fixed line T on local outgoing calls, which means customers prefer calling mobile.Also there is no outliers present for 8th month churn customer graph which shows decrease in minutes of usage.`

### std_og_t2t_mou  (Minutes of usage within same operator on STD outgoing calls)

In [None]:
dist_box_plot(telecom,'std_og_t2t_mou_6','std_og_t2t_mou_7','std_og_t2t_mou_8')

` Significant decrease in minutes of usage within same operator on STD outgoing calls for the churn customers in 8th month .`

### isd_og_mou  (Minutes of usage  on ISD outgoing calls)

In [None]:
dist_box_plot(telecom,'std_og_mou_6','std_og_mou_7','std_og_mou_8')

`STD outgoing calls has been decreased for 8th month churn customers.`

### spl_og_mou  (Minutes of usage  on Special calls)

In [None]:
dist_box_plot(telecom,'spl_og_mou_6','spl_og_mou_7','spl_og_mou_8')

` Less Minutes of usage on Special calls for churn customers in 8th month. There are some customers who are having high minutes of usage.`

### loc_ic_t2t_mou  (Minutes of usage within same operator on local incoming calls)

In [None]:
dist_box_plot(telecom,'loc_ic_t2t_mou_6','loc_ic_t2t_mou_7','loc_ic_t2t_mou_8')

`Local Incoming calls minutes of usage are reduced in 8th month. Huge outliers for Non churn customers`

### loc_ic_t2f_mou  (Minutes of usage from Operator T to fixed lines of T on local incoming calls)

In [None]:
dist_box_plot(telecom,'loc_ic_t2f_mou_6','loc_ic_t2f_mou_7','loc_ic_t2f_mou_8')

`Only few customer are calling to Fixed line. Also mou for fixed lines is decreased for churn customers in 8th month.`

### total_ic_mou  (Total Minutes of usage on incoming calls)

In [None]:
dist_box_plot(telecom,'total_ic_mou_6','total_ic_mou_7','total_ic_mou_8')

`Significant drop in Total Minutes of usage on incoming calls for 8th month for churn customers. High outliers  present for  non churn customers.`

### total_rech_amt (Total Reacharge Amount)

In [None]:
dist_box_plot(telecom,'total_rech_amt_6','total_rech_amt_7','total_rech_amt_8')

`Total recharge amount distribution is getting increased  from 6 to 7th month and then getting decrease in 8th month for Churn customers.`

### max_rech_amt (Max Reacharge Amount)

In [None]:
dist_box_plot(telecom,'max_rech_amt_6','max_rech_amt_7','max_rech_amt_8')

`Looks like Max Recharge Amount is decreased in 8th month for Churn customers.`

### last_day_rch_amt (Last day Reacharge Amount)

In [None]:
dist_box_plot(telecom,'last_day_rch_amt_6','last_day_rch_amt_7','last_day_rch_amt_8')

`Last day Reacharge decreased in in 8th month for churn customers.`

### max_rech_data (Max Reachrge Data)

In [None]:
dist_box_plot(telecom,'max_rech_data_6','max_rech_data_7','max_rech_data_8')

`Max recharge data reduced for 8th month churn customers.It looks like many customers are using high data as their huge outliers.`

### av_rech_amt_data (Average recharge amount data)

In [None]:
dist_box_plot(telecom,'av_rech_amt_data_6','av_rech_amt_data_7','av_rech_amt_data_8')

`Average recharge amount data is decreased in 8th month for churn customers.`

### vol_2g_mb (2g data volume in Mb)

In [None]:
dist_box_plot(telecom,'vol_2g_mb_6','vol_2g_mb_7','vol_2g_mb_8')

`2g data volume in Mb decreased in 8th month for churn customers.`

### vol_3g_mb (3g data volume in Mb)

In [None]:
dist_box_plot(telecom,'vol_3g_mb_6','vol_3g_mb_7','vol_3g_mb_8')

`We can see same trend here as 3g data volume in Mb usage decreased in 8th month (for churn customers)`

### arpu_3g (Avg Revenue per user from 3g data)

In [None]:
dist_box_plot(telecom,'arpu_3g_6','arpu_3g_7','arpu_3g_8')

`Significant drop in Avg Revenue per user from 3g data in 8th month for churn customers`

### arpu_2g (Avg Revenue per user from 2g data)

In [None]:
dist_box_plot(telecom,'arpu_2g_6','arpu_2g_7','arpu_2g_8')

`Similalry significant drop in Avg Revenue per user for 2g data for churn customers.`

### month_vbc (Volume based cost - when no specific scheme is not purchased and paid as per usage)

In [None]:
dist_box_plot(telecom,'jun_vbc_3g','jul_vbc_3g','aug_vbc_3g')

`Decrease in Volume based cost - when no specific scheme is not purchased and paid as per usage in 8th month for churn customers`

### tot_amt (Total amount)

In [None]:
dist_box_plot(telecom,'tot_amt_6','tot_amt_7','tot_amt_8')

`Huge decrease in total amount for churn customers .There are lots of outliers in Total amount ,churn customers count increased from 6th to 8th month and amount is decreased.`

### total_rech_num (Total no of recharges)

In [None]:
dist_box_plot(telecom,'total_rech_num_6','total_rech_num_7','total_rech_num_8')

### aon(Age on network - number of days the customer is using the operator T network)

In [None]:
plt.figure(figsize=(8,6))
ax=sns.distplot(telecom['aon'],color='#FFC30F',hist_kws={"alpha": 0.7})
ax.set_title('Age on Network',fontweight='bold',size=20)
plt.show()

`For most of the customers Age on Network is around 800-900 days. There are outliers too as the maximum value is 4321.`

## Categorical columns

### Night pack user

In [None]:
plt.figure(figsize=(10,6))
color=['#1ABC9C','#E74C3C']
fig, axes = plt.subplots(nrows=1, ncols=3,figsize=(20, 8))
df_temp= pd.DataFrame(telecom.groupby(['night_pck_user_6','churn']).count()['arpu_6']).reset_index()
ax= sns.barplot(x=df_temp['night_pck_user_6'],y=df_temp['arpu_6'],hue=df_temp['churn'],palette=color,ax=axes[0])
showvalues(ax)
ax.set_title('night_pck_user_6',fontweight='bold',size=20)
df_temp= pd.DataFrame(telecom.groupby(['night_pck_user_7','churn']).count()['arpu_6']).reset_index()
ax= sns.barplot(x=df_temp['night_pck_user_7'],y=df_temp['arpu_6'],hue=df_temp['churn'],palette=color,ax=axes[1])
showvalues(ax)
ax.set_title('night_pck_user_6',fontweight='bold',size=20)
df_temp= pd.DataFrame(telecom.groupby(['night_pck_user_8','churn']).count()['arpu_6']).reset_index()
ax= sns.barplot(x=df_temp['night_pck_user_8'],y=df_temp['arpu_6'],hue=df_temp['churn'],palette=color,ax=axes[2])
showvalues(ax)
ax.set_title('night_pck_user_8',fontweight='bold',size=20)
plt.show()

`ARPU is increasing from 6th to 8th month for churn customers who hasn't taken Night Pack service. And ARPU is getting decreased from 6th to 8th month for night pack users. Means moslty churn customers have stopped their night pack facility`

### Fb User

In [None]:
plt.figure(figsize=(10,6))
color=['#1ABC9C','#E74C3C']
fig, axes = plt.subplots(nrows=1, ncols=3,figsize=(20, 8))
df_temp= pd.DataFrame(telecom.groupby(['fb_user_6','churn']).count()['arpu_6']).reset_index()
ax= sns.barplot(x=df_temp['fb_user_6'],y=df_temp['arpu_6'],hue=df_temp['churn'],palette=color,ax=axes[0])
showvalues(ax)
ax.set_title('fb_user_6',fontweight='bold',size=20)
df_temp= pd.DataFrame(telecom.groupby(['fb_user_7','churn']).count()['arpu_6']).reset_index()
ax= sns.barplot(x=df_temp['fb_user_7'],y=df_temp['arpu_6'],hue=df_temp['churn'],palette=color,ax=axes[1])
showvalues(ax)
ax.set_title('fb_user_7',fontweight='bold',size=20)
df_temp= pd.DataFrame(telecom.groupby(['fb_user_8','churn']).count()['arpu_6']).reset_index()
ax= sns.barplot(x=df_temp['fb_user_8'],y=df_temp['arpu_6'],hue=df_temp['churn'],palette=color,ax=axes[2])
showvalues(ax)
ax.set_title('fb_user_8',fontweight='bold',size=20)
plt.show()

`Significant drop in average revenue in 8th month (churn customers)  for the users who are avaialing fb_user facility.However ARPU is increasing for churn users who arent avaialing fb_users facility.`

In [None]:
# Function to plot columns related to minutes of usage with month
def plot_Churn_Mou(colList,calltype):
    fig, ax = plt.subplots(figsize=(7,4))
    df=telecom.groupby(['churn'])[colList].mean().T
    plt.plot(df)
    ax.set_xticklabels(['Jun','Jul','Aug','Sep'])
    plt.legend(['Non-Churn', 'Churn'])
    plt.title("Avg. "+calltype+" MOU  V/S Month", loc='left', fontsize=12, fontweight=0, color='#F1948A')
    plt.xlabel("Month")
    plt.ylabel("Avg. "+calltype+" MOU")

In [None]:
# Function to plot columns related to minutes of usage with month
def plot_Churn(data,col):
    # per month churn vs Non-Churn
    fig, ax = plt.subplots(figsize=(7,4))
    colList=list(data.filter(regex=(col)).columns)
    colList = colList[:3]
    plt.plot(telecom.groupby('churn')[colList].mean().T)
    ax.set_xticklabels(['Jun','Jul','Aug','Sep'])
    plt.legend(['Non-Churn', 'Churn'])
    plt.title( str(col) +" V/S Month", loc='left', fontsize=12, fontweight=0, color='#F1948A')
    plt.xlabel("Month")
    plt.ylabel(col)
    plt.show()

In [None]:
ic_mou = ['total_ic_mou_6','total_ic_mou_7','total_ic_mou_8']
og_mou = ['total_og_mou_6','total_og_mou_7','total_og_mou_8']
plot_Churn_Mou(ic_mou,'Incoming')
plot_Churn_Mou(og_mou,'Outgoing')

`Significant drop in total incoming calls and total outgoing calls for churn customers , however for non churn customer its increasing.`

# Data Preparation

## Feature Engineering

`Creating derived features: og_to_ic_mou_6, og_to_ic_mou_7, og_to_ic_mou_8 These features will hold the ratio (=total_ogmou / total_icmou) for each month. These features will combine both incoming and outgoing informations and should be a better predictor of churn.`

In [None]:
# Creating new features which are ratio of total outgoing call with total incoming calls minutes of usage.
for i in range(6,9):
    telecom['tot_og_to_ic_mou_'+str(i)] = (telecom['total_og_mou_'+str(i)])/(telecom['total_ic_mou_'+str(i)]+1)

`Creating derived features: loc_og_to_ic_mou_6, loc_og_to_ic_mou_7, loc_og_to_ic_mou_8 These features will hold the ratio (=loc_ogmou / loc_icmou) for each month. These features will combine the local calls, both incoming and outgoing informations and should be a better predictor of churn.`

In [None]:
# Creating new features which are ratio of local outgoing call with local incoming calls minutes of usage
for i in range(6,9):
    telecom['loc_og_to_ic_mou_'+str(i)] = (telecom['loc_og_mou_'+str(i)])/(telecom['loc_ic_mou_'+str(i)]+1)

`Creating derived features: roam_og_to_ic_mou_6, roam_og_to_ic_mou_7, roam_og_to_ic_mou_8 These features will hold the ratio (=roam_og_mou / roam_ic_mou) for each month. These features will combine the roaming calls, both incoming and outgoing informations and should be a better predictor of churn.`

In [None]:
# Creating new features which are ratio of roaming outgoing call with local incoming calls minutes of usage
for i in range(6,9):
    telecom['roam_og_to_ic_mou_'+str(i)] = (telecom['roam_og_mou_'+str(i)])/(telecom['roam_ic_mou_'+str(i)]+1)

`Creating derived features: spl_og_to_ic_mou_6, spl_og_to_ic_mou_7, spl_og_to_ic_mou_8 These features will hold the ratio (=spl_og_mou / spl_ic_mou) for each month. These features will combine the Special calls, both incoming and outgoing informations and should be a better predictor of churn.`

In [None]:
# Creating new features which are ratio of Special outgoing call with local incoming calls minutes of usage
for i in range(6,9):
    telecom['spl_og_to_ic_mou_'+str(i)] = (telecom['spl_og_mou_'+str(i)])/(telecom['spl_ic_mou_'+str(i)]+1)

`Creating derived features: std_og_to_ic_mou_6, std_og_to_ic_mou_7, std_og_to_ic_mou_8 These features will hold the ratio (=std_og_mou / std_ic_mou) for each month. These features will combine the std calls, both incoming and outgoing informations and should be a better predictor of churn.`

In [None]:
# Creating new features which are ratio of std outgoing call with local incoming calls minutes of usage
for i in range(6,9):
    telecom['std_og_to_ic_mou_'+str(i)] = (telecom['std_og_mou_'+str(i)])/(telecom['std_ic_mou_'+str(i)]+1)

`Creating derived features: isd_og_to_ic_mou_6, isd_og_to_ic_mou_7, isd_og_to_ic_mou_8 These features will hold the ratio (=isd_og_mou / isd_ic_mou) for each month. These features will combine the isd calls, both incoming and outgoing informations and should be a better predictor of churn.`

In [None]:
# Creating new features which are ratio of isd outgoing call with local incoming calls minutes of usage
for i in range(6,9):
    telecom['isd_og_to_ic_mou_'+str(i)] = (telecom['isd_og_mou_'+str(i)])/(telecom['isd_ic_mou_'+str(i)]+1)

### tot_og_to_ic_mou( Total outgoing mou to incoming mou)

In [None]:
plot_Churn(telecom,'tot_og_to_ic_mou')

`As the ratio of outgoing to incoming seesm to be getting dropped for churn customer , we can say incoming calls were less in Jun and singificantly increases which cause the ratio to drop.`

### loc_og_to_ic_mou( Local outgoing mou to incoming mou)

In [None]:
plot_Churn(telecom,'loc_og_to_ic_mou')

`Ratio is getting dropped for churn customers`

### std_og_to_ic_mou( Std outgoing mou to incoming mou)

In [None]:
plot_Churn(telecom,'std_og_to_ic_mou')

### total_amount (Total amount)

In [None]:
plot_Churn(telecom,'tot_amt')

`Total amount spent tends to decrease for the churn customers.`

`Creating derived features: total_vol_data - Combining vol_2g_mb and vol_3g_mb. These features will combine the 2g data usage and 3g data usage and should be a better predictor of churn.`

In [None]:
# Creating new features combining 2g and 3g
for i in range(6,9):
    telecom['total_vol_'+str(i)] = (telecom['vol_2g_mb_'+str(i)])+(telecom['vol_3g_mb_'+str(i)])

In [None]:
plot_Churn(telecom,'total_vol')

`Total data volume decreases for churn customer decreases as we go from jun to Aug.`

`Creating derived features: total_arpu - Combining arpu_2g and arpu_3g. These features will combine the average revenue per user from 2g and 3g data and should be a better predictor of churn.`

In [None]:
# Creating new features combining average revenue per user from 2g and 3g
for i in range(6,9):
    telecom['total_arpu_'+str(i)] = (telecom['arpu_3g_'+str(i)])+(telecom['arpu_3g_'+str(i)])

In [None]:
date_cols=['date_of_last_rech_data_6','date_of_last_rech_data_7','date_of_last_rech_data_8','date_of_last_rech_6','date_of_last_rech_7',\
         'date_of_last_rech_8']
for col in date_cols:
    telecom[col] = pd.to_datetime(telecom[col],format='%m/%d/%Y',errors='coerce')

In [None]:
telecom['date_of_last_rech_dow_6'] = telecom['date_of_last_rech_6'].dt.dayofweek.astype(str)
telecom['date_of_last_rech_dow_7'] = telecom['date_of_last_rech_7'].dt.dayofweek.astype(str)
telecom['date_of_last_rech_dow_8'] = telecom['date_of_last_rech_8'].dt.dayofweek.astype(str)
telecom['date_of_last_rech_data_dow_6'] = telecom['date_of_last_rech_data_6'].dt.dayofweek.fillna(7).astype(int).astype(str)
telecom['date_of_last_rech_data_dow_7'] = telecom['date_of_last_rech_data_7'].dt.dayofweek.fillna(7).astype(int).astype(str)
telecom['date_of_last_rech_data_dow_8'] = telecom['date_of_last_rech_data_8'].dt.dayofweek.fillna(7).astype(int).astype(str)

In [None]:
# recent recharge date 
cols = ['date_of_last_rech_6','date_of_last_rech_7','date_of_last_rech_8']
telecom['last_rech_date'] = telecom[cols].max(axis=1)
# number of days from  the recent charge date till last date of Aug/8th month
telecom['days_since_last_rech'] = np.floor(( pd.to_datetime('2014-08-31',\
                                                         format='%Y-%m-%d') - telecom['last_rech_date'] ).astype('timedelta64[D]'))
# subtract it from 3 to add higher weightage for values present in all the columns. 
# len(cols) = 3,  means present in all columns, 0 means not present in any column
telecom['rech_weightage'] = len(cols) - (telecom[cols].isnull().sum(axis=1))
telecom.drop(['last_rech_date','date_of_last_rech_6','date_of_last_rech_7','date_of_last_rech_8'], axis=1, inplace=True)

# recent recharge date 
cols = ['date_of_last_rech_data_6','date_of_last_rech_data_7','date_of_last_rech_data_8']
telecom['last_rech_data_date'] = telecom[cols].max(axis=1)
# number of days from  the recent charge date till last date of Aug/8th month
telecom['days_since_last_data_rech'] = np.floor(( pd.to_datetime('2014-08-31',\
                                                         format='%Y-%m-%d') - telecom['last_rech_data_date'] ).astype('timedelta64[D]'))
telecom['days_since_last_data_rech'] = telecom['days_since_last_data_rech'].fillna(0)
# subtract it from 3 to add higher weightage for values present in all the columns. 
# len(cols) = 3,  means present in all columns, 0 means not present in any column
telecom['rech_data_weightage'] = len(cols) - (telecom[cols].isnull().sum(axis=1))
telecom.drop(['last_rech_data_date','date_of_last_rech_data_6','date_of_last_rech_data_7','date_of_last_rech_data_8'], axis=1, inplace=True)

In [None]:
catg =['night_pck_user_6','monthly_2g_6','sachet_2g_6','monthly_3g_6','sachet_3g_6','fb_user_6',\
       'night_pck_user_7','monthly_2g_7','sachet_2g_7','monthly_3g_7','sachet_3g_7','fb_user_7',\
       'date_of_last_rech_dow_6','date_of_last_rech_dow_7','date_of_last_rech_data_dow_6','date_of_last_rech_data_dow_7',\
       'date_of_last_rech_dow_8','date_of_last_rech_data_dow_8','night_pck_user_8','monthly_2g_8','sachet_2g_8',\
       'monthly_3g_8','sachet_3g_8','fb_user_8']
catg1 =['night_pck_user_6','fb_user_6','night_pck_user_7','fb_user_7',\
       'date_of_last_rech_dow_6','date_of_last_rech_dow_7','date_of_last_rech_data_dow_6','date_of_last_rech_data_dow_7',\
       'date_of_last_rech_dow_8','date_of_last_rech_data_dow_8','night_pck_user_8','monthly_2g_8','sachet_2g_8',\
       'monthly_3g_8','sachet_3g_8','fb_user_8']
num_col = list(set(telecom.columns).difference(set(catg)))

#### Creating new features which will be average  of all columns for 6th, 7th 

In [None]:
#filtering columns name by removing only last charcter of column name
col_list = telecom.select_dtypes(include=['float64',\
                                          'int64']).filter(regex='_6|_7').drop(catg[:12],axis=1).drop(['avg_amt_6_7','og_others_6'],axis=1).columns.str[:-2]
for idx, col in enumerate(col_list.unique()):
    col_name = 'avg67_'+col # lets create the column name dynamically
    col6 = col+'_6'
    col7 = col+'_7'
    telecom[col_name] = round((telecom[col6]  + telecom[col7])/ 2,2)

In [None]:
## Deriving columns difference between 8 month column and avg of 6th and 7th column calculated above
col_list1 = telecom.select_dtypes(include=['float64',\
                                          'int64']).filter(regex='avg67_').columns
col_list2 = telecom.select_dtypes(include=['float64',\
                                          'int64']).filter(regex='_8').drop(['fb_user_8','night_pck_user_8'],axis=1).columns
for col1,col2 in zip(col_list1,col_list2):
    col_name=col2[:-2]+'_avgdiff8'
    telecom[col_name]=telecom[col2]-telecom[col1]
    

In [None]:
#Dropping those columns since we already created features from it(dropping because of multicollinearity)
col_list = telecom.select_dtypes(include=['float64',\
                'int64']).filter(regex='_6|_7').drop(catg[:12],axis=1).drop(['avg_amt_6_7','og_others_6'],axis=1).columns
telecom.drop(col_list,axis=1,inplace=True)
num_col = list(set(telecom.columns).difference(set(catg)))

#### Creating dummy variables for categorical variables

In [None]:
dummy_df=pd.get_dummies(telecom[catg],drop_first=True)
telecom=pd.concat([telecom,dummy_df],axis=1)
telecom= telecom.drop(catg,axis=1)

##  Handling Skewness

In [None]:
# Lets find out if numerical predictor variables are largely skewed or not
telecom_numerical=telecom[num_col]
skew_features = telecom_numerical.apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skew_features[skew_features > 0.5]
skew_index = high_skew.index
skewness = pd.DataFrame({'Skew' :high_skew})
pd.DataFrame(skew_features,columns=['Skewness']).head(10)

`As we can see from skewness values, predictor variables are highly skewed. We need to take care of skewness.`

In [None]:
#Removing Skewness
num_col.remove('churn')
qntle_trnsfrm=QuantileTransformer()
telecom[num_col]= qntle_trnsfrm.fit_transform(telecom[num_col])

## Handling Imbalance Dataset

`There are many techniques available. I am using ADASYN(Adaptive Synthetic) which is an improved version of SMOTE.
What it does is same as SMOTE just with a minor improvement. After creating those sample it adds a random small values to the points thus making it more realistic. In other words instead of all the sample being linearly correlated to the parent they have a little more variance in them i.e they are bit scattered.`

In [None]:
X=telecom.drop('churn',axis=1) #Independent/predictor variable
y=telecom['churn'] #output variables
randm_state= 42 # fixing random state to use same state wherever applicable

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=randm_state)
print(X_train.shape)
print(X_test.shape)

`There are many techniques available. I am using SMOTE.`

In [None]:
smt = SMOTE()
X_train,y_train = smt.fit_sample(X_train,y_train)


### Scaling

In [None]:
# scaling using StandardScaler scaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
X=scaler.fit_transform(X)

### Applying PCA

In [None]:
# We are going with PCA 0.9 variance or 90% information
pca=PCA(0.9)
df_train=pca.fit_transform(X_train)
df_train.shape

In [None]:
# Lets see correlation between these new PCA  features
plt.figure(figsize=(10,8))
sns.heatmap(pd.DataFrame(df_train).corr())

`All principal components are uncorrelated with each other`

In [None]:
#intializing PCA with n_components=61
pca=PCA(n_components=70,random_state=randm_state,svd_solver='randomized')
X_train_pca=pca.fit_transform(X_train)
X_test_pca=pca.transform(X_test)

In [None]:
#transforming X also using PCA, since we will use it during cross validation
pca=PCA(n_components=70,random_state=42,svd_solver='randomized')
X_pca=pca.fit_transform(X)

In [None]:
#Function to plot accuracy metrics
def model_metrics(actual, predicted):
    confusion = confusion_matrix(actual, predicted)
    TP=confusion[1,1] #True positives
    TN= confusion[0,0] #True Negatives
    FP=confusion[0,1] #False Positives
    FN= confusion[1,0] #False Negatives
    acc_score = round(accuracy_score(actual, predicted),2) #accuracy score
    rcl_score=round(recall_score(actual, predicted),2) #recall score
    roc_score = round(roc_auc_score(actual, predicted),2) # roc_auc score
    fpr = round(FP/float(TN+FP),2) #False Positive Ration
    specificity = round(TN/float(TN+FP),2) #False Positive Ration
    metrics_df = pd.DataFrame(data=[[acc_score,roc_score,fpr,\
                                     specificity,rcl_score,TP,\
                                     TN,FP,FN,]],columns=['accuracy','roc_auc','fpr','specificity','recall_score',\
                                                          'true_positive','true_negative',\
                                                          'false_positive','false_negative'],index=['score'])
    return metrics_df

# Data Modelling 

In [None]:
# Function to predict class labels based on model predicted probabilty and cutoff/threshold for assigning labels
def predictChurnlabeloncutoff(model,X,y,threshold=0.5):
    pred_probs = model.predict_proba(X)[:,1]
    pred_df= pd.DataFrame({'churn':y, 'churn_Prob':pred_probs})
    # Creating new column 'predicted' with 1 if Churn_Prob>threshold else 0
    pred_df['predicted'] = pred_df.churn_Prob.map( lambda x: 1 if x > threshold else 0)
    return pred_df

In [None]:
def optimal_cutoff(df):
    # Let's create columns with different probability cutoffs 
    numbers = [float(x)/10 for x in range(10)]
    for i in numbers:
        df[i] = df.churn_Prob.map( lambda x: 1 if x > i else 0)
    # Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
    cutoff_df = pd.DataFrame( columns = ['prob','accuracy','recall','specificity'])
    from sklearn.metrics import confusion_matrix

    # TP = confusion[1,1] # true positive 
    # TN = confusion[0,0] # true negatives
    # FP = confusion[0,1] # false positives
    # FN = confusion[1,0] # false negatives

    num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    for i in num:
        
        cm1 = confusion_matrix(df.churn, df[i] )
        total1=sum(sum(cm1))
        accuracy = (cm1[0,0]+cm1[1,1])/total1

        speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
        sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
        cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
    cutoff_df.plot.line(x='prob', y=['accuracy','recall','specificity'],figsize=(8,6))
    plt.show()

### Note: We want to predict customers who are likely to churn(positive class). Hence, we want to maximize(Recall/Senstivity). Also, we don't want to drop accuracy or roc_auc value much while maximising recall.

## BaseLine Models (Without tuning hyperparameters, Running with default parameters)

# Naive Bayes Classifier

In [None]:
gnb=GaussianNB()
gnb.fit(X_train_pca,y_train)
y_test_pred=gnb.predict(X_test_pca)
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

# Logistic Regression (Default parameters)

In [None]:
log_reg= LogisticRegression(random_state=randm_state,class_weight='balanced')
log_reg.fit(X_train_pca,y_train) #fitting model
y_test_pred=log_reg.predict(X_test_pca) #model prediction
print(classification_report(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

# Logistic Regression( Tuned Hyperameters)

In [None]:
fold=StratifiedKFold(random_state=randm_state,shuffle=True,n_splits=5) #stratified kfold for cross validation

In [None]:
params={'penalty':['l1'],'C':list(np.power(10.0, np.arange(-2, 3))),'solver':('saga','liblinear'),\
                                          'class_weight':['balanced']}
#we are using scoring metrics Recall
log_regcv=GridSearchCV(LogisticRegression(random_state=randm_state,max_iter=1000),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
log_regcv.fit(X_train_pca,y_train)

In [None]:
print(log_regcv.best_params_)
print(log_regcv.best_score_)
print(log_regcv.best_estimator_)
# test set prediction using tuned model
log_regcv=log_regcv.best_estimator_


In [None]:
log_regcv.fit(X_train_pca,y_train) #fitting best models
df_cutoff=predictChurnlabeloncutoff(log_regcv,X_test_pca,y_test)
optimal_cutoff(df_cutoff)

In [None]:
threshold=0.25 # from the above graph(recall, accuracy and specificity have good score)
y_test_pred= predictChurnlabeloncutoff(log_regcv,X_test_pca,y_test,threshold).predicted
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

# KNN (Using Default paramters)

In [None]:
knn=KNeighborsClassifier()
knn.fit(X_train_pca,y_train)
y_test_pred = knn.predict(X_test_pca)
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

# KNN (Tuned Hyperameters)

In [None]:
fold=StratifiedKFold(random_state=randm_state,shuffle=True,n_splits=3) #stratified kfold for cross validation
#params={'n_neighbors': list(range(1,6)), 'p':[1,2],'weights':['uniform', 'distance']}
# Setting params to tuned values, since it takes around 3hrs for 30 fits with knn
params={'n_neighbors':[3], 'p':[2]} #I ran gridsearchcv on above paramgerids with 30 fits and found these tuned params
#we are using scoring metrics Recall
knn_cv=GridSearchCV(KNeighborsClassifier(),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
knn_cv.fit(X_train_pca,y_train)

In [None]:
print(knn_cv.best_params_)
print(knn_cv.best_score_)
print(knn_cv.best_estimator_)
# test set prediction using tuned model
knn_cv=knn_cv.best_estimator_ # best parameters value for knn

In [None]:
knn_cv.fit(X_train_pca,y_train) #fitting best models
df_cutoff=predictChurnlabeloncutoff(knn_cv,X_test_pca,y_test)
optimal_cutoff(df_cutoff)


In [None]:
threshold=0.45 #from graph, at 0.45 we can have good score for recall(objective to maximize recall) with decent score for accuracy and specificity.
y_test_pred= predictChurnlabeloncutoff(knn_cv,X_test_pca,y_test,threshold).predicted
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

# Random Forest(Default parameters)

In [None]:
rfc=RandomForestClassifier(random_state=randm_state)
rfc.fit(X_train_pca,y_train)
y_test_pred= rfc.predict(X_test_pca)
model_metrics(y_test,y_test_pred)

# Random Forest(Tuned Hyperparameters)

`Tuning all hyperparametrs will take large amount of time beacuse of possible combination will for gridsearchcv will be very large. Hence, tuning 1 or 2 hyperparamters at a time.`

### Tuning n_estimators and criterion

In [None]:
fold=StratifiedKFold(random_state=randm_state,shuffle=True,n_splits=3)
params={'n_estimators': range(50,150,30)}
#we are using scoring metrics Recall
rfc_cv=GridSearchCV(RandomForestClassifier(random_state=randm_state,class_weight='balanced'),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
rfc_cv.fit(X_train_pca,y_train)

In [None]:
cv_results = pd.DataFrame(rfc_cv.cv_results_)
plt.plot(cv_results['param_n_estimators'],cv_results['mean_test_score'],label='test')
plt.plot(cv_results['param_n_estimators'],cv_results['mean_train_score'],label='train')
plt.legend()


`Taking 80 as n_estimators.`

In [None]:
params={'criterion': ["gini", "entropy"]}
#we are using scoring metrics Recall
rfc_cv=GridSearchCV(RandomForestClassifier(random_state=randm_state,class_weight='balanced',n_estimators=80),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
rfc_cv.fit(X_train_pca,y_train)

In [None]:
rfc_cv.best_params_

### Tuning max_features

In [None]:
fold=StratifiedKFold(random_state=randm_state,shuffle=True,n_splits=3)
params={'max_features': [5, 10, 15, 20, 25]}
#we are using scoring metrics Recall
rfc_cv=GridSearchCV(RandomForestClassifier(random_state=randm_state,class_weight='balanced',\
                                           n_estimators=80,criterion='entropy'),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
rfc_cv.fit(X_train_pca,y_train)

In [None]:
cv_results = pd.DataFrame(rfc_cv.cv_results_)
plt.plot(cv_results['param_max_features'],cv_results['mean_test_score'],label='test')
plt.plot(cv_results['param_max_features'],cv_results['mean_train_score'],label='train')
plt.legend()


` No of features =15 based on above plot`

### Tunning minimum sample leaf

In [None]:
params={'min_samples_leaf': range(50, 200, 50)}
#we are using scoring metrics Recall
rfc_cv=GridSearchCV(RandomForestClassifier(random_state=randm_state,class_weight='balanced',\
                                           n_estimators=80,criterion='entropy',max_features=15),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
rfc_cv.fit(X_train_pca,y_train)

In [None]:
print(rfc_cv.best_params_)
print(rfc_cv.best_score_)

In [None]:
### Tunning minimum sample split

In [None]:
params={'min_samples_split': range(50, 200, 50)}
#we are using scoring metrics Recall
rfc_cv=GridSearchCV(RandomForestClassifier(random_state=randm_state,class_weight='balanced',\
                                           n_estimators=80,criterion='entropy',max_features=15,min_samples_leaf=50,\
                                          ),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
rfc_cv.fit(X_train_pca,y_train)

In [None]:
cv_results = pd.DataFrame(rfc_cv.cv_results_)
plt.plot(cv_results['param_min_samples_split'],cv_results['mean_test_score'],label='test')
plt.plot(cv_results['param_min_samples_split'],cv_results['mean_train_score'],label='train')
plt.legend()


`going with 100 samples to take care of overfitting.`

# Tuned Random Forest 

In [None]:
rfc=RandomForestClassifier(random_state=randm_state,class_weight='balanced',\
                                           n_estimators=80,criterion='entropy',max_features=15,min_samples_leaf=50,\
                                          min_samples_split=100)
rfc.fit(X_train_pca,y_train)

In [None]:
#rfc.fit(X_train_pca,y_train) #fitting best models
df_cutoff=predictChurnlabeloncutoff(rfc,X_train_pca,y_train)
optimal_cutoff(df_cutoff)


In [None]:
# We want recall to be better so going with 0.34, as you can see above from the at0.34 recall is slightly better than accuracy and specificty
threshold=0.34 #from graph, at 0.34 we can have good score for recall, accuracy and specificity.
y_test_pred= predictChurnlabeloncutoff(rfc,X_test_pca,y_test,threshold).predicted
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

# SVM(default parameters)

In [None]:
svc= SVC(random_state=randm_state,class_weight='balanced')
svc.fit(X_train_pca,y_train) #fitting model
y_test_pred=svc.predict(X_test_pca) #model prediction
print(classification_report(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

### SVM(Tuned Hyperparameters)

In [None]:
fold=StratifiedKFold(random_state=randm_state,shuffle=True,n_splits=3)
params={'C':list(np.power(10.0, np.arange(-1, 2)))}
#we are using scoring metrics Recall
svc_cv=GridSearchCV(SVC(random_state=randm_state,class_weight='balanced'),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
svc_cv.fit(X_train_pca,y_train)

In [None]:
print(svc_cv.best_params_)
print(svc_cv.best_score_)

In [None]:
params={'kernel' :['poly', 'rbf']}
#we are using scoring metrics Recall
svc_cv=GridSearchCV(SVC(random_state=randm_state,class_weight='balanced',C=10),param_grid=params,cv=fold,scoring='recall',\
                       verbose=1,return_train_score=True)
svc_cv.fit(X_train_pca,y_train)

# Tuned SVC

In [None]:
svc= SVC(random_state=randm_state,class_weight='balanced',C=0.1,kernel='rbf',probability=True)
svc.fit(X_train_pca,y_train) #fitting model
y_train_pred=svc.predict_proba(X_train_pca) #model prediction

In [None]:
# We are going slightly towards recall score, reason mentioned at starting of data modelling.
threshold=0.4 #from graph, at 0.4 we can have good score for recall, accuracy and specificity.
y_test_pred= predictChurnlabeloncutoff(svc,X_test_pca,y_test,threshold).predicted
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

# Gradient Boosting

In [None]:
gbc=GradientBoostingClassifier()
gbc.fit(X_train_pca,y_train)
y_train_pred = gbc.predict_proba(X_train_pca)

In [None]:
df_cutoff=predictChurnlabeloncutoff(gbc,X_train_pca,y_train)
optimal_cutoff(df_cutoff)


In [None]:
# We are going slightly towards recall score, reason mentioned at starting of data modelling.
threshold=0.3 #from graph, at 0.3 we can have good score for recall, accuracy and specificity.
y_test_pred= predictChurnlabeloncutoff(gbc,X_test_pca,y_test,threshold).predicted
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

# Tuned XGBoost Model (Tuned)

In [None]:
xgb=XGBClassifier(random_state=randm_state ,max_depth = 3 , learning_rate=0.01,\
                  n_estimators=100,objective='binary:logistic')
xgb.fit(X_train_pca,y_train)
y_train_pred = xgb.predict_proba(X_train_pca)


In [None]:
df_cutoff=predictChurnlabeloncutoff(xgb,X_train_pca,y_train) #optimal cutoff
optimal_cutoff(df_cutoff)


In [None]:
# We are going slightly towards recall score, reason mentioned at starting of data modelling.
threshold=0.42 #from graph, at 0.42 we can have good score for recall, accuracy and specificity
y_test_pred= predictChurnlabeloncutoff(xgb,X_test_pca,y_test,threshold).predicted
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

`Conclusion of different tuned models:`
- Naive Bayes baseline model with PCA and SMOTE
    - Recall : 40%
- Logistic Regression PCA and SMOTE
    - Recall : 83%
    - Accuracy : 82%
- KNN with PCA and SMOTE   
    - Recall : 86%
    - Accuracy : 71%
- Random Forest with PCA and SMOTE   
    - Recall : 81%
    - Accuracy : 81%
- SVC with PCA and SMOTE   
    - Recall : 68%
- Gradient Boosting with PCA and SMOTE   
    - Recall : 82%
    - Accuracy : 82%
- XGBoost with PCA and SMOTE   
    - Recall : 88%
    - Accuracy : 73%

`XGboost classifie and K Nearest neighbours classifier has maximum recall respectively. We will impelement interpretable model later in this notebook. Since, PCA does transformation of original raw attrinutes to find new variables or Principal components which are orthogonal and uncorrelated to each other. Hence, these model features cannot be used for business.`

`Building another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. A good choice to identify important variables is a logistic regression model and Random Forest classifier. In case of logistic regression, make sure to handle multi-collinearity.`

# Interpretable model for Business understanding: Logistic Regression

In [None]:
X=telecom.drop('churn',axis=1) # Feature variables for new interpretable models
y=telecom['churn'] 
print(X.shape)

In [None]:
df_train, df_test = train_test_split(telecom, train_size = 0.7, test_size = 0.3, random_state = randm_state)
print(df_train.shape)
print(df_train.shape)

In [None]:
y_train = df_train['churn']
X_train = df_train.drop('churn',axis=1)
y_test = df_test['churn']
X_test = df_test.drop('churn',axis=1)


In [None]:
# Scaling
scaler = StandardScaler()
colums = list(X_train.columns)
X_train[colums] = scaler.fit_transform(X_train[colums])
X_test[colums] = scaler.transform(X_test[colums])

In [None]:
log_reg=LogisticRegression(C=0.01, class_weight='balanced', max_iter=1000, penalty='l1',
                   random_state=42, solver='liblinear')
# Running RFE with the output number of the variable equal to 30
rfe = RFE(log_reg,20)             # running RFE
rfe = rfe.fit(X_train, y_train)


In [None]:
pd.DataFrame(list(zip(X_train.columns,rfe.support_,rfe.ranking_)),columns=[['column_name','Include','feature_rank']])


In [None]:
col = X_train.columns[rfe.support_]
X_train_rfe = X_train[col]

In [None]:
#statsmodel to find significant variables
X_train_rfe_new = sm.add_constant(X_train_rfe)
log_reg_sm = sm.OLS(y_train,X_train_rfe_new).fit() 
print(log_reg_sm.summary())

In [None]:
# lets drop avg67_loc_og_t2f_mou since it has high p value
X_train_rfe_new=X_train_rfe.drop('avg67_loc_og_t2f_mou',axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_rfe_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#statsmodel to find significant variables
X_train_rfe_new = sm.add_constant(X_train_rfe_new)
log_reg_sm = sm.OLS(y_train,X_train_rfe_new).fit() 
print(log_reg_sm.summary())

In [None]:
# lets drop vol_2g_mb_avgdiff8 since it has high p value
X_train_rfe_new = X_train_rfe_new.drop('vol_2g_mb_avgdiff8',axis=1)
 

In [None]:
vif = pd.DataFrame()
X = X_train_rfe_new.drop(['const'], axis=1)
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# lets drop avg67_total_rech_amt since it has high VIF value
X_train_rfe_new=X_train_rfe_new.drop('avg67_total_rech_amt',axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_rfe_new.drop(['const'], axis=1)
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# lets drop av_rech_amt_data_8 since it has high VIF value
X_train_rfe_new=X_train_rfe_new.drop('av_rech_amt_data_8',axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_rfe_new.drop(['const'], axis=1)
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

`As we can see VIF value is less than 2, we have taken care of multicollinearity issue between feature variables, Lets do prediction and modelling using these variables`

In [None]:
X_train=X_train[X.columns]
X_test=X_test[X.columns]

In [None]:
log_reg=LogisticRegression(C=0.01, class_weight='balanced', max_iter=1000, penalty='l1',
                   random_state=42, solver='liblinear')
log_reg.fit(X_train,y_train)

In [None]:
#optimal cutoff
df_cutoff=predictChurnlabeloncutoff(log_reg,X_train,y_train)
optimal_cutoff(df_cutoff)


In [None]:
#As we can see cutoff=0.45 given good recall score as well as accuracy and specificity.
threshold=0.45 #from graph, at 0.45 we can have good score for recall(objective to maximize recall) with decent score for accuracy and specificity.
y_test_pred= predictChurnlabeloncutoff(log_reg,X_test,y_test,threshold).predicted
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

In [None]:
feat_imp=log_reg.coef_
feat_imp=np.abs(feat_imp)
len(X_train.columns)

In [None]:
cmap = plt.get_cmap('Spectral')
colors = [cmap(i) for i in np.linspace(0, 1, 16)]
plt.figure(figsize=(16,16))
wedges, labels, autopct = plt.pie(feat_imp, labeldistance=1.02,labels=X_train.columns, autopct='%1.1f%%', shadow=False, colors=colors)
for lab in labels:
    lab.set_fontsize(15)
plt.rcParams['font.size'] = 12
plt.rcParams['font.weight'] = 'bold'
centre_circle = plt.Circle((0,0),0.20,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.axis('equal')  
plt.tight_layout()
plt.title('Logistic Regression Feature Importance',fontweight='bold',size=20)
plt.show()

`Top 10 features for Business should consider while predicting customers who are likely to churn and can make business decision on those. Below are the features importances in their decreasing order:`
- total_ic_mou_8 (total incoming minutes of usage for 8th month)
- days_since_last_rech (days since last recharge)
- roam_og_to_ic_mou_8(roaming outgoing to incoming ratio of minute of usage for 8th month)
- total_vol_8 (total volume of data usage (2g + 3g) for 8th month
- last_day_rch_amt_8 (last day recharge amount for 8th month)
- avg67_arpu( avergae revenure per user average for 6th and 7th month)
- tot_og_to_ic_mou_8 (total outgoing to incoming minutes of usage for 8th month)
- total_ic_mou_avgdiff8 (total incoming minutes of usage average 6th and 7th month difference from 8th month)
- aon (age on network)
- avg67_max_rech_amt(average of 6th and 7th month maximum recaherge amount)

`From logistic regression using 16 feature variables I got:`
- Recall Score= 87%
- Accuracy = 85%
- ROC AUC= 85%
- Specificity= 84%

# Random Forest Interpretable model 

In [None]:
y_train = df_train['churn']
X_train = df_train.drop('churn',axis=1)
y_test = df_test['churn']
X_test = df_test.drop('churn',axis=1)
# Scaling
scaler = StandardScaler()
colums = list(X_train.columns)
X_train[colums] = scaler.fit_transform(X_train[colums])
X_test[colums] = scaler.transform(X_test[colums])

In [None]:
rfc=RandomForestClassifier(random_state=randm_state,class_weight='balanced',\
                                           n_estimators=80,criterion='entropy',max_features=15,min_samples_leaf=50,\
                                          min_samples_split=100)
rfc.fit(X_train,y_train)

In [None]:
#optimal cutoff
df_cutoff=predictChurnlabeloncutoff(rfc,X_train,y_train)
optimal_cutoff(df_cutoff)


In [None]:
#As we can see cutoff=0.38 given good recall score as well as accuracy and specificity.
threshold=0.38 #from graph, at 0.38 we can have good score for recall(objective to maximize recall) with decent score for accuracy and specificity.
y_test_pred= predictChurnlabeloncutoff(rfc,X_test,y_test,threshold).predicted
print('Classification Report:\n')
print(classification_report(y_test,y_test_pred))
print('Confusion Matrix:\n')
print(confusion_matrix(y_test,y_test_pred))
model_metrics(y_test,y_test_pred)

In [None]:
# Check the feature importance score for each feature
feat_imp_df = pd.DataFrame({'Feature':X_train.columns, 'Score':rfc.feature_importances_})
feat_imp_df = feat_imp_df.sort_values('Score', ascending=False).reset_index() # Order features by score
feat_imp_df.head(20)

In [None]:
m=list(feat_imp_df.Feature[:16])
n=list(feat_imp_df.Score[:16])

In [None]:
cmap = plt.get_cmap('Spectral')
colors = [cmap(i) for i in np.linspace(0, 1, 16)]
plt.figure(figsize=(16,16))
wedges1, labels1, autopct1 = plt.pie(list(feat_imp_df.Score[:15]), labeldistance=1.02,\
                                  labels=list(feat_imp_df.Feature[:15]), autopct='%1.1f%%', shadow=False, colors=colors)
for lab in labels:
    lab.set_fontsize(15)
plt.rcParams['font.size'] = 12
plt.rcParams['font.weight'] = 'bold'
centre_circle = plt.Circle((0,0),0.20,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.axis('equal')  
plt.tight_layout()
plt.title('Random Forest Feature Importance',fontweight='bold',size=20)
plt.show()

`Top 10 feature variables for business which will be very helpful in business making decision and give additional benefits/discount to customers who are likely to churn`
- total_og_mou_8 (Total outgoing minutes of usage for 8th month)
- total_ic_mou_8 (Total incoming minutes of usage for 8th month)
- tot_amt_avgdiff8 (Total amount 6th and 7th month average difference from 8th month)
- ot_og_to_ic_mou_8 (outgoing to incoming minutes of usage ratio for 8th month
- last_day_rch_amt_8 (last dat recharge amount for 8th month)
- tot_amt_8 (total amount for 8th month)
- roam_og_to_ic_mou_8 (roaming outgoing to incoming minutes of usage for 8th month)
- arpu_avgdiff8	( avergae revenure per user 6th and 7th month average from 8th month)
- arpu_8 ( avergae revenure per user for 8th month)
- days_since_last_rech (days since last recharge)

`From logistic regression using 16 feature variables I got:`
- Recall Score= 86%
- Accuracy = 86%
- ROC AUC= 86%
- Specificity= 86%

`As we can see both logistic regression and random forest classifier provides us similar predictor variables for making business decision to decrease or stop customer from churning who are likely to churn. Telcom company can introduce promotional offers to those customers who are likely to churn. Also, we have implemented all our models for high value customers. Now, Business people can make decision on those most important preidctor variables. Since, we have use different models for predicting these feature variable and difference techniques for feature selection we are getting similar predicting variable but order is slightly different. Hence we recommend  Telecom company to consider these feature variables which are strong indicators to manage customer churn.`

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>If you found this notebook helpful , some upvotes would be very much appreciated - That will keep me motivated :)</center></h2>


<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>Thank You:)</center></h2>
