### Introduction:

Telecom companies focus more on retaining customers than on acquiring customers as it costs 5-10 times more to acquire a new customer than to retain an existing one.

It is observed that 80% of revenue comes from 20% of the customers. They are names as High Value Customers. Hence more emphasis is made on high value customers.

### Data Provided:

Dataset contains customer usage data in the duration of 4 months.
These months are grouped as:
1. Good phase - 6th and 7th month
2. Action phase - 8th month
3. Churn phase - 9th month

### Business Objectives:

1. Build a predictive model that can be used to know if customers are going to churn or not in the future.
2. To identify those factors( variables ) that are helpful in predicting the churn and customer behavior.

### Approach:

1. Data Understanding & Cleaning
2. EDA
3. Data preparation for model building
4. Handle Class Imbalance
5. Dimensionality Reduction using PCA
6. Classification models to predict Churn
7. Model Evaluation
8. Interpretable model creation to identify strong predictors of churn.
9. Summary


In [None]:
#importing necessary packages and data ( The Data in csv format is downloaded )
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import plotly.express as px
import plotly.figure_factory as ff

from sklearn.metrics import r2_score
from sklearn.preprocessing import scale 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

###  Data Understanding & Cleaning

In [None]:
#reading the data
df_telecom = pd.read_csv('../input/telecom-churn-data')
df_telecom.head()

In [None]:
#alter the pandas for better visualization
pd.set_option('max_columns', 400)
pd.set_option('max_rows',400)

In [None]:
#check the size of the data
df_telecom.shape

In [None]:
#validating the data
df_telecom.describe()

In [None]:
#validating the column name and type
df_telecom.info(verbose=1)

In [None]:
#Check the percentage of null values in the columns
df_telecom.isnull().sum() * 100 / len(df_telecom)

In [None]:
# missingno is used to identify missing values in the dataset 

import missingno as msno

msno.matrix(df_telecom)

## Data Preparation

As per the assignment we are already provided with key data preapartion steps, let us follow that

    1. Derive new features
    2. Filter high-value customers
    3. Tag churners and remove attributes of the churn phase

1. Derive new feature, let us create a feature to tag a customer is churned or not, as per the document Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are:

    total_ic_mou_9

    total_og_mou_9

    vol_2g_mb_9

    vol_3g_mb_9

In [None]:
#Before performing this operation let is check whether the data as null value or not
df_telecom[['total_ic_mou_9','total_og_mou_9','vol_2g_mb_9','vol_3g_mb_9']].isnull().sum() * 100 / len(df_telecom[['total_ic_mou_9','total_og_mou_9','vol_2g_mb_9','vol_3g_mb_9']])

In [None]:
# As the data is not null let us proceed to create the column and tag the user is churned or not
df_telecom['Churn'] =  ((df_telecom['total_ic_mou_9']==0.0) & (df_telecom['total_og_mou_9']==0.0) & (df_telecom['vol_2g_mb_9']==0.0) & (df_telecom['vol_3g_mb_9']==0.0)).astype(int)

In [None]:
df_telecom.head()

In [None]:
# Count churn and non churn data in percentage
df_telecom.Churn.value_counts(1)*100

In [None]:
#let us create column for total data recharged amount
#total data recharge amount = total_rech_data *av_rech_amt_data
#let us check if the data is null or not
df_telecom[['av_rech_amt_data_6','av_rech_amt_data_7','total_rech_data_6','total_rech_data_7','av_rech_amt_data_8','av_rech_amt_data_9','total_rech_data_8','total_rech_data_9']].isnull().sum() * 100 / len(df_telecom[['av_rech_amt_data_6','av_rech_amt_data_7','total_rech_data_6','total_rech_data_7','av_rech_amt_data_8','av_rech_amt_data_9','total_rech_data_8','total_rech_data_9']])

In [None]:
# There are null values in  recharge amount we can impute them with '0'
df_telecom[['av_rech_amt_data_6','av_rech_amt_data_7','total_rech_data_6','total_rech_data_7','av_rech_amt_data_8','av_rech_amt_data_9','total_rech_data_8','total_rech_data_9']]=df_telecom[['av_rech_amt_data_6','av_rech_amt_data_7','total_rech_data_6','total_rech_data_7','av_rech_amt_data_8','av_rech_amt_data_9','total_rech_data_8','total_rech_data_9']].fillna(0, axis=1)

In [None]:
#revalidation
df_telecom[['av_rech_amt_data_6','av_rech_amt_data_7','total_rech_data_6','total_rech_data_7','av_rech_amt_data_8','av_rech_amt_data_9','total_rech_data_8','total_rech_data_9']].isnull().sum() * 100 / len(df_telecom[['av_rech_amt_data_6','av_rech_amt_data_7','total_rech_data_6','total_rech_data_7','av_rech_amt_data_8','av_rech_amt_data_9','total_rech_data_8','total_rech_data_9']])

In [None]:
#creating column for total data recharged amount
for month in [6,7,8,9]:
    df_telecom['total_data_recharged_amt_'+str(month)] = df_telecom['total_rech_data_'+str(month)] * df_telecom['av_rech_amt_data_'+str(month)]

In [None]:
df_telecom.head()

2. Filter high-value customers

    As mentioned in document , we need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).
    
    The first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase.

In [None]:
df_telecom['ph_good_rech'] = (df_telecom['total_rech_amt_6'] + df_telecom['total_rech_amt_7']+df_telecom['total_data_recharged_amt_6']+df_telecom['total_data_recharged_amt_7'])/2
df_telecom['ph_action_rech'] = df_telecom['total_rech_amt_8']+df_telecom['total_data_recharged_amt_8']
df_telecom['ph_churn_rech'] = df_telecom['total_rech_amt_9']+df_telecom['total_data_recharged_amt_9']

In [None]:
#to get the high value customer
print('70th percentile average recharge amount :' ,df_telecom['ph_good_rech'].quantile(.70))

In [None]:
# lets filter out the high-value customers
df_high_value =  df_telecom.loc[df_telecom.ph_good_rech >=df_telecom['ph_good_rech'].quantile(.70)]

In [None]:
# we are left with  around 30k records which is almost the same as the value mentioned in the description (29.9k records)
df_high_value.shape


3. Tag churners and remove attributes of the churn phase
    After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names)

In [None]:
#we have already tagged the churners let us now drop the columns corresponding to churn phase
df_high_value.drop([col for col in df_high_value.columns if '_9' in col],axis=1,inplace=True)

In [None]:
#validate the columns
df_high_value.columns

In [None]:
#count churn and non churn data in %
df_high_value.Churn.value_counts(1)*100

#### 8.13 % of high value customers have churned. There is a class imbalance. We need to treat this using SMOTE later when necessary.

##### As we have marked our churn and non churn customers we can remove the columns belonging to churn phase as those are not available in real time.

## Data Cleansing and missing value population

In [None]:
#check for unique values
df_high_value.nunique().sort_values()

In [None]:
#let us check the columns with unique values
unique_dataset=pd.DataFrame(df_high_value.nunique())

In [None]:
# Validating the data with only constant value
df_high_value[unique_dataset[unique_dataset[0]==1].index].describe()

In [None]:
#let us drop these columns
df_high_value.drop(columns=unique_dataset[unique_dataset[0]==1].index,inplace=True)

In [None]:
# Validation
df_high_value.nunique().sort_values()

In [None]:
#validating the columns with only 2 unique values
df_high_value[unique_dataset[unique_dataset[0]==2].index].isnull().sum()

In [None]:
# lets us find categorical columns
# From data dictionary these columns seemed to be categorical.
# Let us find out the unique values in them

In [None]:
for col in unique_dataset[unique_dataset[0]==2].index:
    print(df_high_value[col].value_counts())

In [None]:
#let us populate the null data by -1 to indicate the nightpack and fb usgae of the user is unkown
df_high_value[unique_dataset[unique_dataset[0]==2].index]=df_high_value[unique_dataset[unique_dataset[0]==2].index].fillna(-1, axis=1)

In [None]:
#validation 
df_high_value.nunique().sort_values()

In [None]:
#let us drop the mobile_number column as all the value is unqiue
df_high_value.drop(['mobile_number'],axis=1,inplace=True)

In [None]:
#Check the percentage of null values in the columns
null_dataset=pd.DataFrame(df_high_value.isnull().sum() * 100 / len(df_high_value))

In [None]:
#let us check the columns with more than 40% null data
null_dataset[null_dataset[0]>40]

In [None]:
df_high_value[null_dataset[null_dataset[0]>40].index].info()

In [None]:
# For the categorical data we can populate the mode
# Let us take the list of categorical data
df_high_value[null_dataset[null_dataset[0]>0].index].select_dtypes(include='object').columns

In [None]:
# Imputing missing values with mode
for col in df_high_value[null_dataset[null_dataset[0]>0].index].select_dtypes(include='object').columns:
    print(col)
    print(df_high_value[col].mode()[0])
    df_high_value[col].fillna(df_high_value[col].mode()[0], inplace=True)

In [None]:
# Let us convert the datatype for these columns
for col in df_high_value[null_dataset[null_dataset[0]>0].index].select_dtypes(include='object').columns:
    df_high_value[col]=pd.to_datetime(df_high_value[col])

In [None]:
#Now lets identify the number of days between recharge in months 6-7 and 7-8
df_high_value['ndays_bw_rech_6_7']=(df_high_value['date_of_last_rech_7']-df_high_value['date_of_last_rech_6']).apply(lambda x:int(x.days))
df_high_value['ndays_bw_rech_7_8']=(df_high_value['date_of_last_rech_8']-df_high_value['date_of_last_rech_7']).apply(lambda x:int(x.days))

In [None]:
# Verify the data set post column creation
df_high_value.head()

In [None]:
#let us take the list of categorical data
df_high_value[null_dataset[null_dataset[0]>40].index].select_dtypes(include='float64').describe()

In [None]:
#count_rech 2g/3g is handeled in total_rech_data itself for each month, we can drop this data
#validation:
df_high_value[['total_rech_data_6','total_rech_data_7','total_rech_data_8','count_rech_2g_6','count_rech_2g_7','count_rech_2g_8','count_rech_3g_6','count_rech_3g_7','count_rech_3g_8']]

In [None]:
#let us drop these count_rech 2g/3g column 
df_high_value.drop(columns=['count_rech_2g_6','count_rech_2g_7','count_rech_2g_8','count_rech_3g_6','count_rech_3g_7','count_rech_3g_8'],inplace=True)

In [None]:
# argpu : Average revenue per user
# let us populate 0 for all the null values
df_high_value[['arpu_3g_6','arpu_3g_7','arpu_3g_8','arpu_2g_6','arpu_2g_7','arpu_2g_8']]=df_high_value[['arpu_3g_6','arpu_3g_7','arpu_3g_8','arpu_2g_6','arpu_2g_7','arpu_2g_8']].fillna(0, axis=1)

In [None]:
# let us check if the max_rech_data has any 0 value or not
df_high_value[(df_high_value['max_rech_data_6']==0) | (df_high_value['max_rech_data_7']==0) | (df_high_value['max_rech_data_8']==0)]

In [None]:
# imputing with 0 as minimum recharge amount is 1. So null values means no recharge or zero recharge

df_high_value[['max_rech_data_6','max_rech_data_7','max_rech_data_8']]=df_high_value[['max_rech_data_6','max_rech_data_7','max_rech_data_8']].fillna(0,axis=1)

In [None]:
#Check the percentage of null values in the columns
df_high_value.isnull().sum() * 100 / len(df_high_value)

In [None]:
#all other data are less than 5%, let us drop the rows
#before that let us take the count
df_high_value.shape

In [None]:
#dropping the rows with na values
df_high_value.dropna(inplace=True)

In [None]:
df_high_value.shape

In [None]:
28504/30001
#5% data is dropped

In [None]:
#Check the percentage of null values in the columns
df_high_value.isnull().sum() * 100 / len(df_high_value)

In [None]:
# Finished cleaning the data
#let us check churn count
#count churn and non churn data in %
df_high_value.Churn.value_counts(1)*100

In [None]:
df_high_value.shape

#### We have treated all missing values from the data set. We are left with 28504 records and 161 columns 

In [None]:
#count churn and non churn data in number
df_high_value.Churn.value_counts()

## EDA

### Univariate Analysis

In [None]:
# Categorical and Numerical 

numerical_feats = df_high_value.dtypes[(df_high_value.dtypes =='float64') | (df_high_value.dtypes =='int64')].index
print("Number of Numerical features: ", len(numerical_feats))

categorical_feats = df_high_value.dtypes[df_high_value.dtypes == "object"].index
print("Number of Categorical features: ", len(categorical_feats))
#this is not helpful

In [None]:
#let us analyze the churn value
sns.histplot(data=df_high_value, x="Churn", hue="Churn")

*__Inference :__* 

    1.The data is highly imabalanced.

In [None]:
# Let us define some functions that create plots
def uni(col):
    
    if col.dtype == np.int64 or col.dtype == np.float64:
        sns.distplot(col)
        
    
    elif col.dtype == 'category':
        sns.countplot(col)

In [None]:
uni(df_high_value.arpu_6)

In [None]:
uni(df_high_value.arpu_7)

In [None]:
uni(df_high_value.arpu_8)

### Inference : Average revenue per user is skewed to the right


In [None]:
uni(df_high_value.total_og_mou_6)

In [None]:
uni(df_high_value.total_og_mou_7)

In [None]:
uni(df_high_value.total_og_mou_8)

### Inference: Total outgoing calls data distribution is skewed. Most of the values are lying around 1000

In [None]:
uni(df_high_value.fb_user_6)

In [None]:
uni(df_high_value.fb_user_7)

In [None]:
uni(df_high_value.fb_user_8)

### Inference : During 8th month the number of users who didn't use facebook has increased

### Bivariate analysis

In [None]:
#let us analyze the churn data based on age on network
sns.histplot(data=df_high_value, x="aon", hue="Churn")

*__Inference :__* 

    1.The churn rate is low if age on network is high.

In [None]:
#let us analayze the churn data based on recharge frequency
GBR=[col for col in df_high_value.columns if 'ndays_bw_rech' in col]
nr_rows = 2
nr_cols = 2

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(GBR):
            sns.boxplot(y=GBR[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn",showfliers=False)
    
plt.tight_layout()    
plt.show() 

In [None]:
#let us analyze the anual return per customer in each month
arpu=['arpu_6','arpu_7','arpu_8']
nr_rows = 2
nr_cols = 2

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(arpu):
            sns.boxplot(y=arpu[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn",showfliers=False)
    
plt.tight_layout()    
plt.show() 

*__Inference :__* 

    1.ARPU drops in action phase in churned customers

In [None]:
#analysis on -  On Network Minutes of Usage and Off Network Minutes of Usage :
net_mou=[col for col in df_high_value.columns if 'onnet_mou' in col]+[col for col in df_high_value.columns if 'offnet_mou' in col]
nr_rows = 2
nr_cols = 3
fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(net_mou):
            sns.boxplot(y=net_mou[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn",showfliers=False)
    
plt.tight_layout()    
plt.show() 

*__Inference :__* 

    1.On network minutes usage drops in action phase for churned customers.
    2.Off network minutes usage drops in action phase for churned customers.

In [None]:
#analysis on outgoing and incoming minutes of usage operator wise
#T2T    	Operator T to T, i.e. within same operator (mobile to mobile)
#T2M    	Operator T to other operator mobile
#T2O    	Operator T to other operator fixed line
#T2F    	Operator T to fixed lines of T
#T2C    	Operator T to it’s own call center

operator_wise=[col for col in df_high_value.columns if 'og_t' in col]+[col for col in df_high_value.columns if 'ic_t' in col]
nr_rows = 13
nr_cols = 3

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(operator_wise):
            sns.boxplot(y=operator_wise[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn",showfliers=False)
    
plt.tight_layout()    
plt.show() 

*__Inference :__* 

    1.Operator T to T's incoming/outgoing calls - standard and local minutes of usage drops in action phase for churned customers
    2.Operator T to Other's incoming/outgoing calls - standard and local minutes of usage drops in action phase for churned customers
    3.Operator T to Fixed line's incoming/outgoing calls - local minutes of usage drops in action phase for churned customers
    4.Operator T to Own call center's outgoing calls - local minutes of usage drops in action phase for churned customers 

In [None]:
#analysis on total incoming/outgoing value:
total=[col for col in df_high_value.columns if 'total_ic_mou' in col]+[col for col in df_high_value.columns if 'total_og_mou' in col]
nr_rows = 2
nr_cols = 3

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(total):
            sns.boxplot(y=total[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn",showfliers=False)
    
plt.tight_layout()    
plt.show() 

*__Inference :__* 

    1.Total minutes of usage for incoming/outcoming dropped in action phase for churned customers

In [None]:
#analysis on total number of recharge value:
total_rech=[col for col in df_high_value.columns if 'rech_num' in col]
nr_rows = 2
nr_cols = 2

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(total_rech):
            sns.boxplot(y=total_rech[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn",showfliers=False)
    
plt.tight_layout()    
plt.show() 

*__Inference :__* 

    1.Total number of recharges dropped in action phase for churned customers

In [None]:
#analysis on amount value:
amount=[col for col in df_high_value.columns if '_amt' in col]
nr_rows = 5
nr_cols = 3

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(amount):
            sns.boxplot(y=amount[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn",showfliers=False)
    
plt.tight_layout()    
plt.show() 

*__Inference :__* 

    1.Total call recharge amount drops in action phase in churned customers
    2.Maximum recharge amount drops in action phase in churned customers
    3.Last day recharge amount drops in action phase in churned customers
    4.Total data recharge amount drops in action phase in churned customers 

In [None]:
#the above data is not helpful
#let us go for 2g/3g related column:
g_column=[col for col in df_high_value.columns if ('_2g' )  in col] + [col for col in df_high_value.columns if ('_3g' )  in col]
nr_rows = 9
nr_cols = 3

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(g_column):
            sns.boxplot(y=g_column[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn",showfliers=False)
    
plt.tight_layout()    
plt.show() 

*__Inference :__* 

    1.2g/3g data usage drops in action phase for churned customers.
    2.revenue generated by 2g/3g usage also drops in action phase for churned customers.
    3.usage and revenue dropped for sachet package in good phase for churned customers.

In [None]:
#Let us validate the monthly 2g/3g package and sanchet package
package=[col for col in df_high_value.columns if ('monthly' )  in col] + [col for col in df_high_value.columns if ('sachet' )  in col]
nr_rows = 3
nr_cols = 3

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*7,nr_rows*5))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(package):
            sns.barplot(y=package[i], x=df_high_value.Churn, data=df_high_value, ax = axs[r][c],hue="Churn")
    
plt.tight_layout()    
plt.show()

*__Inference :__* 

    1.monthly package has dropped in action phase for churned customers, network availability/ package cost might be an issue

### Multivariate Analysis

In [None]:
#creating dummy variables
#check for unique values
df_high_value.nunique().sort_values()

In [None]:
#the columns we can consider for dummy creation
unique_dataset=pd.DataFrame(df_high_value.nunique())
dummy=list(unique_dataset[unique_dataset[0]==3].index)
dummy

In [None]:
#creating dummy variables from categorical variable
for col in dummy:
    dummies = pd.get_dummies(df_high_value[col])
    dummies = dummies.add_prefix(f'{col}_')
    df_high_value = pd.concat([df_high_value, dummies], axis = 1)
    df_high_value.drop([col], axis = 1, inplace = True)
    
#we have already created dummy variable, let us now drop the columns we created -1
df_high_value.drop([col for col in df_high_value.columns if '-1' in col],axis=1,inplace=True)

In [None]:
#validaitng the dataframe
df_high_value.head()

In [None]:
#let us check for correlation factor 
#plotting the heat map
plt.figure(figsize=[30,30])
sns.heatmap(df_high_value.corr(), annot=True)

*__Inference :__* 

    1.many columns have high correlation value

## Data Preparation for model:

In [None]:
#check the size before proceeding
df_ml_high_value=df_high_value.copy()
df_ml_high_value.shape

In [None]:
#as timestamp is causing issue during data scale let us convert it to float 
df_ml_high_value[df_ml_high_value.select_dtypes(include='datetime64').columns]=df_ml_high_value[df_ml_high_value.select_dtypes(include='datetime64').columns].values.astype('float64')

In [None]:
#let us split the data
y = df_ml_high_value.pop('Churn')
X = df_ml_high_value

In [None]:
#let us split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.3,random_state = 1)

In [None]:
scaler = StandardScaler()
#let us scale the data
np.random.seed(0)
X_train[:]=scaler.fit_transform(X_train[:])
X_test[:]=scaler.transform(X_test[:])

In [None]:
print('size')
print(X_train.shape)
print(X_test.shape)

### Treating class imbalance

In [None]:
#let use SMOTE as suggested in discussion call
from imblearn.over_sampling import SMOTE

upsampler = SMOTE() 
X_train,y_train = upsampler.fit_resample(X_train,y_train)

In [None]:
#validation
print('size')
print(X_train.shape)
print(y_train.shape)
#let us analyze the churn value
sns.histplot(data=y_train)

### Reducing features -  PCA.

In [None]:
#let us Reduce the number of variables using PCA as suggested 
from sklearn.decomposition import PCA

pca = PCA(random_state=42)
pca.fit(X_train)

In [None]:
pca.components_

In [None]:
#explaoned variance ratio for each components
pca.explained_variance_ratio_

In [None]:
var_cumu = np.cumsum(pca.explained_variance_ratio_)
var_cumu

In [None]:
fig = plt.figure(figsize=[12,8])
#plt.vlines(x=75, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=150, xmin=0, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.grid()
plt.show()

In [None]:
from sklearn.decomposition import IncrementalPCA
pca_final = IncrementalPCA(n_components=75)
df_X_train_pca = pca_final.fit_transform(X_train)
#let us check the size
df_X_train_pca.shape

In [None]:
corrmat = np.corrcoef(df_X_train_pca.transpose())
corrmat.shape

In [None]:
#plotting the heat map
plt.figure(figsize=[30,30])
sns.heatmap(corrmat, annot=True)

In [None]:
#let us change the same for test data 
df_X_test_pca = pca_final.transform(X_test)
df_X_test_pca.shape

## Model Building - PCA Data

### Logistic regression model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logistic_learner_pca = LogisticRegression(C=1e9)
logistic_model_pca=logistic_learner_pca.fit(df_X_train_pca, y_train)
logistic_model_pca.get_params()

In [None]:
#let us make the prediction on test set
pred_probs_test = logistic_model_pca.predict_proba(df_X_test_pca)

In [None]:
from sklearn import metrics
"{:2.3}".format(metrics.roc_auc_score(y_test, pred_probs_test[:,1]))

In [None]:
#to create confusion matrix, let us make the prediction
pred_probs_train = logistic_model_pca.predict_proba(df_X_train_pca)
pred_probs_train

In [None]:
y_train.shape
pred_probs_train.shape

##### Creating a dataframe with the actual churn flag and the predicted probabilities

In [None]:
y_train_pred_final = pd.DataFrame({'Churn':y_train.values, 'Churn_Prob':pred_probs_train[:,1]})
y_train_pred_final['ID'] = y_train.index
y_train_pred_final.head()

##### Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0

In [None]:
y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.predicted )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
FP/ float(TN+FP)

In [None]:
# positive predictive value 
TP / float(TP+FP)

In [None]:
# Negative predictive value
TN / float(TN+ FN)

In [None]:
#let us create the ROC curve
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
draw_roc(y_train_pred_final.Churn, y_train_pred_final.Churn_Prob)

#### Finding Optimal Cutoff Point

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.grid()
plt.show()

#### From the curve above, 0.56 is the optimum point to take it as a cutoff probability.

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.56 else 0)

y_train_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.final_predicted )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.final_predicted))

#### Precision and Recall

In [None]:
#precision
confusion[1,1]/(confusion[0,1]+confusion[1,1])

In [None]:
#recall
confusion[1,1]/(confusion[1,0]+confusion[1,1])

In [None]:
#precision recall tradeoff
from sklearn.metrics import precision_recall_curve
p, r, thresholds = precision_recall_curve(y_train_pred_final.Churn, y_train_pred_final.Churn_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

In [None]:
#let us make the prediction on test set
y_test_pred_final = logistic_model_pca.predict_proba(df_X_test_pca)
y_test_pred_final

In [None]:
#preparing data  for confusion matrix
y_pred_final = pd.DataFrame({'Churn':y_test.values, 'Churn_Prob':y_test_pred_final[:,1]})
y_pred_final['ID'] = y_test.index
y_pred_final.head()

In [None]:


y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.56 else 0)

y_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_pred_final.Churn, y_pred_final.predicted )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_pred_final.Churn, y_pred_final.predicted))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

*sensitivity of regression model* : 0.82

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_pred_final.Churn, y_pred_final.predicted))

### Logistic regression model - hyperparameter tuning

In [None]:
from sklearn.model_selection import GridSearchCV
logistic_learner_pca_tune = LogisticRegression(penalty='l2')

In [None]:
log_params = {'C':[0.1, 0.2,0.3, 0.4, 0.5, 1, 2]}
folds=5
log_model_cv = GridSearchCV(estimator = logistic_learner_pca_tune, 
                        param_grid = log_params, 
                        scoring  = 'recall', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)  
log_model_cv.fit(df_X_train_pca,y_train)

In [None]:
log_model_cv.best_estimator_

In [None]:
#let us increase the param range and check
log_params = {'C':[2,4,10,50,100,200,300]}
folds=5
log_model_cv = GridSearchCV(estimator = logistic_learner_pca_tune, 
                        param_grid = log_params, 
                        scoring  = 'recall', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)  
log_model_cv.fit(df_X_train_pca,y_train)

In [None]:
log_model_cv.best_estimator_

In [None]:
#let us increase the param range and check
log_params = {'C':[40,50,55,60]}
folds=5
log_model_cv = GridSearchCV(estimator = logistic_learner_pca_tune, 
                        param_grid = log_params, 
                        scoring  = 'recall', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)  
log_model_cv.fit(df_X_train_pca,y_train)

In [None]:
log_model_cv.best_estimator_

In [None]:
#let us increase the param range and check
log_params = {'C':[x for x in range(45,55)]}
folds=5
log_model_cv = GridSearchCV(estimator = logistic_learner_pca_tune, 
                        param_grid = log_params, 
                        scoring  = 'recall', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)  
log_model_cv.fit(df_X_train_pca,y_train)

In [None]:
log_model_cv.best_estimator_

In [None]:
#let us create a model with tuned parameters
logistic_learner_pca_tuned = LogisticRegression(penalty='l2',C=46)

In [None]:
logistic_model_pca_tuned=logistic_learner_pca_tuned.fit(df_X_train_pca, y_train)
logistic_model_pca_tuned.get_params()

In [None]:
#let us make the prediction on test set
pred_probs_test = logistic_model_pca_tuned.predict_proba(df_X_test_pca)
"{:2.4}".format(metrics.roc_auc_score(y_test, pred_probs_test[:,1]))

In [None]:
#to create confusion matrix, let us make the prediction
pred_probs_tuned_train = logistic_model_pca_tuned.predict_proba(df_X_train_pca)

y_train_pred_tuned_final = pd.DataFrame({'Churn':y_train.values, 'Churn_Prob':pred_probs_tuned_train[:,1]})
y_train_pred_tuned_final['ID'] = y_train.index
y_train_pred_tuned_final.head()

y_train_pred_tuned_final['predicted'] = y_train_pred_tuned_final.Churn_Prob.map(lambda x: 1 if x > 0.56 else 0)

# Let's see the head
y_train_pred_tuned_final.head()

In [None]:
#to create confusion matrix, let us make the prediction
pred_probs_tuned_test = logistic_model_pca_tuned.predict_proba(df_X_test_pca)

y_test_pred_tuned_final = pd.DataFrame({'Churn':y_test.values, 'Churn_Prob':pred_probs_tuned_test[:,1]})
y_test_pred_tuned_final['ID'] = y_test.index
y_test_pred_tuned_final.head()

y_test_pred_tuned_final['predicted'] = y_test_pred_tuned_final.Churn_Prob.map(lambda x: 1 if x > 0.56 else 0)

# Let's see the head
y_test_pred_tuned_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_tuned_final.Churn, y_train_pred_tuned_final.predicted )
print(confusion)

In [None]:
# Confusion matrix 
confusion_test = metrics.confusion_matrix(y_test_pred_tuned_final.Churn, y_test_pred_tuned_final.predicted )
print(confusion_test)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_tuned_final.Churn, y_train_pred_tuned_final.predicted))

TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's check the overall accuracy in test data
print(metrics.accuracy_score(y_test_pred_tuned_final.Churn, y_test_pred_tuned_final.predicted))

TP_test = confusion_test[1,1] # true positive 
TN_test = confusion_test[0,0] # true negatives
FP_test = confusion_test[0,1] # false positives
FN_test = confusion_test[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
sens= TP / float(TP+FN)
TP / float(TP+FN)

In [None]:
# Let's see the sensitivity of our logistic regression model
sens_test= TP_test / float(TP_test+FN_test)
TP_test / float(TP_test+FN_test)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
draw_roc(y_train_pred_tuned_final.Churn, y_train_pred_tuned_final.predicted)

In [None]:
draw_roc(y_test_pred_tuned_final.Churn, y_test_pred_tuned_final.predicted)

In [None]:
print(classification_report(y_train_pred_tuned_final.Churn, y_train_pred_tuned_final.predicted))

In [None]:
#Logistic regression tuned model:
print('Accuracy :',metrics.accuracy_score(y_test_pred_tuned_final.Churn, y_test_pred_tuned_final.predicted))
print('Sensitivity :',sens_test)

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

RF=RandomForestClassifier()
RFG=RF.fit(df_X_train_pca,y_train)

In [None]:
y_train_pred=RFG.predict(df_X_train_pca)
y_train_pred

In [None]:
#preparing data  for confusion matrix
y_train_pred_df = pd.DataFrame({'Churn':y_train.values, 'Predicted':y_train_pred})
y_train_pred_df['ID'] = y_train.index
y_train_pred_df.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_df.Churn, y_train_pred_df.Predicted )
print('Confusion Matrix:')
print(confusion)

In [None]:
#ROC Score
"{:2.2}".format(metrics.roc_auc_score(y_train_pred_df.Churn, y_train_pred_df.Predicted ))

In [None]:
#let us test on train data
y_test_pred=RFG.predict(df_X_test_pca)
y_test_pred

In [None]:
#preparing data  for confusion matrix
y_test_pred_df = pd.DataFrame({'Churn':y_test.values, 'Predicted':y_test_pred})
y_test_pred_df['ID'] = y_test.index
y_test_pred_df.head()

In [None]:
draw_roc(y_test_pred_df.Churn, y_test_pred_df.Predicted)

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_test_pred_df.Churn, y_test_pred_df.Predicted )
print('Confusion Matrix:')
print(confusion)

In [None]:
"{:2.2}".format(metrics.roc_auc_score(y_test_pred_df.Churn, y_test_pred_df.Predicted ))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our RF model
TP / float(TP+FN)

In [None]:
print(classification_report(y_test_pred_df.Churn, y_test_pred_df.Predicted))

### Random Forest Classifier - Parameter Tuning

In [None]:
#for parameter tuning let us start with depth
rf = RandomForestClassifier()
parameters = {'max_depth': range(1, 30,5)}

In [None]:
rf_cv = GridSearchCV(rf, parameters, cv=5, scoring='recall',return_train_score=True)
rf_cv.fit(df_X_train_pca,y_train)

In [None]:
scores = rf_cv.cv_results_
# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_depth"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
# we can take max depth between 10-15
#let us check for estimators
parameters = {'n_estimators': range(100, 250, 25)}
rf_cv = GridSearchCV(rf, parameters, cv=5, scoring='recall',return_train_score=True)
rf_cv.fit(df_X_train_pca,y_train)

In [None]:
scores = rf_cv.cv_results_
# plotting accuracies with n_estimators
plt.figure()
plt.plot(scores["param_n_estimators"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_n_estimators"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
#let us check for estimators by increasing the range
parameters = {'n_estimators': range(250, 300, 25)}
rf_cv = GridSearchCV(rf, parameters, cv=5, scoring='recall',return_train_score=True)
rf_cv.fit(df_X_train_pca,y_train)

In [None]:
scores = rf_cv.cv_results_
# plotting accuracies with n_estimators
plt.figure()
plt.plot(scores["param_n_estimators"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_n_estimators"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
#let us use n_estimators=200
parameters = {'min_samples_leaf': range(1, 100, 10)}
rf_cv = GridSearchCV(rf, parameters, cv=5, scoring='recall',return_train_score=True)
rf_cv.fit(df_X_train_pca,y_train)

In [None]:
scores = rf_cv.cv_results_
# plotting accuracies with min_samples_leaf
plt.figure()
plt.plot(scores["param_min_samples_leaf"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_min_samples_leaf"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("min_samples_split")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
#let us consider the min_sample_leaf in range 10-20
    
parameters = {'min_samples_split': range(1, 50, 10)}
rf_cv = GridSearchCV(rf, parameters, cv=5, scoring='recall',return_train_score=True)
rf_cv.fit(df_X_train_pca,y_train)

In [None]:
scores = rf_cv.cv_results_
# plotting accuracies with min_samples_leaf
plt.figure()
plt.plot(scores["param_min_samples_split"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_min_samples_split"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("min_samples_split")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
#let us have min_samples_split in range 10 to 30 for better accuracy
#from the random search result let us create a new model

parameters = {
    'max_depth': [8,10,14,16],
    'n_estimators': [200],
    'min_samples_leaf': [10],
    'min_samples_split': [10,20,30]
}
rf_cv = GridSearchCV(rf, parameters, cv=5, scoring='recall',return_train_score=True)
rf_cv.fit(df_X_train_pca,y_train)

In [None]:
rf_cv.best_params_

In [None]:
#let us create a new model with tuned parameters
rf_tuned = RandomForestClassifier(max_depth= 16,min_samples_leaf= 10,min_samples_split= 10,n_estimators= 200)
rf_tuned.fit(df_X_train_pca,y_train)

In [None]:
#let us test on train data
y_test_pred_rf_tuned=rf_tuned.predict(df_X_test_pca)
y_test_pred_rf_tuned

In [None]:
#preparing data  for confusion matrix
y_test_pred_rf_tuned_df = pd.DataFrame({'Churn':y_test.values, 'Predicted':y_test_pred_rf_tuned})
y_test_pred_rf_tuned_df['ID'] = y_test.index
y_test_pred_rf_tuned_df.head()

In [None]:
draw_roc(y_test_pred_rf_tuned_df.Churn, y_test_pred_rf_tuned_df.Predicted)

In [None]:
# Confusion matrix 
confusion_rf_tuned = metrics.confusion_matrix(y_test_pred_rf_tuned_df.Churn, y_test_pred_rf_tuned_df.Predicted )
print('Confusion Matrix:')
print(confusion_rf_tuned)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_test_pred_rf_tuned_df.Churn, y_test_pred_rf_tuned_df.Predicted))

TP = confusion_rf_tuned[1,1] # true positive 
TN = confusion_rf_tuned[0,0] # true negatives
FP = confusion_rf_tuned[0,1] # false positives
FN = confusion_rf_tuned[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our RF model
TP / float(TP+FN)

In [None]:
print(classification_report(y_test_pred_rf_tuned_df.Churn, y_test_pred_rf_tuned_df.Predicted))

### XGBoost

In [None]:
import xgboost as xgb

GBDT=xgb.XGBClassifier()
GBDTG=GBDT.fit(df_X_train_pca,y_train)

y_train_GBDTG_pred=GBDTG.predict(df_X_train_pca)

In [None]:
#preparing data  for confusion matrix
y_train_GBDTG_pred_df = pd.DataFrame({'Churn':y_train.values, 'Predicted':y_train_GBDTG_pred})
y_train_GBDTG_pred_df['ID'] = y_train.index
y_train_GBDTG_pred_df.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_GBDTG_pred_df.Churn, y_train_GBDTG_pred_df.Predicted )
print('Confusion Matrix:')
print(confusion)

In [None]:
#ROC Score
"{:2.2}".format(metrics.roc_auc_score(y_train_GBDTG_pred_df.Churn, y_train_GBDTG_pred_df.Predicted ))

In [None]:
#let us test on train data
y_test_GBDTG_pred=GBDTG.predict(df_X_test_pca)
y_test_GBDTG_pred

In [None]:
#preparing data  for confusion matrix
y_test_GBDTG_pred_df = pd.DataFrame({'Churn':y_test.values, 'Predicted':y_test_GBDTG_pred})
y_test_GBDTG_pred_df['ID'] = y_test.index
y_test_GBDTG_pred_df.head()

In [None]:
draw_roc(y_test_GBDTG_pred_df.Churn, y_test_GBDTG_pred_df.Predicted)

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_test_GBDTG_pred_df.Churn, y_test_GBDTG_pred_df.Predicted )
print('Confusion Matrix:')
print(confusion)

In [None]:
"{:2.2}".format(metrics.roc_auc_score(y_test_GBDTG_pred_df.Churn, y_test_GBDTG_pred_df.Predicted))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Let's see the sensitivity of our RF model
TP / float(TP+FN)

In [None]:
print(classification_report(y_test_GBDTG_pred_df.Churn, y_test_GBDTG_pred_df.Predicted))

### XGBoost - HyperParameter tuning

In [None]:
import xgboost as xgb
parameters={'learning_rate':[0.01,0.02,0.03,0.04,0.05]}

GBDT=xgb.XGBClassifier()

GBDTG=GridSearchCV(GBDT,parameters,cv=3,scoring='recall',return_train_score=True,n_jobs=4,verbose=6)
GBDTG.fit(df_X_train_pca,y_train)

In [None]:
scores = GBDTG.cv_results_
# plotting accuracies with learning_rate
plt.figure()
plt.plot(scores["param_learning_rate"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_learning_rate"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("learning_rate")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
parameters={'learning_rate':[0.05,0.1,0.5]}

GBDT=xgb.XGBClassifier()

GBDTG=GridSearchCV(GBDT,parameters,cv=3,scoring='recall',return_train_score=True,n_jobs=4,verbose=6)
GBDTG.fit(df_X_train_pca,y_train)

In [None]:
scores = GBDTG.cv_results_
# plotting accuracies with learning_rate
plt.figure()
plt.plot(scores["param_learning_rate"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_learning_rate"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("learning_rate")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
parameters={'learning_rate':[0.1,0.13,0.15]}

GBDT=xgb.XGBClassifier()

GBDTG=GridSearchCV(GBDT,parameters,cv=3,scoring='recall',return_train_score=True,n_jobs=4,verbose=6)
GBDTG.fit(df_X_train_pca,y_train)

In [None]:
scores = GBDTG.cv_results_
# plotting accuracies with learning_rate
plt.figure()
plt.plot(scores["param_learning_rate"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_learning_rate"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("learning_rate")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
#let us select learning_rate= 0.13
GBDT_tuned=xgb.XGBClassifier(learning_rate= 0.13)
GBDT_tuned=GBDT_tuned.fit(df_X_train_pca,y_train)

y_train_GBDTG_tuned_pred=GBDT_tuned.predict(df_X_train_pca)

In [None]:
y_test_GBDTG_tuned_pred=GBDT_tuned.predict(df_X_test_pca)
#preparing data  for confusion matrix
y_test_GBDTG_tuned_pred_df = pd.DataFrame({'Churn':y_test.values, 'Predicted':y_test_GBDTG_tuned_pred})
y_test_GBDTG_tuned_pred_df['ID'] = y_test.index
y_test_GBDTG_tuned_pred_df.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_test_GBDTG_tuned_pred_df.Churn, y_test_GBDTG_tuned_pred_df.Predicted )
print('Confusion Matrix:')
print(confusion)

In [None]:
#ROC Score
"{:2.2}".format(metrics.roc_auc_score(y_test_GBDTG_tuned_pred_df.Churn, y_test_GBDTG_tuned_pred_df.Predicted ))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Let's see the sensitivity of our RF model
TP / float(TP+FN)

In [None]:
print(classification_report(y_test_GBDTG_tuned_pred_df.Churn, y_test_GBDTG_tuned_pred_df.Predicted))

## Identifying important predictors

In [None]:
# let us use logistic regression to identify the importatnt predictors
import statsmodels.api as sm
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

In [None]:
X_train_rfe=X_train.copy()
X_test_rfe=X_test.copy()

In [None]:
# Recursive Feature Elimination (RFE) for Feature Selection in Python
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
logreg = LogisticRegression()
rfe = RFE(logreg, 15)
rfe = rfe.fit(X_train_rfe, y_train)

In [None]:
list(zip(X_train_rfe.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train_rfe.columns[rfe.support_]
col

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]
X_train_rfe.head()

In [None]:
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train_rfe)), family = sm.families.Binomial())
logm1.fit().summary()

In [None]:
# Calculate the VIFs for the model
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#dropping the variable for better analaysis
X_train_rfe=X_train_rfe.drop(['total_og_mou_8'],axis=1)

In [None]:
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train_rfe)), family = sm.families.Binomial())
logm1.fit().summary()

In [None]:
# Calculate the VIFs for the model
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#dropping the variable for better analaysis
X_train_rfe=X_train_rfe.drop(['offnet_mou_8'],axis=1)

In [None]:
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train_rfe)), family = sm.families.Binomial())
logm1.fit().summary()

In [None]:
# Calculate the VIFs for the model
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
logm1.fit().params

In [None]:
imp_feat = pd.DataFrame({'Feature':logm1.fit().params.index, 'Coeff':abs(logm1.fit().params.values)})
imp_feat.head(15)

In [None]:
#dropping the constant variable for better analaysis
imp_feat=imp_feat.drop(0)

In [None]:
#validating the data frame
imp_feat.head(20)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure(figsize=[20,10])
sns.barplot(data=imp_feat.sort_values(['Coeff']),y='Feature',x='Coeff')
fig.suptitle('Important predictors', fontsize = 20)                  # Plot heading 
plt.xlabel('Coefficients', fontsize = 18)                         # X-label

In [None]:
#the top key features for prediction are
imp_feat.sort_values(['Coeff'],ascending=False).Feature

# Results of various models
| Model Name | Accuracy | Sensitivity |
| --- | --- | --- |
| Default	Logistic Reg | 85% | 81% |
| Default	Random forest | 96% | 34% |
| Default	XGBoost | 97% | 50% |
| Tuned	Logistic Reg | 85% | 81% |
| Tuned	Random forest | 91% | 51% |
| Tuned	XGBoost | 76% | 57% |

### Model Summary:
Out of Logistic Regression, Random forest and boosting algorithms, the Logistic regression identifies Churners better than Non churners with an accuracy of 85% and sensitivity of 81%

## Inference from EDA
1. Churn rate is low if age on network is high.
2. ARPU drops in action phase in churned customers
3. On/off network minutes usage drops in action phase for churned customers.
4. Operator T to T's incoming/outgoing calls - standard and local minutes of usage drops in action phase for churned customers
5. Operator T to Other's incoming/outgoing calls - standard and local minutes of usage drops in action phase for churned customers
6. Operator T to Fixed line's incoming/outgoing calls - local minutes of usage drops in action phase for churned customers
7. Operator T to Own call center's outgoing calls - local minutes of usage drops in action phase for churned customers 
8. Total minutes of usage for incoming/outcoming dropped in action phase for churned customers.
9. Total number of recharges dropped in action phase for churned customers.
10. Total call recharge amount drops in action phase in churned customers
11. Last day recharge amount drops in action phase in churned customers
12. Total data recharge amount drops in action phase in churned customers .
13. 2g/3g data usage drops in action phase for churned customers.
14. Revenue generated by 2g/3g usage also drops in action phase for churned customers.
15. Usage and revenue dropped for sachet package in good phase for churned customers.
16. Monthly package has dropped in action phase for churned customers, network availability/ package cost might be an issue.

### From Logistic Regression, the most important features are :
1. "ph_churn_rech" -> The average data and call usage of customer during the churn phase.
2. total_ic_mou_8 -> Total incommign minutes of usage of customer during the active phase.
3. arpu_8  -> Average revenue per customer during active phase.  
4. roam_og_mou_8 ->  Outgoing roaming minues of use of customer.
5. spl_ic_mou_8, and 

    arpu_7, loc_og_t2m_mou_8, sep_vbc_3g, onnet_mou_8, total_ic_mou_7,
    max_rech_data_8, arpu_6, std_og_t2m_mou_8

# Business Insights:
1. The above mentioned columns are very strong predictors of Churn.
2. Most of the customers who churn have very low usage of calls and data during the churn phase. So we can give special offers for STD and ISD rates to those customers to retain them.
3. The active phase is very critical to retain a customer, so during the active phase we can reduce roaming charges to retain customers.
4. Customers who recharge with higher amounts are high value customers by definition, so giving them good deals on Higher data or call packages compared to the competitors would retain more customers.