# Brief overview of the Telecom churn assignment :¶

The dataset contains prepaid customer-level information from Indian and Southeast Asian market for a span of four consecutive months - June, July, August and September. 

a) June and July is the good phase. 

b) August is the action phase. 

c) September is the churn phase.


### Few points based on which churn can be predicted:
Revenue-based churn : low revenue generators and may not make dent if churned. However check for the % of such customers and determine if there is pattern. Derive total/average/median revenue value to analyze further.

Usage-based churn : customers not using the services. Determine the duration for which they have been in-active to predict how likely they can churn.

High-value Churn : In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage.

Time based churn : Good phase : June , July. No identification yet Action phase : Aug. High churn risk customers Churn phase : Sep. Data not available. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase.

### Data preparation:
Derive new features : Club few columns to identify pattern and get more meaning from the dataset Identify high values customers: Filter out customers who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

Tag churners and remove attributes of the churn phase : Based on the data and call details identify churn customers and remove remove all the attributes corresponding to the churn phase.

### Why are we doing this:

Implement 3 models and come up with the following points: 
- Predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc.
- Identify important variables that are strong predictors of churn

In [1]:
#import all the libraries at one place. 

import numpy as np
import pandas as pd
import warnings

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import IncrementalPCA
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus, graphviz
from sklearn.ensemble import RandomForestClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.externals.six import StringIO  
from sklearn.metrics import classification_report,confusion_matrix

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)



Import the data, set the file path containing the CSV

In [2]:
data = pd.read_csv('telecom_churn_data.csv')

FileNotFoundError: [Errno 2] File b'telecom_churn_data.csv' does not exist: b'telecom_churn_data.csv'

Let's understand the structure of our dataframe

In [None]:
data.shape

In [None]:
data.head()

In [None]:
# check for the data types
data.dtypes

There are few date related object columns and rest all are either int64 or float64 type

In [None]:
# overall picture of the data present in dataframe
data.agg(['count', 'size', 'nunique'])

In [None]:
#Check for null values
nullData = round((data.isnull().sum()/len(data.index)*100),2).sort_values(ascending=False)

#Check how many columns having more than 70% of null values
nullData.loc[lambda x : x>70]

We can see there is good amount of entries with nulll data which either needs to be imputed with value or set to 0, before we can proceed further with model building.

### Checking for Outliers 

In [None]:
data.describe(percentiles=[0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99])

Cant make out much due to the amount of information. Let's eliminate un-wanted data and filter on top 30% of customers who contribute more of revolution and then process the data further

### Data preparation

Let's remove some columns that are not that necessary and the proceed with the next steps.

In [None]:
# make copy of the data and use this copy to process further
CleanedData = data
CleanedData.describe()

Removing few columns like :
- mobile number, circle ID which wont add any value as they are unique values
- date columns as there is no other derived columns that can be drivenr or any pattern

In [None]:
#drop columns that has same mean, std, min,max values as they wont add any value
CleanedData = CleanedData.drop(['mobile_number','circle_id','loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou'
                  ,'last_date_of_month_6','last_date_of_month_7','last_date_of_month_8','last_date_of_month_9'
                  ,'date_of_last_rech_6','date_of_last_rech_7','date_of_last_rech_8','date_of_last_rech_9'
                  ,'date_of_last_rech_data_6','date_of_last_rech_data_7','date_of_last_rech_data_8','date_of_last_rech_data_9'
                               ], 1)

In [None]:
# Adding column name with % of null entries against the column

# arpu_3g_6	74.85
# arpu_2g_6	74.85
# arpu_3g_7	74.43
# arpu_2g_7	74.43
# arpu_2g_9	74.08
# arpu_3g_9	74.08
# arpu_3g_8	73.66
# arpu_2g_8	73.66

CleanedData.drop(['arpu_3g_6','arpu_3g_7','arpu_3g_8','arpu_3g_9'],axis=1,inplace=True)
CleanedData.drop(['arpu_2g_6','arpu_2g_7','arpu_2g_8','arpu_2g_9'],axis=1,inplace=True)

In [None]:
# Adding column name with % of null entries against the column

# max_rech_data_6	74.85
# max_rech_data_7	74.43
# max_rech_data_9	74.08
# max_rech_data_8	73.66

CleanedData.drop(['max_rech_data_6','max_rech_data_7','max_rech_data_8','max_rech_data_9'],axis=1,inplace=True)

In [None]:
# Adding column name with % of null entries against the column

# night_pck_user_6	74.85
# night_pck_user_7	74.43
# night_pck_user_9	74.08
# night_pck_user_8	73.66

CleanedData.drop(['night_pck_user_6','night_pck_user_7','night_pck_user_8','night_pck_user_9'],axis=1,inplace=True)

In [None]:
# Adding column name with % of null entries against the column

# count_rech_2g_6	74.85
# count_rech_3g_6	74.85
# count_rech_2g_7	74.43
# count_rech_3g_7	74.43
# count_rech_3g_9	74.08
# count_rech_2g_9	74.08
# count_rech_3g_8	73.66
# count_rech_2g_8	73.66

CleanedData.drop(['count_rech_2g_6','count_rech_2g_7','count_rech_2g_8','count_rech_2g_9'],axis=1,inplace=True)

In [None]:
# Adding column name with % of null entries against the column

# av_rech_amt_data_6	74.85
# av_rech_amt_data_7	74.43
# av_rech_amt_data_9	74.08
# av_rech_amt_data_8	73.66

#Fill empty entries with 0
CleanedData[['av_rech_amt_data_6','av_rech_amt_data_7','av_rech_amt_data_8','av_rech_amt_data_9']] = CleanedData[['av_rech_amt_data_6','av_rech_amt_data_7','av_rech_amt_data_8','av_rech_amt_data_9']].fillna(0)

In [None]:
# Adding column name with % of null entries against the column

# count_rech_2g_6	74.85
# count_rech_3g_6	74.85
# count_rech_2g_7	74.43
# count_rech_3g_7	74.43
# count_rech_3g_9	74.08
# count_rech_2g_9	74.08
# count_rech_3g_8	73.66
# count_rech_2g_8	73.66

CleanedData.drop(['count_rech_3g_6','count_rech_3g_7','count_rech_3g_8','count_rech_3g_9'],axis=1,inplace=True)

In [None]:
# Adding column name with % of null entries against the column

# total_rech_data_6	74.85
# total_rech_data_7	74.43
# total_rech_data_9	74.08
# total_rech_data_8	73.66

CleanedData[['total_rech_data_6','total_rech_data_7','total_rech_data_8','total_rech_data_9']] = CleanedData[['total_rech_data_6','total_rech_data_7','total_rech_data_8','total_rech_data_9']].fillna(0)

In [None]:
# Removing few more columns with null entries

CleanedData.drop(['std_og_t2c_mou_6','std_og_t2c_mou_7','std_og_t2c_mou_8','std_og_t2c_mou_9'],axis=1,inplace=True)
CleanedData.drop(['std_ic_t2o_mou_6','std_ic_t2o_mou_7','std_ic_t2o_mou_8','std_ic_t2o_mou_9'],axis=1,inplace=True)

In [None]:
# Identify mode value for categorical column like FB User before imputing

CleanedData[['fb_user_6','fb_user_7','fb_user_8','fb_user_9']].mode()

In [None]:
#Based on the mode value, impute 1 for emptry entries for facebook user column

CleanedData[['fb_user_6','fb_user_7','fb_user_8','fb_user_9']]=CleanedData[['fb_user_6','fb_user_7','fb_user_8','fb_user_9']].fillna(1)

In [None]:
CleanedData.shape

In [None]:
#Check again for null values
nullData = round((CleanedData.isnull().sum()/len(CleanedData.index)*100),2).sort_values(ascending=False)

#Check how many columns having more than 10% of null values
nullData.loc[lambda x : x>10]

Columns with null values are less than 10% now

In [None]:
#print to see how the data looks so far
CleanedData.head(4)

In [None]:
CleanedData.isnull().sum().sort_values(ascending=False)

We still have few null/missing values which needs to be processed. Going by the column definition, remaining column with null/empty values seems to be important and could give essential factor in predicting the model and therefore lets proceed to impute them than dropping those columns.

## Impute values for missing entries:

#### All the variables with missing values are continuous variables hence imputing them by KNN imputation. Because the churned variables will be between 5-10%, we cannot delete the entries as they may have churned variables and could negatively impact the model.

#### Let us use some imputers to impute the data. We can use SimpleImputer for univariate imputation and IterativeImputer for multivariate imputation

In [None]:
imp = SimpleImputer(strategy="median")

In [None]:
data_cleaned=pd.DataFrame(imp.fit_transform(CleanedData),columns=CleanedData.columns)

In [None]:
imputer = IterativeImputer(random_state=0)

In [None]:
data_cleaned_bayesian=pd.DataFrame(imp.fit_transform(CleanedData),columns=CleanedData.columns)

In [None]:
df=round((data_cleaned.isnull().sum()/len(data_cleaned.index)*100),2).sort_values(ascending=False)
df[df>0]

In [None]:
data.columns

In [None]:
#print the data to see how the data processed looks so far
data_cleaned

In [None]:
data_cleaned_bayesian.isnull().sum().sort_values(ascending=False)

In [None]:
#verify if there are any more null values that needs to be imputed

df=round((data_cleaned_bayesian.isnull().sum()/len(data_cleaned_bayesian.index)*100),2).sort_values(ascending=False)
df[df>0]

#### Let's derive few new columns based off the existing columns to add better meaning to the dataset.

In [None]:
# Getting average of total recharge during the good phase
data_cleaned_bayesian['sum_total_recharge_good'] = (data_cleaned_bayesian.total_rech_amt_6 + data_cleaned_bayesian.total_rech_amt_7)/2

In [None]:
# Get percentile date of the total recharge data processed so far
data_cleaned_bayesian['sum_total_recharge_good'].describe(percentiles=[0.25,0.50,0.75,0.90,0.95,0.99])

#### We can create new variables with variance to capture the difference of various columns to observe for changes in cusotmer from good phase to action phase which can be an good indicator for churn

In [None]:
# Getting average of vbc_3g_variance during the good phase
data_cleaned_bayesian['vbc_3g_variance'] = (data_cleaned_bayesian.jun_vbc_3g + data_cleaned_bayesian.jul_vbc_3g)/2 - data_cleaned_bayesian.aug_vbc_3g

In [None]:
# Get percentile date of the total recharge data processed so far
data_cleaned_bayesian['vbc_3g_variance'].describe(percentiles=[0.25,0.50,0.75,0.90,0.95,0.99])

In [None]:
# Getting the top 30% customers who are high value generators to the company
high_value_data=data_cleaned_bayesian[data_cleaned_bayesian.sum_total_recharge_good > data_cleaned_bayesian.sum_total_recharge_good.quantile(.70)]

In [None]:
# print the shape of the data processed so far
high_value_data.shape

In [None]:
# lets review how the data looks like so far
high_value_data.head(5)

#### Lets tackle the CHURN column now. Fill the churn column data based on the 9th month (churn phase) if they are still either using mobile data or the call services. If all these column data is 0, then they have already churned.

#### Print out how many columns we have from Churn phase to take stock of the current data

In [None]:
churn_columns = [col for col in high_value_data.columns if '_9' in col]
churn_columns

In [None]:
# Set the churn colmn data based off the mobile data and call record data
high_value_data['churn'] = np.where((((high_value_data.total_ic_mou_9 == 0) & (high_value_data.total_og_mou_9 == 0)) & ((high_value_data.vol_3g_mb_9 == 0) & (high_value_data.vol_2g_mb_9 == 0))),1,0)

In [None]:
# Now that we have the churn data populate, lets drop all the churn columns corresponding to churn phase 
# to avoid multi-collinearlity
high_value_data.drop(churn_columns,axis=1,inplace=True)
high_value_data.drop(['sep_vbc_3g'],axis=1,inplace=True)

In [None]:
high_value_data[high_value_data['churn'] ==1]

In [None]:
# Lets print the shape of the dataframe processed so far
high_value_data.shape

#### Deriving some more variables for our analysis. Capturing the average from 6th , 7th month and subtracting it from 8th month as variance to determine if there is any pattern.

In [None]:
# capturing avg of incoming mou and deriving variance value

high_value_data['total_ic_mou_good'] = (high_value_data.total_ic_mou_6 + high_value_data.total_ic_mou_7)/2
high_value_data['total_ic_mou_variance']=high_value_data.total_ic_mou_good - high_value_data.total_ic_mou_8

In [None]:
# capturing avg of outgoing mou and deriving variance value

high_value_data['total_og_mou_good'] = (high_value_data.total_og_mou_6 + high_value_data.total_og_mou_7)/2
high_value_data['total_og_mou_variance']=high_value_data.total_og_mou_good - high_value_data.total_og_mou_8

In [None]:
# capturing avg of 3g mb volume data and deriving variance value

high_value_data['vol_3g_mb_good'] = (high_value_data.vol_3g_mb_6 + high_value_data.vol_3g_mb_7)/2
high_value_data['vol_3g_mb_variance']=high_value_data.vol_3g_mb_good - high_value_data.vol_3g_mb_8

In [None]:
# capturing avg of 2g mb volume data and deriving variance value

high_value_data['vol_2g_mb_good'] = (high_value_data.vol_2g_mb_6 + high_value_data.vol_2g_mb_7)/2
high_value_data['vol_2g_mb_variance']=high_value_data.vol_2g_mb_good - high_value_data.vol_2g_mb_8

In [None]:
# capturing both 3g and 2g variance value as one parameter and dropping actual variance column to avoid multi-collinearity

high_value_data['vol_data_mb_variance'] = high_value_data['vol_3g_mb_variance'] + high_value_data['vol_2g_mb_variance']
high_value_data.drop(['vol_3g_mb_variance','vol_2g_mb_variance'],axis=1,inplace=True)

In [None]:
# capturing only t2c mou variance as outgoing data already captured in another parameter

high_value_data['loc_og_t2c_mou_variance'] = (high_value_data.loc_og_t2c_mou_6 + high_value_data.loc_og_t2c_mou_7)/2 - high_value_data.loc_og_t2c_mou_8

In [None]:
# capturing total recharge variance

high_value_data['sum_total_recharge_variance']=high_value_data.sum_total_recharge_good - high_value_data.total_rech_amt_8

In [None]:
# capturing roaming mou variance

high_value_data['roam_mou_variance'] = (high_value_data.roam_ic_mou_6 + high_value_data.roam_ic_mou_7 + high_value_data.roam_og_mou_6 + high_value_data.roam_og_mou_7)/2 - ( high_value_data.roam_ic_mou_8 + high_value_data.roam_og_mou_8)

In [None]:
plt.figure(figsize=(16, 6))
sns.scatterplot(x='aon', y='roam_mou_variance',hue='churn', data=high_value_data)

#### We can see customers who are new to the network are more on the churn category.

In [None]:
plt.figure(figsize=(14, 5))
sns.scatterplot(x='total_ic_mou_good', y='total_og_mou_good',hue='churn', data=high_value_data)

#### We can see customers incoming and outgoing calls and there is no obvious pattern for churn as such

In [None]:
plt.figure(figsize=(20, 15))
sns.scatterplot(x='aon', y='vol_2g_mb_good',hue='churn', data=high_value_data)

In [None]:
plt.figure(figsize=(20, 15))
sns.scatterplot(x='aon', y='vol_3g_mb_good',hue='churn', data=high_value_data)

#### We can see customers who are new to the network use more of mobile data (3g and 2g) and also there is more percentage of churn as well. With more time, customers using mobile data and also churn rate reduces.

In [None]:
plt.figure(figsize=(16, 6)) 

# plotting sum_total_recharge_variance across aon
sns.barplot(x='aon', y='sum_total_recharge_variance', data=high_value_data)
plt.show()

#### We can see there is no pattern on recharge variance with age on network and there continues to be spikes throughout.

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(16, 8))
plt.title('Total incoming mou details')
sns.boxplot( x='total_ic_mou_6',orient='h',  data=high_value_data)

In [None]:
high_value_data['sum_total_recharge_good'].describe(percentiles=[0.25,0.50,0.75,0.90,0.95,0.99])

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize=(16, 8))
plt.title('Total recharge data')
sns.boxplot( x='sum_total_recharge_good',orient='h',  data=high_value_data)

#### There are lot of outliers, so let us do some outlier tratment. As we have used sum_total_recharge_good for taking high valued customers.We can use the same to make sure there are no outliers in high valued customers.

In [None]:
high_value_data=high_value_data[high_value_data.sum_total_recharge_good < high_value_data.sum_total_recharge_good.quantile(.99)]

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(16, 8))
plt.title('Total recharge data')
sns.boxplot( x='sum_total_recharge_good',orient='h',  data=high_value_data)

#### Still there are some outliers.We will take 95% as it eliminates the outliers.

In [None]:
high_value_data=high_value_data[high_value_data.sum_total_recharge_good < high_value_data.sum_total_recharge_good.quantile(.95)]

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(16, 8))
plt.title('Total recharge data')
sns.boxplot( x='total_og_mou_6',orient='h',  data=high_value_data)

In [None]:
# Print churn data derived so far
high_value_data[high_value_data['churn']==1]

In [None]:
# Let's see the correlation matrix 
plt.figure(figsize = (20,10)) # Size of the figure
sns.heatmap(high_value_data.corr(),annot = True)

In [None]:
# take the churn variable as y and remove it from actual data set to create X and y.
y = high_value_data['churn']

X = high_value_data.drop(['churn'],axis=1)

In [None]:
scaler = StandardScaler()
columns = X.columns

X[columns] = scaler.fit_transform(X[columns])

## Modelling : Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(class_weight = 'balanced')
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 12)         # running RFE with 12 variables as output
rfe = rfe.fit(X,y)
print(rfe.support_)           # Printing the boolean results
print(rfe.ranking_)   

#### As the features in the dataset are highly correated let us get around 5-10 important variables without correlation and P value.

In [None]:
# ranking the variables.
list(zip(X.columns,rfe.support_,rfe.ranking_))

In [None]:
columns_needed_automated = X.columns[rfe.support_]

In [None]:
columns_needed_automated

In [None]:
# split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.7,
                                                    test_size = 0.3, random_state=100)

In [None]:
# Let's run the model using the selected variables
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logsk = LogisticRegression(C=1e9,class_weight='balanced')
logsk.fit(X_train[columns_needed_automated], y_train)

In [None]:
X_train_rfe=X_train[columns_needed_automated]

#### Check for p-value and VIF

In [None]:
import statsmodels.api as sm

#Comparing the model with StatsModels
logm4 = sm.GLM(y_train,(sm.add_constant(X_train_rfe)), family = sm.families.Binomial())
modres = logm4.fit()
logm4.fit().summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train_rfe.drop(['total_ic_mou_8'],axis=1,inplace=True)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train_rfe.drop(['total_ic_mou_good'],axis=1,inplace=True)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
import statsmodels.api as sm

#Comparing the model with StatsModels
logm4 = sm.GLM(y_train,(sm.add_constant(X_train_rfe)), family = sm.families.Binomial())

modres = logm4.fit()
logm4.fit().summary()

In [None]:
X_train_rfe.drop(['total_ic_mou_6'],axis=1,inplace=True)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train_rfe.drop(['total_og_mou_8'],axis=1,inplace=True)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
import statsmodels.api as sm

#Comparing the model with StatsModels
logm4 = sm.GLM(y_train,(sm.add_constant(X_train_rfe)), family = sm.families.Binomial())

modres = logm4.fit()
logm4.fit().summary()

In [None]:
X_train_rfe.drop(['og_others_8'],axis=1,inplace=True)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
import statsmodels.api as sm

#Comparing the model with StatsModels
logm4 = sm.GLM(y_train,(sm.add_constant(X_train_rfe)), family = sm.families.Binomial())
modres = logm4.fit()
logm4.fit().summary()

#### After removing the variables which are highly corelated and with high P-Value we are left with 7 variables.

In [None]:
columns_needed_automated=X_train_rfe.columns
columns_needed_automated

### From the above columns we can see that according to logistic regression average_recharge_data,minutes of usage for incoming ,og for local std and isd are important.
### Important thing to note here is all the data is from 8th month which is the action month which tells us that the month's data is very important for predicting churn.

In [None]:
X_test[columns_needed_automated].shape

In [None]:
# Let's run the model using the selected variables
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logsk = LogisticRegression(C=1e9,class_weight='balanced')
logsk.fit(X_train[columns_needed_automated], y_train)


In [None]:
# Predicted probabilities
y_pred = logsk.predict_proba(X_test[columns_needed_automated])
# Converting y_pred to a dataframe which is an array
y_pred_df = pd.DataFrame(y_pred)
# Converting to column dataframe
y_pred_1 = y_pred_df.iloc[:,[1]]
# Let's see the head
y_pred_1.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)
y_test_df.head()

In [None]:
y_test_df.shape

In [None]:
y_pred_1.shape

In [None]:
# Putting CustID to index
y_test_df['CustID'] = y_test_df.index

# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)

y_test_df.reset_index(drop=True, inplace=True)

# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)

# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 1 : 'Churn_Prob'})
y_pred_final


In [None]:
# Creating new column 'predicted' with 1 if Churn_Prob>0.5 else 0
y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_pred_final.head()

In [None]:
from sklearn import metrics

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix( round(y_pred_final.churn,2), y_pred_final.predicted )
confusion

In [None]:
#Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.churn, y_pred_final.predicted)

In [None]:
print(classification_report(y_pred_final.churn, y_pred_final.predicted))

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(6, 6))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return fpr, tpr, thresholds

In [None]:
draw_roc(y_pred_final.churn, y_pred_final.predicted)

In [None]:

"{:2.2f}".format(metrics.roc_auc_score(y_pred_final.churn, y_pred_final.Churn_Prob))

## PCA model implementation:

In [None]:
pca = PCA(random_state=42)

In [None]:
pca.fit(X_train)

In [None]:
pca.components_

In [None]:
pca.explained_variance_ratio_

In [None]:
var_cumu = np.cumsum(pca.explained_variance_ratio_)

In [None]:
fig = plt.figure(figsize=[16,12])
plt.vlines(x=15, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=30, xmin=0, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.show()

In [None]:
pca_final = IncrementalPCA(n_components=16)

In [None]:
df_train_pca = pca_final.fit_transform(X_train)

In [None]:
df_train_pca.shape

In [None]:
corrmat = np.corrcoef(df_train_pca.transpose())

In [None]:
corrmat.shape

#### Lets draw heatmap to get better picture of the data

In [None]:
plt.figure(figsize=[15,15])
sns.heatmap(corrmat, annot=True)

In [None]:
df_test_pca = pca_final.transform(X_test)
df_test_pca.shape

In [None]:
learner_pca = LogisticRegression(class_weight='balanced')

In [None]:
model_pca = learner_pca.fit(df_train_pca, y_train)

In [None]:
pred_probs_test = model_pca.predict_proba(df_test_pca)

In [None]:
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_probs_test[:,1]))

In [None]:
pca_again = PCA(0.9)

In [None]:
df_train_pca2 = pca_again.fit_transform(X_train)

In [None]:
df_train_pca2.shape


In [None]:
learner_pca2 = LogisticRegression(class_weight='balanced')

In [None]:
model_pca2 = learner_pca2.fit(df_train_pca2, y_train)

In [None]:
df_test_pca2 = pca_again.transform(X_test)

In [None]:
df_test_pca2.shape

In [None]:
pred_probs_test2 = model_pca2.predict_proba(df_test_pca2)[:,1]

In [None]:
pred_probs_test2

In [None]:
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_probs_test2))

## As we can see with PCA the same results with high accuracy and sensitivity could be obtained with very minimal work and execution time. This model is most preferred while processing huge dataset.

## Decision Tree modelling:

In [None]:
# Create the parameter grid 
param_grid = {
    'max_depth': range(5, 15, 5),
    'min_samples_leaf': range(50, 150, 50),
    'min_samples_split': range(50, 150, 50),
    'criterion': ["entropy", "gini"]
}

n_folds = 5

# Instantiate the grid search model
dtree = DecisionTreeClassifier(class_weight ='balanced')
grid_search = GridSearchCV(estimator = dtree, param_grid = param_grid, 
                          cv = n_folds, verbose = 1)

# Fit the grid search to the data
grid_search.fit(X_train,y_train)

In [None]:
# cv results
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results

In [None]:
# printing the optimal accuracy score and hyperparameters
print("best accuracy", grid_search.best_score_)
print(grid_search.best_estimator_)

In [None]:
# model with optimal hyperparameters
clf_gini = DecisionTreeClassifier(criterion = 'entropy', 
                                  class_weight ='balanced',
                                  random_state = 100,
                                  max_depth=5, 
                                  min_samples_leaf=50,
                                  min_samples_split=50)
clf_gini.fit(X_train, y_train)

In [None]:
X_train.shape

In [None]:
# Putting features
features = list(high_value_data.columns[1:])
features

In [None]:
# accuracy score
clf_gini.score(X_train,y_train)

In [None]:
# accuracy score
clf_gini.score(X_test,y_test)

In [None]:
# plotting the tree
dot_data = StringIO()  
export_graphviz(clf_gini, out_file=dot_data,feature_names=features,filled=True,rounded=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [None]:
# classification metrics

y_pred = clf_gini.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
# confusion matrix
print(confusion_matrix(y_test,y_pred))

# Random Forest 

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [5, 15, 5],
    'min_samples_leaf': range(30, 150, 50),
    'min_samples_split': range(30, 150, 50),
        'n_estimators': [20,50,100,200], 
    'max_features': [5, 10]
}
# Create a based model
rf = RandomForestClassifier(class_weight='balanced')
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1,verbose = 1)

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

In [None]:
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# model with the best hyperparameters
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(bootstrap=True,
                             class_weight='balanced',
                             max_depth=15,
                             min_samples_leaf=30, 
                             min_samples_split=30,
                             max_features=10,
                             n_estimators=20)

In [None]:
rfc.fit(X_train,y_train)

In [None]:
# predict
predictions = rfc.predict(X_test)

In [None]:
# evaluation metrics
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))

# Models tried: 

### Logistic regression : Accuracy - 75% Sensitivity -85%
### PCA : Logistic Regression - Accuracy -87%
### Decision Tree : Accuracy - 83% -Sensitivity -85%
### Random Forest : Accuracy - 92% Sensitivity -76%

### Conclusions:

1.Logistic Regression : It is taking lot of time to run logistic regression model.The accuracy is decent but it is doing good in sensitivity or recall.
Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.Hence, the recall is vital as predicting churned customers as churned is more important than predicting non churned as non churned.
Worst thing about this model is correlation.Lot of features had to be removed MANUALLY because of correlation.
But Accuracy is also important and should be more than 75 %.So any model we should choose should do better than logistic regression in terms of accuracy and sensitivity.

2.Random Forest the accuracy is very good.it is 92%.But it takes lot of time and performance issues but sensitivity is less than logistic regression.It is not performing good for churned values than other models .We can use it in cases where sensitivity is not the priority and accuracy is very good.We have put these numbers after many permutations and combinations of the hyper parameters.

3.Decision Trees are performing really good(which goes to show our model is NOT overfitting) both on training and test data sets with accuracy of 83 and sensitivty of 85.The performance was also good for decision tree and it gives the features also to give some recommendations based on which the comany can give plans for the customers.As it is not over fitting the model could be re used.

4.PCA :The best part about PCA is it's performance .It is completed in just seconds.If we want to reuse a model based on it's outcome to decide how many people may churn it could be used.As it is not over fitting and gives a consistant model.
It also tells number of components to choose from Scree plot.

##### Observations :

We can use either PCA with Logistic Regression or Decsion tree based on the need.If the need is to get a good,consistant and high performing model to predict churn PCA is the best.
If the choose to alter the prepaid plan to get the Features for churn prediction we can go for Decision tree.

##### Recommendations:
1.Good plans for Special Incoming or Outgoing Minutes of usage.

2.Better recharge plans with more value for money.For example if a customer recharges with 100 rs
give talktime or data worthy of 100rs.Because max_recharge_data and total_recharge_amt are strong indicators of churn .
There might be a competitor offering better value for money.Revise recharge plans.

3.Devise lucrative Data plans.Volume of data used 2g or 3g is a good indicator. Considering there are very less people using 3g either
because of cost but still it ended up as a good indicator .Better 3g plans make people shift from 2g to 3g.

4.Better STD plans.

5.Better roaming plans .Less charge on incoming or outgoing calls.

6.Better t2t onnet plans .For same network the charges might be lowered considering Airtel was the top network provider in India 
and more t2t calls would have been done.

7.Customers new to the network are more likely to churn. Provide attractive offers to such customers to refrain them from churning.
    
    