## Business Problem Overview

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business goal.

To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.

In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

## Understanding and Defining Churn

There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services).

In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn.

However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again).

Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully.  Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America.

This project is based on the Indian and Southeast Asian market.

## Definitions of Churn

There are various ways to define churn, such as:

Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’. 
The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. 

Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time. 
A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator.

In this project, we will use the usage-based definition to define churn.

## High Value Churn

In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage.

In this project, we will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers.

## Understanding the dataset

The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. 

The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful.

## Understanding Customer Behaviour During Churn

Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle :

The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual.

The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a  competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.)

The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase.

 

In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase.

## Imports


In [None]:
#pip install -U scikit-learn

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Importing Pandas and NumPy
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import numpy as np

# Importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#statsmodels
import statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
#sklearn
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Importing classification report and confusion matrix from sklearn metrics
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
# Imputer from sklearn.impute 
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# RFE import
from sklearn.feature_selection import RFE

## Reading and Understanding the Data


In [None]:
# Importing all datasets
churn_data = pd.read_csv("../input/telecom-churn-dataset/telecom_churn_data.csv")
churn_data.head()

In [None]:
churn_data.shape

In [None]:
churn_data.info()

In [None]:
# Checking Null values%
round(100*(churn_data.isnull().sum()/len(churn_data.index)),2)

## Data Preparation and Cleaning


### Filter high-value customers

We need to predict churn only for the high-value customers. 
Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

In [None]:
churn_data['avg_total_rech_amt_6_7'] = churn_data[['total_rech_amt_6', 'total_rech_amt_7']].mean(axis=1)

In [None]:
churn_data = churn_data[churn_data.avg_total_rech_amt_6_7 >= churn_data.avg_total_rech_amt_6_7.quantile(.70)]

In [None]:
# Average columns will be added later
churn_data.drop(['avg_total_rech_amt_6_7'],inplace=True,axis=1)
churn_data.shape

### Tag churners

Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are:

total_ic_mou_9
total_og_mou_9
vol_2g_mb_9
vol_3g_mb_9

In [None]:
churn_data['churn'] = np.where(  (churn_data.total_ic_mou_9 == 0)
                               & (churn_data.total_og_mou_9 == 0)
                               & (churn_data.vol_2g_mb_9 == 0)
                               & (churn_data.vol_3g_mb_9 == 0), 1, 0)

In [None]:
churn_data['churn'].value_counts()

In [None]:
churn_percent = (sum(churn_data['churn'])/len(churn_data.index))*100
print(churn_percent)

plt.figure(figsize=(12, 6))
colors = ["#3791D7", "#D72626"]
labels = "Loyal Customers", "Churn Customers"
churn_data["churn"].value_counts().plot.pie(explode=[0,0.2], autopct='%1.2f%%', shadow=True, colors=colors, labels=labels, fontsize=12, startangle=70)
plt.ylabel('% of Customers', fontsize=14)
plt.show()

In [None]:
churn_data.shape

### Rename few of the attributes for consistency

In [None]:
# Rename the vbc columns for consistency with other columns
churn_data.rename(columns = {'jun_vbc_3g':'vbc_3g_6' , 'jul_vbc_3g':'vbc_3g_7', 
                             'aug_vbc_3g':'vbc_3g_8' , 'sep_vbc_3g':'vbc_3g_9' }, inplace = True)

# Rename the last_day_rch_amt columns to last_day_rech_amt for consistency
churn_data.rename(columns = {'last_day_rch_amt_6':'last_day_rech_amt_6' , 'last_day_rch_amt_7':'last_day_rech_amt_7',
                             'last_day_rch_amt_8':'last_day_rech_amt_8' , 'last_day_rch_amt_9':'last_day_rch_amt_9'},
                  inplace = True)

### Remove all the attributes corresponding to the churn phase

After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names).

In [None]:
churn_data = churn_data.loc[:,~churn_data.columns.str.endswith('_9')]

churn_data.shape

### Removing high NULL value features/columns

Drop features/columns that have more than 60% NULL values

In [None]:
# Checking Null values%
round(100*(churn_data.isnull().sum()/len(churn_data.index)),2)

In [None]:
# Drop features/columns that have more than 60% NULL values
churn_data = churn_data.dropna(thresh=churn_data.shape[0]*0.6, how='all', axis=1)

### Drop date columns

In [None]:
# Drop the date columns as they don't seem to hold significance
for col in churn_data.columns:
    if 'date' in col:
        churn_data.drop(col,inplace=True,axis=1)

churn_data.shape

### Drop columns which have only one unique value

In [None]:
# Drop features that have only one unique value
for col in churn_data.columns:
    if (churn_data[col].nunique() == 1):
        print(col)
        churn_data.drop(col,inplace=True,axis=1)

churn_data.shape

In [None]:
# No duplicate mobile number rows
print(churn_data.mobile_number.nunique())
# Drop mobile number column
churn_data.drop('mobile_number',inplace=True,axis=1)

### Outlier treatment

In [None]:
churn_data.describe(percentiles=[.99,.95,.9,.75,.25,.1,.05,.01])

In [None]:
for col in churn_data.columns:
    if(col != "churn"):
        Q90 = churn_data[col].quantile(0.9)  
        churn_data[col] = np.clip(churn_data[col], 0, Q90)

In [None]:
churn_data.describe(percentiles=[.99,.95,.9,.75,.25,.1,.05,.01])

### Impute missing values

In [None]:
# Checking Null values%
round(100*(churn_data.isnull().sum()/len(churn_data.index)),2)

In [None]:
# All missing values are in mou columns
# Impute with value 0 on all other missing values
churn_data = churn_data.fillna(0)

In [None]:
# Checking Null values%
round(100*(churn_data.isnull().sum()/len(churn_data.index)),2)

### Derived Features along with Univariate and Bivariate Analysis

#### Derive average value columns of 6th and 7th months (action months) and diff value columns

Also drop the 6th and 7th months columns since they are depicted by a corresponding single column.

For example derive avg_arpu_6_7 from arpu_6 and arpu_7 and then drop arpu_6 and arpu_7.

Diff column = arpu_8 - avg_arpu_6_7

In [None]:
for first_col in churn_data.columns:
    if first_col.endswith("_6"):
        second_col = first_col.replace('_6','_7')
        third_col  = first_col.replace('_6','_8')
        avg_col = "avg_"+first_col+"_7"
        #print("Derive ", avg_col, "from ",first_col, second_col)        
        churn_data[avg_col]  = churn_data[[first_col, second_col]].mean(axis=1)
        churn_data.drop([first_col, second_col],inplace=True,axis=1)

In [None]:
churn_data.describe()

In [None]:
print(churn_data.shape)

In [None]:
#Identifying customers who use roaming
churn_data["roaming_user"] = np.where( (churn_data.avg_roam_ic_mou_6_7 != 0) |                                       
                                       (churn_data.avg_roam_og_mou_6_7 != 0) |
                                       (churn_data.roam_ic_mou_8 != 0) |                                       
                                       (churn_data.roam_og_mou_8 != 0),
                                       1,0
                                    )

#Identifying customers who use std
churn_data["std_user"] = np.where( (churn_data.avg_std_ic_mou_6_7 != 0) |                                       
                                   (churn_data.avg_std_og_mou_6_7 != 0) |
                                   (churn_data.std_ic_mou_8 != 0) |                                       
                                   (churn_data.std_og_mou_8 != 0),
                                   1,0)

#Identifying customers who use internet
churn_data["internet_user"] = np.where(  (churn_data.avg_vol_2g_mb_6_7 != 0) |
                                          (churn_data.avg_vol_3g_mb_6_7 != 0) |
                                          (churn_data.avg_sachet_2g_6_7 !=0) |                                          
                                          (churn_data.avg_vbc_3g_6_7 !=0) |
                                          (churn_data.vol_2g_mb_8 != 0) |
                                          (churn_data.vol_3g_mb_8 != 0) |
                                          (churn_data.sachet_2g_8 !=0)  |
                                          (churn_data.vbc_3g_8 !=0),
                                          1,0
                                       )

df_roaming_user = churn_data.loc[(churn_data["roaming_user"] == 1),:]
df_std_user = churn_data.loc[(churn_data["std_user"] == 1),:]
df_internet_user = churn_data.loc[(churn_data["internet_user"] == 1),:]

#### Analyze roaming mou columns

In [None]:
print(churn_data.groupby("churn").roaming_user.value_counts())

df_churn = churn_data.loc[(churn_data["churn"] == 1),:]
df_non_churn = churn_data.loc[(churn_data["churn"] == 0),:]

plt.figure(figsize=(14,8))
plt.subplot(1, 2, 1)
labels = "Roaming Users", "Non-Roaming Users"
df_churn.roaming_user.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Churn users")

plt.subplot(1, 2, 2)
labels = "Non-Roaming Users", "Roaming Users"
df_non_churn.roaming_user.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Non Churn users")
plt.show()

Inference ->

Among the churn users, around 69% are roaming users.

Whereas among the non churn users, around 37% are roaming users.

#### Analyze STD Usage minutes

In [None]:
cond = (((churn_data.avg_std_ic_mou_6_7 != 0) &
         (churn_data.std_ic_mou_8 == 0)) | \
        ((churn_data.avg_std_og_mou_6_7 != 0) &
         (churn_data.std_og_mou_8 == 0))
        )

churn_data['std_churn_b'] = np.where(cond, 1, 0)
print(churn_data.groupby(["churn"]).std_churn_b.value_counts())

In [None]:
print(churn_data.groupby("churn").std_churn_b.value_counts())

df_churn = churn_data.loc[(churn_data["churn"] == 1),:]
df_non_churn = churn_data.loc[(churn_data["churn"] == 0),:]

plt.figure(figsize=(14,8))
plt.subplot(1, 2, 1)
labels = "STD Users", "OK"
df_churn.std_churn_b.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Churn users")

plt.subplot(1, 2, 2)
labels = "OK", "STD Users"
df_non_churn.std_churn_b.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Non Churn users")
plt.show()

Inference ->

Among the churn users, around 56% churn when STD usage is 0 as compared to previous months.

Whereas among the non churn users, 15% user STD usage becomes 0.

#### Analyze aon (age on network) feature

In [None]:
# Boxplot of churned customers vs non-churned customers based on no of days they are with the network.
print(churn_data.groupby("churn").aon.describe())
sns.boxplot(x=churn_data.churn, y=churn_data.aon)

In [None]:
churn_data['customer_new'] = churn_data["aon"].apply(lambda x : 1 if x<1000 else 0)

In [None]:
df_churn = churn_data.loc[(churn_data["churn"] == 1),:]
df_non_churn = churn_data.loc[(churn_data["churn"] == 0),:]
print(churn_data.groupby("churn").customer_new.value_counts())

plt.figure(figsize=(12,6))
plt.subplot(1, 2, 1)
labels = "New Customers","Old Customers"
df_churn.customer_new.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Churn users")

plt.subplot(1, 2, 2)
labels = "Old Customers","New Customers"
df_non_churn.customer_new.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Non Churn users")
plt.show()

Inference ->

Among the churn users, almost 74% are new customers whose aon is less than ~1000 days.

Whereas in non churn users, around 51% are new customers whose aon is less than ~1000 days.

Drop aon as the derived customer_category will be used for modeling

In [None]:
churn_data.drop(['aon'], axis = 1, inplace = True)

#### Analyze arpu (average revenue per user) column

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.boxplot(x=churn_data.churn,y=churn_data.avg_arpu_6_7)
plt.subplot(1,2,2)
sns.boxplot(x=churn_data.churn,y=churn_data.arpu_8)
plt.show()
print(churn_data.groupby("churn").avg_arpu_6_7.describe())
print(churn_data.groupby("churn").arpu_8.describe())

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(2,2,1)
sns.distplot(df_non_churn.avg_arpu_6_7)
plt.subplot(2,2,2)
sns.distplot(df_non_churn.arpu_8)
plt.subplot(2,2,3)
sns.distplot(df_churn.avg_arpu_6_7)
plt.subplot(2,2,4)
sns.distplot(df_churn.arpu_8)
plt.show()

All the distribution plots are right skewed

#### Analyze all the recharge columns

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
ax1 = sns.boxplot(x="churn",y="avg_total_rech_num_6_7", data = churn_data)
plt.subplot(1,2,2)
ax1 = sns.boxplot(x="churn",y="total_rech_num_8", data = churn_data)
plt.show()
print(churn_data.groupby("churn").avg_total_rech_num_6_7.describe())
print(churn_data.groupby("churn").total_rech_num_8.describe())

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(2,2,1)
sns.distplot(df_non_churn.avg_total_rech_num_6_7)
plt.subplot(2,2,2)
sns.distplot(df_non_churn.total_rech_num_8)
plt.subplot(2,2,3)
sns.distplot(df_churn.avg_total_rech_num_6_7)
plt.subplot(2,2,4)
sns.distplot(df_churn.total_rech_num_8)
plt.show()

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
ax1 = sns.boxplot(x="churn",y="avg_total_rech_amt_6_7", data = churn_data)
plt.subplot(1,2,2)
ax1 = sns.boxplot(x="churn",y="total_rech_amt_8", data = churn_data)
plt.show()
print(churn_data.groupby("churn").avg_total_rech_amt_6_7.describe())
print(churn_data.groupby("churn").total_rech_amt_8.describe())

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(2,2,1)
sns.distplot(df_non_churn.avg_total_rech_amt_6_7)
plt.subplot(2,2,2)
sns.distplot(df_non_churn.total_rech_amt_8)
plt.subplot(2,2,3)
sns.distplot(df_churn.avg_total_rech_amt_6_7)
plt.subplot(2,2,4)
sns.distplot(df_churn.total_rech_amt_8)
plt.show()

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
ax1 = sns.boxplot(x="churn",y="avg_max_rech_amt_6_7", data = churn_data)
plt.subplot(1,2,2)
ax1 = sns.boxplot(x="churn",y="max_rech_amt_8", data = churn_data)
plt.show()
print(churn_data.groupby("churn").avg_max_rech_amt_6_7.describe())
print(churn_data.groupby("churn").max_rech_amt_8.describe())

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(2,2,1)
sns.distplot(df_non_churn.avg_max_rech_amt_6_7)
plt.subplot(2,2,2)
sns.distplot(df_non_churn.max_rech_amt_8)
plt.subplot(2,2,3)
sns.distplot(df_churn.avg_max_rech_amt_6_7)
plt.subplot(2,2,4)
sns.distplot(df_churn.max_rech_amt_8)
plt.show()

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
ax1 = sns.boxplot(x="churn",y="avg_last_day_rech_amt_6_7", data = churn_data)
plt.subplot(1,2,2)
ax1 = sns.boxplot(x="churn",y="last_day_rech_amt_8", data = churn_data)
plt.show()
print(churn_data.groupby("churn").avg_last_day_rech_amt_6_7.describe())
print(churn_data.groupby("churn").last_day_rech_amt_8.describe())

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(2,2,1)
sns.distplot(df_non_churn.avg_last_day_rech_amt_6_7)
plt.subplot(2,2,2)
sns.distplot(df_non_churn.last_day_rech_amt_8)
plt.subplot(2,2,3)
sns.distplot(df_churn.avg_last_day_rech_amt_6_7)
plt.subplot(2,2,4)
sns.distplot(df_churn.last_day_rech_amt_8)
plt.show()

In [None]:
cond = (  ((churn_data['avg_last_day_rech_amt_6_7']!=0) & (churn_data['last_day_rech_amt_8']==0)) | \
          ((churn_data['avg_max_rech_amt_6_7']!=0) & (churn_data['max_rech_amt_8']==0)) | \
          ((churn_data['avg_total_rech_amt_6_7']!=0) & (churn_data['total_rech_amt_8']==0))
       )

churn_data['rech_churn_b'] = np.where(cond, 1, 0)
print(churn_data.groupby(["churn"]).rech_churn_b.value_counts())

In [None]:
print(churn_data.groupby("churn").rech_churn_b.value_counts())

df_churn = churn_data.loc[(churn_data["churn"] == 1),:]
df_non_churn = churn_data.loc[(churn_data["churn"] == 0),:]

plt.figure(figsize=(14,8))
plt.subplot(1, 2, 1)
labels = "Recharge", "OK"
df_churn.rech_churn_b.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Churn users")

plt.subplot(1, 2, 2)
labels = "OK", "Recharge"
df_non_churn.rech_churn_b.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Non Churn users")
plt.show()

Inference ->

Among the churn users, around 70% churn when recharges are 0 as compared to previous months.

Whereas among the non churn users, 20% recharge usage becomes 0.

#### Analyze Internet Usage columns

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(2,2,1)
ax1 = sns.boxplot(x="churn",y="avg_vol_2g_mb_6_7", data = df_internet_user)
plt.subplot(2,2,2)
ax1 = sns.boxplot(x="churn",y="vol_2g_mb_8", data = df_internet_user)

plt.figure(figsize=(14, 6))
plt.subplot(2,2,3)
ax1 = sns.boxplot(x="churn",y="avg_vol_3g_mb_6_7", data = df_internet_user)
plt.subplot(2,2,4)
ax1 = sns.boxplot(x="churn",y="vol_3g_mb_8", data = df_internet_user)
plt.show()

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(2,2,1)
sns.boxplot(x="churn",y="avg_monthly_3g_6_7", data = df_internet_user)
plt.subplot(2,2,2)
sns.boxplot(x="churn",y="monthly_3g_8", data = df_internet_user)
plt.subplot(2,2,3)
sns.boxplot(x="churn",y="avg_monthly_2g_6_7", data = df_internet_user)
plt.subplot(2,2,4)
sns.boxplot(x="churn",y="monthly_2g_8", data = df_internet_user)
plt.show()

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(2,2,1)
sns.boxplot(x="churn",y="avg_sachet_3g_6_7", data = df_internet_user)
plt.subplot(2,2,2)
sns.boxplot(x="churn",y="sachet_3g_8", data = df_internet_user)
plt.subplot(2,2,3)
sns.boxplot(x="churn",y="avg_sachet_2g_6_7", data = df_internet_user)
plt.subplot(2,2,4)
sns.boxplot(x="churn",y="sachet_2g_8", data = df_internet_user)
plt.show()

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
sns.boxplot(x="churn",y="avg_vbc_3g_6_7", data = df_internet_user)
plt.subplot(1,2,2)
sns.boxplot(x="churn",y="vbc_3g_8", data = df_internet_user)
plt.show()

In [None]:
cond = ( ((churn_data['avg_vol_3g_mb_6_7']!=0) & (churn_data['vol_3g_mb_8']==0)) | \
         ((churn_data['avg_sachet_2g_6_7']!=0) & (churn_data['sachet_2g_8']==0)) | \
         ((churn_data['avg_monthly_3g_6_7']!=0) & (churn_data['monthly_3g_8']==0)) | \
         ((churn_data['avg_vbc_3g_6_7']!=0) & (churn_data['vbc_3g_8']==0)) | \
         ((churn_data['avg_vol_2g_mb_6_7']!=0) & (churn_data['vol_2g_mb_8']==0)))

churn_data['internet_churn_b'] = np.where(cond, 1, 0)
print(churn_data.groupby(["churn","internet_user"]).internet_churn_b.value_counts())

In [None]:
df_internet_churn = churn_data.loc[(churn_data["churn"] == 1) & (churn_data["internet_user"] == 1),:]
df_internet_non_churn = churn_data.loc[(churn_data["churn"] == 0) & (churn_data["internet_user"] == 1),:]

plt.figure(figsize=(14,8))
plt.subplot(1, 2, 1)
labels = "Change in internet plans","OK"
df_internet_churn.internet_churn_b.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Churn Internet users")

plt.subplot(1, 2, 2)
labels = "OK","Change in internet plans"
df_internet_non_churn.internet_churn_b.value_counts().plot.pie(labels = labels, autopct="%1.2f%%")
plt.ylabel("Partition of Non Churn Internet users")
plt.show()

Inference ->

Among the churn internet users, around 75% churn when internet usage changes to 0 as compared to previous months.

Whereas among the non churn internet users, 40% internet usage becomes 0.

In [None]:
### Dropping few insignifanct features
churn_data.drop(['avg_monthly_2g_6_7','monthly_2g_8',
                 'avg_sachet_3g_6_7','sachet_3g_8'], axis = 1, inplace = True)

#### Analysis of minutes of usage columns

In [None]:
print(churn_data.groupby("churn").avg_og_others_6_7.describe())
print(churn_data.groupby("churn").og_others_8.describe())
print(churn_data.groupby("churn").avg_ic_others_6_7.describe())
print(churn_data.groupby("churn").ic_others_8.describe())

In [None]:
# Dropping as most values are 0
churn_data.drop(['avg_og_others_6_7','og_others_8'],inplace=True,axis=1)
churn_data.drop(['avg_ic_others_6_7','ic_others_8'],inplace=True,axis=1)

In [None]:
print(churn_data.groupby("churn").avg_spl_ic_mou_6_7.describe())
print(churn_data.groupby("churn").spl_ic_mou_8.describe())
print(churn_data.groupby("churn").avg_spl_og_mou_6_7.describe())
print(churn_data.groupby("churn").spl_og_mou_8.describe())
plt.figure(figsize=(14, 6))
plt.subplot(2,2,1)
sns.boxplot(x="churn",y="avg_spl_ic_mou_6_7", data = churn_data)
plt.subplot(2,2,2)
sns.boxplot(x="churn",y="spl_ic_mou_8", data = churn_data)
plt.subplot(2,2,3)
sns.boxplot(x="churn",y="avg_spl_og_mou_6_7", data = churn_data)
plt.subplot(2,2,4)
sns.boxplot(x="churn",y="spl_og_mou_8", data = churn_data)
plt.show()

In [None]:
# Dropping as most values are 0; and both churn and non churn users show a drop in spl og
churn_data.drop(['avg_spl_ic_mou_6_7','spl_ic_mou_8'],inplace=True,axis=1)

In [None]:
print(churn_data.groupby("churn").avg_loc_og_t2c_mou_6_7.describe())
print(churn_data.groupby("churn").loc_og_t2c_mou_8.describe())
plt.figure(figsize=(12, 5))
plt.subplot(1,2,1)
sns.boxplot(x="churn",y="avg_loc_og_t2c_mou_6_7", data = churn_data)
plt.subplot(1,2,2)
sns.boxplot(x="churn",y="loc_og_t2c_mou_8", data = churn_data)
plt.show()

In [None]:
churn_data.drop(['avg_loc_og_t2c_mou_6_7','loc_og_t2c_mou_8'],inplace=True,axis=1)

In [None]:
print(churn_data.groupby("churn").avg_onnet_mou_6_7.describe())
print(churn_data.groupby("churn").onnet_mou_8.describe())
print(churn_data.groupby("churn").avg_offnet_mou_6_7.describe())
print(churn_data.groupby("churn").offnet_mou_8.describe())
plt.figure(figsize=(14, 6))
plt.subplot(2,2,1)
sns.boxplot(x="churn",y="avg_onnet_mou_6_7", data = churn_data)
plt.subplot(2,2,2)
sns.boxplot(x="churn",y="onnet_mou_8", data = churn_data)
plt.subplot(2,2,3)
sns.boxplot(x="churn",y="avg_offnet_mou_6_7", data = churn_data)
plt.subplot(2,2,4)
sns.boxplot(x="churn",y="offnet_mou_8", data = churn_data)
plt.show()

In [None]:
print(churn_data.groupby("churn").avg_loc_ic_t2t_mou_6_7.describe())
print(churn_data.groupby("churn").loc_ic_t2t_mou_8.describe())
print(churn_data.groupby("churn").avg_loc_ic_t2m_mou_6_7.describe())
print(churn_data.groupby("churn").loc_ic_t2m_mou_8.describe())
print(churn_data.groupby("churn").avg_loc_ic_t2f_mou_6_7.describe())
print(churn_data.groupby("churn").loc_ic_t2f_mou_8.describe())
print(churn_data.groupby("churn").avg_loc_ic_mou_6_7.describe())
print(churn_data.groupby("churn").loc_ic_mou_8.describe())

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
sns.boxplot(x="churn",y="avg_loc_ic_mou_6_7", data = churn_data)
plt.subplot(1,2,2)
sns.boxplot(x="churn",y="loc_ic_mou_8", data = churn_data)
plt.show()

In [None]:
print(churn_data.groupby("churn").avg_loc_og_t2t_mou_6_7.describe())
print(churn_data.groupby("churn").loc_og_t2t_mou_8.describe())
print(churn_data.groupby("churn").avg_loc_og_t2m_mou_6_7.describe())
print(churn_data.groupby("churn").loc_og_t2m_mou_8.describe())
print(churn_data.groupby("churn").avg_loc_og_t2f_mou_6_7.describe())
print(churn_data.groupby("churn").loc_og_t2f_mou_8.describe())
print(churn_data.groupby("churn").avg_loc_og_mou_6_7.describe())
print(churn_data.groupby("churn").loc_og_mou_8.describe())

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
sns.boxplot(x="churn",y="avg_loc_og_mou_6_7", data = churn_data)
plt.subplot(1,2,2)
sns.boxplot(x="churn",y="loc_og_mou_8", data = churn_data)
plt.show()

In [None]:
print(churn_data.groupby("churn").avg_std_ic_t2t_mou_6_7.describe())
print(churn_data.groupby("churn").std_ic_t2t_mou_8.describe())
print(churn_data.groupby("churn").avg_std_ic_t2m_mou_6_7.describe())
print(churn_data.groupby("churn").std_ic_t2m_mou_8.describe())
print(churn_data.groupby("churn").avg_std_ic_t2f_mou_6_7.describe())
print(churn_data.groupby("churn").std_ic_t2f_mou_8.describe())
print(churn_data.groupby("churn").avg_std_ic_mou_6_7.describe())
print(churn_data.groupby("churn").std_ic_mou_8.describe())

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
sns.boxplot(x="churn",y="avg_std_ic_mou_6_7", data = churn_data)
plt.subplot(1,2,2)
sns.boxplot(x="churn",y="std_ic_mou_8", data = churn_data)
plt.show()

In [None]:
print(churn_data.groupby("churn").avg_std_og_t2t_mou_6_7.describe())
print(churn_data.groupby("churn").std_og_t2t_mou_8.describe())
print(churn_data.groupby("churn").avg_std_og_t2m_mou_6_7.describe())
print(churn_data.groupby("churn").std_og_t2m_mou_8.describe())
print(churn_data.groupby("churn").avg_std_og_t2f_mou_6_7.describe())
print(churn_data.groupby("churn").std_og_t2f_mou_8.describe())
print(churn_data.groupby("churn").avg_std_og_mou_6_7.describe())
print(churn_data.groupby("churn").std_og_mou_8.describe())

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
sns.boxplot(x="churn",y="avg_std_og_mou_6_7", data = churn_data)
plt.subplot(1,2,2)
sns.boxplot(x="churn",y="std_og_mou_8", data = churn_data)
plt.show()

In [None]:
print(churn_data.groupby("churn").avg_isd_og_mou_6_7.describe())
print(churn_data.groupby("churn").isd_og_mou_8.describe())
print(churn_data.groupby("churn").avg_isd_ic_mou_6_7.describe())
print(churn_data.groupby("churn").isd_ic_mou_8.describe())

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(2,2,1)
sns.boxplot(x="churn",y="avg_isd_ic_mou_6_7", data = churn_data)
plt.subplot(2,2,2)
sns.boxplot(x="churn",y="isd_ic_mou_8", data = churn_data)
plt.subplot(2,2,3)
sns.boxplot(x="churn",y="avg_isd_og_mou_6_7", data = churn_data)
plt.subplot(2,2,4)
sns.boxplot(x="churn",y="isd_og_mou_8", data = churn_data)
plt.show()

In [None]:
churn_data.drop(['avg_isd_og_mou_6_7','isd_og_mou_8'],inplace=True,axis=1)
churn_data.drop(['avg_isd_ic_mou_6_7','isd_ic_mou_8'],inplace=True,axis=1)

In [None]:
print(churn_data.groupby("churn").avg_total_ic_mou_6_7.describe())
print(churn_data.groupby("churn").total_ic_mou_8.describe())

In [None]:
print(churn_data.groupby("churn").avg_total_og_mou_6_7.describe())
print(churn_data.groupby("churn").total_og_mou_8.describe())

In [None]:
plt.figure(figsize=(14, 6))
plt.subplot(2,2,1)
sns.boxplot(x="churn",y="avg_total_ic_mou_6_7", data = churn_data)
plt.subplot(2,2,2)
sns.boxplot(x="churn",y="total_ic_mou_8", data = churn_data)
plt.subplot(2,2,3)
sns.boxplot(x="churn",y="avg_total_og_mou_6_7", data = churn_data)
plt.subplot(2,2,4)
sns.boxplot(x="churn",y="total_og_mou_8", data = churn_data)
plt.show()

In [None]:
# Dropping roaming mou columns as we already derived roaming user
for col in churn_data.columns:
    if ("roam_" in col):
        churn_data.drop(col, axis=1, inplace=True)

# Dropping std_user column as we already derived std_churn_b
churn_data.drop(['std_user'],inplace=True,axis=1)

#### Check your derived variables

In [None]:
churn_data.describe()

In [None]:
churn_data_pca = churn_data.copy()
churn_data_pca.shape

### Checking the correlation between the features

In [None]:
df_mou = pd.DataFrame()
for col in churn_data.columns:
    if ("mou_6_7" in col):
        df_mou[col] = churn_data[col]

plt.figure(figsize=(20,10))
sns.heatmap(df_mou.corr(), annot = True, cmap="YlGnBu")

In [None]:
# Dropping high correlated features
churn_data.drop(['avg_loc_ic_mou_6_7'], axis=1, inplace=True)
churn_data.drop(['avg_loc_og_mou_6_7'], axis=1, inplace=True)
churn_data.drop(['avg_std_ic_mou_6_7'], axis=1, inplace=True)
churn_data.drop(['avg_total_ic_mou_6_7'], axis=1, inplace=True)

In [None]:
df_mou = pd.DataFrame()
for col in churn_data.columns:
    if ("mou_8" in col):
        df_mou[col] = churn_data[col]

plt.figure(figsize=(20,10))
sns.heatmap(df_mou.corr(), annot = True, cmap="YlGnBu")

In [None]:
# Dropping high correlated features
churn_data.drop(['loc_og_mou_8'], axis=1, inplace=True)
churn_data.drop(['loc_ic_mou_8'], axis=1, inplace=True)
churn_data.drop(['std_ic_mou_8'], axis=1, inplace=True)
churn_data.drop(['total_ic_mou_8'], axis=1, inplace=True)

In [None]:
df_mou = pd.DataFrame()
for col in churn_data.columns:
    if ("mou" in col):
        df_mou[col] = churn_data[col]

plt.figure(figsize=(20,10))
sns.heatmap(df_mou.corr(), annot = True, cmap="YlGnBu")

In [None]:
churn_data.drop(['onnet_mou_8'], axis=1, inplace=True)
churn_data.drop(['avg_onnet_mou_6_7'], axis=1, inplace=True)

In [None]:
df_others = pd.DataFrame()
for col in churn_data.columns:
    if ("mou" not in col):
        df_others[col] = churn_data[col]

plt.figure(figsize=(20,10))
sns.heatmap(df_others.corr(), annot = True, cmap="YlGnBu")

In [None]:
# Highly Correlated with arpu
churn_data.drop(['total_rech_amt_8'], axis=1, inplace=True)
churn_data.drop(['avg_total_rech_amt_6_7'], axis=1, inplace=True)

In [None]:
df_others = pd.DataFrame()
for col in churn_data.columns:
    if ("mou" not in col):
        df_others[col] = churn_data[col]

plt.figure(figsize=(20,10))
sns.heatmap(df_others.corr(), annot = True, cmap="YlGnBu")

## Data Modeling

Build models to predict churn. The predictive model that you’re going to build will serve two purposes:

It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc.

It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks.

### Splitting Data into Training and Test Sets

In [None]:
print(churn_data.shape)

In [None]:
churn_data.head()

In [None]:
from sklearn.model_selection import train_test_split

# Putting feature variable to X
X = churn_data.drop(['churn'],axis=1)

# Putting response variable to y
y = churn_data['churn']

y.head()

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.7,test_size=0.3,random_state=100,stratify=y)

### Normalize the dataset

In [None]:
scaler = StandardScaler()

category_cols = ['roaming_user','customer_new','internet_user','internet_churn_b','std_churn_b','rech_churn_b']  

X_train_category = X_train[category_cols]
X_train = X_train.drop(category_cols,axis=1)

X_test_category = X_test[category_cols]
X_test = X_test.drop(category_cols,axis=1)

numeric_cols = X_train.columns

# Apply fit_transform on train data
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])

# Apply transform on test data
X_test[numeric_cols]  = scaler.transform(X_test[numeric_cols])

# Concatenate numerical transformed data and categorical data columns
X_train = pd.concat([X_train_category, X_train], axis=1)
X_test = pd.concat([X_test_category, X_test], axis=1)

print(X_train.shape)
print(X_test.shape)

In [None]:
X_test.head()

### Common functions for all models

In [None]:
def df_predictions(y, y_pred, prob_boundary):
    y_df = pd.DataFrame(y)
    y_df['CustID'] = y_df.index

    y_pred_df = pd.DataFrame(y_pred)

    # Removing index for both dataframes to append them side by side 
    y_df.reset_index(drop=True, inplace=True)
    y_pred_df.reset_index(drop=True, inplace=True)

    y_pred_final = pd.concat([y_df,y_pred_df],axis=1)

    # Renaming the column
    y_pred_final= y_pred_final.rename(columns={ 0 : 'churn_Prob'})

    # Rearranging the columns
    #y_pred_final = y_pred_final.reindex_axis(['CustID','churn','churn_Prob'], axis=1)

    # Creating new column 'predicted' with 1 if churn_Prob>prob_boundary else 0
    y_pred_final['churn_predicted'] = y_pred_final.churn_Prob.map( lambda x: 1 if x >prob_boundary else 0)
   
    return y_pred_final

In [None]:
def model_eval(y_pred_final):
    # Confusion matrix 
    confusion = metrics.confusion_matrix(y_pred_final.churn, y_pred_final.churn_predicted)
    TP = confusion[1,1] # true positive 
    TN = confusion[0,0] # true negatives
    FP = confusion[0,1] # false positives
    FN = confusion[1,0] # false negatives
    print("Confusion Matrix -> ")
    print("# Predicted","\t","notchurn","\t","churn")
    print("# Actual")
    print("# not_churn\t",confusion[0,0],"\t\t",confusion[0,1])        
    print("# churn\t\t",confusion[1,0],"\t\t",confusion[1,1])
    # Let's check the report of our model
    print("\nClassification Report -> ")
    print(classification_report(y_pred_final.churn, y_pred_final.churn_predicted))    
    print("\nSensitivity (True Positive rate OR Recall of Churn Label) -> ")
    print(round(TP / (TP+FN),2))
    print("\nSpecificity (True Negative rate OR Recall of Non Churn Label) -> ")
    print(round(TN / (TN+FP),2))
    auc_score = metrics.roc_auc_score( y_pred_final.churn, y_pred_final.churn_Prob)
    print("\nAUC Score -> ", auc_score)    

In [None]:
#Draws the ROC
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
def plot_cutoff_df(y_train_pred_final):
    # Let's create columns with different probability cutoffs 
    numbers = [float(x)/10 for x in range(10)]
    for i in numbers:
        y_train_pred_final[i]= y_train_pred_final.churn_Prob.map(lambda x: 1 if x > i else 0)
    # Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
    cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
    # TP = confusion[1,1] # true positive 
    # TN = confusion[0,0] # true negatives
    # FP = confusion[0,1] # false positives
    # FN = confusion[1,0] # false negatives

    num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    for i in num:
        cm1 = metrics.confusion_matrix(y_train_pred_final.churn, y_train_pred_final[i] )
        total1=sum(sum(cm1))
        accuracy = (cm1[0,0]+cm1[1,1])/total1    
        speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
        sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
        cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
    print(cutoff_df)

    # Let's plot accuracy sensitivity and specificity for various probabilities.
    cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
    plt.show()

In [None]:
#Prints VIF of all the features
def calculateVif(x_train_sm): 
    vif = pd.DataFrame()
    vif['Features'] = x_train_sm.columns
    vif['VIF'] = [variance_inflation_factor(x_train_sm.values, i) for i in range(x_train_sm.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    print(vif)

In [None]:
#returns the stats metric
def getLogisticRegStatSummary():
    logm2 = sm.GLM(y_train,A_train_sm, family = sm.families.Binomial())
    res = logm2.fit()
    print(res.summary())
    calculateVif(X_train_sm)
    return res

### Data Modeling using Logistic Regression without PCA

In [None]:
#Initialising the model
logreg = LogisticRegression(class_weight='balanced')

In [None]:
rfe = RFE(logreg, 12)
rfe = rfe.fit(X_train, y_train)

In [None]:
col = X_train.columns[rfe.support_]

In [None]:
## Building the model in bottom backward approach
#X_train_sm = sm.add_constant(X_train[col])
X_train_sm = (X_train[col])
X_train_sm.shape

In [None]:
#res = getLogisticRegStatSummary(X_train_sm)
res = logreg.fit(X_train_sm,y_train)
calculateVif(X_train_sm)

In [None]:
#Lets drop the column with highest VIF
X_train_sm.drop("internet_user", axis=1, inplace=True)
#res = getLogisticRegStatSummary(X_train_sm)
res = logreg.fit(X_train_sm,y_train)
calculateVif(X_train_sm)

In [None]:
#Lets drop the column with highest VIF
X_train_sm.drop("loc_og_t2m_mou_8", axis=1, inplace=True)
#res = getLogisticRegStatSummary(X_train_sm)
res = logreg.fit(X_train_sm,y_train)
calculateVif(X_train_sm)

In [None]:
#Co-efficients of the attributes.
pd_coefficients = pd.concat([pd.DataFrame(X_train_sm.columns), pd.DataFrame(np.transpose(logreg.coef_))], axis=1)
pd_coefficients.columns = ["Feature", "Co-efficient"]
pd_coefficients

##### Model Evaluation for Logistic Regression on Train Data

In [None]:
##Lets stop converging since both p-value and vif are in acceptable range
#y_train_pred = res.predict(X_train_sm).values.reshape(-1)
y_train_pred = res.predict_proba(X_train_sm)[:,1]
y_train_pred[:10]

In [None]:
y_train_pred_final = df_predictions(y_train, y_train_pred, 0.5)

# Let's see the head
print(y_train_pred_final[y_train_pred_final['churn']==0].head())
print(y_train_pred_final[y_train_pred_final['churn']==1].head())

draw_roc(y_train_pred_final.churn, y_train_pred_final.churn_Prob)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
plot_cutoff_df(y_train_pred_final)

##### Based upon above graph; decide the probability cutoff for churn label

In [None]:
final_cutoff = 0.45
y_train_pred_final['churn_predicted'] = y_train_pred_final.churn_Prob.map( lambda x: 1 if x > final_cutoff else 0)
y_train_pred_final.head()

In [None]:
model_eval(y_train_pred_final)

##### Model Evaluation for Logistic Regression on Test Data

In [None]:
X_train_sm.columns

In [None]:
#Predicting on test data 
#X_test_sm = sm.add_constant(X_test)
X_test_sm = X_test[X_train_sm.columns]
#y_test_pred = res.predict(X_test).values.reshape(-1)
y_test_pred = res.predict_proba(X_test_sm)[:,1]
y_test_pred

In [None]:
y_test_pred_final = df_predictions(y_test, y_test_pred, final_cutoff)

# Let's see the head
print(y_test_pred_final[y_test_pred_final['churn']==0].head())
print(y_test_pred_final[y_test_pred_final['churn']==1].head())

In [None]:
model_eval(y_test_pred_final)

### Data Modeling using PCA

In [None]:
from sklearn.decomposition import PCA
from sklearn.decomposition import IncrementalPCA
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline

In [None]:
# create folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)

Logistic_grid = GridSearchCV(
    Pipeline([
        ('reduce_dim', PCA()),
        ('classify', LogisticRegression(class_weight='balanced'))
        ]),
    param_grid=[
        {
            'reduce_dim__n_components': range(20,40,4),
            'classify__C': np.logspace(-4, 4, 4)
        }
    ],
    cv=folds, scoring='roc_auc')

In [None]:
Logistic_grid.fit(X_train, y_train)

In [None]:
print("PCA ",Logistic_grid.best_estimator_.named_steps['classify'])
print("\nBest params ",Logistic_grid.best_params_)
print("\nBest score (CV score=%0.3f):" % Logistic_grid.best_score_)

#### Making a scree plot for the explained variance and components

In [None]:
var_cumu = np.cumsum(Logistic_grid.best_estimator_.named_steps['reduce_dim'].explained_variance_ratio_)
fig = plt.figure(figsize=[12,8])
plt.vlines(x=36, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.96, xmax=45, xmin=0, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.show()

#### Based upon above scree plot, consider "n" PCA components

In [None]:
pca = IncrementalPCA(n_components=36)
df_train_pca = pca.fit_transform(X_train)
df_test_pca = pca.transform(X_test)

In [None]:
print(df_train_pca.shape)
print(df_test_pca.shape)
print(pca.components_)
pca.explained_variance_ratio_

In [None]:
corrmat = np.corrcoef(df_train_pca.transpose())
print(corrmat.shape)
plt.figure(figsize=[15,15])
sns.heatmap(corrmat, annot=True)

#### Apply Logistic Regression on PCA components

In [None]:
learner_pca = LogisticRegression(class_weight='balanced')
model_pca = learner_pca.fit(df_train_pca, y_train)

##### Model Evaluation for Logistic Regression on Train Data

In [None]:
y_train_pred = model_pca.predict_proba(df_train_pca)[:,1]

y_train_pred_final = df_predictions(y_train, y_train_pred, 0.5)

draw_roc(y_train_pred_final.churn, y_train_pred_final.churn_Prob)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
plot_cutoff_df(y_train_pred_final)

##### Based upon above graph; decide the probability cutoff for churn label

In [None]:
final_cutoff = 0.45
y_train_pred_final['churn_predicted'] = y_train_pred_final.churn_Prob.map( lambda x: 1 if x > final_cutoff else 0)
# Let's see the head
print(y_train_pred_final[y_train_pred_final['churn']==0].head())
print(y_train_pred_final[y_train_pred_final['churn']==1].head())

In [None]:
model_eval(y_train_pred_final)

##### Model Evaluation for PCA+Logistic Regression on Test Data

In [None]:
y_test_pred = model_pca.predict_proba(df_test_pca)[:,1]
y_test_pred

In [None]:
"{:2.2}".format(metrics.roc_auc_score(y_test, y_test_pred))

In [None]:
y_test_pred_final = df_predictions(y_test, y_test_pred, final_cutoff)

# Let's see the head
print(y_test_pred_final[y_test_pred_final['churn']==0].head())
print(y_test_pred_final[y_test_pred_final['churn']==1].head())

In [None]:
model_eval(y_test_pred_final)

#### Apply Random Forest on PCA components

In [None]:
# Importing random forest classifier from sklearn library
from sklearn.ensemble import RandomForestClassifier
#from imblearn.ensemble import BalancedRandomForestClassifier

In [None]:
# create folds
folds = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 0)

rf_grid = GridSearchCV(
    Pipeline([
        ('reduce_dim', PCA()),
        ('classify', RandomForestClassifier(class_weight='balanced_subsample'))
        ]),
    param_grid=[
        {
            'reduce_dim__n_components': range(20,40,4),
            'classify__max_depth': [6,8],
            'classify__min_samples_leaf': range(200,400,100),
            'classify__min_samples_split': range(300,500,100),
            'classify__n_estimators': [10,20,30],
            'classify__criterion':["gini","entropy"]
        }
    ],
    cv=folds, scoring='roc_auc')

In [None]:
rf_grid.fit(X_train, y_train)

In [None]:
print("PCA ",rf_grid.best_estimator_.named_steps['classify'])
print("\n",rf_grid.best_params_)
print("\nBest parameter (CV score=%0.3f):" % rf_grid.best_score_)

##### Fitting the final model with the best parameters obtained from grid search.

In [None]:
pca = IncrementalPCA(n_components=36)
df_train_pca = pca.fit_transform(X_train)
df_test_pca = pca.transform(X_test)

In [None]:
rfc = RandomForestClassifier(bootstrap=True,
                             class_weight='balanced_subsample',
                             max_depth=6,
                             criterion='gini',
                             min_samples_leaf=200, 
                             min_samples_split=400,
                             n_estimators=30)

In [None]:
# fit
rfc.fit(df_train_pca,y_train)

##### Model Evaluation for PCA+Random Forest on Train Data

In [None]:
y_train_pred = rfc.predict_proba(df_train_pca)[:,1]

y_train_pred_final = df_predictions(y_train, y_train_pred, 0.5)

draw_roc(y_train_pred_final.churn, y_train_pred_final.churn_Prob)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
plot_cutoff_df(y_train_pred_final)

##### Based upon above graph; decide the probability cutoff for churn label

In [None]:
final_cutoff = 0.4
y_train_pred_final['churn_predicted'] = y_train_pred_final.churn_Prob.map( lambda x: 1 if x > final_cutoff else 0)
# Let's see the head
print(y_train_pred_final[y_train_pred_final['churn']==0].head())
print(y_train_pred_final[y_train_pred_final['churn']==1].head())

In [None]:
model_eval(y_train_pred_final)

##### Model Evaluation for PCA+Random Forest on test data

In [None]:
y_test_pred = rfc.predict_proba(df_test_pca)[:, 1]
y_test_pred

##### Create a dataframe to make predictions for PCA+Random Forest

In [None]:
y_test_pred_final = df_predictions(y_test, y_test_pred, final_cutoff)

# Let's see the head
print(y_test_pred_final[y_test_pred_final['churn']==0].head())
print(y_test_pred_final[y_test_pred_final['churn']==1].head())

In [None]:
model_eval(y_test_pred_final)

#### Apply XG Boost on PCA components

In [None]:
#Installinh XGBoost:Comment out if already installed
#pip install xgboost

In [None]:
#Importing xgboost package
import xgboost as xgb

In [None]:
# create folds
folds = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 0)

xgBoost_grid = GridSearchCV(
    Pipeline([
        ('reduce_dim', PCA()),
        ('classify', xgb.XGBClassifier(random_state=42, scale_pos_weight = 11))
        ]),
    param_grid=[
        {
         'reduce_dim__n_components': range(20,40,4),  
         'classify__objective':['binary:logistic'],
         'classify__learning_rate': [0.001,0.05,0.1, 10], 
         'classify__max_depth': [2,3],
         'classify__min_child_weight': [35],
         'classify__subsample': [0.8],
         'classify__colsample_bytree': [0.7],
         'classify__n_estimators': [35]}
    ],
    cv=folds, scoring='roc_auc')


In [None]:
xgBoost_grid.fit(X_train, y_train)

In [None]:
print("PCA ",xgBoost_grid.best_estimator_.named_steps['classify'])
print("\n",xgBoost_grid.best_params_)
print("\nBest parameter (CV score=%0.3f):" % xgBoost_grid.best_score_)

#### Fitting the model with best parameters obtained from grid search

In [None]:
pca = IncrementalPCA(n_components=36)
df_train_pca = pca.fit_transform(X_train)
df_test_pca = pca.transform(X_test)

In [None]:
xgb_model = xgb.XGBClassifier(objective = 'binary:logistic',
              learning_rate= 0.1, 
              max_depth= 3,
              min_child_weight= 35,
              subsample= 0.8,
              colsample_bytree= 0.7,
              n_estimators= 35,
              random_state= 42,
              scale_pos_weight = 11)

In [None]:
xgb_model.fit(df_train_pca,y_train)

##### Model Evaluation for PCA+XGBoost on Train Data

In [None]:
y_train_pred = xgb_model.predict_proba(df_train_pca)[:,1]

y_train_pred_final = df_predictions(y_train, y_train_pred, 0.5)

draw_roc(y_train_pred_final.churn, y_train_pred_final.churn_Prob)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
plot_cutoff_df(y_train_pred_final)

##### Based upon above graph; decide the probability cutoff for churn label

In [None]:
final_cutoff = 0.45
y_train_pred_final['churn_predicted'] = y_train_pred_final.churn_Prob.map( lambda x: 1 if x > final_cutoff else 0)
# Let's see the head
print(y_train_pred_final[y_train_pred_final['churn']==0].head())
print(y_train_pred_final[y_train_pred_final['churn']==1].head())

In [None]:
model_eval(y_train_pred_final)

##### Model Evaluation for PCA+XGBoost on test data

In [None]:
y_test_pred = xgb_model.predict_proba(df_test_pca)[:, 1]
y_test_pred

In [None]:
##### Create a dataframe to make predictions for PCA+XGBoost Forest

In [None]:
y_test_pred_final = df_predictions(y_test, y_test_pred, final_cutoff)

# Let's see the head
print(y_test_pred_final[y_test_pred_final['churn']==0].head())
print(y_test_pred_final[y_test_pred_final['churn']==1].head())

In [None]:
model_eval(y_test_pred_final)

In [None]:
pd_coefficients

## Recommend strategies to manage customer churn


#### Potential reasons for churn

1. Roaming and STD users are more likely to churn. This was clearly highlighted in the EDA and the logistic regression model coverged down to include these features with a positive co-efficients .

2. Reduced usage of internet volume and internet plans are also one of the main reasons that why users are churning. Reduced or usage values=0 in the 8th month as compared to last 2 months are also causes of concern.

3. New customers are more likely to churn.Customers whose aon(No of days with the network) is low are the ones churning more when compared to customers who are with the network since long.The observation was captured in the EDA.

4. Reduced or recharges values=0 in the 8th month as compared to last 2 months are also causes of concern.

5. Reduced minutes of usages across all categories are also causes of concern.

6. There can be seen a negative coefficient in logistic regression internet vol attributes indicating a decrease in internet   consumption in the 8 month contributing to the churn.



#### Strategies to withold the customers

1. As seen from the above potential areas causing churn include STD users and roaming users , therefore a more focus in the area    of STD and roaming in lines of pricing/offer is required.
2. It is very clear from the models and EDA that new customers are tending more to churn, hence new customers to the network        should be focused more and should be given better offers.
3. It is also observed that the volume consumption in both 2g and 3g are dropping in the action month of churn users ,              indicating the customers disatisfaction in the internet services offered to them. To the potential customers a better  
   internet plan can be offered to make the customer stay back.