## Problem Statement:
- In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.<br><br>
- For many incumbent operators, retaining high profitable customers is the number one business goal.<br><br>
- To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.<br><br>
- In this project, we will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

 

# Step 1 Reading the data

In [None]:
# Import Required Librarues
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import statsmodels as sm
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

from IPython.display import Image  
from six import StringIO  
from sklearn.tree import export_graphviz
import pydotplus, graphviz
pd.set_option('max_columns',500)
pd.set_option('max_rows',200)

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import the data in data variable
data=pd.read_csv('telecom_churn_data.csv')
data.head()

In [None]:
data.info()

In [None]:
#statstical description of the data
data.describe()

In [None]:
# Shape of the data
data.shape

**Insights and Observations**
- 1 lakh rows and 226 columns
- Some columns just have 0 values eg: loc_og_t2o_mou which can be removed while cleaning the data
- Some columns like total recharge data have lowest value as 1 - the null values could mean that the customer did not recharge and can be imputed with 0

# Step 2 Cleaning the data

## Null value check and treatment

In [None]:
data.isnull().sum()

**Observations**

In the recharge variables where min value is 1, we can impute missing values with 0 since it shows customer didn't recharge in that month

In [None]:
# create a list of recharge columns where we will impute missing values with zeroes
z_impute = ['total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8', 'total_rech_data_9',
        'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 'av_rech_amt_data_9',
        'max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8', 'max_rech_data_9']

# impute missing values with 0
data[z_impute] = data[z_impute].apply(lambda x: x.fillna(0))

In [None]:
# let's make sure values are imputed correctly
print("Missing value ratio:\n")
print(data[z_impute].isnull().sum()*100/data.shape[1])

In [None]:
# null values greater than 30percent
nullperce=data.isnull().sum()/data.shape[0]
nullgreaterthan30=nullperce[nullperce>0.3]
nullgreaterthan30

In [None]:
print(nullgreaterthan30)

## Insights:
- We could see that we have 40 features where null value count is greater than30 percent.
- We are dropping these features now and we will consider only remaining features since these values may lead to bias and null value percentage less than 30 are treated in further steps

In [None]:
#null vlaues less than 30 are considereed for further analysis
nulllessthan30=nullperce[nullperce<=0.3]
nulllessthan30

In [None]:
data2=data[nulllessthan30.index]
data2.head()

## Insights:
- In the above two columns we have performed null value check percentage and created a new dataframe `data2` which holds our new dataframe,In which null value peercentage of features is less than 30.

In [None]:
# Finding the number of columns having 2 & less than 2 unique values
remcol = []
for col in list(data2.columns):
    uniqueValues = data2[col].unique()
    if len(uniqueValues) < 3:
        remcol.append(col)
remcol

## Insights:
- As shown above we are removing the above specified columns/features ,since each feature is saying same information to the dataset which will not help us,So we are removing from further analysis.
- As an exmaple`circle_id` in this dataset is same for entire data.so considering these type of data may affet the peerfomance of the model.

## Removing columns having very less variance

In [None]:
# Mobile number is not a useful feature and thus it makes sense to remove it along with zero variance columns

remcol.append('mobile_number')
print(remcol)
print("Before removing the columns shape of df {}".format(data2.shape))

for col in remcol:
    data2=data2.drop(col,axis=1)

print("After removing the columns shape of df {}".format(data2.shape))

In [None]:
data2.shape

## Treating the null values
- Lets validate the existing columns to fix the null values 

In [None]:
# Columns having less than 30% null values
null_values = data2.isnull().sum()
list_null_col = list(null_values[data2.isnull().sum()>0].index)

In [None]:
# Since we are not doing time series forecasting all the date columns can be removed

df_date = data2.select_dtypes(exclude=['float64','int64'])

print("Before removing the date columns shape of df {}".format(data2.shape))

for col in df_date.columns:
    data2=data2.drop(col,axis=1)

print("After removing the date columns shape of df {}".format(data2.shape))

In [None]:
# fill the na values with mean vlaues
data2.fillna(data2.mean(),inplace=True)

In [None]:
# Filling the rest of null values with mean as all of them are numerical variables
data2.isnull().sum()

In [None]:
data2.head()

In [None]:
data2.shape

## Filtering High value customers
- Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

In [None]:
#Take average recharge amount in the month of 6 and 7 adn store in a new clumn called avergae_rech_amt_6_7

data2['average_rech_amt_6_7']=(data2['total_rech_amt_6']+data2['total_rech_amt_7'])/2

In [None]:
#check the quantile values since we will consider the value abve 70 percentile as high value customers
np.quantile(data2['average_rech_amt_6_7'],[0,0.5,0.7,1])

**Insights**:
- As we could see value of ~390 is the 70th percentile value

In [None]:
data3=data2[data2['average_rech_amt_6_7']>=368.5]
data3.head()

In [None]:
data3.shape

## Insights:
- Now we are having 30011 rows of data, Which was formed after considering only high vlaue customers.
- Lets continue our analysis with this data

## Identify churners where any of the following fields are 0
- `total_ic_mou_9`
- `total_og_mou_9`
- `vol_2g_mb_9`
- `vol_3g_mb_9`

In [None]:
data3['total_ic_mou_9'].head()

In [None]:
data3['total_og_mou_9'].head()

In [None]:
data3['vol_2g_mb_9'].head()

In [None]:
data3['vol_3g_mb_9'].head()

### Now when all of the above fields are 0 then we can tag them as churners

In [None]:
data3['churn']=((data3['total_ic_mou_9']==0.00) & (data3['total_og_mou_9']==0.00 ) & (data3['vol_2g_mb_9']==0.00) & (data3['vol_3g_mb_9']==0))

In [None]:
data3.head()

In [None]:
# Funcc is reusable method for converting the values of True and False to numericla format indicating churn and nonchurn
def funcc(inp):
    if(inp==False):
        return 0
    else:
        return 1

In [None]:
data3['churn']=data3['churn'].apply(funcc)

In [None]:
temp = data3['churn'].value_counts()
df_1 = pd.DataFrame({'labels': temp.index,'values': temp.values})
df_1.iplot(kind='pie',labels='labels',values='values', title="% Data Imbalance") 

**Insights and Observations**
   - The churn rate of ~8.6% is very small and it would bias the model to the majority class. Thus we need to use class balance techniques like SMOTE on training data before running the model

### Removing the columns having '_9' and sept post preparing churn column

In [None]:
data3.shape

In [None]:
listofcolumns2=[]
eliminatecolumns=[]
for i in data3.columns:
    if('_9' in i):
        eliminatecolumns.append(i)
    else:
        listofcolumns2.append(i)

In [None]:
print(len(listofcolumns2))

In [None]:
print(len(eliminatecolumns))

## Insights:
- As shown above we are removing the columns where we have 9th data in it and we could observe 43 columns can be removed after applying this condition

In [None]:
# As showed above we have removed columns perfectly which are not required for further analysis
data4=data3[listofcolumns2]
data4.shape

In [None]:
# Removing variable calculated in Sept 
data4.drop('sep_vbc_3g',axis=1,inplace=True)

## Insights:
- Removing variable calculated in Sept  which is the last month of the data which we are using to make predicions

# Step 3 Visualizing and Deriving variables

In [None]:
# List of columns left in dataset post cleaning
list(data4.columns)

In [None]:
# Visualizing the variables through box plot
def pltbox(r,c,columns):
    for i,col in zip(range(1,(r*c)+1),columns):
        plt.subplot(r,c,i)
        sns.boxplot(y = col,x='churn', data = data4, hue='churn',palette=("Set3"),showfliers=False)
        plt.tight_layout(pad=1.0)

In [None]:
#Visualizing the mean across different variable across three months
def plot_mean_bar_chart(df,columns_list):
    df_0 = df[df.churn==0].filter(columns_list)
    df_1 = df[df.churn==1].filter(columns_list)

    mean_df_0 = pd.DataFrame([df_0.mean()],index={'Non Churn'})
    mean_df_1 = pd.DataFrame([df_1.mean()],index={'Churn'})

    frames = [mean_df_0, mean_df_1]
    mean_bar = pd.concat(frames)

    mean_bar.T.plot.line(figsize=(10,5),rot=0)
    plt.show()
    
    return mean_bar

## Total recharge number and amount

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['total_rech_num_6','total_rech_num_7','total_rech_num_8','total_rech_amt_6','total_rech_amt_7','total_rech_amt_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['total_rech_num_6','total_rech_num_7','total_rech_num_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['total_rech_amt_6','total_rech_amt_7','total_rech_amt_8'])


**Insights**

The total amount recharge and number drops from June to August for churn users

## Average revenue per user

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(1,3,['arpu_6','arpu_7','arpu_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['arpu_6','arpu_7','arpu_8'])


**Insights**

- There are negative values in these three variables which can be brought to same scale through Standard scaler
- Initially average revenue per user for churn cases is much higher than non churn users but it drops with month and becomes very low in August

## Onnet v.s Offnet calls

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['onnet_mou_6','onnet_mou_7','onnet_mou_8','offnet_mou_6','offnet_mou_7','offnet_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['onnet_mou_6','onnet_mou_7','onnet_mou_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['offnet_mou_6','offnet_mou_7','offnet_mou_8'])


**Insights**

The minutes of usage decreses with a higher slope during August for both within network and outside network

## Roaming Incoming v.s Outgoing calls

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['roam_ic_mou_6','roam_ic_mou_7','roam_ic_mou_8','roam_og_mou_6','roam_og_mou_7','roam_og_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['roam_ic_mou_6','roam_ic_mou_7','roam_ic_mou_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['roam_og_mou_6','roam_og_mou_7','roam_og_mou_8'])


**Insights**

The roaming minutes of usage for churned users is much higher compared to non churners. The mean pretty much remains constant throughout the three month time period. This could mean that lot of churners have moved out of state or travel most of the times

## Outgoing calls

### Outgoing calls within network v.s outside network

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['loc_og_t2t_mou_6',
 'loc_og_t2t_mou_7',
 'loc_og_t2t_mou_8',
 'loc_og_t2m_mou_6',
 'loc_og_t2m_mou_7',
 'loc_og_t2m_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_og_t2t_mou_6',
 'loc_og_t2t_mou_7',
 'loc_og_t2t_mou_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_og_t2m_mou_6',
 'loc_og_t2m_mou_7',
 'loc_og_t2m_mou_8'])


**Insights**

Out going calls within network and out side network reduces drastically by August for churned users

### Outgoing calls to fixed lines of T v.s it's own call center

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['loc_og_t2f_mou_6',
 'loc_og_t2f_mou_7',
 'loc_og_t2f_mou_8',
 'loc_og_t2c_mou_6',
 'loc_og_t2c_mou_7',
 'loc_og_t2c_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_og_t2f_mou_6',
 'loc_og_t2f_mou_7',
 'loc_og_t2f_mou_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_og_t2c_mou_6',
 'loc_og_t2c_mou_7',
 'loc_og_t2c_mou_8'])


**Insights**

Out going calls to network call centre increased drastically in July and then reduced in August for churned users. It could be because the customers weren't happy with the services provided and had decided to leave the network

### Outgoing local calls - within same telecome circle

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['loc_og_mou_6',
 'loc_og_mou_7',
 'loc_og_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_og_mou_6',
 'loc_og_mou_7',
 'loc_og_mou_8'])


**Insights**

Out going calls within network and out side network reduces drastically by August for churned users

### Outgoing STD calls - within same telecome circle (within n/w outside n/w)

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['std_og_t2t_mou_6',
 'std_og_t2t_mou_7',
 'std_og_t2t_mou_8',
 'std_og_t2m_mou_6',
 'std_og_t2m_mou_7',
 'std_og_t2m_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['std_og_t2t_mou_6',
 'std_og_t2t_mou_7',
 'std_og_t2t_mou_8'
 ])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['std_og_t2m_mou_6',
 'std_og_t2m_mou_7',
 'std_og_t2m_mou_8'
 ])


**Insights**

STD Out going calls within network and out side network reduces drastically by August for churned users

### Outgoing STD calls 

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['std_og_t2f_mou_6',
 'std_og_t2f_mou_7',
 'std_og_t2f_mou_8',
 'std_og_mou_6',
 'std_og_mou_7',
 'std_og_mou_8',
 ])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['std_og_t2f_mou_6',
 'std_og_t2f_mou_7',
 'std_og_t2f_mou_8'
 ])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['std_og_mou_6',
 'std_og_mou_7',
 'std_og_mou_8',
 ])

**Insights**

STD Out going calls reduces drastically from July to August for churned users

### Others and total outgoing mou calls 

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['isd_og_mou_6',
 'isd_og_mou_7',
 'isd_og_mou_8',
 'spl_og_mou_6',
 'spl_og_mou_7',
 'spl_og_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['isd_og_mou_6',
 'isd_og_mou_7',
 'isd_og_mou_8',
 ])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,[ 'spl_og_mou_6',
 'spl_og_mou_7',
 'spl_og_mou_8'
 ])

**Insights**

ISD and special out going calls reduces drastically from July to August for churned users

### ISD v.s Special calls 

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['og_others_6',
 'og_others_7',
 'og_others_8',
 'total_og_mou_6',
 'total_og_mou_7',
 'total_og_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['og_others_6',
 'og_others_7',
 'og_others_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['total_og_mou_6',
 'total_og_mou_7',
 'total_og_mou_8'])

**Insights**

Total out going calls reduces drastically from July to August for churned users

## Incoming calls

### incoming calls within network v.s outside network

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['loc_ic_t2t_mou_6',
 'loc_ic_t2t_mou_7',
 'loc_ic_t2t_mou_8',
 'loc_ic_t2m_mou_6',
 'loc_ic_t2m_mou_7',
 'loc_ic_t2m_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_ic_t2t_mou_6',
 'loc_ic_t2t_mou_7',
 'loc_ic_t2t_mou_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_ic_t2m_mou_6',
 'loc_ic_t2m_mou_7',
 'loc_ic_t2m_mou_8'])


**Insights**

incoming calls within network and inside network reduces drastically by August for churned users

### incoming calls to fixed lines of T

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['loc_ic_t2f_mou_6',
 'loc_ic_t2f_mou_7',
 'loc_ic_t2f_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_ic_t2f_mou_6',
 'loc_ic_t2f_mou_7',
 'loc_ic_t2f_mou_8'])


### incoming local calls - within same telecome circle

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['loc_ic_mou_6',
 'loc_ic_mou_7',
 'loc_ic_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['loc_ic_mou_6',
 'loc_ic_mou_7',
 'loc_ic_mou_8'])


**Insights**

incoming calls within network and inside network reduces drastically by August for churned users

### incoming STD calls - within same telecome circle (within n/w outside n/w)

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['std_ic_t2t_mou_6',
 'std_ic_t2t_mou_7',
 'std_ic_t2t_mou_8',
 'std_ic_t2m_mou_6',
 'std_ic_t2m_mou_7',
 'std_ic_t2m_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['std_ic_t2t_mou_6',
 'std_ic_t2t_mou_7',
 'std_ic_t2t_mou_8'
 ])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['std_ic_t2m_mou_6',
 'std_ic_t2m_mou_7',
 'std_ic_t2m_mou_8'
 ])


**Insights**

STD incoming calls within network and inside network reduces drastically by August for churned users

### incoming STD calls 

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['std_ic_t2f_mou_6',
 'std_ic_t2f_mou_7',
 'std_ic_t2f_mou_8',
 'std_ic_mou_6',
 'std_ic_mou_7',
 'std_ic_mou_8',
 ])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['std_ic_t2f_mou_6',
 'std_ic_t2f_mou_7',
 'std_ic_t2f_mou_8'
 ])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['std_ic_mou_6',
 'std_ic_mou_7',
 'std_ic_mou_8',
 ])

**Insights**

STD incoming calls reduces drastically from July to August for churned users

### Others and total incoming mou calls 

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['isd_ic_mou_6',
 'isd_ic_mou_7',
 'isd_ic_mou_8',
 'spl_ic_mou_6',
 'spl_ic_mou_7',
 'spl_ic_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['isd_ic_mou_6',
 'isd_ic_mou_7',
 'isd_ic_mou_8',
 ])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,[ 'spl_ic_mou_6',
 'spl_ic_mou_7',
 'spl_ic_mou_8'
 ])

**Insights**

ISD and special incoming calls reduces drastically from July to August for churned users

### ISD v.s Special calls 

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['ic_others_6',
 'ic_others_7',
 'ic_others_8',
 'total_ic_mou_6',
 'total_ic_mou_7',
 'total_ic_mou_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['ic_others_6',
 'ic_others_7',
 'ic_others_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['total_ic_mou_6',
 'total_ic_mou_7',
 'total_ic_mou_8'])

**Insights**

Total incoming calls reduces drastically from July to August for churned users

## Max recharge v.s last day recharge amount

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['max_rech_amt_6',
 'max_rech_amt_7',
 'max_rech_amt_8','last_day_rch_amt_6',
 'last_day_rch_amt_7',
 'last_day_rch_amt_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['max_rech_amt_6',
 'max_rech_amt_7',
 'max_rech_amt_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['last_day_rch_amt_6',
 'last_day_rch_amt_7',
 'last_day_rch_amt_8'])


**Insights**

The max and last recharge amount reduces from July to August

## 2G / 3G packs

### Volume based 2g v.s 3g plans

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['vol_2g_mb_6',
 'vol_2g_mb_7',
 'vol_2g_mb_8',
 'vol_3g_mb_6',
 'vol_3g_mb_7',
 'vol_3g_mb_8'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['vol_2g_mb_6',
 'vol_2g_mb_7',
 'vol_2g_mb_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['vol_3g_mb_6',
 'vol_3g_mb_7',
 'vol_3g_mb_8'])

**Insights**

The volume of 2g/3g mobile data usage reduces for churned customers

### monthly v.s sachet 2G plans

In [None]:
# The values present in monthly/sachet plans seems to be categories(months/days) opted by different user
data4.monthly_2g_6.value_counts()

In [None]:
# Changing the data type for the categorical variables

#data4 = data4.astype({"monthly_2g_6":'object',"monthly_2g_7":'object',"monthly_2g_8":'object',"sachet_2g_6":'object',"sachet_2g_7":'object',"sachet_2g_8":'object'})
#data4[['monthly_2g_6',
# 'monthly_2g_7',
 #'monthly_2g_8']].info()

In [None]:
# Function to plot multiple bar charts
def pltvar(r,c,columns,rot=45):
    for i,col in zip(range(1,(r*c)+1),columns):
        plt.subplot(r,c,i)
        plt.title('Distribution of Catgeories in '+col+ ' Feature',size=10,color='Green')
        sns.countplot(x=col,hue='churn',data=data4)
        plt.xticks(rotation=rot)
        plt.tight_layout(pad=1.0)

In [None]:
plt.figure(figsize=(15,10))
pltvar(2,2,['monthly_2g_6',
 'monthly_2g_7',
 'monthly_2g_8'])

In [None]:
plt.figure(figsize=(15,10))
pltvar(2,2,['sachet_2g_6',
 'sachet_2g_7',
 'sachet_2g_8'])

**Insights**

The monthly/sachet plans usage is least churned customers across all three months

### monthly v.s sachet 3g plans

In [None]:
data4.monthly_3g_6.value_counts()

In [None]:
# Changing the data type for the categorical variables

#data4 = data4.astype({"monthly_3g_6":'object',"monthly_3g_7":'object',"monthly_3g_8":'object',"sachet_3g_6":'object',"sachet_3g_7":'object',"sachet_3g_8":'object'})
#data4[['monthly_3g_6',
# 'monthly_3g_7',
# 'monthly_3g_8']].info()

In [None]:
plt.figure(figsize=(15,10))
pltvar(2,2,['monthly_3g_6',
 'monthly_3g_7',
 'monthly_3g_8'])

In [None]:
plt.figure(figsize=(15,10))
pltvar(2,2,['sachet_3g_6',
 'sachet_3g_7',
 'sachet_3g_8'])

**Insights**

The monthly/sachet plans usage reduces reduces for churned customers from July to August

## Age on network

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['aon'])
plt.show()

**Insights**

The churned users have stayed a very short time on the network compared to non churners

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['monthly_3g_6',
 'monthly_3g_7',
 'monthly_3g_8'])


In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['sachet_3g_6',
 'sachet_3g_7',
 'sachet_3g_8'])

**Insights**

The monthly/sachet plans usage reduces reduces for churned customers from July to August

## Volume based cost 3g plans

In [None]:
# Plotting the distribution
plt.figure(figsize=(15, 5))
pltbox(2,3,['aug_vbc_3g','jul_vbc_3g','jun_vbc_3g'])
plt.show()

In [None]:
# plotting the mean across the months
plt.figure(figsize=(15, 5))
plot_mean_bar_chart(data4,['aug_vbc_3g','jul_vbc_3g','jun_vbc_3g'])


**Insights**

Volume based plan enrollment reduces from June to August for churned users

In [None]:
data4.shape

## Outlier Treatment
- A lot of variables have outliers which can be observed in the visualization in box plots.
- They can be capped

### Integer columns

In [None]:
int_columns=data4.select_dtypes(include='int64').columns
int_columns

In [None]:
int_columns=list(int_columns[:-1])
int_columns

In [None]:
for i in int_columns:
    print("quantile values :",i)
    quantile_values=(np.quantile(data4[i],[0.0,0.05,0.1,0.2,0.9,0.95,0.99,1.0]))
    for i,j in zip([0.0,0.05,0.1,0.2,0.9,0.95,0.99,1.0],quantile_values):
        print(i,'----', j )
    print()

## Insights and observations:

- We could see from the above cell that there are outiers present in majority of the columns where we could see big difference in the vlaues from 99th perntile values to 100 th percentile values


## Capping outliers

In [None]:

len_intcolumns=len(int_columns)

for i,j in zip(int_columns,range(len_intcolumns)):
    percentilevalues = data4[i].quantile([0.0,0.99]).values
    data4[i] = np.clip(data4[i], percentilevalues[0], percentilevalues[1])  # Replace the original features after capping the data in the original dataframe 
data4.head()

In [None]:
for i in int_columns:
    print("quantile values After capping :",i)
    quantile_values=(np.quantile(data4[i],[0.0,0.05,0.1,0.2,0.9,0.95,0.99,1.0]))
    for i,j in zip([0.0,0.05,0.1,0.2,0.9,0.95,0.99,1.0],quantile_values):
        print(i,'----', j )
    print()

## Insights;
- As we can see from  the above cell we have capped the values correctly which is good for further analysis

### Float variables

In [None]:
## Float

float_columns=data4.select_dtypes(include='float64').columns
float_columns

In [None]:
for i in float_columns:
    print("quantile values :",i)
    quantile_values=(np.quantile(data4[i],[0.0,0.05,0.1,0.2,0.9,0.95,0.99,1.0]))
    for i,j in zip([0.0,0.05,0.1,0.2,0.9,0.95,0.99,1.0],quantile_values):
        print(i,'----', j )
    print()

In [None]:

len_floatcolumns=len(float_columns)

for i,j in zip(float_columns,range(len_floatcolumns)):
    percentilevalues = data4[i].quantile([0.0,0.99]).values
    data4[i] = np.clip(data4[i], percentilevalues[0], percentilevalues[1])  # Replace the original features after capping the data in the original dataframe 
data4.head()

In [None]:
for i in float_columns:
    print("quantile values After capping :",i)
    quantile_values=(np.quantile(data4[i],[0.0,0.05,0.1,0.2,0.9,0.95,0.99,1.0]))
    for i,j in zip([0.0,0.05,0.1,0.2,0.9,0.95,0.99,1.0],quantile_values):
        print(i,'----', j )
    print()

## Insights;
- As we can see from  the above cell we have capped the values correctly which is good for further analysis
- So we have treated all the columns with integer and float datatype

## Derived Variables
- Difference between the recharge/calls/mou at month 8 with average of month 6 & 7
- This variable could bring important information about how much was the difference for the customers who are loyal and once who plan to churn

In [None]:
data4.shape

In [None]:
data4['arpu_diff'] = data4.arpu_8 - ((data4.arpu_6 + data4.arpu_7)/2)

data4['onnet_mou_diff'] = data4.onnet_mou_8 - ((data4.onnet_mou_6 + data4.onnet_mou_7)/2)

data4['offnet_mou_diff'] = data4.offnet_mou_8 - ((data4.offnet_mou_6 + data4.offnet_mou_7)/2)

data4['roam_ic_mou_diff'] = data4.roam_ic_mou_8 - ((data4.roam_ic_mou_6 + data4.roam_ic_mou_7)/2)

data4['roam_og_mou_diff'] = data4.roam_og_mou_8 - ((data4.roam_og_mou_6 + data4.roam_og_mou_7)/2)

data4['loc_og_mou_diff'] = data4.loc_og_mou_8 - ((data4.loc_og_mou_6 + data4.loc_og_mou_7)/2)

data4['std_og_mou_diff'] = data4.std_og_mou_8 - ((data4.std_og_mou_6 + data4.std_og_mou_7)/2)

data4['isd_og_mou_diff'] = data4.isd_og_mou_8 - ((data4.isd_og_mou_6 + data4.isd_og_mou_7)/2)

data4['spl_og_mou_diff'] = data4.spl_og_mou_8 - ((data4.spl_og_mou_6 + data4.spl_og_mou_7)/2)

data4['total_og_mou_diff'] = data4.total_og_mou_8 - ((data4.total_og_mou_6 + data4.total_og_mou_7)/2)

data4['loc_ic_mou_diff'] = data4.loc_ic_mou_8 - ((data4.loc_ic_mou_6 + data4.loc_ic_mou_7)/2)

data4['std_ic_mou_diff'] = data4.std_ic_mou_8 - ((data4.std_ic_mou_6 + data4.std_ic_mou_7)/2)

data4['isd_ic_mou_diff'] = data4.isd_ic_mou_8 - ((data4.isd_ic_mou_6 + data4.isd_ic_mou_7)/2)

data4['spl_ic_mou_diff'] = data4.spl_ic_mou_8 - ((data4.spl_ic_mou_6 + data4.spl_ic_mou_7)/2)

data4['total_ic_mou_diff'] = data4.total_ic_mou_8 - ((data4.total_ic_mou_6 + data4.total_ic_mou_7)/2)

data4['total_rech_num_diff'] = data4.total_rech_num_8 - ((data4.total_rech_num_6 + data4.total_rech_num_7)/2)

data4['total_rech_amt_diff'] = data4.total_rech_amt_8 - ((data4.total_rech_amt_6 + data4.total_rech_amt_7)/2)

data4['max_rech_amt_diff'] = data4.max_rech_amt_8 - ((data4.max_rech_amt_6 + data4.max_rech_amt_7)/2)

data4['vol_2g_mb_diff'] = data4.vol_2g_mb_8 - ((data4.vol_2g_mb_6 + data4.vol_2g_mb_7)/2)

data4['vol_3g_mb_diff'] = data4.vol_3g_mb_8 - ((data4.vol_3g_mb_6 + data4.vol_3g_mb_7)/2)

In [None]:
#inspecting one of the derived variables

data4[['arpu_diff','arpu_6','arpu_7','arpu_8']].head()

In [None]:
# Dropping the variables which have been used to create new variables 
rem_col_1 = []
col = ['arpu','onnet_mou','offnet_mou','roam_ic_mou','roam_og_mou','loc_og_mou','std_og_mou','isd_og_mou','spl_og_mou'
       ,'total_og_mou','loc_ic_mou','std_ic_mou','isd_ic_mou','spl_ic_mou','total_ic_mou','total_rech_num'
       ,'total_rech_amt','max_rech_amt','vol_2g_mb','vol_3g_mb']
for i in col:
    rem_col_1.append(i+'_6')
    rem_col_1.append(i+'_7')
    rem_col_1.append(i+'_8')

data5 = data4.drop(rem_col_1,axis=1)

In [None]:
data4.shape

In [None]:
data5.shape

In [None]:
# Defining numerical variables
df_numerical = data5.select_dtypes(exclude=['object'])
df_numerical.shape
df_refined = df_numerical.copy()
df_refined.shape

## Test-train split and Feature scalling

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Putting feature variable to X
X = df_refined.drop(['churn'], axis=1)

X.head()

In [None]:
# Putting response variable to y
y = df_refined['churn']

y.head()

In [None]:
# Splitting the data into train and test (75%/25%)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42,stratify=y)

### Feature Scaling in Training set

In [None]:
df_numerical_rem_churn = df_numerical.drop('churn',axis=1)
df_numerical_rem_churn.shape

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# Scaling the numerical variables
scaler = StandardScaler()

X_train[list(df_numerical_rem_churn.columns)] = scaler.fit_transform(X_train[list(df_numerical_rem_churn.columns)])

X_train.head()

## SMOTE technique to increase samples of churned customers compared to not churned

In [None]:
import imblearn
from imblearn.combine import SMOTETomek

In [None]:
from imblearn.over_sampling import SMOTE
smt = SMOTE(0.5,random_state=42)
X_train_SMOTE, y_train_SMOTE = smt.fit_sample(X_train, y_train)

In [None]:
# checking the churn rate before and after applying SMOTE
churn_1 = (sum(y_train)/len(y_train))*100
churn_2 = (sum(y_train_SMOTE)/len(y_train_SMOTE))*100
print("Before smote the number of records {} and the churn rate is {}".format(len(y_train),round(churn_1,2)))
print("After smote the number of records {} and the churn rate is {}".format(len(y_train_SMOTE),round(churn_2,2)))
print('Number of records in X_train before SMOTE {}'.format(len(X_train)))
print('Number of records in X_train after SMOTE {}'.format(len(X_train_SMOTE)))

In [None]:
X_train.shape

In [None]:
X_train_SMOTE.shape

**Insights and Observations**

The up sampling has been done such that the number of churn cases increases from 8% to 50% and now the model can be build using the new training sets

# Step 4 Model Building and PCA

## PCA
- We have lot of features in the training set which can be reduced using PCA

In [None]:
from sklearn.decomposition import PCA

In [None]:
# Initialise PCA without any component size initially
pca=PCA(random_state=42) 

In [None]:
pca.fit(X_train_SMOTE)

In [None]:
pca.components_

In [None]:
pca.components_.shape

In [None]:
var_cumu = np.cumsum(pca.explained_variance_ratio_)
var_cumu

In [None]:
fig = plt.figure(figsize=[12,8])
#plt.vlines(x=60, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmin=0,xmax=120, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.show()

In [None]:
from sklearn.decomposition import IncrementalPCA

In [None]:
pca_final = IncrementalPCA(n_components=65)

In [None]:
X_train_SMOTE.shape

In [None]:
df_train_pca = pca_final.fit_transform(X_train_SMOTE)

In [None]:
df_train_pca.shape

### Applying PCA on test data set

In [None]:
df_test_pca = pca_final.transform(X_test)
df_test_pca.shape

## Running first model - Logistic Regression

### Applying logistic regression on the data on our Principal components

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
learner_pca = LogisticRegression()

In [None]:
model_pca = learner_pca.fit(df_train_pca, y_train_SMOTE)

## Insights:
- AUC value is 91 percent
- AUC (Area under the curve) is more in this case which indicates good model since high AUC indicates high TPR and low FPR

### Making predictions on train set

In [None]:
pred_probs_train =model_pca.predict_proba(df_train_pca)

In [None]:
# Calculating the roc_auc_score
"ROC AUC Score for training set is {:2.2}".format(metrics.roc_auc_score(y_train_SMOTE, pred_probs_train[:,1]))

## Insights:
- AUC value is 85 percent on test data
- AUC (Area under the curve) in this case indicates good model since high AUC indicates high TPR and low FPR

### Making predictions on the test set

In [None]:
pred_probs_test = model_pca.predict_proba(df_test_pca)

In [None]:
"ROC AUC Score for test set is {:2.2}".format(metrics.roc_auc_score(y_test, pred_probs_test[:,1]))

**Insights and Obervations**

The model seems to give fine output with 20 variables and doesn't seem to overfit on training set. However, lets check how the results are when we use PCA to explain 95 variance

## Apply pca which can explain 95 percent variance

In [None]:
pca_again = PCA(0.95)

In [None]:
# PCA on X_train
df_train_pca2 = pca_again.fit_transform(X_train_SMOTE)

In [None]:
df_train_pca2.shape

Following it up with a logistic regression model

In [None]:
learner_pca2 = LogisticRegression()

In [None]:
# Logistic regression on training set
model_pca2 = learner_pca2.fit(df_train_pca2, y_train_SMOTE)
model_pca2

In [None]:
# Transforming the x test based on PCA
df_test_pca2 = pca_again.transform(X_test)
df_test_pca2

In [None]:
df_test_pca2.shape

### Making predictions on train set

In [None]:
# Using the model to predict target variable - df_train_pca2
pred_probs_train2 =model_pca2.predict_proba(df_train_pca2)[:,1]

In [None]:
# Calculating the roc_auc_score
"ROC AUC Score for training set is {:2.2}".format(metrics.roc_auc_score(y_train_SMOTE, pred_probs_train2))

### Making predictions on the test set

In [None]:
# Using the model to predict target variable - df_test_pca2
pred_probs_test2 = model_pca2.predict_proba(df_test_pca2)[:,1]

In [None]:
"ROC AUC Score for test set is {:2.2}".format(metrics.roc_auc_score(y_test, pred_probs_test2))

## Insights:
- Results are same when we selected the PCA Component size selected manually and automatically
- Lets understand the resuts better by validating them

In [None]:
y_train_logistic_pred = pd.DataFrame({'Churn_Prob':pred_probs_train2})
y_train_logistic_pred


### Finding Optimal Cutoff point

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_logistic_pred[i]= y_train_logistic_pred.Churn_Prob.map(lambda x: 1 if x > i else 0)
y_train_logistic_pred.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_SMOTE, y_train_logistic_pred[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

**Insights and Observations**

Sensitivity and Specificity balance each other at 0.5 probability which could be used as cut off to determine the churn and non churn cases

In [None]:
pred_probs_test2

In [None]:
pred_probs_test3=pred_probs_test2.copy()
pred_probs_test3

In [None]:
pred_probs_test3=(pred_probs_test3>=0.35)
pred_probs_test3

In [None]:
pred_probs_test3=pred_probs_test3.astype('int')
pred_probs_test3

In [None]:
print(metrics.classification_report(y_test, pred_probs_test3))

In [None]:
print(metrics.confusion_matrix(y_test, pred_probs_test3))

In [None]:
print(metrics.accuracy_score(y_test, pred_probs_test3))

In [None]:
print(metrics.recall_score(y_test, pred_probs_test3))

In [None]:
print(metrics.precision_score(y_test, pred_probs_test3))

**Observations**

- The logistic regression post using pca is giving recall of 64% and accuracy of 87% 
- Let's use decision tree using pca variables

## Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix,classification_report

In [None]:
dc=DecisionTreeClassifier(random_state=42)

In [None]:
dc.fit(df_train_pca2,y_train_SMOTE)

In [None]:
y_pred=dc.predict(df_test_pca2)
y_pred

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
print(confusion_matrix(y_test,y_pred))

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Create the parameter grid based on the results of random search 
params = {
    'max_depth': [2, 3, 5, 10, 20,25],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'criterion': ["gini", "entropy"]
}

from sklearn.model_selection import StratifiedKFold
folds=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)

In [None]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=dc, 
                           param_grid=params, 
                           cv=folds, n_jobs=-1, verbose=1, scoring = "accuracy")

In [None]:
%%time
grid_search.fit(df_train_pca2, y_train_SMOTE)

In [None]:
grid_search.cv_results_

In [None]:
score_df = pd.DataFrame(grid_search.cv_results_)
score_df.head()

In [None]:
grid_search.best_estimator_

In [None]:
dt_best = grid_search.best_estimator_
dt_best

In [None]:
dc=DecisionTreeClassifier(criterion='entropy', max_depth=25, min_samples_leaf=5,random_state=42)

In [None]:
dc.fit(df_train_pca2,y_train_SMOTE)

In [None]:
y_pred=dc.predict(df_test_pca2)
y_pred

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
print(confusion_matrix(y_test,y_pred))

In [None]:
print(metrics.accuracy_score(y_test,y_pred))

In [None]:
print(metrics.recall_score(y_test,y_pred))

In [None]:
print(metrics.precision_score(y_test,y_pred))

**Insights and Observations**
- The recall is 28% and accuracy is 86%, which is much lower than what we were getting for logistic regression. We will concentrate on recall rather than accuracy as we want to find how many customers who actually churned - did the model correctly identify

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

#### Grid search for hyper-parameter tuning

In [None]:
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1)

In [None]:
classifier_rf.fit(df_train_pca2,y_train_SMOTE)

In [None]:
# Create the parameter grid based on the results of random search 
params = {
    'max_depth': [1, 2, 5, 10, 20],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'max_features': [2,3,4],
    'n_estimators': [10, 30, 50, 100, 200]
}

In [None]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=classifier_rf, param_grid=params, 
                          cv=4, n_jobs=-1, verbose=1, scoring = "accuracy")

In [None]:
%%time
grid_search.fit(X,y)

In [None]:
rf_best = grid_search.best_estimator_

In [None]:
rf_best

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

### Evaluating model on training and testing set

In [None]:
# Since random forest can manage class imbalance thus not giving SMOTE and PCA treated variables as input
# On training set
print("Train Accuracy :", accuracy_score(y_train, rf_best.predict(X_train)))
print("Train Confusion Matrix:")
print(confusion_matrix(y_train, rf_best.predict(X_train)))
# On testing set
print("-"*50)
print("Test Accuracy :", accuracy_score(y_test, rf_best.predict(X_test)))
print("Test Confusion Matrix:")
print(confusion_matrix(y_test, rf_best.predict(X_test)))

**Insights and Observations**
- The recall and accuracy is clearly very poor, thus the most accurate model is PCA with Logistic regression

### Variable importance in RandomForest

In [None]:
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=20, n_estimators=200, oob_score=True)

In [None]:
classifier_rf.fit(X_train_SMOTE, y_train_SMOTE)

In [None]:
classifier_rf.feature_importances_

In [None]:
imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": classifier_rf.feature_importances_
})

In [None]:
imp_df.sort_values(by="Imp", ascending=False,inplace=True)

# Step 5 Model to define relationship between feature and target variable

Using logistic regression and feature importance from random forest to identify useful features

In [None]:
# Extracting top 30 features

# extract top 'n' features
top_n = 30
top_features = imp_df.Varname[0:top_n]
top_features

## Using Logistic Regression on the above important features

In [None]:
interpret_learner = LogisticRegression()

In [None]:
X_train_int = X_train_SMOTE[top_features]
X_test_int = X_test[top_features]
print(X_train_int.shape)
print(X_test_int.shape)

In [None]:
model_interpret = interpret_learner.fit(X_train_int , y_train_SMOTE)

### Making predictions on train set

In [None]:
pred_probs_train =model_interpret.predict_proba(X_train_int)

In [None]:
# Calculating the roc_auc_score
"ROC AUC Score for training set is {:2.2}".format(metrics.roc_auc_score(y_train_SMOTE, pred_probs_train[:,1]))

### Making predictions on the test set

In [None]:
pred_probs_test = model_interpret.predict_proba(X_test_int)

In [None]:
"ROC AUC Score for test set is {:2.2}".format(metrics.roc_auc_score(y_test, pred_probs_test[:,1]))

## Extract the coefficients from the logistic model

In [None]:
# coefficients
coefficients = model_interpret.coef_.reshape((30, 1)).tolist()
coefficients = [val for sublist in coefficients for val in sublist]
coefficients = [round(coefficient, 3) for coefficient in coefficients]

logistic_features = list(X_train_int.columns)
coefficients_df = pd.DataFrame(model_interpret.coef_, columns=logistic_features)

# concatenate dataframes
coefficients = pd.concat([coefficients_df], axis=1)

#coefficients.sort_values(ascending=False)
coefficients

**Business Insights**

- Telecom company should analyze the rates being offered when a customer is roaming or calling outside india (ISD). The higher rates forces the customer to think of switching to a network which are cheaper. There is drastic drop in the calls made while roaming or outside country from June to August

- Company can keep tabs when the customer reduces the calls within the network or outside network, it can only mean one thing that customer is trying to reduce the expense and plan to move to some other network. It could act as trigger for the salesperson to reach out to customer and resolve any pending issues or offer discounts

- Incoming calls from some other networks increases that means saleperson from other network is trying to poach customer by offering them good deals. This could also act as trigger to get in touch with customer

- Another trigger could be the reduction in the recharge amount in consecutive months

- Company should also be cautious of new customers, churn is more for those who have newly joined


