### Telco Customer Churn Analysis 

Customer churn - or attrition - measures the number of clients who discontinue a service (cellphone plan, bank account, SaaS application...) or stop buying products (retail, e-commerce...) in a given time period. This dataset for Telecom customer churn is picked up from [IBM Watson Sample datasets](https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/). 

This dataset contains total 7043 records. Each record is unique for a customer identified using feature customerID. Here the target column on which we will peform classification is Churn which tells whether customer churn or not. Dataset contain total 21 columns whose details are below: 

__customerID__ - Customer ID uniquly identifying record of a customer

__gender__ - Customer gender (female, male)

__SeniorCitizen__ - Whether the customer is a senior citizen or not (1, 0)

__Partner__ - Whether the customer has a partner or not (Yes, No)

__Dependents__ - Whether the customer has dependents or not (Yes, No)

__tenure__ - Number of months the customer has stayed with the company

__PhoneService__ - Whether the customer has a phone service or not (Yes, No)

__MultipleLines__ - Whether the customer has multiple lines or not (Yes, No, No phone service)

__InternetService__ - Customer’s internet service provider (DSL, Fiber optic, No)

__OnlineSecurity__ - Whether the customer has online security or not (Yes, No, No internet service)

__OnlineBackup__ - Whether the customer has online backup or not (Yes, No, No internet service)

__DeviceProtection__ - Whether the customer has device protection or not (Yes, No, No internet service)

__TechSupport__ - Whether the customer has tech support or not (Yes, No, No internet service)

__StreamingTV__ - Whether the customer has streaming TV or not (Yes, No, No internet service)

__StreamingMovies__ - Whether the customer has streaming movies or not (Yes, No, No internet service)

__Contract__ -The contract term of the customer (Month-to-month, One year, Two year)

__PaperlessBilling__ - Whether the customer has paperless billing or not (Yes, No)

__PaymentMethod__ - The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

__MonthlyCharges__ - The amount charged to the customer monthly

__TotalCharges__ - The total amount charged to the customer

__Churn__ -Whether the customer churned or not (Yes or No)


#### Updates

1. Updated with common function for model evaluation and changed parameter grid for random forest.

### Import libraries and dataset

In [None]:
#import basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.plotly as py
from plotly import tools
import plotly.figure_factory as ff
init_notebook_mode(connected=True)

plt.style.use('fivethirtyeight')


In [None]:
#import the dataset
data = pd.read_csv('../input/WA_Fn-UseC_-Telco-Customer-Churn.csv')
#snapshot of data
data.head()

In [None]:
#column datatypes
data.info()

#### Here by going through data description, we have 4 numeric variables and 17 categorical variables(including target).

### Exploratory Data Analysis

In [None]:
#counts of customer churn cases vs not churn in dataset
target= data['Churn'].value_counts()
levels = ['No','Yes']
trace = go.Pie(labels=target.index,values=target.values,
               marker=dict(colors=('orange','green')))
layout = dict(title="Telco Customer Churn Ratio", margin=dict(l=150), width=500, height=500)
figdata = [trace]
fig = go.Figure(data=figdata, layout=layout)
iplot(fig)
#print target class counts
print(target)


In pie-chart, we can see from data we have, 26.5% of total customers in dataset churn. 

In [None]:
#Let's visualize the churn on the basis of Gender
def bar_plot(col,data,barmode='group',width=800,height=600,color1='orange',color2='purple'):
    values = list(data[col].value_counts().keys())
    if values ==[0,1]:
        data[col].replace(0,'No',inplace=True)
        data[col].replace(1,'Yes',inplace=True)
        values = list(data[col].value_counts().keys())
    tr1 = data[data[col]==values[0]]['Churn'].value_counts().to_dict()
    tr2 = data[data[col]==values[1]]['Churn'].value_counts().to_dict()
    xx = ['Male', 'Female']
    trace1 = go.Bar(y=[tr1['No'], tr2['No']], name="Not Churn", x=values, marker=dict(color=color1))
    trace2 = go.Bar(y=[tr1['Yes'], tr2['Yes']], name="Churn", x=values, marker=dict(color=color2))
    data = [trace1, trace2]
    layout = go.Layout(
        barmode=barmode,xaxis = dict(title=col),yaxis=dict(title='Count'),
    title='Effect of '+ col + ' on Customer Churn',width=width,height=height)
    fig = go.Figure(data=data, layout=layout)
    iplot(fig)


In [None]:
#Comparison of churn between male and female
bar_plot('gender',data)

There is almost no difference in churn ratio for male and female.

In [None]:
#Let's visualize the churn ratio for senior citizens
bar_plot('SeniorCitizen',data,barmode='stack',width=600,height=400,color1='orange',color2='green')

Churn rate for senior citizens is significanlty higher as compared to non-senior citizens.

In [None]:
#let's visualize the impact of having partner on customer churn
bar_plot('Partner',data,barmode='stack',width=600,height=400,color1='blue',color2='pink')

From above bar chart, we can see that churn ratio for people having partners is lower than that of the one not having a partner.

In [None]:
#effect of having dependents on churn
bar_plot('Dependents',data,barmode='stack',width=600,height=400)

People without any dependent have higher churn ratio as compared to one's having dependents.

In [None]:
#effect of phone service on churn
bar_plot('PhoneService',data)

First observation from above graph is most of the people have Phone Service.Those who don't have may be are only using Internet service of the telecom company.Also, people having phone service have higher churn ratio.

In [None]:
#let's check effect of PaperlessBilling
bar_plot('PaperlessBilling',data)

People not haivng paperless billing have higher churn ratio as compared to having paperless billing. Also, more people have paperless billing. It's good that people prefer eco-friendly bills!

In [None]:
# values = list(data['gender'].value_counts().keys())
# tr1 = data[data['gender']==values[0]]['Churn'].value_counts().to_dict()
# tr1['No']

In [None]:
#counts of billing frequency or contacts
fig = plt.gcf()
fig.set_size_inches( 7, 5)
plt.title('Counts of billing frequencies')
sns.countplot(data['Contract'])


Most of the people have monthly billing cycle.

In [None]:
#Use of differnt Internet service lines
fig = plt.gcf()
fig.set_size_inches( 7, 5)
plt.title('Counts of different interent service lines')
sns.countplot(data['InternetService'])

Most of the people who have internet service uses fiber optic lines.

In [None]:
#Churn ratio with respect to internet service type
fig = plt.gcf()
plt.title('Churn ratio with respect to internet service type')
fig.set_size_inches( 8, 6)
sns.countplot(data['InternetService'],hue=data['Churn'])

People using Fiber optic line for internet have higher churn ratio as compared to having DSL line internet.

In [None]:
#counts of different bill payment methods using pie chart
target= data['PaymentMethod'].value_counts()
levels = ['Electronic check','Mailed check','Bank transfer','Credit card']
trace = go.Pie(labels=target.index,values=target.values
               )
layout = dict(title="Telco Customer Payment Method", margin=dict(l=50), width=800, height=500)
figdata = [trace]
fig = go.Figure(data=figdata, layout=layout)
iplot(fig)

Around one third customer use Electronic check to pay their telecom bills.

In [None]:
#Churn ratio analysis for different bill payment method
fig = plt.gcf()
fig.set_size_inches( 12, 8)
plt.title('Churn ratio analysis for different bill payment method')
sns.countplot(data['PaymentMethod'],hue=data['Churn'])

It is clear that churn ratio for people paying bill using Electronic check is much higher as compared to other payment methods.

In [None]:
data['OnlineSecurity'].value_counts()

Here in above column OnlineSecurity, we can replace No internet service with No .Similar is the case with other columns like OnlineBackup ,DeviceProtection ,TechSupport ,StreamingTV ,StreamingMovies.

In [None]:
internet_features = ['OnlineSecurity','OnlineBackup' ,'DeviceProtection' ,
                     'TechSupport' ,'StreamingTV' ,'StreamingMovies','InternetService']

In [None]:
#replace No internet service with No
data[internet_features]=data[internet_features].replace('No internet service','No')

In [None]:
#let's verify it
data['OnlineSecurity'].value_counts()

In [None]:
#churn ratio for column Online security
bar_plot('OnlineSecurity',data)

Clearly churn ratio for customers not having online security feature is higher in comparison to those having it.

In [None]:
#Churn ratio for StreamingTV
bar_plot('StreamingTV',data)

Churn ratio for people having Streaming TV servie is higher as compared to those not having it.

In [None]:
#churn ratio for people having StreamingMovie service
bar_plot('StreamingMovies',data)

Like StreamingTV, Churn ratio for people having StreamingMovies service is higher.It seems these two online services are major factor for churn.

In [None]:
#Churn ratio for feature tech support
bar_plot('TechSupport',data)

People not having tech support have higher churn ratio as compared to having it. May be telecom company can provide some discount on tech support charges so that more customer can avail it.

In [None]:
#churn ratio for column onlinebackup
bar_plot('OnlineBackup',data)

People who don't have online backup service have higher churn ratio as compared to people having it.

In [None]:
plt.figure(figsize=(10,6))
ax = sns.boxplot(x='Churn', y = 'tenure', data=data)
ax.set_title('Effect of Tenure length on Churn', fontsize=18)
ax.set_ylabel('Tenure', fontsize = 15)
ax.set_xlabel('Churn', fontsize = 15)

It seems that customer churn in initial period more(over 10 months). Company must put some extra focus on new customers having tenure over 10 months.

In [None]:
plt.figure(figsize=(10,6))
ax = sns.boxplot(x='Churn', y = 'MonthlyCharges', data=data)
ax.set_title('Effect of Monthly Charges on Churn', fontsize=18)
ax.set_ylabel('Charges', fontsize = 15)
ax.set_xlabel('Churn', fontsize = 15)

Customer churning are paying more monthly charges as compared to non-churn customers.

### Feature engineering

In [None]:
# Converting Total Charges to a numerical data type.
data.TotalCharges = pd.to_numeric(data.TotalCharges, errors='coerce')

In [None]:
#Let's check for nulls first
nulls = data.isnull().sum()
nulls[nulls > 0]

In [None]:
#impute missing values with 0
data.fillna(0,inplace=True)

In [None]:
#new feature - Internet(Yes- have internet service, No- do not have internet service)
data['Internet'] = data['InternetService'].apply(lambda x : x if x=='No' else 'Yes')

In [None]:
data['Internet'].value_counts()

In [None]:
data['MultipleLines'].value_counts()

In [None]:
#replace No phone service with No
data['MultipleLines'].replace('No phone service','No',inplace=True)

In [None]:
#train and target
y = data['Churn'].map({'Yes':1,'No':0})
X = data.drop(labels=['Churn','customerID'],axis=1).copy()

In [None]:
#find list of categorical columns for encoding
cat_cols = []
for column in X.columns:
    if column not in ['tenure','MonthlyCharges','TotalCharges']:
        cat_cols.append(column)

In [None]:
#Convert categorical columns to binary
X= pd.get_dummies(X,columns=cat_cols)


In [None]:
#shape after conversion of categorical features
X.head()

### Modelling

In [None]:
#import ML models and metrics
from sklearn.metrics import f1_score,roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

In [None]:
#create seperate train and test splits for validation
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
#create function for validation and return accuracy and roc-auc score
def evaluate_model(model):
    model.fit(X_train,y_train)
    prediction_test = model.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, prediction_test)
    rocauc = metrics.roc_auc_score(y_test, prediction_test)
    return accuracy,rocauc,prediction_test

#### Logistic regression

In [None]:
# Running logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=0.1)
acc,rocauc,testpred_lr  = evaluate_model(lr)
print('Logistic Regression...')
print('Accuracy score :',acc)
print('ROC-AUC score :',rocauc)

#### Random forest Classifier

In [None]:
rf =RandomForestClassifier()
rf.fit(X_train,y_train)
acc,rocauc,testpred_rf  = evaluate_model(rf)
print('Random Forest...')
print('Accuracy score :',acc)
print('ROC-AUC score :',rocauc)

#### Here till now, Logistic regression is performing better than RandomForest(with default parameters). Let's try tuning parameters for random forest using Random search.

##### Parameter Tuning for random Forest

In [None]:
#set up search grid
#Number of search trees
n_estimators=range(50,100)
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = range(4,8)
# Minimum number of samples required to split a node
min_samples_split = range(2,6)
# Minimum number of samples required at each leaf node
min_samples_leaf = range(1,5)
# Method of selecting samples for training each tree
bootstrap = [True, False]
#criterion
criterion=['gini','entropy']
#create the random grid
random_grid = {'n_estimators':n_estimators,
              'max_features':max_features,
              'max_depth':max_depth,
              'min_samples_split':min_samples_split,
              'min_samples_leaf':min_samples_leaf,
              'bootstrap':bootstrap,
              'criterion':criterion}
print(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier(random_state=2018)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, cv = 3, verbose=2, n_iter=100,random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train,y_train)

In [None]:
#best params
params = rf_random.best_params_
params

In [None]:
rfc = RandomForestClassifier(**params,random_state=42)
rfc.fit(X_train,y_train)
acc,rocauc,testpred_rfc  = evaluate_model(rfc)
print('Random Forest Optimized...')
print('Accuracy score :',acc)
print('ROC-AUC score :',rocauc)

##### Feature Importance - random forest

In [None]:
indices = np.argsort(rfc.feature_importances_)[::-1]
indices = indices[:45]

# Visualise these with a barplot
plt.subplots(figsize=(20, 15))
g = sns.barplot(y=X.columns[indices], x = rfc.feature_importances_[indices], orient='h')
g.set_xlabel("Relative importance",fontsize=12)
g.set_ylabel("Features",fontsize=12)
g.tick_params(labelsize=9)
g.set_title("RandomForest feature importance");

#### ROC curve for both models

In [None]:
#we define a plot_multiple_roc to visualise all the model curves together

def plot_multiple_roc(y_preds, y_test, model_names):
    
    fig, ax = plt.subplots(figsize=(8, 8))
    
    
    for i in range (0, len(y_preds)):
        false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, y_preds[i])
        label = ""
        if len(model_names) > i:
            label = model_names[i]
        ax.plot(false_positive_rate, true_positive_rate, label=label)
    ax.plot([0, 1], [0, 1], 'k--', linewidth=.5)
    ax.grid(True)
    
    ax.set(title='ROC Curves for telecom customer churn problem',
           xlabel = 'False positive Rate', ylabel = 'True positive rate')
        
    if len(model_names) > 0:
        plt.legend(loc=4)

In [None]:
validation_probs_fs = []
validation_probs_fs.append(testpred_lr)
validation_probs_fs.append(testpred_rfc)

In [None]:
all_models_names = ['logistic reg', 'Random_forest']
plot_multiple_roc(validation_probs_fs, y_test, all_models_names)

We can see that ROC AUC for logistic regression is more than random forest. So , __Logistic regression__ is winner here!

### Conclusion

From accuracy scores of all 2 models, we can see that __Logistic regression__
outperforms other models.Some of the key takeaways from EDA and data analysis are :
1. There is no difference in churn ratio for male and female.
2. Churn ratio for senior citizens is significantly higher. Company must focus on some specific needs of senior citizens.
3. People with paperless billing have higher churn ratio! May be the people having their bills deliverd in hard copy(paper)are loyal customers of the company.
4. People paying bills via Electronic check have higher churn ratio. Company must focus on them and ask them if they are facing any difficulties in paying bills via electronic check or not?
5. People not having TechSupport from the telecom company have higher churn rate. If company is charging high rates for TechSupport, then they must consider giving some discounts on them so that customers can avail the service.