# Figuring Out Which Customers May Leave - Churn Analysis

### A Notebook By Aneesh Chopra

## Dataset Context 

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

* Customers who left within the last month – the column is called Churn
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents

## Features/Column/Attributes Explanation

* **customerID **- Customer ID
* **gender** - Whether the customer is a male or a female
* **SeniorCitizen** -Whether the customer is a senior citizen or not (1, 0)
* **Partner** -Whether the customer has a partner or not (Yes, No)
* **Dependents** -Whether the customer has dependents or not (Yes, No)
* **tenure** -Number of months the customer has stayed with the company
* **PhoneService** -Whether the customer has a phone service or not (Yes, No)
* **MultipleLines** -Whether the customer has multiple lines or not (Yes, No, No phone service)
* **InternetService** -Customer’s internet service provider (DSL, Fiber optic, No)
* **OnlineSecurity** -Whether the customer has online security or not (Yes, No, No internet service)
* **OnlineBackup** -Whether the customer has online backup or not (Yes, No, No internet service)
* **DeviceProtection** -Whether the customer has device protection or not (Yes, No, No internet service)
* **TechSupport** -Whether the customer has tech support or not (Yes, No, No internet service)
* **StreamingTV** -Whether the customer has streaming TV or not (Yes, No, No internet service)
* **StreamingMovies** -Whether the customer has streaming movies or not (Yes, No, No internet service)
* **Contract** -The contract term of the customer (Month-to-month, One year, Two year)
* **PaperlessBilling** -Whether the customer has paperless billing or not (Yes, No)
* **PaymentMethod** -The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
* **MonthlyCharges** -The amount charged to the customer monthly
* **TotalCharges** -The total amount charged to the customer
* **Churn** -Whether the customer churned or not (Yes or No)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as  sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Reading in the Dataset

In [None]:
org=pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
org.head()

## Data Cleaning Process

- First, we will make a copy of the dataset and work on that, so we make sure that we have the original data set to fallback to incase of any mistakes
- From the first glimpse at the data, the ID column seems redundant as we already have index to provide us unique identifiers for each customer and it gives no additional info 
- For Standardization purposes you can divide every cleaning operation into **Define** , **Code** and **Test** Sections

Let me show you an example**** 

### Cleaning Operation 1

### Define
- Removing the CustomerID column since it is redundant

### Code

In [None]:
#making a copy of the original dataset
df=org.copy()

#Dropping the customerID column 
df.drop('customerID', axis=1, inplace= True)

### Test

- Always make sure to test your operations have worked after every step of cleaning 
- you can make use of **assert** statements for testing
- if your **assert** statements are false, they will throw an error
- if it's true, rest of the code present in the cell will be executed without a problem


In [None]:
assert 'customerID' not in df.columns

- See, No error is thrown, which means customerID is no longer present in our dataset and we have succesfully completed our first cleaning operation
- Let's continue toying with the data to see what other cleaning operations we need to perform on our data

- **describe()** and **info()** are some of the most basic and useful methods of assessing our data and find faults in them such as missing values, incorrect datatypes, outliers , etc.

In [None]:
df.info()

 - There seems to be no missing data as of now, let's continue

In [None]:
df.describe()

- There are supposed to be 4 numeric columns according to the data context given to us
- **TotalCharges** seems to be missing, lets try to understand why

### Cleaning Operation 2

### Define

- Converting **TotalCharges** Column into numeric 

### Code

In [None]:
#Converting all possible values in the TotalCharges Column into numeric and converting rest of them into missing values
df['TotalCharges']=df['TotalCharges'].apply(pd.to_numeric, errors='coerce')

#Checking if there are any null values now
df.isnull().sum().sum()

In [None]:
df[df.isnull().any(axis=1)]

- there seem to be 11 missing values now, we will take care of them in the next cleaning operation
- all the missing values have tenure=0, which means this is the customer's first month. In this case, we can substitute TotalCharges missing values with MonthlyCharge values itself

### Test

In [None]:
assert df['TotalCharges'].dtype=='float64'

- Column has been succesfully converted into a numerical column





- As we can see, there is an average of about **3%**  and a median of **1.995%** difference between the predicted and actual values
- Thus, these predicted values seem like a very sensible amount to impute our missing values with 

### Cleaning Operation 3

### Define

- Impute missing values with MonthlyCharges

### Code

In [None]:
df.loc[(pd.isnull(df.TotalCharges)), 'TotalCharges'] = df.MonthlyCharges

### Test

In [None]:
assert len(df.loc[(pd.isnull(df.TotalCharges))])==0

 - we have successfully imputed the values
 
 - now lets make sure all categorical columns only have the values among the given range

In [None]:
#extracting all categorical columns
categorical_columns=df.select_dtypes(include='object').columns.tolist()
print(categorical_columns)

In [None]:
#Made a list of number of unique values allowed in each column according to the details provided to us
x=[2,2,2,2,3,3,3,3,3,3,3,3,3,2,4,2]
y=[]

#Number of unique elements present in each column
for i in categorical_columns:
    y.append(len(df[i].unique()))

#Comparing Values
for i,j in zip(x,y):
    print(i,j)

- All Columns Seem to have values present in the given range
- Looks like no more cleaning needs to be done, lets move onto exploration

## Exploratory Data Analysis

In [None]:
# Summarize our dataset 
print ("Rows     : " ,df.shape[0])
print ("Columns  : " ,df.shape[1])
print ("\nFeatures : \n" ,df.columns.tolist())
print ("\nMissing values :  ", df.isnull().sum().values.sum())
print ("\nUnique values :  \n",df.nunique())

### 1-D Exploration
- Lets get the Class Label(Customer Churn) Breakdown of the dataset given to us

In [None]:
df['Churn'].value_counts()/df.shape[0]

In [None]:
labels = df['Churn'].value_counts(sort = True).index
sizes = df['Churn'].value_counts(sort = True)

colors = ["green","red"]
explode = (0.05,0)  # explode 1st slice
 
plt.figure(figsize=(7,7))
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90,)

plt.title('Customer Churn Breakdown')
plt.show()

Side Note : Try to avoid pie charts if possible. It’s a common visualization joke that Pie Charts are the worst. 

- Only use them when there are 2-4 different unique values. More the variables, the harder it will be for us to understand what the pie chart is trying to convey
- Pie charts can be used to manipulate or hide facts as well. For eg, We get to know the distribution of the the Customer Churning thorugh the pie chart but we don't get to know how believable this is since we are unable to figure out the sample size taken for this chart
- This wouldnt be the case in bar plots since we would have the value counts on the x-axis

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
sns.distplot( df["tenure"] , color="skyblue", ax=axes[0])
sns.distplot(df['MonthlyCharges'],color='orange',ax=axes[1])
sns.distplot(df['TotalCharges'],color='green',ax=axes[2])
fig.suptitle('Histogram of Numerical Columns')

#### Obervations 

* There seems to be a somewhat uniform distribution of tenure of customers except for two peaks at the extremes which suggest that there are atleast 2 customer segments: 
 1. Loyalists: Which have remained with the given telco for >70 months 
 2. Newcomers: Which have started using the given telco's services


* Unable to make any clear observations from the MonthlyCharges Histogram but there seems to be a base plan of $20 being offered by the telco which many customers seem to be using 

*  The TotalCharges ditribution sort of mimics the long-tail distribution where as the TotalCharge increases, the number of customer goes on further decreasing 

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(20, 20))
fig1=sns.countplot( df["gender"] ,ax=axes[0,0])
fig2=sns.countplot( df["SeniorCitizen"] , ax=axes[0,1])
fig3=sns.countplot( df["Contract"] , ax=axes[2,0])
fig4=sns.countplot( df["PaymentMethod"] , ax=axes[2,1])
fig5=sns.countplot( df["Partner"] , ax=axes[1,0])
fig6=sns.countplot( df["Dependents"] , ax=axes[1,1])

figures=[fig1,fig2,fig3,fig4,fig5,fig6]

for graph in figures:
    graph.set_xticklabels(graph.get_xticklabels(),rotation=90)
    
    for p in graph.patches:
        height = p.get_height()
        graph.text(p.get_x()+p.get_width()/2., height,height ,ha="center")


fig.suptitle('')

#### Observations

- I only plotted categorical features which seem important to me at the moment
- Gender and Partner features seem to be evenly distributed among the dataset with each unique value being equally represented 
- There are less number of customers who have dependents as well as less number of customers who are senior citizens
- A huge majority of customers are tied to the telco services on a month to month basis, which gives them alot of flexibility to move around to try out other competitors
- Alot of customers also prefer electonic check when it comes to payment method. Maybe due to the ease of the process, no other inference can be made from this solely

### 2-D Plots

#### A) Numerical Columns vs Churn

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 7))
g = sns.violinplot(x="Churn", y = "MonthlyCharges",data = df, palette = "Pastel1",ax=axes[0])
g = sns.violinplot(x="Churn", y = "tenure",data = df, palette = "Pastel1",ax=axes[1])


#### Observations

- Many people who have chosen the base plan of $20 seem to be sticking to the telco's services wheres as the Monthly Charges go on increasing a huge number of customers seem to have left the services as seen in the ranges 60-120

- There is no pattern observable among customers who have stayed when it comes to tenure but new-comer Customers seem to take up portion of the customers who have churned


#### B) Categorical Columns vs Churn

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(20, 20))
fig1=sns.countplot( x=df["Churn"],hue=df["gender"] ,ax=axes[0,0])
fig2=sns.countplot( x=df["Churn"],hue=df["SeniorCitizen"] , ax=axes[0,1])
fig3=sns.countplot( x=df["Churn"],hue=df["Contract"] , ax=axes[2,0])
fig4=sns.countplot( x=df["Churn"],hue=df["PaymentMethod"] , ax=axes[2,1])
fig5=sns.countplot( x=df["Churn"],hue=df["Partner"] , ax=axes[1,0])
fig6=sns.countplot( x=df["Churn"],hue=df["Dependents"] , ax=axes[1,1])

figures=[fig1,fig2,fig3,fig4,fig5,fig6]

for graph in figures:
    graph.set_xticklabels(graph.get_xticklabels(),rotation=90)
    
    for p in graph.patches:
        height = p.get_height()
        graph.text(p.get_x()+p.get_width()/2., height,round((height/7043)*100,2) ,ha="center")


fig.suptitle('')

#### Observations
- The gender doesn't seem to play a role in Customer Churn as the distribution remains the same in both cases
- We see a shift in distribution in the partners category among Cutomers who left and who stayed.
1. Among Customers who stayed ,There are slighlty more Customers who have partners (38.8%) than those who don't (34.66%)
2. When it comes to customers who left, Customers who don't have partners (17.04%) are almost twice as much as those who do (9.5%)
- We already know that majority of customers have a month-to-month contract but as we can see,there is a huge difference in the ratios of Churned Customers where Month-to-Month Contract Customers take up a huge chunk
- There is a similar case when it comes to PaymentMethods, where Electronic Check replaces Month-to-Month Contracts
- Using these barplots and annotations, we are able to see the absolute percentage each bar represents in the whole dataset, but when it comes knowing the relative percentages or probabilities each variable has when related to Customer Churning, we will have to use a better alternative i.e in this case, a cross-tabulation
- Cross-tab will in a sense, help us in calculating probabilities such as **P(Churn = 'Yes'|Contract='Month-to-Month')** which play a key role in Naive Bayes Classifier as well as understanding the impact a variable has on the outcome

In [None]:
#First Create a dataset which only has categorical columns for the cross-tab

#Method 1: Dropping All Numerical Columns or adding All Categorical Columns MANUALLY
df_cat=df.drop(['MonthlyCharges', 'TotalCharges', 'tenure'], axis=1)
print(df_cat.shape)

#Method 2: Create a Method to automatically parse through all columns and recognise categorical columns
cat_cols = df.nunique()[df.nunique() < 6].keys().tolist()
cat_cols = [x for x in cat_cols]
df_cat=df[cat_cols]
print(df_cat.shape)


In [None]:
summary = pd.concat([pd.crosstab(df_cat[x], df_cat.Churn) for x in df_cat.columns[:-1]], keys=df_cat.columns[:-1])
summary['Churn_Percentage'] = summary['Yes'] / (summary['No'] + summary['Yes'])

In [None]:
#Lets check cases where more than 1/3rd of the customers have left
summary[summary['Churn_Percentage']>0.33]

### Data Preprocessing and Feature Engineering

- In the Data Preprocessing Phase, we manipulate and transform our data into a format which can be used by our model to train on. This involves various forms of encoding and scaling of numerical values 

- In the Featue Engineering section we will try to build new features or modify exisiting features which in turn will help our model to perform better

- Generally to do this, you require domain knowledge as well as undertanding of common Feature Engineering techniques used 


#### A) New Features

- we could multiply the **tenure** column values with the **MonthlyCharges** , lets compare the predicted values (tenure * MonthlyCharges) with actual **TotalCharges** in the dataset. If the values differ much, it means the Customer had his plan changed at some point. 

1. If **tenure * MonthlyCharges** == **TotalCharges** -> Consistent Customer: He/She is probably satisfied with the service being provided so far
2. If **tenure * MonthlyCharges** < **TotalCharges** -> Profitable Customer: A customer would only increase his/her plan if he requires more services and/or his/her income level has increased
3. If **tenure * MonthlyCharges** > **TotalCharges** -> Declining Customer: The Customer's income level as probably decreased or he/she is dissatified with certain services of the Telco and is trying out the competitor's services

- These are just mere assumptions I have made to justify the reason behind the changes. I could very easily be wrong, but let's check out if we can find anything new about our data

In [None]:
#Creating a Predicted Values Column
df['PCharges']=df['MonthlyCharges']*df['tenure']
#Creating a Column to calculate Absolute Percentage Difference between predicted and actual values
df['PDifference']=(((df['PCharges']-df['TotalCharges'])/df['TotalCharges'])*100)



In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.distplot( df[df['Churn']=="No"]["PDifference"] , color="green",ax=axes[0])
sns.distplot( df[df['Churn']=="Yes"]["PDifference"] , color="red",ax=axes[1])
df['PDifference'].describe()

In [None]:
df.drop(["PCharges"],axis=1,inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

target_col = ["Churn"]

#numerical columns
num_cols = [x for x in df.columns if x not in cat_cols + target_col]

#Binary columns with 2 values
bin_cols = df.nunique()[df.nunique() == 2].keys().tolist()

#Columns more than 2 values
multi_cols = [i for i in cat_cols if i not in bin_cols]

#Label encoding Binary columns
le = LabelEncoder()
for i in bin_cols :
    df[i] = le.fit_transform(df[i])
    
#Duplicating columns for multi value columns
df = pd.get_dummies(data = df, columns = multi_cols )
df.head()

In [None]:
#Scaling Numerical columns
std = StandardScaler()

# Scale data
scaled = std.fit_transform(df[num_cols])
scaled = pd.DataFrame(scaled,columns=num_cols)

#dropping original values merging scaled values for numerical columns
df_telcom_og = df.copy()
df = df.drop(columns = num_cols,axis = 1)
df = df.merge(scaled, left_index=True, right_index=True, how = "left")

#churn_df.info()
df.head()

## Modelling 

In [None]:
from sklearn.model_selection import train_test_split

# We remove the label values from our training data
X = df.drop(['Churn'], axis=1).values

# We assigned those label values to our Y dataset
y = df['Churn'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=109)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import metrics

print("Gaussian Naive Bayes Classifier Results")
#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(metrics.accuracy_score(y_test,y_pred))

In [None]:
from sklearn.tree import DecisionTreeClassifier

parameters = {'max_depth':[1, 5, 10, 50],'min_samples_split':[5, 10, 100, 500]}
dec = DecisionTreeClassifier()
clf = GridSearchCV(dec, parameters, cv=3, scoring='accuracy',return_train_score=True)
clf.fit(X_train, y_train)
results = pd.DataFrame.from_dict(clf.cv_results_)
results_sort = results.sort_values(['mean_test_score'])
results_sort.tail()

In [None]:
print("Decision Tree Classifier Results")
#Create a Gaussian Classifier
dec = DecisionTreeClassifier(max_depth=5,min_samples_split=500)

#Train the model using the training sets
dec.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = dec.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(metrics.accuracy_score(y_test,y_pred))

In [None]:
from xgboost.sklearn import XGBClassifier

parameters = {'max_depth':[1, 5, 10, 50],'min_child_weight':range(1,6,2)}
dec = XGBClassifier()
clf = GridSearchCV(dec, parameters, cv=3, scoring='accuracy',return_train_score=True)
clf.fit(X_train, y_train)
results = pd.DataFrame.from_dict(clf.cv_results_)
results_sort = results.sort_values(['mean_test_score'])
results_sort.tail()

In [None]:
print("XGBoost Classifier Results")
#Create a Gaussian Classifier
dec = XGBClassifier(max_depth=1,min_child_weight=3)

#Train the model using the training sets
dec.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = dec.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(metrics.accuracy_score(y_test,y_pred))

- We have got our accuracy to peak at 0.80265 by using our XGBoost Classifier
- Though we got only .6994 accuracy from our Naive Bayes Classifier it was better at predicting the Churned Customers since it had great recall but poor precision
- Our main objective is to be better at predicting customers who leave/churn thus, we should try to create an ensemble of these two models in a way that we can predict the customers who leave better while mainaining high accuracy

In [None]:
from sklearn.calibration import CalibratedClassifierCV

gnb=CalibratedClassifierCV(gnb,method='isotonic')
gnb.fit(X_train,y_train)



In [None]:
final_pred=[]
for dp in X_test:
    dp=dp.reshape(1, -1)
    gnb_prob=gnb.predict(dp)
    xgb_prob=dec.predict(dp)
    if gnb_prob[0]!=xgb_prob[0]:
        if gnb_prob[0]==0:
            final_pred.append(1)
        else:
            prob_1=gnb.predict_proba(dp)[0][1]
            prob_0=dec.predict_proba(dp)[0][0]
            if prob_1>=prob_0:
                final_pred.append(1)
            else:
                final_pred.append(0)
    else:
        final_pred.append(gnb_prob[0])
    

In [None]:
y_pred=np.asarray(final_pred)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(metrics.accuracy_score(y_test,y_pred))


- As you can see, we were only able to increase our accuracy by .001 but we were able to predict more customers who left at the cost of wrongly predicted customers who stayed
- This happened because I created a model in such a way where higher loss was associated with not bring able to predict customers who left


### Conclusion

- A huge majority of customers are tied to the telco services on a month to month basis, which gives them alot of flexibility to move around to try out other competitors

- The TotalCharges ditribution sort of mimics the long-tail distribution where as the TotalCharge increases, the number of customer goes on further decreasing

- There seems to be a somewhat uniform distribution of tenure of customers except for two peaks at the extremes which suggest that there are atleast 2 customer segments:
 1. Loyalists: Which have remained with the given telco for >70 months
 2. Newcomers: Which have started using the given telco's services


- The gender doesn't seem to play a role in Customer Churn as the distribution remains the same in both cases

- We see a shift in distribution in the partners category among Cutomers who left and who stayed.
 1. Among Customers who stayed ,There are slighlty more Customers who have partners (38.8%) than those who don't (34.66%)
 2. When it comes to customers who left, Customers who don't have partners (17.04%) are almost twice as much as those who do (9.5%)
 
 
- We were succesfully able to built an ensemble model with 80.36% accuracy using XGBoost and Naive Bayes Classifier. For further improvement in the model, we could acquire more data, more features or make an even more complex model but we will have to take care of not overfitting our model
