# Telecom Churn Modelling

![](https://drive.google.com/file/d/1brxIhmq4Fo7fLbR1jmT6W4oyFTVo5Jeh/view?usp=sharing) 

# Table of Contents
1. Exploratory Data Analysis (EDA).
2. Visual Data Analysis (VDA).
3. Data Preprocessing.
4. Splitting Data and Scaling it.
5. Building Benchmark model.
6. Building and Evaluating models.
7. Model Validation.
8. Optimization and Hyper-parameter tuning.

In [None]:
# importing libararies

# Warning
import warnings
warnings.filterwarnings("ignore")

# for Data
import pandas as pd
import numpy as np
import math

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(style='whitegrid')
sns.set(font_scale=1.5);
%matplotlib inline


# Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler; std_scaler = StandardScaler()

# Splitting
from sklearn.model_selection import train_test_split


# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier


# Metrics
from sklearn.metrics import fbeta_score, accuracy_score, make_scorer, classification_report
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
from sklearn.model_selection import GridSearchCV


In [None]:
# reading dataset
data = pd.read_csv('../input/WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 1. Exploratory Data Analysis (EDA)

#### A glance at the dataset

In [None]:
data.head()

#### Dataset Information

In [None]:
# information about the dataset structure
data.info()

- The dataset contains 21 columns and 7043 rows.
- There are no missing values to impute.
- The dataset contains 18 columns of the datatype Object which indicated that it will need a lot of labeling and encoding during the preprocessing step.

#### Fixing `TotalCharges` datatype

In [None]:
data['TotalCharges'] = data['TotalCharges'].convert_objects(convert_numeric=True)
data['TotalCharges'].dtype

#### Getting Statistical insights

In [None]:
# Numerical features stats
data.describe()

`SeniorCitizen` feature is a binary categorical feature so looking at it's stats is nonesense.


In [None]:
data.describe(include=['object'])

#### preparing dataa before visual data exoloration

in the `MultipleLines` feature a customer who has no phone service still has no multiple lines so I will replace the 'No Phone Service' values with just 'No' similarly, in `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`,
`TechSupport`,`StreamingTV`, `StreamingMovies` 'No internet servive' will be replaced by 'No'

In [None]:
for dataset in [data]:
    dataset['MultipleLines'] = dataset['MultipleLines'].replace({'No phone service':'No'})
    dataset['OnlineSecurity'] = dataset['OnlineSecurity'].replace({'No internet service':'No'})
    dataset['DeviceProtection'] = dataset['DeviceProtection'].replace({'No internet service':'No'})
    dataset['TechSupport'] = dataset['TechSupport'].replace({'No internet service':'No'})
    dataset['StreamingTV'] = dataset['StreamingTV'].replace({'No internet service':'No'})
    dataset['OnlineBackup'] = dataset['OnlineBackup'].replace({'No internet service':'No'})
    dataset['StreamingMovies'] = dataset['StreamingMovies'].replace({'No internet service':'No'})
    
print ("Number of unique values in each column\n")
for col_name in data.columns:
 print(col_name,": " ,data[col_name].nunique())

# 2. Visual Data Analysis (VDA)

In [None]:
# count of customers churn

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,7))
ax = data['Churn'].value_counts().plot(kind='pie',autopct='%.1f%%', ax=axes[0])
ax.set_title('Number of customer churn', fontsize=15)


ax = sns.countplot(y='Churn', data=data, ax=axes[1]);
for i,j in enumerate(data["Churn"].value_counts().values) : 
    ax.text(.1,i,j,fontsize = 20,color = "k")

ax.set_title('Number of customer churn', fontsize=15)


###### It seems like the dataset is partially imbalanced

In [None]:
ig, ax = plt.subplots(figsize=(15,9))
sns.violinplot(x="gender", y="tenure", hue='Churn', data=data, split=True, bw=0.05 , palette='husl', ax=ax)
plt.title('Churn by gender ')
plt.show()

In [None]:
g = sns.factorplot(x="InternetService", y="tenure", hue="Churn", col="gender", data=data, kind="swarm", dodge=True, palette='husl', size=8, aspect=.9, s=8)

##### Customers who have Fiber Optic internet service are more likely to churn which is kind of weird because fiber optics mean that the quality and speed of internet is very high so I am assuming that customers who have Fiber optic internet service have hifh payments or monthly charges and customers who have high total charges are more likely to churn so, I will investigate this in the following graphs.

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))

ax = sns.distplot(data[data['Churn']=='Yes']['MonthlyCharges'],label='Churn', bins= 10, kde=True)
#ax = sns.distplot(data[data['Churn']=='No']['MonthlyCharges'],label='Not Churn', bins= 18, kde=False)

ax.legend()
ax.set_title('Monthly Charges distrobution for Churn cutomers')
plt.show()

##### Customers have higher probability to churn when monthly charges are high which agrees with my hypothesis, now  the last part of my hypothesis is tht to show that Fiber Optic customers pay high monthly charges combared to other internet services which eventually leads to their churn.

In [None]:
fig, ax = plt.subplots(figsize=(20,12))
ax =  sns.stripplot('InternetService', 'MonthlyCharges', 'Churn', data=data,
                        palette="husl", size=15, marker="D",
                        edgecolor="red", alpha=.30)

##### Now this confirms my hypithesis, as you see the Fiber optic internet service is the most expensixe one  and customers who have high monthly charges have higher tendency to leave the company, this is why customers who subscribe to the Fiber optic service are more likely to leave the company although they have high quality and internet speed service.

###### This observation is very important for the company as they might target this customer segment and try to make offers to decrease the monthly charges on the fiber optics service or suggest them to use DSL for example if they can't afford the fiber optic service for so long. 

In [None]:
FacetGrid = sns.FacetGrid(data, hue='Churn', aspect=4)
FacetGrid.map(sns.kdeplot, 'tenure', shade=True)
FacetGrid.set(xlim=(0, data['tenure'].max()))
FacetGrid.add_legend()

##### cusotomers tend to churn after the first few years of their subscribtion so, the company may need to target new customers with marketing campaign and offers.

In [None]:
FacetGrid = sns.FacetGrid(data, hue='Churn', aspect=4)
FacetGrid.map(sns.kdeplot, 'TotalCharges', shade=True)
FacetGrid.set(xlim=(0, data['TotalCharges'].max()))
FacetGrid.add_legend()

##### customers with low Total charges are more likely to churn.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 7))

# count of churn customers by payment method
ax = sns.countplot(y='PaymentMethod', hue='Churn', data=data, palette="husl", ax=axes[0]);

ax.set_yticklabels(ax.get_yticklabels(), rotation=30, ha="right")
ax.set_xlabel('Payment Method', fontsize = 12)
ax.set_ylabel('Number of Customers', fontsize = 12)

ax.set_title('Count of churn customers by Payment Method', fontsize=15)





# count of customers by payment method
ax = sns.countplot(x='PaymentMethod', data=data, palette="husl", ax=axes[1]);

ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right")
ax.set_xlabel('Payment Method', fontsize = 12)
ax.set_ylabel('Number of Customers', fontsize = 12)

ax.set_title('Count customers by Payment Method', fontsize=15)


##### It's obvious that customers who pay by Electronic check have a high tendency to check

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(26, 24))

plot = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
        'PhoneService', 'MultipleLines', 'InternetService', 'TechSupport',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'Contract',
       'StreamingTV', 'StreamingMovies', 'PaperlessBilling']
index = 0
for i in range(5):
    for j in range(3):

        ax = sns.countplot(x=plot[index], hue='Churn', data=data, palette="husl", ax=axes[i,j]);
        index+=1



- __gender__: Females have a very small tendency to churn other than men, I guess there is no much impact for the gender on churn.
- __SeniorCitizen__: Seniors are more likely to churn.
- __Partner__: customers who have no partners are more likely to churn.
- __Dependents__: having dependents makes customers less likely to churn.
- __MultipleLines__: having multiple lines makes customers more likely to churn.
- __Contract__: Month_to_Month contracts are more likely to churn.
- __PaperlessBilling__: A lot of customers who have paperless billing have churned so, the company need to investigate it's system searching for the problem.

# 3. Data Preprocessing

#### Removing `customerID` as it doesn't contribute to the prediction of churn.

In [None]:
df = data.copy()

In [None]:
df = df.drop('customerID', axis=1)

##### Getting dummy variables for categorical columns of mare than two values

In [None]:
# Getting dummy variables for these columns
# using drop_first=True in order to avoid the dummy variables trap

df = pd.get_dummies(data = df,columns = ['InternetService', 'Contract', 'PaymentMethod'], drop_first=True )

#### Encoding categorical variables

In [None]:
enc = LabelEncoder()
df = df.apply(enc.fit_transform)
df.head()

# 4. Splitting dataset and scaling it

In [None]:
X = df.drop('Churn', axis=1)
y= df['Churn']

##### Scaling features

In [None]:
X = std_scaler.fit_transform(X)

##### splitting dataset into training and testing

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25,random_state=42)

==========================================================================================================
===========================================================================================================

# 5. Building Benchmark model (Logistic regression)

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
acc_log_reg = log_reg.score(X_test, y_test)*100
print("{:.2f}".format(acc_log_reg))

 ##### Evaluating our benchmark model

##### Logistic Regression Classification report

In [None]:
print (classification_report(y_test, y_pred))

#### Confusion matrix

In [None]:
lr_conf = confusion_matrix(y_test, y_pred)
sns.heatmap(lr_conf, annot=True, fmt="d")
plt.show

##### ROC Curve/Score

In [None]:
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, y_pred)

# plotting them against each other
def plot_roc_curve(false_positive_rate, true_positive_rate, label=None):
    plt.plot(false_positive_rate, true_positive_rate, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'r', linewidth=4)
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (FPR)', fontsize=16)
    plt.ylabel('True Positive Rate (TPR)', fontsize=16)

plt.figure(figsize=(14, 7))
plot_roc_curve(false_positive_rate, true_positive_rate)
plt.show()

In [None]:
lr_auc = roc_auc_score(y_test, y_pred)
print("Roc score: ", round(lr_auc,2)*100, "%")

# 6. Bulding And Evaluating Models

In [None]:
#Random Forest
rf = RandomForest(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
acc_rf = rf.score(X_test, y_test) * 100
print("{:.2f}".format(acc_rf))

In [None]:
RF_conf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(RF_conf, annot=True, fmt="d")
plt.show

In [None]:
# Ada Boost
AdaBoost = AdaBoostClassifier(random_state=42)
AdaBoost.fit(X_train, y_train)
y_pred = AdaBoost.predict(X_test)
acc_AdaBoost = AdaBoost.score(X_test, y_test) * 100
print("{:.2f}".format(acc_AdaBoost))

In [None]:
Ada_conf = confusion_matrix(y_test, y_pred)
sns.heatmap(Ada_conf, annot=True, fmt="d")
plt.show

In [None]:
# LGBM Classifier
lgbm = LGBMClassifier(random_state=42)
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_test)
acc_lgbm = lgbm.score(X_test, y_test) * 100
print("{:.2f}".format(acc_lgbm))

In [None]:
LGBM_conf = confusion_matrix(y_test, y_pred)
sns.heatmap(LGBM_conf, annot=True, fmt="d")
plt.show

In [None]:
# XGBoost
xg = XGBClassifier(random_state=42)
xg.fit(X_train, y_train)
y_pred = xg.predict(X_test)
acc_xg = xg.score(X_test, y_test) * 100
print("{:.2f}".format(acc_xg))

In [None]:
XG_conf = confusion_matrix(y_test, y_pred)
sns.heatmap(XG_conf, annot=True, fmt="d")
plt.show

# 6. Validating Model

#### Choosing the best model

The 2 common error types in Churn Prediction:

   > **Type I Error — False Negative: Failing to identify a customer who has a high propensity to unsubscribe.**

From a business perspective, this is the __least desirable error__ as the customer is very likely to quit/cancel/abandon the business, thus adversely affecting its revenue.

   > **Type II Error — False Positive: Classifying a good, satisfied customer as one likely to Churn.**

From a business perspective, this is __acceptable__ as it does not impact revenue.

Any Predictive Algorithm going into Production will have to be **the one with the least Type I error.**

Source: [Building Predictive Models for Customer Churn in Telecom](https://medium.com/@Experfy/building-predictive-models-for-customer-churn-in-telecom-4864d759ebf8)


Type I Error in our confusion matrix is when the customer is going to churn (1) but the model have predicted that he/she is not going to churn(0) so Type I Error is the **left bottom** cell in our confusion matrix.

### Ada Boost has the least Type I Error of 224 so it will be our final model.
Ada Boost has beaten our benchmark model which have Type I Error of 232.

# 7. Optimization - Hyper Parameter tuning

In [None]:
clf = AdaBoostClassifier(random_state=42)

param_grid = {"n_estimators": [50, 100, 300, 500],\
              "learning_rate" : [1, 0.0001, 0.5]}

grid_obj = GridSearchCV(clf, param_grid=param_grid, cv=10)


grid_fit = grid_obj.fit(X_train, y_train)

print("Best parameter: ", grid_obj.best_params_)

# Get the estimator/ clf
best_clf = grid_fit.best_estimator_

grid_y_pred = best_clf.predict(X_test)

print("Optimal accuracy score on the testing data: {:.2f}".format(accuracy_score(y_test, grid_y_pred)*100))


It seems like the default parameters achieve the highest performance.

##### Ada Boost Classification report

In [None]:
print (classification_report(y_test, grid_y_pred))

##### Confusion matrix 

In [None]:
Grid_conf = confusion_matrix(y_test, grid_y_pred)
sns.heatmap(Grid_conf, annot=True, fmt="d")
plt.show

##### ROC Curve/Score

In [None]:
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, grid_y_pred)

# plotting them against each other
def plot_roc_curve(false_positive_rate, true_positive_rate, label=None):
    plt.plot(false_positive_rate, true_positive_rate, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'r', linewidth=4)
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (FPR)', fontsize=16)
    plt.ylabel('True Positive Rate (TPR)', fontsize=16)

plt.figure(figsize=(14, 7))
plot_roc_curve(false_positive_rate, true_positive_rate)
plt.show()

In [None]:
grid_auc = roc_auc_score(y_test, grid_y_pred)
print("Roc score: {:.2f}".format((grid_auc)*100), "%")

> please let me know if you have any suggestions to further improve my kernel and upvote it If you have found it useful. :)