# Prediction of Customer Churn

## Important:Instructions mentioned below.

- The Sheet is structured in **4 steps**:
    1. Understanding data and manipulation
    2. Data visualization
    3. Implementing Machine Learning models(Note: It should be more than 1 algorithm)
    4. Model Evaluation and concluding with the best of the model.[](http://)

### Importing the data

In [None]:
# use these links to do so:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from collections import defaultdict
from sklearn import metrics
from pylab import rcParams

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

### Understanding the data

In [None]:
import os
print(os.listdir('../input/telco-customer-churn'))

In [None]:
data = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.dtypes

In [None]:
data.describe()

### Data Manipulation

In [None]:
for item in data.columns:
    print(item)
    print (data[item].unique())

In [None]:
data.drop(['customerID'], axis=1, inplace=True)

#### Convert all yes and no to 0's & 1's so our classifier can use this data.

In [None]:
data["gender"].replace(['Female','Male'],[0,1],inplace=True)
data["Partner"].replace(['No', 'Yes'], [0, 1], inplace=True)
data["Dependents"].replace(['No', 'Yes'], [0, 1], inplace=True)
data["PhoneService"].replace(['No', 'Yes'], [0, 1], inplace=True)
data["PaperlessBilling"].replace(['No', 'Yes'], [0, 1], inplace=True)
data["Churn"].replace(['No', 'Yes'], [0, 1], inplace=True)
data["StreamingMovies"].replace(['No', 'Yes'], [0, 1], inplace=True)

data["InternetService"].replace(['No','DSL', 'Fiber optic'],[0,1,2],inplace=True)
data["Contract"].replace(['Month-to-month','One year', 'Two year'],[0,1,2],inplace=True)

data = pd.get_dummies(data=data, columns=['PaymentMethod'])

data["MultipleLines"].replace(['No','Yes'],[0,1],inplace=True)
data["OnlineSecurity"].replace(['No','Yes'],[0,1],inplace=True)
data["OnlineBackup"].replace(['No','Yes'],[0,1],inplace=True)
data["DeviceProtection"].replace(['No','Yes'],[0,1],inplace=True)
data["TechSupport"].replace(['No', 'Yes'], [0, 1], inplace=True)
data["StreamingTV"].replace(['No', 'Yes'], [0, 1], inplace=True)

In [None]:
columns_to_convert = ['MultipleLines', 
                      'OnlineSecurity', 
                      'OnlineBackup', 
                      'DeviceProtection', 
                      'TechSupport',
                      'StreamingTV',
                     'StreamingMovies']

for item in columns_to_convert:
    data[item].replace(to_replace='No internet service',  value=0, inplace=True)
    data[item].replace(to_replace='No phone service',  value=0, inplace=True)
data.head()

In [None]:
#We can see TotalCharges is still an object. Fix TotalCharges as a float...
data['TotalCharges'] = data['TotalCharges'].replace(r'\s+', np.nan, regex=True)
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])

data = data.fillna(value=0)

In [None]:
data.dtypes

In [None]:
data.groupby('Churn').size()/len(data) # What is the percentage of churners

### Data Visualization

In [None]:
data.hist(bins=50, figsize=(20,15));

In [None]:
corr = data.corr()
corr

In [None]:
sns.countplot(data['Churn'],label = 'count')

In [None]:
# Data to plot
labels =data['Churn'].value_counts(sort = True).index
sizes = data['Churn'].value_counts(sort = True)


colors = ["whitesmoke","red"]
explode = (0.1,0)  # explode 1st slice
 
rcParams['figure.figsize'] = 8,8
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=270,)

plt.title('Percent of churn in customer')
plt.show()

In [None]:
sns.countplot(x='SeniorCitizen',data=data,hue='Churn')

In [None]:
plt.scatter(x='MonthlyCharges',y='TotalCharges',alpha=0.1, data=data)

In [None]:
#We plot the correlation matrix, the darker a box is, the more features are correlated
plt.figure(figsize=(12,10))
corr = data.apply(lambda x: pd.factorize(x)[0]).corr()
ax = sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, linewidths=.2, cmap='Blues')

#### Churn rate is a health indicator for subscription-based companies. The ability to identify customers that aren’t happy with provided solutions allows businesses to learn about product or pricing plan weak points, operation issues, as well as customer preferences and expectations to proactively reduce reasons for churn.

It’s important to define data sources and observation period to have a full picture of the history of customer interaction. Selection of the most significant features for a model would influence its predictive performance: The more qualitative the dataset, the more precise forecasts are.

Companies with a large customer base and numerous offerings would benefit from customer segmentation. The number and choice of ML models may also depend on segmentation results. Data scientists also need to monitor deployed models, and revise and adapt features to maintain the desired level of prediction accuracy.

### Implement Machine Learning Models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, accuracy_score, precision_score, recall_score

In [None]:
data["Churn"] = data["Churn"].astype(int)
Y = data["Churn"].values
X = data.drop(labels = ["Churn"],axis = 1)
# Create Train & Test Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=101)

### Model Evaluation

#### LogisticRegression

In [None]:
# Running logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
result = model.fit(X_train, y_train)
from sklearn import metrics
prediction_test = model.predict(X_test)
# Print the prediction accuracy
print (metrics.accuracy_score(y_test, prediction_test))

#### RandomForestClassifier

In [None]:
model_rf = RandomForestClassifier(n_estimators=1000 , oob_score = True, n_jobs = -1,
                                  random_state =50, max_features = "auto",
                                  max_leaf_nodes = 30)
model_rf.fit(X_train, y_train)

# Make predictions
prediction_test = model_rf.predict(X_test)
print (metrics.accuracy_score(y_test, prediction_test))

#### SupportVectorClassifier

In [None]:
model.svm = SVC(kernel='linear') 
model.svm.fit(X_train,y_train)
preds = model.svm.predict(X_test)
metrics.accuracy_score(y_test, preds)

#### XGBClassifier

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
preds = model.predict(X_test)
metrics.accuracy_score(y_test, preds)

#### AdaBoostClassifier

In [None]:
# AdaBoost Algorithm
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier()
# n_estimators = 50 (default value) 
# base_estimator = DecisionTreeClassifier (default value)
model.fit(X_train,y_train)
preds = model.predict(X_test)
metrics.accuracy_score(y_test, preds)

#### Confusion matrix

In [None]:
# Create the Confusion matrix
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,preds))  

In [None]:
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, preds)
np.set_printoptions(precision=2)
class_names = ['Not churned','churned']
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()

from sklearn.metrics import classification_report
eval_metrics = classification_report(y_test, preds, target_names=class_names)
print(eval_metrics)

### Final Conclusions

##### Finally, this task allowed us to identify the parameters that influence the departure of a client. It also permitted to develop a predictive model that will help the company to target more easily and quickly people that are likely to leave.

As LR score of 0.80 which is quite correct, optimizing the parameters didn't led to a better score. We can try to use more complex models such as Random Forest, Gradient Boosting etc.

Churn rate is a health indicator for subscription-based companies. The ability to identify customers that aren’t happy with provided solutions allows businesses to learn about product or pricing plan weak points, operation issues, as well as customer preferences and expectations to proactively reduce reasons for churn.

It’s important to define data sources and observation period to have a full picture of the history of customer interaction. Selection of the most significant features for a model would influence its predictive performance: The more qualitative the dataset, the more precise forecasts are.

Companies with a large customer base and numerous offerings would benefit from customer segmentation. The number and choice of ML models may also depend on segmentation results. Data scientists also need to monitor deployed models, and revise and adapt features to maintain the desired level of prediction accuracy.

> **Foot-notes:¶**
> I'm not a stats major, so please do let me know in the comments if you feel that I've left out any important technique or if there was any mistake in the content.
> 
> Do leave a comment/upvote :)