# 1. Introduction

In most of businesses loosing a client is more expensive than gaining a new one. That's why companies try to keep their customers as much as they can. KPI used to measure how many customers leave the company is called [churn rate](https://www.investopedia.com/terms/c/churnrate.asp) also called attrition rate. According to Investopedia it is defined as:  

> The churn rate, also known as the rate of attrition or customer churn, is the rate at which customers stop doing business with an entity. It is most commonly expressed as the percentage of service subscribers who discontinue their subscriptions within a given time period. 

This KPI is of crucial in the telecomunication industry which includes Internet providers, television providers (e.g. Netflix) and telephone providers (e.g. O2, Vodaphone, Orange).  
However, it's widely used in any companies with business  based on subscribtions, e.g.: 
* audiobooks services - [Audible](https://www.audible.com/), [Storytel](https://www.storytel.com/pl/pl/)
* meal kits delivery - [Hellofresh](https://www.hellofreshgroup.com/en/)  

For such companies, especially for marketing and customer retention departments, it is crucial to target subscribers being at high churn risk. Here machine learning comes to play.

# 2. Reading and Cleaning Data

In [None]:
# importing basic libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

print("pandas version: {}".format(pd.__version__))
print("numpy version: {}".format(np.__version__))
print("seaborn version: {}".format(sns.__version__))

In [None]:
data = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
data.head()

In [None]:
# inspiration code: https://www.kaggle.com/dwin183287/covid-19-world-vaccination
fig=plt.figure(figsize=(5,2),facecolor='white')

ax0=fig.add_subplot(1,1,1)
ax0.text(0.75,1,"Key figures",color='black',fontsize=28, fontweight='bold', fontfamily='monospace',ha='center')

ax0.text(0,0.4,"{}".format(data.shape[0]),color='gold',fontsize=25, fontweight='bold', fontfamily='monospace',ha='center')
ax0.text(0,0.001,"Number of rows \nin the dataset",color='dimgrey',fontsize=17, fontweight='light', fontfamily='monospace',ha='center')

ax0.text(0.75,0.4,"{}".format(data.shape[1]),color='gold',fontsize=25, fontweight='bold', fontfamily='monospace',ha='center')
ax0.text(0.75,0.001,"Number of features \nin the dataset",color='dimgrey',fontsize=17, fontweight='light', fontfamily='monospace',ha='center')

a = round(data[data["Churn"]=="Yes"].shape[0]/data.shape[0]*100, 2)
ax0.text(1.5,0.4,"{}%".format(a),color='gold',fontsize=25, fontweight='bold', fontfamily='monospace',ha='center')
ax0.text(1.5,0.001,"Customers \nchurned",color='dimgrey',fontsize=17, fontweight='light', fontfamily='monospace',ha='center')

ax0.set_yticklabels('')
ax0.tick_params(axis='y',length=0)
ax0.tick_params(axis='x',length=0)
ax0.set_xticklabels('')

for direction in ['top','right','left','bottom']:
    ax0.spines[direction].set_visible(False)

Columns and their types are shown below:

In [None]:
data.dtypes

The first glance at the dataset reveals a significant amount of columns with binary values (yes/no). Also a column "TotalCharges" is of object type although it should be numerical. This will be corrected.

In [None]:
data["TotalCharges"] = pd.to_numeric(data["TotalCharges"], errors='coerce')

In [None]:
data.isnull().sum()

There are 11 missing values in TotalCharges columns. This is negligibly small amount and can be removed. Also a column "customerID" does not carry any usefull information thus it can be removed.

In [None]:
# Removing missing values 
data.dropna(inplace = True)

# Remove customer IDs from the data set
data_clean = data.drop(['customerID'], axis=1)

# 2. Data Exploration

### **Summary of this section:**

CUSTOMERS PROFILES:
* There is similar count of men and women and no significant difference of churn rate between them.
* Only ca. 16% of customers are seniors - who churn more often than others (24% vs. 42%).
* There is similar count of customers with and without partners. People with partner churn more often (33% vs. 20%).
* About 70% of customers do not have dependents. These who have them churn more often (31% vs. 16%).

SERVICES:
* Over 90% of customers have a phone service but there is no significant difference in churn rates between the two groups.
* Only 10% of customers have multiple phone lines. Most have one or none. There are no significant differences wrt churn between them.
* Customers having fiber optic connection churn more often (43%) than these with DSL (19%) or no Internet (7%).

SECURITY:
* About half of customers do not have Internet Security. These customers churn the more often - 42%. Similar situation is with customers without Tech Support (ca. half of them) whose churn rate is also 42%.
* Customers without Online Backup have 40% churn rate and similarly these without Device Protection have 39% churn rate. 

STREAMING:
* Customers who do not stream TV or Movies or have no Internet churn more often than these who use these services.

CONTRACTS:
* Customers with month-to-month contracts churn most often (43%) when comparing with yearly contracts (12%) and 2-years contract (3%)
* Customers with automatic bank transfer payment method churn the most often (45%)
* Customers without Paperless Billing churn more often (34%)
* Tenure of customers span from 1 month to 72 months and thses two groups are the biggest ones.
* The higher tenure the lower is the churn rate.
* People with higher monthly charges churn more often.
* There is no significant diferrence between customers who have low and high total charges.

In [None]:
def pie_bar(column):
    """
    This function creates a plot with a pie chart on the left and a stack bar chart on the right
    """
    labels = data[column].unique()
    fig, ax = plt.subplots(1,2, figsize=(15,5))
    
    # part 1: pie chart on the left side
    
    tmp1 = data_clean[column].value_counts(normalize=True).mul(100)
    ax[0].text(0, 1.2,"% of customers by {}".format(column) ,fontsize=12, fontweight='bold', fontfamily='monospace',ha='center')
    ax[0].pie(tmp1.values, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, textprops=dict(size=14),
              colors = ["#fad796","#91deff","#cb96fa","#8f98ff"])
    ax[0].axis('equal') 
    #ax[0].set_title('% of customers by {}'.format(column), fontsize = 12)
    
    # part 2: stacked pie chart on the right side
    
    # creation of a cross table - normalize along rows (churn: yes/no)
    tmp2 = pd.crosstab(index = data_clean[column], columns = data_clean['Churn'], values = data_clean['Churn'],
                               aggfunc = len,
                               normalize = 'index').mul(100)

    ax[1].bar(labels, tmp2['No'], 0.35, label='No churn', color="#1ed686")  
    ax[1].bar(labels, tmp2['Yes'], 0.35, label='Churn', color="#f75931", bottom=tmp2['No'])
    
    # create annotations
    for p in ax[1].patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax[1].annotate('{:.2f}%'.format(height), (p.get_x()+.25*width, p.get_y()+.4*height),
                       color = 'black',
                       weight = 'bold',
                       size = 14)

    ax[1].set_ylabel('Percentage')
    ax[1].text(0.5, 110,"Count by churn and {}".format(column).format(column) ,fontsize=12, fontweight='bold', fontfamily='monospace',ha='center')
    #ax[1].set_title('Count by churn and {}'.format(column))
    ax[1].legend()

    plt.show()

In [None]:
pie_bar("gender")

In [None]:
data["SeniorCitizen"].replace([0,1],["No","Yes"], inplace=True)
pie_bar("SeniorCitizen")

In [None]:
pie_bar("Partner")

In [None]:
pie_bar("Dependents")

In [None]:
pie_bar("PhoneService")

In [None]:
pie_bar("MultipleLines")

In [None]:
pie_bar("InternetService")

In [None]:
pie_bar("OnlineSecurity")

In [None]:
pie_bar("OnlineBackup")

In [None]:
pie_bar("DeviceProtection")

In [None]:
pie_bar("TechSupport")

In [None]:
pie_bar("StreamingTV")

In [None]:
pie_bar("StreamingMovies")

In [None]:
pie_bar("Contract")

In [None]:
pie_bar("PaperlessBilling")

In [None]:
pie_bar("PaymentMethod")

In [None]:
ax = sns.histplot(data_clean['tenure'], stat='count', bins=72)
ax.set_ylabel('# of Customers')
ax.set_xlabel('Tenure (months)')
ax.set_title('# of Customers by their tenure')
plt.show()

In [None]:
tenure_churn = pd.crosstab(index = data_clean["tenure"], columns = data_clean['Churn'], values = data_clean['Churn'],
                               aggfunc = len,
                               normalize = 'index').mul(100)

In [None]:
sns.scatterplot(data=tenure_churn, x=tenure_churn.index, y="Yes")
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.kdeplot(data_clean["MonthlyCharges"][(data_clean["Churn"] == 'No')], color="Green", shade = True, edgecolor='black')
ax = sns.kdeplot(data_clean["MonthlyCharges"][(data_clean["Churn"] == 'Yes')], ax=ax, color="Red", shade= True, edgecolor='black')

ax.legend(["Not Churn","Churn"], loc='upper right')
ax.set_ylabel('Density')
ax.set_xlabel('Total Charges')
ax.set_title('Distribution of monthly charges')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.kdeplot(data_clean["TotalCharges"][(data_clean["Churn"] == 'No')], color="Green", shade = True)
ax = sns.kdeplot(data_clean["TotalCharges"][(data_clean["Churn"] == 'Yes')], ax=ax, color="Red", shade= True)

ax.legend(["Not Churn","Churn"], loc='upper right')
ax.set_ylabel('Density')
ax.set_xlabel('Total Charges')
ax.set_title('Distribution of total charges')
plt.show()

# 3. Modelling

## 3.1 Preparing databasae for modelling

In [None]:
data_clean.head()

I will encode categorical variables in couple of steps as I like to have more control over naming of columns. You can use pd.get_dummies() directly if you want.

In [None]:
# selecting binary columns 
cols_yes_no = ["Partner","Dependents","PhoneService","PaperlessBilling","Churn"]
data_subset1 = data_clean[cols_yes_no].copy()
data_subset2 = data_clean.drop(columns=cols_yes_no)

#creating dummies
data_subset2 = pd.get_dummies(data_subset2)

#removing logical correlation
dummies_drop = ["MultipleLines_No phone service","InternetService_No","Contract_Two year","PaymentMethod_Mailed check",
                "gender_Female", "StreamingTV_No internet service", "OnlineSecurity_No internet service"]
data_subset2 = data_subset2.drop(columns=dummies_drop)

data_subset2.head()

In [None]:
# encoding yes/no binary variables
data_subset1 = data_subset1.replace(["Yes","No"], [1,0])
data_subset1.head()

In [None]:
data_model = pd.concat([data_subset1, data_subset2], axis=1)
data_model.head()

In [None]:
X = data_model.drop(["Churn"], axis=1)
X.head()

In [None]:
y = data_model["Churn"]
y.head()

In [None]:
from sklearn.metrics import roc_auc_score,roc_curve

from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
import xgboost as xgb
from sklearn.feature_selection import SelectFromModel

## 3.2 Initial evaluation and selection of the ML models:

In [None]:
names = ["ExtraTC","DecisionTC","RandomFC","AdaBC","XGB","GradientBC"]
clfs = [
ExtraTreesClassifier(n_estimators=1000, max_depth=5, class_weight='balanced'),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(n_estimators=500, max_depth=5, class_weight='balanced'),
AdaBoostClassifier(n_estimators=100),
xgb.XGBClassifier(n_estimators=100, nthread=-1, max_depth=5, use_label_encoder=False),
GradientBoostingClassifier(n_estimators=200,max_depth=5)
]

plt.figure()

for name, clf in zip(names, clfs):
    clf.fit(X,y)
    y_proba = clf.predict_proba(X)[:,1]
    print("Roc AUC: {}".format(name), roc_auc_score(y, clf.predict_proba(X)[:,1], average='macro'))
    fpr, tpr, thresholds = roc_curve(y, y_proba)
    plt.plot(fpr, tpr, label=name)

plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.savefig('1.png')
plt.show()

In the previous step hyperparameters were set at random and there is a risk that some models were overfitting and some underperforming. In the next step Random Forest Classifier and XGB will be optimised and their performance compared.

## 3.3 Random Forest Classifier

In [None]:
from sklearn.model_selection import cross_val_score, RandomizedSearchCV

random_forest = RandomForestClassifier()

params = {"n_estimators":np.arange(500,2500,100),
          "max_depth":[4,5,6,7,8,9,10]}

clf = RandomizedSearchCV(random_forest,params, n_iter=30, scoring="roc_auc", cv=4)
rand_search = clf.fit(X,y)

In [None]:
pd.DataFrame(rand_search.cv_results_).sort_values(by="rank_test_score")[["rank_test_score","params","mean_test_score"]]

### Under construction