# Customers Churn
 
Customers churn refers to the amount of customers of a given company that stop using products or services during a certain time frame. One can calculate the churn rate by dividing the number of customers lost during that time period -- say a quarter -- by the number of existing customers at the beginning of that time period. For example, starting the quarter with 400 customers and ending with 380, the churn rate is 5% because 5% of your customers dropped off.
For Business Intelligence, this is one of the most important metrics to look at since loosing clients now-a-days is very easy, compared to retain the existing ones. Companies should aim for a churn rate that is as close to 0% as possible. In order to do this, the company has to be on top of its churn rate at all times and treat it as a top priority.
3 Ways to Reduce Customer Churn
Focus the attention on the best customers. Rather than simply focusing on offering incentives to customers who are considering churning, it could be even more beneficial to pool the resources into the loyal, profitable customers.
Analyze churn as it occurs. Use the churned customers as a means of understanding why customers are leaving. Analyze how and when churn occurs in a customer's lifetime with the company, and use that data to put into place preemptive measures.
Show the customers that you care. Instead of waiting to connect with the customers until they reach out to you, try a more proactive approach. Communicate with them all the perks you offer and show them you care about their experience, and they'll be sure to stick around.
In this project, I will use several tools from Survival Analysis to focus on a customer retention program from the Telco company (https://www.telco.com/company-profile). Each row represents a customer, each column contains customer's attributes described on the column Metadata.
The data set includes information about:
Customers who left within the last month: the column is called Churn
Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer's account information: how long they've been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers: gender, age range, and if they have partners and dependents

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Reading the dataset
dataframe = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
# I have changed the name of some columns for a better aesthetic
dataframe = dataframe.rename(columns={'customerID': 'Client', 'gender': 'Gender', 'tenure': 'Tenure'})
dataframe['SeniorCitizen'] = dataframe['SeniorCitizen'].replace([0,1],('No','Yes'))
dataframe['Churn'] = dataframe['Churn'].replace(['Yes','No'],[1,0])
#There were some formatting errors in Total Charges, so I correct them
dataframe['TotalCharges'] = dataframe['TotalCharges'].replace(' ', np.NaN)
dataframe['TotalCharges'] = pd.to_numeric(dataframe['TotalCharges'])
dataframe

The database counts with 7043 entries and 21 columns. In the next cell I will run a quick report to have a global statistical view of the data and look for existing caveats. I will run this quick report with the library Pandas-Profiling which can be installed following the steps in https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/installation.html.
The variable description is:
'Client': A unique ID that identifies each customer
'Gender': The customer’s gender (Male, Female)
'SeniorCitizen': Indicates if the customer is 65 or older (Yes, No)
'Partner': Indicates if the customer is married (Yes, No)
'Dependents': Indicates if the customer lives with any children, parents, etc. (Yes, No)
'Tenure': The total amount of months that the customer has been with the company
'PhoneService': Indicates if the customer subscribes to home phone service with the company (Yes, No)
'MultipleLines': Customer subscribed to multiple telephone lines with the company: Yes, No, No phone service
'InternetService': Customer subscribed to Internet service with the company (No, DSL, Fiber Optic, Cable)
'OnlineSecurity': Customer subscribed to an additional online security service (Yes, No)
'OnlineBackup': Customer subscribed to an additional online backup service (Yes, No)
'DeviceProtection': Customer subscribed to an additional device protection plan (Yes, No)
'TechSupport': Customer subscribed to an additional technical support plan (Yes, No)
'StreamingTV': Customer uses their Internet service to stream television programing (Yes, No)
'StreamingMovies': Customer uses their Internet service to stream movies (Yes, No)
'Contract': Indicates the customer’s current contract type (Month-to-Month, One Year, Two Year)
'PaperlessBilling': Customer has chosen paperless billing (Yes, No)
'PaymentMethod': Customer pays their bill (Bank Withdrawal, Credit Card, Mailed Check)
'MonthlyCharges': Customer's current total monthly charge for all their services from the company.
'TotalCharges': Customer's total charges, calculated to the end of the quarter specified above.
'Churn': The customer left/remained with the company 1/0.

In [None]:
from pandas_profiling import ProfileReport

profile = ProfileReport(dataframe, title='Pandas Profiling Report')
profile

1. I see 11 missing data: I will remove them from the database
2. I see no duplicated data
3. The variable TotalCharges is uniformly distributed
4. There are binary, categorical and continuos variables
5. The outcome variables are Tenure and Churn

# Data Analytics: univariable analysis

In this subsection I will build a function to do a combined plot to show the most distribution of values of every variable (bar plot) and the total number of customer that churned. This will help me understand what are the risk factors for a customer to churn. A multivariable analysis will complete this study as it will take into account variable interactions. I will do a multivariable analysis in the next section.

In [None]:
#Data visualization
import matplotlib    #Importing Matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rc('font', size=16)                #Use big fonts and big plots
plt.rcParams['figure.figsize'] = (10.0,10.0)    
matplotlib.rc('figure', facecolor='white')

import seaborn as sns #Importing Seaborn

def Data_Analytics(df,colname,targetname):
    ### This function checks the target value difference of a given cathegory in the case
    ### of binary classifications.
    
    ## Arguments:
    # df: is a data frame.
    # colname: is a string. The column name to be evaluated.
    # targetname: is a string. The column name of the target variable.
    
    # caculate aggregate stats
    df_cate = df.groupby([colname])[targetname].agg(['count', 'sum', 'mean'])
    df_cate.reset_index(inplace=True)
    #print(df_cate)
    
    # plot visuals
    f, ax = plt.subplots(figsize=(20, 8))
    plt1 = sns.lineplot(x=colname, y="sum", data=df_cate,color="b")
    plt.xticks(size=18,rotation=90)
    plt.yticks(size=20,rotation=0)
    
    for tl in ax.get_yticklabels():
        tl.set_color('b')

    ax2 = ax.twinx()
    plt2 = sns.barplot(x=colname, y="count", data=df_cate,
                       ax=ax2,alpha=0.5)

# The gender of the client?

In [None]:
Data_Analytics(dataframe,'Gender','Churn')

In the above graphic is shown the distribution of the Gender variable as an histogram. The blue line represents the total number of customers from each class that churned. There are slightly more Men than Women as customers. Women churned the most.

# The age of the client?

In [None]:
Data_Analytics(dataframe,'Partner','Churn')

There are slightly more single customers than partnered ones. Not married customers churned the most.

# The client has dependents?

In [None]:
Data_Analytics(dataframe,'Dependents','Churn')

There are more independent customers than dependen. From both groups, independents churned the most.

# The months the customers stayed with the company?

In [None]:
Data_Analytics(dataframe,'Tenure','Churn')

In the above graphic, as an histogram, is shown how many months the customers stayed with the company. Most of them churned during the first month.

# The customer having a phone service?

In [None]:
Data_Analytics(dataframe,'PhoneService','Churn')

Most of the customers subscribed to a home phone service with the company, and they are the most that churned.

# The customer having a multiple line services?

In [None]:
Data_Analytics(dataframe,'MultipleLines','Churn')

More than 3000 customers don't have multiple line services, while around 3000 have. They are the ones that churned the most.

# Which internet service does the customer have?

In [None]:
Data_Analytics(dataframe,'InternetService','Churn')

Customers subscribed mostly to Fiber Optic. They churned the most, followed by the ones that subscribed DSL.

# Online security?

In [None]:
Data_Analytics(dataframe,'OnlineSecurity','Churn')

Most of the customers that didn't subscribe to an additional online security service churned.

# Online Backup

In [None]:
Data_Analytics(dataframe,'OnlineBackup','Churn')

Most of the customers that didn't subscribe to an additional online backup service, churned.

# Device Protection

In [None]:
Data_Analytics(dataframe,'DeviceProtection','Churn')

Most of the customers that didn't subscribe to an additional device protection plan, churned.

# Tech Support

In [None]:
Data_Analytics(dataframe,'TechSupport','Churn')

Most of the customers that didn't subscribe to an additional technical support plan, churned.

# Streaming TV

In [None]:
Data_Analytics(dataframe,'StreamingTV','Churn')

Most of the customer that don't use their Internet service to stream television programing, churned, followed by the ones who do.

# Streaming Movies

In [None]:
Data_Analytics(dataframe,'StreamingMovies','Churn')

Most of the customer that don't use their Internet service to stream movies, churned, followed by the ones who do.

# Contract

In [None]:
Data_Analytics(dataframe,'Contract','Churn')

Most of the customers have a Month-to-Month type of contract, followed by the ones that have Two-years contract. The first churned the most, followed by the ones with One-year contract.

# Paperless Billing

In [None]:
Data_Analytics(dataframe,'PaperlessBilling','Churn')

Most of the customer that have chosen paperless billing, churned.

# Payment Method

In [None]:
Data_Analytics(dataframe,'PaymentMethod','Churn')

Most of the customers that pay with Electronic Check, churned.

# Missing data
I will remove customers with variables with missing data.

In [None]:
dataframe = dataframe.dropna()
dataframe

# Censored data

One thing to have in mind is that, the variable 'Tenure' measures the time in months that the customer has been with the company reaches its maximum at 72 months, and customers like '2234-XADUH' haven't churned yet. I see as well that customers like '7590-VHVEG', '5575-GNVDE', '7795-CFOCW' and many others didn't churn at 1, 34, 45 months, which means that since the maximum information we know about the customers dates till 74 months, those customers that didn't churn but have Tenure less than 75 months, were lost from follow-up, which means that they stopped using the services of the company but they never registered the rupture of the contract within the 75 months, or they died, etc.
This is known as censorship and suddenly complicates the prediction methods as one cannot apply classical Machine Learning models, but only those adapted to Survival Analysis.
In the next section I will use a classical and very powerful model for such a task, know as the Cox's Proportional Hazard Model.
The next function allows to censor the database up to a time of study, for example, up to three, five or seven years. Feel free to use it to stop the study at a given 'censor' time and make predictions up to that moment in time.

In [None]:
def time_censoring(df,timeline,censor,event):
    # Inputs: a dataframe
    #         the name of the colum of the time-to-event
    #         the time one needs to censor
    #         the the name of the event column
    
    #Makes a copy of the input dataframe to not overwrite values
    data_frame = df.copy()
    
    # Censures the time-to-event column
    censored_times = [min(el,censor) for el in data_frame[timeline]]
    data_frame[timeline] = censored_times
    
    # Censures the event column
    data_frame[event] = np.where(data_frame[event] == 0, 0,
                                 np.where(data_frame[timeline] >= censor, 0, 1)) 
    
    # Returns the censored dataframe
    return data_frame

# Churn Risk Modelling using the Cox model

In this section I will split the data set into training and testing subsamples. The idea is to produce a model, using the Cox Proportional Hazard Model with the training subsample, and later validate it with the test subsample. The model I am developing is a multivariable Cox. Categorical variables are further splitted into several classes and the model exaluates the risk factor of each class. I will produce a model and check the main risk factors on the training subsample. Then I will validate the model with the test subsample by computing the concordance index, which is equivalent to the ROC-AUC.

I will install the library lifelines which contains the Cox model and many others.

In [None]:
!pip3 install lifelines

## Train-Test-Split 
I will split the dataset into 80% for training and 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split

to_train, to_tests = train_test_split(dataframe, test_size=0.2)

## Cox Model check assumptions
The Cox model doesn't allow for collinear variables, which means that correlated variables should be eliminated before training the model or just to use a penalizer like Ridge, Lasso or ElasticNet. I will use only Ridge regularization scheme. For Ridge regularization: $l_1 = 0,$ I will use a penalization factor p=0.01 (feel free to change it if you need it).

In [None]:
from lifelines import CoxPHFitter

# Ridge regularization Cox Proportional Hazards model
formula = 'Gender+SeniorCitizen+Partner+Dependents+PhoneService+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract+PaperlessBilling+PaymentMethod+MonthlyCharges+TotalCharges'
model = CoxPHFitter(penalizer=0.01, l1_ratio=0)
model = model.fit(to_train.drop("Client",axis=1), 'Tenure', event_col='Churn',formula=formula)
model.print_summary()

I see that the concordance index is of 92% which is excellent. Then the main risk factors are identified in the column coef or exp(coef); those factors with coef>0 or exp(coef)>1 represent the most sensible for the customers to churn. The risk factors coincide on average with the ones from the univariable analysis.
1. The advantage of this modelling procedure is that anytime we get the data of a new customer, we can predict the churning or not and estimate the amount of months before the churn, so the company has time to prepare a Business Plan for the customer.

In the above graphic is shown the Hazard Ratios of every variable and independent factors, along with the 95% condifence intervals.

In [None]:
model.plot(hazard_ratios=True)
plt.xlabel('Hazard Ratios (95% CI)')

## Validation with the test subsample
I will compute now the concordance index with the test subsample and the predictions of the model on this subsample.

In [None]:
from lifelines.utils import concordance_index
C_index = concordance_index(to_tests['Tenure'],-model.predict_partial_hazard(to_tests.drop('Client',axis=1)),to_tests['Churn'])
print('The concordance of the Cox model on the test subsample is: ', round(C_index*100,2),'%')

The concordance index for the test subsample is of 92.46% which is excellent and there's no risk of overfitting.

## Surival Probabilities
I will plot now the probability in time that the customers in the test subsample don't leave the company.

In [None]:
churn_clients = pd.DataFrame(model.predict_survival_function(to_tests))
churn_clients[churn_clients.columns[0]].plot(color='b', label='Customer 6681')
churn_clients[churn_clients.columns[3]].plot(color='r', label='Customer 339')
churn_clients[churn_clients.columns[4]].plot(color='g', label='Customer 5457')
plt.plot([i for i in range(0,76)],[0.5 for i in range(0,76)],'k--', label='Threshold=0.5')
plt.ylim(0,1)
plt.xlim(0,75)
plt.xlabel('Timeline')
plt.ylabel('Retain probability')
plt.legend(loc='best')

In the x-axis is the Tenure variable that measures how many months the customer stayed with the company. In the y-axis is the Survival Probability: the probability that the customer remains with the company.

# Machine Learning risk evaluation models

To know more about this topic, please check my repo at:
[https://github.com/elopezfune/Customers_Churn](http://)