## Introduction

This document presents:
 * An initial analysis of customer churn of a given telecom company
 * Insights and possible actions the client can take to improve the scenario
 * Recommendations of further hypotheses and improvements over the model
 
**Executive Summary**
Here, we present some of the partial conclusions and what the business could do with these new information.

*Tenure and longer contracts*
These variables have positive impacts in reducing churn. In charge with this informations, the selling and customer success departaments could push longer contract to clients. Each month that the client stays increases the chances of the client staying yet another month.

*More comments on Tenure*
The churn is very high after one month, and we have two main hypotheses for that:
 * Our client (the telecom) does not make a good screening process to accecpt or not clients. This is a opportunitty to yet another project, risk modelling for new customers acceptance.
 * Our onboarding process is too bad (we may take too long to install the service in the customer's house, the product may be hard to use, etc)
 
*Monthly Charges*
Cheaper payments have a good effect on churn. We could further investigate it to find out what is the effect in the life-time value when the price is decrease for a certain service plan. We could get new Monthly charges that would optimize life-time value of the client.
The second usage of this insight is more direct. If a customer wants to finish his contract with the telecom, offering the customer a discount for a certain time is a good practice. The changes of churn decreases and even when the discount is over, the chances of churn are smaller because of the increase in tenure.

*Phone Service and InternetService*
InternetService has a bad effect on Churn (see Lasso analysis), and PhoneService has a null effect (see the chi analysis). Our hypotheses are that the customers don't care much about the PhoneService and that our InternetService is bad.
The telecom could survey clients about the PhoneService and InternetService to test these hypotheses. If they turn out to be true, maybe reducing the offer of PhoneService to a niche group and adjustments to the InternetService could improve our profits.

*The Model*
Note: the model has not achieved the desired results and can not be used by the business as is. Improvements are commented in this document. In this section, we would like to expose what could be done with a good prediction model.
One of the main uses of the models would be to automate customer services.
For instance, since decreases in Montly charges improve churn, the clients with the highest probabilities of churn could receive automatic discounts or coupons. 

# First glance at the data

The first step of our analysis is get a better notion of what we have on our hands.
This will lead us to the first insights of how to clean the data and which models might have a good result.

In [None]:
import pandas as pd
raw_data = pd.read_csv("../input/WA_Fn-UseC_-Telco-Customer-Churn.csv")
print(raw_data.dtypes)
pd.set_option('display.max_columns', None)
raw_data.head()

We have then 20 features (customerID is only an index), 19 independent variables (or input variables) and one dependent, the one named 'Churn'.

Here, I want to point out three things:
* Most of the data is categorical, what suggests me that linear or logistics regression would not work well. These regressions fit better problems with continuous features.
* There are some obvious relationships among the features. For instance, there is Phone Service feature and the Multiple Lines feature has a 'No phone service' category. Further on the analysis, this insight will help us to eliminate some variables of the analysis
* And, we see that some numerical variables could be converted automatically to int or float types, what indicates absence of NaNs (Not the same as saying that the data was 100% correctly filled). The same cannot be said for TotalCharges. 

So, moving on...

# Data Quality

Here, we want to look into Exploratory Data Analyses. Our objective is to get a deeper understanding of the data, more specifically, we have the following tasks:
* Find NaNs and what is their impact in the data
* Plot some distributions and correlations to see which data to use and how to use the data when modelling

So, starting by Gender. In the gender feature, I would expect to find a similar distribution to the general population (55% for male and 45% for female).

In [None]:
import matplotlib.pyplot as plt
import numpy as np

gender_values = raw_data['gender'].unique()
gender_counts = [None] * len(gender_values)
i = 0
for value in gender_values:
    gender_counts[i] = raw_data['gender'].loc[raw_data.gender == value].count()
    i = i + 1
    
plt.bar(gender_values, gender_counts / raw_data['gender'].count())
plt.show()

Seems right, and no NaN, very good.

In [None]:
#let's get rid of code repetition before continuing
def plot_series_bar(series, series_name, ax):
    values = series.unique()
    counts = [None] * len(values)
    i = 0
    for value in values:
        counts[i] = series.loc[series == value].count()
        i = i + 1
    
    ax.bar(values, counts / series.count())
    ax.set_title(series_name)
    
    
fig, axes = plt.subplots(ncols=5, sharex=False, sharey=True, figsize=(15,5))
i = 0
for column in ['SeniorCitizen', 'Partner', 'Dependents','PhoneService', 'MultipleLines']:
    plot_series_bar(raw_data[column], column, axes[i])
    i = i + 1
plt.show()

All the data is complete what is very good.

The percentage of Senior Citizens is similar to the national population (around 15%) and here, the proportion to be a little higher than the national average is expected since we probably don't have underagers as clients.

I thought we would not have all the data for partner or dependents, as it seems odd that a telecom company asks for that. So I guess that some NOs may actually be NaNs, but the quantity should be small since at least the proportion of partner is also close to the general population that is married (around 40%).

Continuing with the plot...

In [None]:
fig, axes = plt.subplots(ncols=7, sharex=False, sharey=True, figsize=(15,5))
i = 0
for column in ['InternetService', 'OnlineSecurity', 'OnlineBackup','DeviceProtection', 'TechSupport', 
               'StreamingTV', 'StreamingMovies']:
    plot_series_bar(raw_data[column], column, axes[i])
    i = i + 1
plt.show()

In [None]:
fig, axes = plt.subplots(ncols=4, sharex=False, sharey=True, figsize=(15,5))
i = 0
for column in ['Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']:
    plot_series_bar(raw_data[column], column, axes[i])
    i = i + 1
plt.show()

The "No internet service" or "No phone service" match the values accross all grpahs. A good point for the data quality.

Below, we continue to evaluate data quality, but we start to check the variables correlations. Mainly with Churn.

In [None]:
import numpy as np
binwidth = 1
tenure_ser = raw_data['tenure']
plt.hist([raw_data.loc[raw_data['Churn']=='Yes']['tenure'], raw_data.loc[raw_data['Churn']=='No']['tenure']],
         bins=np.arange(min(tenure_ser), max(tenure_ser) + binwidth, binwidth), histtype='barstacked',
        label=['Has Churned', 'Has not'])
plt.legend(prop={'size': 10})
plt.show()

We can conclude two things with this graph from tenure:
 * It seems that there is a strong inverse relationship between tenure and Churn
 * There is a lot of clients accumulated with a tenure of 72

The first one is kind of expected, the longer you stick as a cliente, the bigger the probability of you staying a month more. The second seems to indicate that 72 is actually a representation of clients with 72 or more months with the company.

To solve this distortion, we have two alternatives:
 * Create a dummy variable that indicates that the client has a tenure of 72, so the model can more easily figure out if it changes anything
 * To transform the variable in a categorical variable

I would test both if I had the time, but here, we will implement the second alternative because it is more commonly applied and several models respond well for it.

In [None]:
import numpy as np

def plot_hist_with_churn(df, series_name, binwidth, ax):
    ser = df[series_name]
    ax.hist([df.loc[df['Churn']=='Yes'][series_name], df.loc[df['Churn']=='No'][series_name]],
         bins=np.arange(min(ser), max(ser) + binwidth, binwidth), histtype='barstacked',
        label=['Has Churned', 'Has not'])
    ax.legend(prop={'size': 10})

fig, ax = plt.subplots()
plot_hist_with_churn(raw_data, 'MonthlyCharges', 1, ax)
plt.show()

As already exposed in the introduction. The graph above shows one of ours main insights, that cheaper plans have a positive impact in churn. There is a low percentage of churned clients in high paying plans too, probably very satisfied clients, with more time, I would try to find how these very satisfied clients, correlating it with the demographic and services features.

For the total charges plot, we need to first get rid of the NaN info

In [None]:
total_charges_ser = raw_data['TotalCharges']
total_charges_ser = pd.to_numeric(total_charges_ser, errors='coerce')
raw_data['TotalCharges'] = total_charges_ser
#How many NaN do we have at TotalCharges?
print(raw_data.loc[total_charges_ser.isnull()==True].count()['customerID'])

Eleven out of more than 7000 samples is pretty insignificant, we can get rid of these eleven rows if we decide that Total Charges is a important variable to use in the model. If Total Charges turns out to be not so important, we don't need to be concerned with that.

In [None]:
without_nan = raw_data.loc[total_charges_ser.isnull()==False].infer_objects()    
print(without_nan.TotalCharges.hist())

#Create a boxplot
without_nan.boxplot('TotalCharges', by='Churn')

So, people who have payed more through history tend to churn less (the main hypothesis is that they are longer with the telecom). There are though several outliers in the churned clients, maybe another signal that high paying plans are not good for churn.

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

def print_anova(series_name, data):
    model = ols(series_name+' ~ Churn', data=data).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    print(anova_table)
    print('\n')

print_anova('TotalCharges', without_nan) 
print_anova('MonthlyCharges', without_nan)
print_anova('tenure', without_nan)


Above, we presented anova as a measure for correlation. We can see by the p-value (last column) the probability of finding such distributions by random. Since they are vey low, we can safely conclude that these variables have effects on churn.

Below, we present a chi analysis. Again calculating the p-value to get a notion of correlation.

In [None]:
import scipy.stats as stats

def print_cross_table(df, series_name):
    tab = pd.crosstab(df.Churn, df[series_name], margins = True)
    tab.columns = ["Not "+series_name, series_name, "row_totals"]
    tab.index = ["Has Churned", "Has not", "col_totals"]
    print(tab)
    print("\n")
    return tab
    
tab = print_cross_table(without_nan,'SeniorCitizen')
print(stats.chi2_contingency(observed= tab.iloc[0:2,0:2]))
print("\n")
tab = print_cross_table(without_nan,'PhoneService')
print(stats.chi2_contingency(observed= tab.iloc[0:2,0:2]))

Senior has a good effect on churn. But Phone service seems to be indifferent (p-value of 0.35, bigger than 0.05 so we cannot reject the null hypothesis).

# Model

First, the preprocessing.
We create the dummy variables for the categorical data.
And, we also split the tenure and monthly charges into categories.

In [None]:
cat_data =  pd.DataFrame()
cat_data['tenure'] = pd.cut(without_nan['tenure'], 6, labels=['1','2','3','4','5','6']) #cat of one year
cat_data['MonthlyCharges'] = pd.cut(without_nan['MonthlyCharges'], 5, labels=['1','2','3','4','5'])

multi_categorical_columns = ['MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection',
                       'TechSupport','StreamingTV','StreamingMovies','Contract','PaymentMethod']

simple_categorical_columns = ['gender','SeniorCitizen','Partner','Dependents','PhoneService','PaperlessBilling','Churn']

preproc_data =  pd.DataFrame()

for category in multi_categorical_columns:
    dummies = pd.get_dummies(without_nan[category], drop_first=True)
    try:
        dummies = dummies.drop(columns=['No phone service'])
    except:
        pass
    try:
        dummies = dummies.drop(columns=['No internet service'])
    except:
        pass
    
    col_names = []
    for col_name in dummies.columns:
        col_names = col_names + [category + '-' + col_name]
    dummies.columns = col_names
    preproc_data = pd.concat([preproc_data, dummies], axis=1)
    
    
for category in ['tenure','MonthlyCharges']:
    dummies = pd.get_dummies(cat_data[category], drop_first=True)
    col_names = []
    for col_name in dummies.columns:
        col_names = col_names + [category + '-' + col_name]
    dummies.columns = col_names
    preproc_data = pd.concat([preproc_data, dummies], axis=1)

for category in simple_categorical_columns:
    dummies = pd.get_dummies(without_nan[category], drop_first=True)
    col_names = []
    for col_name in dummies.columns:
        col_names = col_names + [category + '-' + str(col_name)]
    dummies.columns = col_names
    preproc_data = pd.concat([preproc_data, dummies], axis=1) 

preproc_data['TotalCharges'] = without_nan['TotalCharges']

In [None]:
from sklearn.linear_model import Lasso
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(preproc_data)
norm_data = pd.DataFrame(data=scaler.transform(preproc_data), columns=preproc_data.columns)

ys = []
xs = []

independent = norm_data.drop(columns=['Churn-Yes'])
dependent = norm_data['Churn-Yes']

for alpha in range(1,20):
    
    lassoreg = Lasso(alpha=(alpha/1e2),normalize=False, max_iter=1e4)

    lassoreg.fit(independent,dependent)
    
    ys = ys + [lassoreg.coef_.tolist()]
    
    xs = xs + [(alpha/1e2)]

handles = plt.plot(xs, ys)
plt.legend(handles=handles, labels=independent.columns.tolist(), bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

In [None]:
lassoreg = Lasso(alpha=0.05, normalize=False, max_iter=1e4)
lassoreg.fit(independent,dependent)
coefs = np.abs(lassoreg.coef_.tolist())
best_ten = sorted(range(len(coefs)), key=lambda k: coefs[k], reverse=True)[0:10]
best_ten_feats = []
for i in best_ten:
    print(independent.columns[i] + ' with coef: ' + str(lassoreg.coef_[i]))
    best_ten_feats = best_ten_feats + [independent.columns[i]]

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#Get the best features, add back the missing categories of the best features and put independent variable
model_data = preproc_data[best_ten_feats + ['PaymentMethod-Credit card (automatic)', 'PaymentMethod-Mailed check',
                                           'tenure-2','tenure-4','tenure-5','tenure-6'] + ['Churn-Yes']]

rf = RandomForestClassifier(n_estimators = 100, random_state = 12)
train_features, test_features = train_test_split(model_data) #by default, 25% of the data is test data

x_train = train_features.drop(columns=['Churn-Yes'])
y_train = (train_features['Churn-Yes']>0.5)

x_test = test_features.drop(columns=['Churn-Yes'])
y_test = (test_features['Churn-Yes']>0.5)

rf.fit(x_train, y_train)

# Use the forest's predict method on the test data
predictions = rf.predict(x_test)

true_positives = ( (predictions==1) & (y_test==1) )
false_negatives = ( (predictions==0) & (y_test==1) )

# Calculate and display accuracy
print('Precision:', round(100*(true_positives.sum()/(predictions>0.5).sum()), 2), '%.')
print('Recall:', round(100*(true_positives.sum()/(y_test==True).sum()), 2), '%.')

In [None]:
import xgboost as xgb

# specify parameters via map
param = {'max_depth':5, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 10

xgtrain = xgb.DMatrix(x_train.values, y_train.values)
xgtest = xgb.DMatrix(x_test.values)

bst = xgb.train(param, xgtrain, num_round)
# make prediction
predictions = bst.predict(xgtest)

true_positives = ( (predictions>0.5) & (y_test==True) )

# Calculate and display accuracy
print('Precision:', round(100*(true_positives.sum()/(predictions>0.5).sum()), 2), '%.')
print('Recall:', round(100*(1 - true_positives.sum()/(y_test==True).sum()), 2), '%.')

Trying a Neural Network 

In [None]:
from sklearn.neural_network import MLPClassifier

train_data, test_data = train_test_split(preproc_data) #by default, 25% of the data is test data

x_train = train_data.drop(columns=['Churn-Yes'])
y_train = (train_data['Churn-Yes']>0.5)

x_test = test_data.drop(columns=['Churn-Yes'])
y_test = (test_data['Churn-Yes']>0.5)

scaler = preprocessing.StandardScaler()
scaler.fit(x_train)

x_train = scaler.transform(x_train)  
# apply same transformation to test data
x_test = scaler.transform(x_test)  

clf = MLPClassifier(solver='adam', alpha=1e-7,
                    hidden_layer_sizes=(10, 10, 2), random_state=1, max_iter=200000)

clf.fit(x_train, y_train)

# make prediction
predictions = clf.predict(x_test)

true_positives = ( (predictions>0.5) & (y_test==True) )

# Calculate and display accuracy
print('Precision:', round(100*(true_positives.sum()/(predictions>0.5).sum()), 2), '%.')
print('Recall:', round(100*(1 - true_positives.sum()/(y_test==True).sum()), 2), '%.')