# Customer Churn

---


## Context
Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

## Content
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

* Customers who left within the last month – the column is called Churn
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents

## Inspiration
To explore this type of models and learn more about the subject.

## First insight

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.metrics import f1_score, classification_report

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC

In [None]:
import lightgbm as lgbm
import xgboost as xgb

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

pd.set_option('display.max_columns', 100)

In [None]:
df = pd.read_csv(r"../input/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

In [None]:
df.shape

The dataset contains about 7000 customers with 19 features.

**Features** are the following:
- `customerID`: a unique ID for each customer
- `gender`: the gender of the customer
- `SeniorCitizen`: whether the customer is a senior (i.e. older than 65) or not
- `Partner`: whether the customer has a partner or not
- `Dependents`: whether the customer has people to take care of or not
- `tenure`: the number of months the customer has stayed
- `PhoneService`: whether the customer has a phone service or not
- `MultipleLines`: whether the customer has multiple telephonic lines or not
- `InternetService`: the kind of internet services the customer has (DSL, Fiber optic, no)
- `OnlineSecurity`: what online security the customer has (Yes, No, No internet service)
- `OnlineBackup`: whether the customer has online backup file system (Yes, No, No internet service)
- `DeviceProtection`: Whether the customer has device protection or not (Yes, No, No internet service)
- `TechSupport`: whether the customer has tech support or not (Yes, No, No internet service)
- `StreamingTV`: whether the customer has a streaming TV device (e.g. a TV box) or not (Yes, No, No internet service)
- `StreamingMovies`: whether the customer uses streaming movies (e.g. VOD) or not (Yes, No, No internet service)
- `Contract`: the contract term of the customer (Month-to-month, One year, Two year)
- `PaperlessBilling`: Whether the customer has electronic billing or not (Yes, No)
- `PaymentMethod`: payment method of the customer (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- `MonthlyCharges`: the amount charged to the customer monthly
- `TotalCharges`: the total amount the customer paid

And the **Target** :
- `Churn`: whether the customer left or not (Yes, No)

As you can see, many features are categorical with more than 2 values. You will have to handle this.

Take time to make a proper and complete EDA: this will help you build a better model.

---

# Exploratory Data Analysis¶

Global infos on the dataset (null values, types...)

In [None]:
df.info()

Nb of each type

In [None]:
df.dtypes.value_counts()

Nb of unique value for each type

In [None]:
df.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

## Target infos

In [None]:
df['Churn'].value_counts()

In [None]:
df['Churn'].str.replace('No', '0').str.replace('Yes', '1').astype(int).plot.hist()

Basic stats on numerical cols

In [None]:
df.describe()

## Basic cleaning

In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum()

In [None]:
df = df.drop(columns=['customerID'])

No missing or duplicated rows. The customer ID is irrelevant and can be dropped.

## Dealing with abnormal values

The 'TotalCharges' column has an object type, but it is supposed to contain only numerical values...Let's dig a little deeper:

In [None]:
# example for the record strip non digit values
#test = pd.Series(["U$ 192.01"])
#test.str.replace('^[^\d]*', '').astype(float)

#df.TotalCharges = df.TotalCharges.str.replace('^[^\d]*', '')

In [None]:
df.iloc[0, df.columns.get_loc("TotalCharges")]

In [None]:
float(df.iloc[0, df.columns.get_loc("TotalCharges")])

In [None]:
df.iloc[488, df.columns.get_loc("TotalCharges")]

In [None]:
len(df[df['TotalCharges'] == ' '])

Drop strange/missing values (the pandas method to_numeric could also has been used!):

In [None]:
# replace missing values by 0
df.TotalCharges = df.TotalCharges.replace(" ",np.nan)

# drop missing values - side note: it represents only 11 out of 7043 rows which is not significant...
df = df.dropna()

# now we can convert the column type
df.TotalCharges = df.TotalCharges.astype('float')

df.shape

In [None]:
num_feat = df.select_dtypes(include=['float', 'int']).columns.tolist()
num_feat.remove('SeniorCitizen')    # SeniorCitizen is only a boolean
num_feat

In [None]:
sns.pairplot(data=df[num_feat])
plt.show()

Plot distribution of those feat, w/ & w/o the distinction between the customers who churn

In [None]:
plt.figure(figsize=(16, 10))

plt.subplot(2, 3, 1)
sns.distplot(df['tenure'])
plt.title('tenure')

plt.subplot(2, 3, 2)
sns.distplot(df['MonthlyCharges'])
plt.title('MonthlyCharges')

plt.subplot(2, 3, 3)
sns.distplot(df['TotalCharges'])
plt.title('TotalCharges')

plt.subplot(2, 3, 4)
sns.kdeplot(df.loc[df['Churn'] == 'No', 'tenure'], shade=True,label = 'Churn == 0')
sns.kdeplot(df.loc[df['Churn'] == 'Yes', 'tenure'], shade=True,label = 'Churn == 1')

plt.subplot(2, 3, 5)
sns.kdeplot(df.loc[df['Churn'] == 'No', 'MonthlyCharges'], shade=True,label = 'Churn == 0')
sns.kdeplot(df.loc[df['Churn'] == 'Yes', 'MonthlyCharges'], shade=True,label = 'Churn == 1')

plt.subplot(2, 3, 6)
sns.kdeplot(df.loc[df['Churn'] == 'No', 'TotalCharges'], shade=True,label = 'Churn == 0')
sns.kdeplot(df.loc[df['Churn'] == 'Yes', 'TotalCharges'], shade=True,label = 'Churn == 1')


Are there any correlations ?

In [None]:
corr = df.corr()
corr

In [None]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(6, 4))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

In [None]:
for c in num_feat:
    plt.figure(figsize=(12, 1))
    sns.boxplot(df[c])
    plt.title(c)
    plt.show()

In [None]:
cat_features = df.select_dtypes('object').columns.tolist()
cat_features

Plot the count of different categories for the other features (with text)

In [None]:
plt.figure(figsize=(16, 20))
plt.subplots_adjust(hspace=0.4)

for i in range(len(cat_features)):
    plt.subplot(6, 3, i+1)
    sns.countplot(df[cat_features[i]])
    #plt.title(cat_features[i])

plt.show()

Same plot but with the distinction between customers who churn

In [None]:
cat_features.remove('Churn')

plt.figure(figsize=(16, 20))
plt.subplots_adjust(hspace=0.4)

for i in range(len(cat_features)):
    plt.subplot(6, 3, i+1)
    sns.countplot(df[cat_features[i]], hue=df['Churn'])
    #plt.title(cat_features[i])

plt.show()

---

# Data Preparation & Feature engineering

Target creation

In [None]:
y = df.Churn.str.replace('No', '0').str.replace('Yes', '1').astype(int)

Label encoding of categorical features

In [None]:
X = pd.get_dummies(data=df, columns=cat_features, drop_first=True)
X = X.drop(columns=['Churn'])

In [None]:
X.shape, y.shape

Features creation

* In this case, it's complicated to add features from an other dataset because no information is provided with the CSV file we're using.
* All columns except the user_id are relevant, so all of them are kept.
* We can combine features to create new ones : by dividing TotalCharges with the tenure which provide a kind of charge average per month. This value compared to the Monthly charges can give an idea of the charges' evolution with time.

In [None]:
X['average_charges'] = X['TotalCharges'] / X['tenure']
X.loc[X['tenure'] == 0, 'average_charges'] = X['MonthlyCharges']
X.head()

Scaling data

In [None]:
num_feat.append('average_charges')
scaler = MinMaxScaler()
X[num_feat] = scaler.fit_transform(X[num_feat])

In [None]:
X.head()

Splitting train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Features importances

In [None]:
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(X, y)

In [None]:
feature_importances = pd.DataFrame(rnd_clf.feature_importances_, index = X.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
feature_importances[:10]

In [None]:
plt.figure(figsize=(8, 10))
sns.barplot(x="importance", y=feature_importances.index, data=feature_importances)
plt.show()

---

# Baselines

In [None]:
# f1_score binary by default
def get_f1_scores(clf, model_name):
    y_train_pred, y_pred = clf.predict(X_train), clf.predict(X_test)
    print(model_name, f'\t - Training F1 score = {f1_score(y_train, y_train_pred) * 100:.2f}% / Test F1 score = {f1_score(y_test, y_pred)  * 100:.2f}%')

In [None]:
model_list = [RandomForestClassifier(),
    LogisticRegression(),
    SVC(),
    LinearSVC(),
    SGDClassifier(),
    lgbm.LGBMClassifier(),
    xgb.XGBClassifier()
             ]

In [None]:
model_names = [str(m)[:str(m).index('(')] for m in model_list]

In [None]:
for model, name in zip(model_list, model_names):
    model.fit(X_train, y_train)
    get_f1_scores(model, name)

The 1st model - RandomForrest Clf - is clearly overfitting the train dataset and can't generalize. The others models don't have good results and are probably underfitting. So let's tuned them !

---

# Training more accurately other models

## Randomforest with weighted classes

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
get_f1_scores(rfc, 'RandomForest')

In [None]:
y.sum(), len(y) - y.sum()

In [None]:
rfc = RandomForestClassifier(class_weight={1:1869, 0:5174})
rfc.fit(X_train, y_train)
get_f1_scores(rfc, 'RandomForest weighted')

The improvement is not significant...

## LGBM with weighted classes

In [None]:
lgbm_w = lgbm.LGBMClassifier(n_jobs = -1, class_weight={0:1869, 1:5174})
lgbm_w.fit(X_train, y_train)
get_f1_scores(lgbm_w, 'LGBM weighted')

## XGB with ratio

In [None]:
ratio = ((len(y) - y.sum()) - y.sum()) / y.sum()
ratio

In [None]:
xgb_model = xgb.XGBClassifier(objective="binary:logistic", scale_pos_weight=ratio)
xgb_model.fit(X_train, y_train)
get_f1_scores(xgb_model, 'XGB with ratio')

That's a little better.

## Adaboost

In [None]:
abc = AdaBoostClassifier()
abc.fit(X_train, y_train)
get_f1_scores(abc, 'Adaboost')

---

# Using GridsearchCV & Combining the best models

With XGB

In [None]:
print(classification_report(y_test, xgb_model.predict(X_test)))

Let's use a GridSearch with 5 cross validation to tuned the hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
params = {'learning_rate':[0.175, 0.167, 0.165, 0.163, 0.17], 
          'max_depth':[1, 2, 3],
          'scale_pos_weight':[1.70, 1.73, 1.76, 1.79]}
clf_grid = GridSearchCV(xgb.XGBClassifier(), param_grid=params, cv=5, scoring='f1', n_jobs=-1, verbose=1)
clf_grid.fit(X_train, y_train)

In [None]:
clf_grid.best_score_

In [None]:
clf_grid.best_params_

With a LogisticRegression

In [None]:
lr = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [None]:
lr.fit(X_train, y_train)
get_f1_scores(lr, 'Logistic Reg')

Now we can try to combine the best models

In [None]:
xgb_model = xgb.XGBClassifier(objective="binary:logistic", learning_rate=0.167, max_depth=2, scale_pos_weight=1.73)
xgb_model.fit(X_train, y_train)
get_f1_scores(xgb_model, 'XGB with ratio')

In [None]:
y_pred_lr = lr.predict_proba(X_test)

In [None]:
lgbm_w = lgbm.LGBMClassifier(n_jobs = -1, class_weight={0:1869, 1:5174})
lgbm_w.fit(X_train, y_train)
y_pred_lgbm = lgbm_w.predict_proba(X_test)

In [None]:
# y_pred with predict_proba returns 2 cols, one for each class
y_pred_xgb[:5, 1]

In [None]:
y_pred_lgbm[:5, 1]

In [None]:
test = np.vstack((y_pred_lgbm[:5, 1], y_pred_xgb[:5, 1]))
test

In [None]:
np.mean(test, axis=0)

In [None]:
y_pred_mean = np.mean(np.vstack((y_pred_lgbm[:, 1], y_pred_xgb[:, 1])), axis=0)
y_pred_mean[:5]

In [None]:
y_pred_mean[y_pred_mean < 0.5] = 0
y_pred_mean[y_pred_mean > 0.5] = 1
y_pred_mean[:5]

In [None]:
print(f'F1 score of models combined on the test dataset = {f1_score(y_test, y_pred_mean)  * 100:.2f}%')