Read Dataset

In [None]:
import pandas as pd 

df =  pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
visu_df = df.copy();

Display important information about the dataset

In [None]:
df.head(5)

In [None]:
df.info()

In [None]:
df.describe().transpose()

Quick check of the distribution of our classes.

In [None]:
df['Churn'].value_counts().plot(kind = 'barh')

According to the result, we clearly see a huge difference between Churn and None-Churn.
Due to that and in order to build good multivariate models we need to apply under/over sampling technics.
Later in this exemple I will apply a simple under-sampling methodin order to have same length of both Churn and None Churn classes.

In [None]:
def encode_to_numerical_data(raw_data):
    for i in raw_data:
        if raw_data[i].dtype == 'object':
            raw_data[i] = factorization(raw_data, i)
    return raw_data

In [None]:
def factorization(raw_data, col):
    return pd.factorize(raw_data[col])[0]

In [None]:
encoded_data = encode_to_numerical_data(df)

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(30,20)) 
x_matrix = encoded_data.drop(['Churn', 'customerID'], axis=1)
corr_map = x_matrix.corr()
sns.heatmap(corr_map, vmax=.8, square=True, annot=True, fmt='.2f', cmap="summer")
    

According to this heatmap we can clearly see a significante correlation between: Internet Service, Online Security, Online Backup,Device Protection, Tech Support,Streaming TV and Streaming Movies plus Contract and Tenure.

In [None]:
    fig, axes = plt.subplots(1, 2, sharex=True, figsize=(20, 10))
    fig.suptitle('Summary')

    sns.countplot(ax=axes[0], x="Churn", hue="InternetService", data=visu_df)
    sns.countplot(ax=axes[1], x="Churn", hue="PhoneService", data=visu_df)

**Image on the right:** 
The majority of customers ( churn or not ) have the Phone Service , just a few minority doesn't have this service.

**Image on the left** : 
In No Churn category : DSL is the most consumed product with small difference with Optical fiber.
In Churn category : the churn is significant with fiber optic consumers which give us a prior idea that the company should pay more attention to this product and make an alarm, because it has a huge factor of churn.


In [None]:
    fig, axes = plt.subplots(4, 2, sharex=True, figsize=(20, 10))
    fig.suptitle('Summary')
    sns.barplot(ax=axes[0, 0], x="tenure", y="Contract", hue="gender", data=visu_df,orient="h")
    sns.barplot(ax=axes[0, 1], x="tenure", y="Contract", hue="PaymentMethod", data=visu_df,orient="h")
    sns.barplot(ax=axes[1, 0], x="tenure", y="StreamingMovies", hue="gender", data=visu_df,orient="h")
    sns.barplot(ax=axes[1, 1], x="tenure", y="StreamingMovies", hue="Partner", data=visu_df,orient="h")
    sns.barplot(ax=axes[2, 0], x="MonthlyCharges", y="InternetService", hue="StreamingTV", data=visu_df,orient="h")
    sns.barplot(ax=axes[2, 1], x="tenure", y="OnlineSecurity", hue="DeviceProtection", data=visu_df,orient="h")
    sns.barplot(ax=axes[3, 0], x="tenure", y="OnlineSecurity", hue="InternetService", data=visu_df,orient="h")
    sns.barplot(ax=axes[3, 1], x="tenure", y="Contract", hue="PaperlessBilling", data=visu_df,orient="h")

From the left to the right:
1. No significant info can be recorded with Contract , Gender and Tenure features, same behaviour between males and females.
2. Payment methods : the favorite means of payments are Electronic Check, Bank transfer and credit card, Mailed check is the less used in all contracts types.
3. No significant info can be recorded with Internet Service , Gender and Tenure features, same behaviour between males and females.
4. Streaming Movies : the most custmers that consume this service are partners
5. Optic fiber is expensive. (I guess this is why customers are leaving out this product)
6. Some people have device protection without online protection (weird , the company should tell them that it not necessery and they can be rewarded with a usefull service instead.. in order to gain customers trust :))
7. Internet Service custmers with large tenure tend to make online Seciruty.
8. Large tenure is significant whith paperless billing ( he company should prioritizee this mean of payment).



Data partition


In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.model_selection import train_test_split


x_matrix = encoded_data.drop(['Churn', 'customerID'], axis=1).values
y_labels = encoded_data['Churn'].values

# Data standardization
x_matrix = StandardScaler().fit_transform(x_matrix)

# UnderResampling (SMOTE tested but not efficient)
churn_args = np.argwhere(y_labels[:] == 1)
notChurn_args = np.argwhere(y_labels[:] == 0)

x_reduced = np.vstack((x_matrix[0:len(churn_args)], np.squeeze(x_matrix[churn_args])))
y_reduced = np.vstack(((y_labels[0:len(churn_args)]).reshape(1869, 1), y_labels[churn_args]))

X_train, X_test, y_train, y_test = train_test_split(x_reduced, np.squeeze(y_reduced))

print(np.shape(X_train))
print(np.shape(y_train))

**As stated before the dataset is imbalanced, so to deal with such problem, under / oversampling methods should be used.
Smote and several others technics were performed, but the used method performed well.
Since we have enough data simple undersampling is enough :)**

GridSearch CV for best model selection

In [None]:
import joblib
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from xgboost import XGBClassifier

In [None]:
models = ['ADB', 'GBC', 'RF', 'XGB', 'SVC']

In [None]:
clfs = [
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    RandomForestClassifier(n_jobs=-1),
    XGBClassifier(),
    SVC(probability=True)
]

In [None]:
params = {
    models[0]: {'learning_rate': [1, 0.01], 'n_estimators': [500, 1000]},
    models[1]: {'learning_rate': [0.01], 'n_estimators': [500, 1000], 'max_depth': [3],
                'min_samples_split': [2], 'min_samples_leaf': [2]},
    models[2]: {'n_estimators': [500, 1000, 1500], 'criterion': ['gini'], 'min_samples_split': [2],
                'min_samples_leaf': [4]},
    models[3]: {},
    models[4]: {'C': [0.01, 1, 10, 100], 'gamma': [1, 0.1], 'kernel': ['rbf', 'linear']},

}

In [None]:
    for name, estimator in zip(models, clfs):
        print("Performing : " + name)
        clf = GridSearchCV(estimator, params[name], n_jobs=-1, cv=10)

        clf.fit(X_train, y_train)

        print("best params: " + str(clf.best_params_))
        print("best scores: " + str(clf.best_score_))
        y_pred = clf.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        print("Accuracy: {:.4%}".format(acc))

        # save the model to disk
        if acc >= .8 and clf.best_score_ >= .8:
            joblib.dump(clf, './' + name + str(clf.best_score_))