# Telco customer churn - binary classification problem
Ancient problem of machine learning - will customer churn or not? Let's do some analysis, preprocessing, feature engineering and then apply XGB Classifier & Tensorflow on our data to predict churn.

![](https://osclasspoint.com/images/customer-churn.png)

# Load libraries
Nothing extraordinary will be used - numpy, pandas, sklearn, matplotlib, seaborn, xgboost and tensorflow

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import resample

%matplotlib inline
pd.options.display.max_columns = 500

import warnings
warnings.filterwarnings('ignore')

# Load data
Load data using pandas, we have just one dataset here that makes things easier

In [None]:
df = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

## Explore our data
First chech top rows, then columns format and missing values.

In [None]:
df.head()

In [None]:
df.info()

It seems we have no null values, that is great and save us some time, however TotalCharges seems to have incorrect format (object), fix this converting field to float (float64) and filling missing values those will be generated during conversion with 0.

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(value=0)

df['tenure'] = df['tenure'].astype('float64')

Drop customer ID as it's not relevant field for analysis.

In [None]:
df.drop('customerID', axis=1, inplace=True)

Split our fields to categorical and numerical so we can do EDA & preprocessing faster. Churn, our target variable, will not be included in categorical fields.

In [None]:
col_cat = df.select_dtypes(include='object').drop('Churn', axis=1).columns.tolist()
col_num = df.select_dtypes(exclude='object').columns.tolist()

# Exploratory data analysis

For our categorical fields, check how many unique values has each column so we will decide if feature engineering (and merging values in case there is too many of them) is needed.
You will see we have 2-4 unique values that is ideal.

In [None]:
for c in col_cat:
    print('Column {} unique values: {}'.format(c, len(df[c].unique())))

Take a look on distribution of Churn across all categorical variables. This is really nice view where you can see that i.e. gender is not correlated with Churn at all, but Contract is highly correlated with churn and customers having month-to-month contract are much more likely to churn, comparing to customers with 1-year and 2-years contracts. That's intresting fact and can help company to make 1 & 2 years contract more attractive!

In [None]:
plt.figure(figsize=(20,20))
for i,c in enumerate(col_cat):
    plt.subplot(5,4,i+1)
    sns.countplot(df[c], hue=df['Churn'])
    plt.title(c)
    plt.xlabel('')

Checkout distribution of our numerical features. We again want to find out some interesting relations in data.
It seems tenure is correlated with Churn.

In [None]:
plt.figure(figsize=(20,5))
for i,c in enumerate(['tenure', 'MonthlyCharges', 'TotalCharges']):
    plt.subplot(1,3,i+1)
    sns.distplot(df[df['Churn'] == 'No'][c], kde=True, color='blue', hist=False, kde_kws=dict(linewidth=2), label='No')
    sns.distplot(df[df['Churn'] == 'Yes'][c], kde=True, color='Orange', hist=False, kde_kws=dict(linewidth=2), label='Yes')
    plt.title(c)

Let's try also violin plot.

In [None]:
plt.figure(figsize=(20,5))
for i,c in enumerate(col_num):
    plt.subplot(1,4,i+1)
    sns.violinplot(x=df['Churn'], y=df[c])
    plt.title(c)

# Data preprocessing
We've completed our quick and simple EDA, it's time to cook our data for taste of machine learning algorithms those like just numerical data, not text data :)

First, do one hot encoding of our categorical features.

In [None]:
df.head()

In [None]:
dfT = pd.get_dummies(df, columns=col_cat)
dfT.head()

Now do simple label encoding of our target variable Churn.

In [None]:
dfT['Churn'] = dfT['Churn'].map(lambda x: 1 if x == 'Yes' else 0)

## Balanced or imbalanced?
Check if our dataset is balanced or imbalanced and if any action is needed. You will find out that data are highly imbalanced, we will use resample function to upsample minority group.

In [None]:
plt.figure(figsize=(5, 5))
sns.countplot(dfT['Churn'])
plt.title('Imbalanced dataset, it seems ratio is 2:5 for Yes:No')
plt.show()

Divide our data into 2 groups, majority (0) and minority (1) and create new dataset by upsampling minority group.

In [None]:
minority = dfT[dfT.Churn==1]
majority = dfT[dfT.Churn==0]

minority_upsample = resample(minority, replace=True, n_samples=majority.shape[0])
dfT = pd.concat([minority_upsample, majority], axis=0)
dfT = dfT.sample(frac=1).reset_index(drop=True)

Do just quick check how it looked like before balance and after balance.

In [None]:
plt.figure(figsize=(10, 5))
plt.subplot(1,2,1)
sns.countplot(df['Churn'])
plt.title('Imbalanced dataset')

plt.subplot(1,2,2)
sns.countplot(dfT['Churn'])
plt.title('Balanced dataset')
plt.show()

## Time to scale!
ML algorithms are sensitive on data that are not normalized to same scale. You might try that deep net (at the end of kernel) will have much lower accuracy when using unscaled data... accuracy can go down even by 10%! I will use robust scaler that can nicely handle outliers, but standard scaler might work well too.

In [None]:
rs = RobustScaler()
dfT['tenure'] = rs.fit_transform(dfT['tenure'].values.reshape(-1,1))
dfT['MonthlyCharges'] = rs.fit_transform(dfT['MonthlyCharges'].values.reshape(-1,1))
dfT['TotalCharges'] = rs.fit_transform(dfT['TotalCharges'].values.reshape(-1,1))

## Data Split
Split our data into train & test partitions. Train partition will be used to train ML model, test will be used to validate it's performance. 80% goes to train, 20% goes to test. It could be also 70:30 or 60:40.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(dfT.drop('Churn', axis=1).values, dfT['Churn'].values, test_size=0.2)

# Modeling
Our first try will be XGBoost. We could try Random Forest or Light GBM, but these will not lead to better results comparing to XGBoost, therefore second choice will be deep neural network consisting of multiple layers.

## XGBoost
Let's start with popular XGB Classifier and check it's performance.

In [None]:
xg = XGBClassifier()
xg.fit(X_train, y_train)
y_test_hat_xg = xg.predict(X_test)

In [None]:
print(classification_report(y_test, y_test_hat_xg))

Not bad! We got really good precission as well as recall and f1 score! Yes you are right, I should try some hyperparameter tunning, but for now let's keep this notebook simple. You may find hyperparameter optimization in other of my kernels ;)

## Deep neural networks
Yes it should be fun. Using simple net? No, we will use something more complex... Let's do it!

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

We use sequential model with multiple dense & dropout layers.

In [None]:
model = Sequential()

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(256, activation='relu'))
model.add(Dropout(0.25))

model.add(Dense(128, activation='relu'))
model.add(Dropout(0.45))

model.add(Dense(1, activation='sigmoid'))

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.3, verbose=1,patience=10, min_lr=0.0000000001)
early_stopping_cb = EarlyStopping(patience=10, restore_best_weights=True)

model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(x=X_train, y=y_train, batch_size=128, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping_cb, reduce_lr])


Model is trained, you might see it's overfitting during training, so increasing dropout would solve this problem... yes I've tried it and once I solved problem with overfitting, accuracy on test data was decreased :) ...ok, predict our test data and compare it to actuals.

In [None]:
y_test_hat_tf = model.predict(X_test)

Output of prediction are probabilities, let's convert probabilities into 0/1

In [None]:
y_test_hat_tf2 = [1 if x > 0.5 else 0 for x in y_test_hat_tf ]

And finally checkout classification report!

In [None]:
print(classification_report(y_test, y_test_hat_tf2))

What you think? It seems xgboost is slightly better, but this net was almost catching it ;)

That's it, feel free to post your comments ;)

## Thanks for checking my notebook, if you liked it, make sure to vote for for this notebook!