# XGBoost for Churn prediction with the Telco dataset

### This code was taken from [this webinar](https://www.youtube.com/watch?v=GrJP9FLV3FE)

This notebook contains:

* Data pre-processing
* Handling missing values and column names (for drawing the tree)
* Modeling a XGBoost
* GridSearchCV for parameter tuning

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import xgboost as xgb

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer, confusion_matrix, plot_confusion_matrix

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data preprocessing

In [None]:
# Loading the data
raw_data = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
raw_data.head()

This dataset is a bit different from the one used in the webinar. On the original dataset, there were 33 columns, and some of them had to be deleted. This one seems to be somewhat cleaned. Some of the columns that he used for the model are missing. We could still drop the Customer ID column.

But, for the sake of the example, we will check the values in each column so that everything is okay.

In [None]:
# Dropping the Customer ID column
raw_data.drop('customerID', axis=1, inplace=True)

In [None]:
# Unique values of each column
for column in raw_data.columns:
    print(column)
    print(raw_data[column].unique())
    print()

In [None]:
# Data types
raw_data.dtypes

## About the dtypes:

Most of the dtypes are correct, but TotalCharges had to be a float64 instead of an object. Let's try to cast it.

In [None]:
# Converting the TotalCharges column
#raw_data['TotalCharges'] = pd.to_numeric(raw_data['TotalCharges'])

There are some blank values in the TotalCharges column, preventing us from converting the column to numeric. Let's investigate why.

In [None]:
# Locating the rows with blank values for TotalCharges
raw_data.loc[raw_data['TotalCharges'] == ' ']

The reason for those blank values is that those clients have just signed up for the services, as we can see in the tenure column (0). So we can assign these blank values to 0, because they haven't been charged yet.

In [None]:
# Assigning 0 to the blank values 
raw_data.loc[(raw_data['TotalCharges'] == ' '), 'TotalCharges'] = 0

In [None]:
# Now we can convert the column to a numeric dtype
raw_data['TotalCharges'] = pd.to_numeric(raw_data['TotalCharges'])
raw_data.dtypes

## Mapping

One last thing: just for the sake of the example, let's map the values of the columns so that they are just like in the seminar.

In [None]:
raw_data['SeniorCitizen'] = raw_data['SeniorCitizen'].map({0: 'No', 1: 'Yes'})
raw_data['Churn'] = raw_data['Churn'].map({'No': 0, 'Yes': 1})

# Model building

In this section we will:

* Separate the inputs and targets (independent and dependent variables)
* Split training and testing sets
* Build our XGBoost model
* Use GridSearchCV for parameter tuning

In [None]:
# Splitting inputs and targets
X = raw_data.drop('Churn', axis=1).copy()
X.head()

In [None]:
y = raw_data['Churn'].copy()
y.head()

In [None]:
# Checking the unique values of y
y.unique()

## One-hot encoding

We will use one-hot encoding for the categorical columns.

In [None]:
# One-hot encoding
X_encoded = pd.get_dummies(X, columns=['gender',
                                       'SeniorCitizen',
                                       'Partner',
                                       'Dependents', 
                                       'PhoneService',
                                       'MultipleLines',
                                       'InternetService',
                                       'OnlineSecurity',
                                       'OnlineBackup',
                                       'DeviceProtection',
                                       'TechSupport',
                                       'StreamingTV',
                                       'StreamingMovies',
                                       'Contract',
                                       'PaperlessBilling',
                                       'PaymentMethod'
                                      ])
X_encoded.head()

#gender               object
#SeniorCitizen         int64
#Partner              object
#Dependents           object
#tenure                int64
#PhoneService         object
#MultipleLines        object
#InternetService      object
#OnlineSecurity       object
#OnlineBackup         object
#DeviceProtection     object
#TechSupport          object
#StreamingTV          object
#StreamingMovies      object
#Contract             object
#PaperlessBilling     object
#PaymentMethod        object
#MonthlyCharges      float64
#TotalCharges        float64

## XGBoost model

First of all, let's observe that this data is imbalanced by diving the number of people who left the company by the total number of people in the dataset

In [None]:
sum(y)/len(y)

There are only 27% that left the company. Because of this, we will split the data into training and testing using stratification in order to maintain the same percentage of people that left in both sets.

In [None]:
# Splitting the data into training and testing using stratification
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, random_state=42, stratify=y)

In [None]:
# Now let's verify that using stratify worked as expected
print(sum(y_train)/len(y_train))
print(sum(y_test)/len(y_test))

Now let's build the preliminary model. Instead of determining the optimal number of trees with cross validation, we will use early stopping to stop building trees when they no longer improve the situation.

In [None]:
# XGBoost
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', seed=42)
clf_xgb.fit(X_train,
            y_train,
            verbose=True,
            early_stopping_rounds=10,
            eval_metric='aucpr',
            eval_set=[(X_test, y_test)])

In [None]:
plot_confusion_matrix(clf_xgb, 
                      X_test, 
                      y_test, 
                      values_format='d', 
                      display_labels=["Did not leave", "Left"])

# Parameter tuning with GridSearchCV

Let's try to improve the churn prediction using GridSearchCV

In [None]:
# GridSearchCV
# param_grid = {
#     'max_depth': [3, 4, 5],
#     'learning_rate': [0.01, 0.05, 0.1, 0.3, 0.5],
#     'gamma': [0, 0.25, 1.0],
#     'reg_lambda': [0, 1.0, 10.0],
#     'scale_pos_weight': [1, 3, 5]
# }

# param_grid = {
#     'max_depth': [1, 2, 3],
#     'learning_rate': [0.07, 0.075, 0.08],
#     'gamma': [0.9, 1.0, 1.1],
#     'reg_lambda': [9.0, 10.0, 11.0],
#     'scale_pos_weight': [3]
# }

# optimal_params = GridSearchCV(
#     estimator=xgb.XGBClassifier(objective='binary:logistic', seed=42, subsample=0.9, colsample_bytree=0.5),
#     param_grid=param_grid,
#     scoring='roc_auc',
#     verbose=2,
#     n_jobs=10,
#     cv=3
# )

# optimal_params.fit(X_train,
#                    y_train,
#                    early_stopping_rounds=10,
#                    eval_metric='auc',
#                    eval_set=[(X_test, y_test)],
#                    verbose=False)

# print(optimal_params.best_params_)

I ran it on my PC because it was taking too long here, and the output was:

> {'gamma': 1.0, 'learning_rate': 0.07, 'max_depth': 3, 'reg_lambda': 10.0, 'scale_pos_weight': 3}

So let's see how much improvment we can make with those parameters

In [None]:
# XGBoost
clf_xgb = xgb.XGBClassifier(
    seed=42,
    objective='binary:logistic',
    gamma=1.0,
    learn_rate=0.07,
    max_depth=3,
    reg_lambda=10,
    scale_pos_weight=3,
    subsample=0.9,
    colsample_bytree=0.5
)

clf_xgb.fit(
    X_train,
    y_train,
    verbose=True,
    early_stopping_rounds=10,
    eval_metric='aucpr',
    eval_set=[(X_test, y_test)]
)

In [None]:
plot_confusion_matrix(clf_xgb, 
                      X_test, 
                      y_test, 
                      values_format='d', 
                      display_labels=["Did not leave", "Left"])

## Results

After using GridSearchCV for tuning our parameters we improved our churn prediction from *51,8%* (242 out of 467) to *82,6%* (386 out of 467)