Xgboost is an useful ML method when you don't want to sacrifice the ability to correctly classify observations but you still want a model that is fairly easy to understand and interpret.

# STEP 1: **IMPORTING THE DATA**
            
# STEP 2: **MISSING DATA**
            - identifying the missing data
            - dealing with the missing data
            
# STEP 3: **FORMATTING THE DATA FOR XGBOOST**
            - splitting the data into dependent and independent variables
            - One hot encoding
            - converting all columns to int, float or bool

# STEP 4: **BUILDING A PRELIMINARY XGBOOST MODEL**

# STEP 5: **OPTIMIZING PARAMETERS WITH CROSS VALIDATION AND gridsearch()**
            - optimizing the learning rate, tree depth, number of trees, gamma (for pruning) and lambda (for regularization)

In [None]:
# importing the required modules

import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [None]:
# importing the data

df = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
df.head()

Basically exit interview data from people that left telco.

In [None]:
df['customerID'].count()

There are a total of 7043 data rows 

In [None]:
len(df['customerID'].unique())

So there are 7043 unique customer IDs.

In [None]:
df.drop('customerID', axis= 1, inplace = True)

In [None]:
df.columns

In [None]:
df['MultipleLines'].unique()

In [None]:
miss = []
for var in df.columns:
    if df[var].isnull().values.any() == True:
        miss.append(var)

In [None]:
print(var)

Observation from above 2 commands:
 The churn column has some missing values.

In [None]:
df.describe()

In [None]:
df.info

In [None]:
df.head()

In [None]:
df.columns = df.columns.str.replace(' ', '_')
df.head()

# MISSING DATA PART 1: IDENTIFYING MISSING DATA

The biggest part of a data science project is ensuring that the data are correctly formatted and fixing it when it is not. The first part of this process is identifying **missing data**.

Missing data is simply a blank space, or a surrogate value like NA, that indicates that we failed to collect data for one of the features. For example: if we forgot to ask someone's age, or forgot to write it down, then we would have a blank space in the dataset for that person's age.

One thing that is realtively unique about Xgboost is that it has default behaviour for missing data. So all we have to do is idenify missing values and make sure they are set to 0.

In this section, we will focus on identifying missing values in the dataset.
First, let's see what sort of data is in each column.

In [None]:
df.dtypes

In [None]:
print(df['TotalCharges'].unique())
print(len(df['TotalCharges'].unique()))

In [None]:
#df['TotalCharges']= pd.to_numeric(df['TotalCharges'])
# throws error 

There is a blank space present in the Total charges column. So we have to deal with that.

# MISSING DATA PART 2: DEALING WITH MISSING DATA, XGBoost Style

One thing that is relatively unique about **XGBoost** is that it determines default behaviour for missing data. So all we have to do is identify missing values and make sure they are set to 0.

However, before we do that, let's see how many rows are missing data. If it's a lot, then we might have a problem than what XGBoost can deal with on its own. If it's not that many, we can just set them to be 0.

In [None]:
len(df.loc[df['TotalCharges'] == ' '])

Since only 11 rows are missing we can take observations manually.

In [None]:
df.loc[df['TotalCharges'] == ' ']

Observations:

    1) We see that all 11 people with TotalCharges == ' ' have just signed up, because tenure is 0 for all of them. 
    
    2) All of these people also have churn == 'No'.
    
So we have few choices here: 
    - We can set TotalCharges = 0 for these 11 people or we can remove them from the dataset. Let's try by setting TotalCharges = 0.

In [None]:
df.loc[(df['TotalCharges'] == ' '), 'TotalCharges'] = 0

Lets verify that we modified  TotalCharges column correctly by observing everyone who had tenure = 0.

In [None]:
df.loc[df['tenure'] == 0]

NOTE : The TotalCharges column still holds object data type values and this is NOT useful since XGBoost only allows int, float or boolean data types.

To fix the above issue we can do the conversion to float using to_numeric() function from the Pandas library.

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])

In [None]:
df.dtypes

Now that we have dealt with the missing data, we will replace all the white spaces (' ') with underscores (_).

In [None]:
df.replace(' ', '_', regex= True, inplace= True)
df.head()

# DATA FORMATTING : PART 1

- Using an XGBoost model to format the data.

**STEPS**

    1. Split the data into 2 parts:
        a. the columns of data that we will use to make classifications
        b. the column of data that we want to predict
        

In [None]:
df.columns

In [None]:
X = df.drop('Churn', axis = 1).copy() 
# ALTERNATE:
# X = df_no_missing.iloc[:,:-1]

X.head()

In [None]:
y = df['Churn'].copy()
y.head()

# DATA FORMATTING : PART 2

ONE-HOT ENCODING

After splitting the data into 2 parts, we observe the variables in X.

In [None]:
X.dtypes

In [None]:
print(df.columns)
for x in df.columns:
    print(df[x].unique())

So, SeniorCitizen, tenure, MonthlyCharges and TotalCharges are all int64 or float64 which is as per requirements for implementing a XGBoost model.
However we need to take care of the categorical data because XGBoost being good at handling continuous data, does not natively support categorical data, like Contract which contains 3 different categories. Thus in order to use categorical data along with XGBoost, we have to use One-Hot Encoding to convert a column of categorical data into multiple columns of binary values.

**Question: Why not treat categorical data as continuous data by taking the 3 categories from Contract column as 3 integer values (say 1, 2, 3) ?**
Answer: 
The XGBoost may cluster 2 of the 3 categories now represented as numbers to other categories with number close to the former category. Example: 1, 2 or 2, 3 get clustered together which results in 3 and 1 getting isolated in the XGBoost tree.

# **ONE-HOT ENCODING**

There are 2 popular ways to implement OHE :
1. ColumnTransformer() from sklearn
2. get_dummies() from pandas

Both methods have their own pros and cons.

ColumnTransformer() has a very cool feature where it creates a persistent function that can validate data that we (may) get in the future. Example: if my XGBoost model using a categorical variable favoriteColor that has red, blue and green as options, then ColumnTransformer() can remember those options and later on when my XGBoost model is being used in a production system, if someone says their favorite color is orange, then ColumnTransformer() can throw an error or handle the situation in another way. The downside of ColumnTransformer() is that it turns our data into an array and loses all the column names, making it difficult to verify that the usage of our columns works as we intended to.

In contrast, get_dummies() leaves our data in the form of a dataframe and retains the column names as well. However, it does not have the persistent behaviour as that of ColumnTransformer().
So I will be using get_dummies() method in this case.

In [None]:
pd.get_dummies(X, columns= ['Contract']).head()

OHE is different from the way we would encode it for the same data ie. OHE gives us a result that is different for linear and logistic regressions. OHE is not suitable for linear and logistic regressions but it works great for trees.

Now lets implement OHE via get_dummies() method on all of the categorical columns and save the result.

In [None]:
X_encoded = pd.get_dummies(X, columns= ['gender', 
                                        'Partner', 
                                        'Dependents', 
                                        'PhoneService', 
                                        'MultipleLines', 
                                        'InternetService',
                                        'OnlineSecurity',
                                        'OnlineBackup',
                                        'DeviceProtection',
                                        'TechSupport',
                                        'StreamingTV',
                                        'StreamingMovies',
                                        'Contract',
                                        'PaperlessBilling',
                                        'PaymentMethod'])

X_encoded.head()

Now we have 45 columns instead of the initially present 21 columns. Next we have to verify that y only contains 1s and 0s with unique().

In [None]:
y.unique()

In [None]:
# REPLACING YESs WITH 1s and NOs with 0s
y = y.str.replace('Yes', '1')
y = y.str.replace('No', '0')
y.unique()

XGBoost chooses the split for data that gives the best value for **Gain**.

QUESTION: Doesn't XGBoost consume a lot of memory for keeping track of all 1s and 0s?

ANSWER :  No, this is so because XGBoost uses sparse matrices and hence only keeps track of all 1s and it doesn't allocate any memory to the 0s.

# **BUILD A PRELIMINARY XGBoost MODEL**

Now we need to split the data first into the training and test sets. Let's ensure that the data is imbalanced by dividing the number of people who left the company in both the training and the testing set.

In [None]:
y = pd.to_numeric(y)

In [None]:
sum(y)/len(y)

So we see only around 26.5 % of the people in the dataset left the company. Due to this, when we split the data using train_test_split, we split using stratification in order to maintain the same percentage of people who left the company in both the training as well as the testing set.

In [None]:
X_train, X_test, y_train, y_test= train_test_split(X_encoded, y, random_state = 42, stratify = y)

Let's verify whether stratification worked or not.

In [None]:
print(sum(y_train)/len(y_train))
print(sum(y_test)/len(y_test))

Thus it is obvious that the stratification worked from the above cell output as we have the same percentage of people that left the company in  both y_train and y_test. Let's build the preliminary model.

NOTE:

Instead of determining the optimal number of trees with cross validation, we will be using **early stopping** to stop building trees when they no longer improve the situation.

In [None]:
xgb_clf = xgb.XGBClassifier(objective='binary:logistic', missing= None, seed= 42)
xgb_clf.fit(X_train,
            y_train,
            verbose= True,
            early_stopping_rounds= 10,
            eval_metric= 'aucpr',
            eval_set= [(X_test, y_test)])

Objective was taken as 'binary:logistic' since XGBoost uses a sort of logistic regression based approach to evaluate how good it is at classifying the observation. The default value of missing argument is None and hence it is unnecessary to set missing= None there in the XGBClassifier function. The missing argument represents what character or value we are using to depict missing values. The default value when we have missing= None is zero or numpy.Nan (numpy:not a number) but it uses zeros in that sparse matrix so it doesn't have to allocate any memory for the same. 

The training is done on the training data set but the evaluating how many trees to build (for early stopping mechanism) is done on the testing data set. 

Now that we have built the **XGBoost model** for classification, let's see how it performs on the testing dataset by running the testing dataset down the model and drawing a confusion matrix.

In [None]:
plot_confusion_matrix(xgb_clf, 
                      X_test,
                      y_test,
                      values_format= 'd',
                      display_labels=["Did not leave", "Left"])

In the confusion matrix, we observe that 1294 people that did not leave the company, 1155 (89.26 %) were correctly classified and of the remaining 467 people that left the company, 225 (48.2 %) were correctly satisfied. The accuracy of the model is NOT really anything impressive. The cause for the issue is that our data is imbalanced. Since people leaving costs the company a lot of financial losses due to default mainly, we must strive to predict more of these leaving people with our model. Let's try to improve the predictions using **Cross Validation** to optimize the parameters. 

The positive point to take in here is that we have a way to do so since **XGBoost** has a parameter called *scale_pos_weight* that helps in dealing with imbalanced data. It adds a penalty for incorrectly classifying the minority class (in this case the people that left the company). So we need to increase that penalty so that the trees will correctly classify more of them to reduce the penalty.


# **Optimize Parameters using Cross Validation and GridSearch()**

**XGBoost** has a lot of *hyperparameters* that we need to manually configure and are NOT determined by XGBoost itself, including *max_depth*, the maximum tree depth, *learning rate*, the learning rate, or ***eta***, *gamma*, the parameter that encourages pruning, and *reg_lambda*, the regularization parameter lambda. Let's try to find the optimal values for these parameters to imporove the accuracy with the **Testing Dataset**.

**NOTE**: Since we have many hyperparameters to optimize, we will use *GridSearchCV()*. We specify a bunch of potential values for the hyperparameters and *GridSearchCV* tests all possible combinations of the parameters for us.

In [None]:
# NOTE: When data is imbalanced, the XGBoost manual says
# If you care only about the overall performance matric (AUC) of your predictions
# -> Balance the positive and negative weights via scale_pos_weight
# -> Use AUC for evaluation.
# Running GridSearchCV() sequentially on subsets of parameter options, rather than all at once in order
# to optimize parameters in a short period of time.

## ROUND 1

param_grid= {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'gamma': [0, 0.25, 1.0],
    'reg_lambda': [0, 1.0, 10.0],
    'scale_pos_weight': [1, 3, 5]
}

## ROUND 2

param_grid= {
    'max_depth': [4],
    'learning_rate': [0.1, 0.5, 1],
    'gamma': [0.25],
    'reg_lambda': [10.0, 20, 100],
    'scale_pos_weight': [3]
}


# In order to speed up the Cross Validation, for each tree we are using a random subset of the actual data ie. we are not using 
# all the data. We are only using 90 % and that is randomly selected per tree. We are also only selecting per tree 50 % of the 
# columns in that dataset so for every tree we create, we select a different 50 % of the column and that helps us with overfitting
# issues as well as speeding things up considerably. Other than that we are just using AUC score and we are not doing a lot of 
# Cross Validation (not 10 fold only 3 fold).

optimal_params= GridSearchCV(
    estimator= xgb.XGBClassifier(objective= 'binary:logistic',
                                 seed= 42,
                                 subsample= 0.9,
                                 colsample_bytree= 0.5),
    param_grid= param_grid,
    scoring= 'roc_auc',
    verbose= 0,
    n_jobs= 10,
    cv= 3
)

optimal_params.fit(X_train,
                   y_train,
                   early_stopping_rounds= 10,
                   eval_metric= 'auc',
                   eval_set= [(X_test, y_test)],
                   verbose= False)

print(optimal_params.best_params_)


# **Building, Evaluating, Drawing and Interpreting the Optimized XGBoost Model**

In [None]:
xgb_clf= xgb.XGBClassifier(seed= 42,
                           objective= 'binary:logistic',
                           gamma= 0.25,
                           learn_rate= 0.1,
                           max_depth= 4,
                           reg_lambda= 10,
                           scale_pos_weight= 3,
                           subsample= 0.9,
                           colsample_bytree= 0.5)

xgb_clf.fit(X_train,
            y_train,
            verbose= True,
            early_stopping_rounds= 10,
            eval_metric= 'aucpr',
            eval_set= [(X_test, y_test)])

Now let's check whether the optimized **XGBoost** model does better by plotting another confusion matrix.

In [None]:
plot_confusion_matrix(xgb_clf,
                      X_test,
                      y_test,
                      values_format= 'd',
                      display_labels= ["Did not leave", "Left"])

So we can draw the conclusion that the optimized **XGBoost** model is doing a comparatively much better job at classifying the people that left the company. Out of the 467 poeple that left 380 (81.37 %) were correctly identified. Before optimization, the percentage of correct identification was 48.2 % which looking at the present result is way worse than what we could be hoping for.

NOTE:-

However this improvement was at the cost of not being able to correctly classify as many people that did not leave the company. With the optimized model, out of 1294 people that didn't leave the company only 932 (72 %) were correctly classified. That said this trade off may be better for the company since the people that leave the company take their money with them resulting in the increases in financial losses suffered by the company. So from the company's perspective it would be better to identify such people before they leave and take necessary steps to prevent as many people as possible from leaving consequently reducing the (expected) losses.

# **Drawing and Interpreting the XGBoost Tree**

If we need to gain information, such as gain and cover etc, at each node in the first tree, we just build the first tree, otherwise we'll get the average over all of the trees.

In [None]:
xgb_clf= xgb.XGBClassifier(seed= 42,
                           objective= 'binary:logistic',
                           learn_rate= 0.1,
                           max_depth= 4,
                           reg_lambda= 10,
                           scale_pos_weight= 3,
                           subsample= 0.9,
                           colsample_bytree= 0.5,
                           n_estimators= 1)
# n_estimators set to 1 so that we can get gain, cover etc.
xgb_clf.fit(X_train, y_train)

Now let's print the weight, gain, cover etc for the tree.

weight= number of times a feature is used in a branch or root across all trees
gain= the average gain across all splits a feature is used in 
cover= the average coverage across all splits a feature is used in 
total_gain= the total gain across all splits a feature is used in 
total_cover= the total coverage across all splits a feature is used in 

**NOTE**: Since only 1 tree is being built :
    gain= total_gain ; cover= total_cover

In [None]:
bst= xgb_clf.get_booster()

for importance_type in ('weight', 'gain', 'cover', 'total_gain', 'total_cover'):
    print('%s: ' % importance_type, bst.get_score(importance_type= importance_type))
    
    
node_params= {'shape': 'box',  # makes the node fancy
              'style': 'filled, rounded',
              'fillcolor': '#78cbe'
             }

leaf_params= {'shape': 'box',
              'style': 'filled',
              'fillcolor': '#e48038'}

# NOTE: num_trees is NOT the number of trees to plot, but the specific tree that we are going to plot
# The default value is 0, but let's set it just to show it since it is counter-intuitive.
# xgb.to_graphviz(xgb_clf, num_trees= 0, size= "10, 10")

xgb.to_graphviz(xgb_clf, num_trees= 0, size= "10, 10",
                condition_node_params= node_params,
                leaf_node_params= leaf_params)

# TO SAVE THE FIGURE (in jupyter notebook):
# graph_data= xgb.to_graphviz(xgb_clf, num_trees= 0, size= "10, 10",
#                 condition_node_params= node_params,
#                 leaf_node_params= leaf_params)
# graph_data.view(filename= 'insert arbitrary file name as required')

Let's discuss how to interpret the XGBoost tree. In each node, we have:

    -  The variable (column name) and the threshold for splitting the observations. For example: in the tree's root, we use Contract_month_to_month to split the observations. All the observations with Contract-month-to-month < 1 go to the LEFT and all the observations with the value =< 1 go to the RIGHT.
    
    - Each branch either says YES or NO and some branches also say MISSING:
        -> **yes** and **no** refer to whether the threshold in the node above it is **true** or **not**. If so, then **yes**
            otherwise **no**.
        -> **missing** is the default option if the data is missing in any instance.
      
    - **leaf** tells us the output value for each leaf.
    
# **SUMMARY**:

1. Loaded the Data from a File
2. Identified and Dealt with the Missing Data
3. Formatted the Data for **XGBoost** using OHE (One-Hot Encoding)
4. Built an **XGBoost** Model for classification
5. Optimize the **XGBoost Parameters** with Cross Validation and GridSearchCV()
6. Built, Drew, Interpreted and Evaluated  the Optimized XGBoost Model