### Introduction

Today, we explore the problem of class imbalance, and try out a technique called XGBoost. We will use a dataset on the data of COVID patients released by the Mexican government. We will attempt to predict if a patient has passed away based on some features(TBC). More information about the dataset can be found [here](https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset).

#### Class imbalance

Class imbalance occurs typically in classification problems, when we have a skewed dataset. This means that we will have a majority class(class where we have tonnes of data) and a minority class(class where we have very little data). Feeding our classification model such data will result in a skewed/biased model. For example, if we have such a dataset:

<img src='https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/11/Scatter-Plot-of-a-Binary-Classification-Problem-with-a-1-to-1000-Class-Imbalance.png' width=500 />

Our model will cunningly learn that it will do very well on the accuracy metric if it just predicted 0 for every single sample, and this is exactly the scenario we try to avoid.

For a decision tree, we realise that it really is only an algorithm that partitions the data in a certain way, so having class imbalance in a decision tree can cause it to find a poor partition due to the lack of data, and also over-confidence in each partition that we have created.

#### Tackling class imbalance

We therefore introduce a few ways of tackling the statistical bias in decision trees:

1. Introduce different evaluation metrics - precision, recall, accuracy etc. such that we get a better reflection and understanding of how well our model is actually performing

2. Oversampling and undersampling

3. Other methods of classification such as XGBoost which we will explore

#### 1. Evaluation metrics 

We will introduce a few evaluation metrics so that you have more choices and methods to evaluate how well your model is actually performing.

* Accuracy
<br>
$$
Accuracy = \frac{Correct\:Predictions}{Total\:Data}
$$

* Precision
<br>
$$
Precision = \frac{True\:Positive}{Predicted\:Positive}
$$

* Recall
<br>
$$
Recall = \frac{True\:Positive}{Actual\:Positive}
$$

* False Positive Rate
<br>
$$
False\:Positive = \frac{False\:Positive}{False\:Positive + True\:Negative}
$$

* False Negative Rate
<br>
$$
False\:Negative = \frac{False\:Negative}{False\:Negative + True\:Positive}
$$

* F1
<br>
$$
F1 = 2*\frac{Precision*Recall}{Precision+Recall}
$$


Suppose we have some cancer data where 98% of the patients do not have cancer(represented by 0, and 2% of the patients do have cancer.

Let us investigate an extreme model, one that always predicts 0.

|  | 0 | 1 |
|---|---|---|
| 0 | 98 | 0 |
| 1 | 2 | 0 |

<br>
This model will do relatively well in accuracy, it has an accuracy of 0.98, but now if we introduce recall, it has a recall score of 0, and we can find that this model is actually really bad!

Note that precision and recall works better generally if our minority class is represented as positive(1), if our minority class is 0 and our model always predict 1, it will perform really well! And we will then have to fall back on the false positive and false negative rates.

#### 2. Resampling

In resampling, there are 2 ways to go about doing things:

* **Undersampling**

In this technique, we will take only a portion of the majority class, and all data points in the minority class to build our classification algorithm to ensure a fairer spread of data. However, the key problem here is that we will loss some precious data! One way to go about doing this would be using **ensembling techniques**.

* **Oversampling**

In this method, we do the reverse of undersampling, instead of selecting less majority class data, we instead create synthetic data to push into the minority class, and in a way beef it up to a comparable level to the majority class.

<img src='https://miro.medium.com/max/725/0*FeIp1t4uEcW5LmSM.png' width=700/>

In [None]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
data = pd.read_csv('../input/covid19-patient-precondition-dataset/covid.csv')

In [None]:
death = pd.DataFrame(data['date_died'])
age = pd.DataFrame(data['age'])
death.rename({'date_died':'Deceased'}, axis=1, inplace=True)
death.where(death == '9999-99-99', 1, inplace=True)

In [None]:
death.where(death != '9999-99-99', 0, inplace=True)
death = death.astype('int64')

In [None]:
#Let's see if our age correlates with death
f = plt.figure(figsize=(12,4))
sns.boxplot(x='age', y='Deceased', data=pd.concat([age,death], axis=1), orient = 'h')

So, it does seems like the older you are, the more lethal covid is. Let's now get a decision tree and confusion matrix to see how it looks like!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(age, death, test_size = 0.2)
print(type(X_train))

In [None]:
tree = DecisionTreeClassifier(max_depth=3, min_samples_split=10)
tree.fit(X_train, y_train)
f = plt.figure(figsize=(12,12))
plot_tree(tree, feature_names=['Age'], class_names=['Alive', 'Deceased'], filled=True)

In [None]:
#Checking the accuracy of the model
print(tree.score(X_test, y_test))
mat = confusion_matrix(y_test, tree.predict(X_test))
plot_confusion_matrix(tree,X_test, y_test)
print("The False Negative Rate is: {:.02%}".format(mat[1][0]/(mat[1][0]+mat[1][1])))

The accuracy is great! However if we turn to the confusion matrix, we see the big problem, we have a huge number of false negatives, and this is actually the most important indicator! If our model tell us a patient will survive, we are going to pay less attention to him and this will significantly compromise his chance of living, we thus work at reducing the **False positive rate** in this model.

We see the problem of class imbalance here! Since the Infection Fatality Rate(IFR) of COVID is extremely low in general, so we have very little data of those who passed, and thus the model will end up learning to predict that everyone is going to live on happily despite their covid infections. Essentially, the problem is that we do not have enough data on deaths in COVID, and thus we produced a biased model.

However, we have to consider carefully the use of upsampling in this scenario. Consider instead the case where we are predicting if a house has a balcony based on its saleprice, we can upsample because it is essentially akin to collecting samples of houses with balconies, it enables our model to see a better distribution of saleprice in the two different classes. For COVID, the distribution of the classes is very skewed to begin with, by artificially creating and injecting new death samples, we may be inflating the death rate and creating a model that instead over-estimates the lethality of the illness!

A good direction to go may instead be including more factors, this will make our decision tree more robust. We will nonetheless try out upsampling with SMOTE today just to get a sense of how it works.

#### Synthetic Minority Oversampling Technique(SMOTE)

In SMOTE, we make use of the concept of interpolation - a method of constructing new data points within the range of a discrete set of known data points. 

SMOTE works by first choosing a random example from the minority class. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

It is worth noting that SMOTE can be applied together with other resampling techniques, and does not have to be a standalone application.


<p float="up">
  <img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/10/Scatter-Plot-of-Imbalanced-Binary-Classification-Problem.png" width="300" />
  <img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/10/Scatter-Plot-of-Imbalanced-Dataset-Transformed-by-SMOTE-and-Random-Undersampling.png" width="300" /> 
</p>

In [None]:
X_resampled, y_resampled = SMOTE(sampling_strategy=0.6).fit_resample(X_train, y_train)

new_tree = DecisionTreeClassifier(max_depth=3, min_samples_split=10)
new_tree.fit(X_resampled, y_resampled)
f = plt.figure(figsize=(12,12))
plot_tree(new_tree, feature_names=['Age'], class_names=['Alive', 'Deceased'], filled=True)

In [None]:
print(new_tree.score(X_test, y_test))
new_mat = confusion_matrix(y_test, new_tree.predict(X_test))
plot_confusion_matrix(new_tree,X_test, y_test)
print("The False Negative Rate is: {:.02%}".format(new_mat[1][0]/(new_mat[1][0]+new_mat[1][1])))

While the **accuracy** of our model has declined. This is a trade-off I am completely willing to make! We much rather have some false positives in this case than to have false negatives that can lead to lives being lost.

The **important thing** to note here is that using a single variable for decision tree is really not ideal, we cannot hope to generate any complex model to make good predictions, most techniques are not going to be helpful because we fundamentally are looking at a model too simple, it is like giving all the resources necessary to a 3-years-old and expecting him/her to build a nuclear reactor!

#### Gradient Boost

Before diving into Extreme Gradient Boost, we must first investigate the techniques of Gradient boosting implemented in classification tasks. This section is essentially implementing and explaining what statquest went through in his video [here](https://www.youtube.com/watch?v=jxuNLH5dXCs).

Let us first create a synthetic dataset and go on from there.

In [None]:
Data = pd.DataFrame(np.array([['Yes',12,'Blue','Yes'],['Yes',87,'Green','Yes'],
                              ['No',44,'Blue','No'],['Yes',19,'Red','No'],
                              ['No',32,'Green','Yes'],['No',14,'Blue','Yes']]),
                    columns = ['Likes Popcorn','Age','Favourite Colour','Loves Troll 2'])
Data

Let our task be predicting if the person will like Troll 2. GB starts by first making predictions, since this is classification, we must first quantify our predictions somehow. 
1. $ odds = \frac{P_{positive}}{1-P_{positive}} $

2. Then we have the **probability**, i.e. the chance of a person liking the movie.

$$
classification = \log(odds)
$$

$$
probability\;of\;individual\;liking\;movie = \frac{e^{classification}}{1+e^{classification}}
$$

Thus making use of the formulas above, we get log(odds) = 0.7 initially, and the probability of liking the movie is 0.7 as well. Once we have made the predictions, now we will move on to find out the error. And the error is very similar to what we see in logistic regression:

$$
Residual = (Observed - Predicted),\;Observed = [Yes:1,\;No:0]
$$

In [None]:
Data = pd.concat([Data, pd.DataFrame([[0.3],[0.3],[-0.7],[-0.7],[0.3],[0.3]],columns=['Residual'])], axis=1)
Data

Our tree is now in fact a regression tree, it tries to adjust itself such that the residual is minimised, thus the name gradient boost, in a way you can see it as that we are sliding down the gradient curve bit by bit to reach this desired value where we can predict most items correctly.

We will first construct a **regression tree**, where we split the examples up by their residual. We will achieve a tree like this:

<img src='https://i.imgur.com/9ETHA7x.png' width=300>

For the specifics of why the tree is constructed as such, refer to the appendix **below**.

Just like any other form of learning, we must be able to do something to adjust our classifications and make it more accurate, however we **cannot** directly add the residual to the classification. This is because $residual=P_{actual} - P_{pred}$, it is a deviation in terms of probability, but our initial classification value of **0.7** is simply taking the logarithm of the odds. Therefore we use some form of transformation given by the equation below:

$$
\Delta Classification = \frac{\sum Residual_{i}}{\sum{P_{i}^{prev} * (1 - P_{i}^{prev})}}
$$

Take the leftmost node for instance, we predicted a probability of 0.7 for it, and it has a residual of -0.7, so the $\Delta Classfication = (-0.7)/(0.7*(1-0.7)) = -3.3$, so our **new classification** of this sample will be 

$$
0.7(initial\;classification) + 0.8(learning\;rate) * -3.3(Delta\;classification) =  -1.94
$$

Our new probability will in turn be 

$$
P_{pred} = \frac{e^{-1.94}}{1 + e^{-1.94}} = 0.126
$$


In [None]:
Data.insert(4, 'Pred Prob', pd.DataFrame([[0.9],[0.5],[0.5],[0.1],[0.9],[0.9]]))
Data['Residual'] = pd.DataFrame([[0.1],[0.5],[-0.5],[-0.1],[0.1],[0.1]])
Data

You can see that the residual is getting smaller and predictions are getting better! This process can be repeated until the predictions stop improving, or until it reaches the maximum number of trees we have specified.

#### eXtreme Gradient Boost

In XGBoost, we apply a largely similar theory to gradient boost but with a few modifications.

1. Our initial prediction, instead of using the formulas, is usually a random value, such as 0.5, then we go on from there.

2. Once we obtain the residual values, instead of classifying by our regression tree methods, we will use a formula to calculate the similarity score between the values we put inside the same leaf node, and we use the divider that gives the greatest gain in the similarity score. The formula used is given as:

$$
\Delta Classification = \frac{(\sum Residual_{i})^{2}}{\sum({P_{i}^{prev} * (1 - P_{i}^{prev}))}+\lambda}
$$

The output of each leaf is then given by the formula

$$
Output = \frac{(\sum Residual_{i})}{\sum({P_{i}^{prev} * (1 - P_{i}^{prev}))}+\lambda}
$$

where lambda is our bias, we will talk about the use of $\lambda$ later.

<img src='https://i.imgur.com/eDaBneB.png' width= 600>

3. We then calculate the gain, i.e. the increase in similarity score across each layer of the tree, and we use a arbitrary $\gamma$ to be our gauge for pruning, if the gain is lower than our $\gamma$, then the branch will be eliminated and the leaf is retracted.

4. Once we have our tree made, we will have to adjust our original value, note that in this case, our original value started with a probability, so to adjust it, we must first convert it into classification, which is given by 

$$
Classification = \log(\frac{P}{1-P})
$$

We then add the value we obtained from the above formula, and add the values we obtained from the output of each leaf, to generate the new probability and therefore residual, then we repeat the process, until we reach the maximum number of trees or when the value stops improving.

#### Appendix