# Modeling Overview (Classification)

This is a set of personal notes on popular classification models used in data science for future reference. The goal is to discuss the main features of a number of classification algorithms without going into too much detail, and demonstrate how they are implemented in python. Since this is intended for my own personal use, the questions and confusions I address will be unique to my own background and experience. Nonetheless, I hope that others studying to enter data science will find this a convenient reference in their own data science journeys. Questions and comments are always appreciated! For those that found this notebook useful, I also recommend refering to my notes on [regression algorithms](https://www.kaggle.com/michaelgeracie/modelling-overview-regression) and [clustering algorithms](https://www.kaggle.com/michaelgeracie/modelling-overview-clustering).

We will use the following conventions throughout. The training data is given by a set of $m$ numerical features $\mathbf x_j \in \mathbb R^n$ where $j = \{ 1 , \dots , m \}$. We will sometimes denote feature vectors using the bold face vector notation just given, and sometimes use components. The $i$th feature of the $j$th training sample is then $x_{ij} = (\mathbf x_j )_i$ where $i \in \{ 1 , \dots , n\}$. The features may be real valued or categorical, and if the $i$th feature is categorical, that component will be labelled by some subset $ C \subseteq \mathbb Z$ of the integers. 

To date we have covered the following algorithms. The treatment of each is not exhaustive and I hope to come back later to address important aspects of these models that I'm glossing over now. 

## 1) Naive Bayes
## 2) Logistic Regression
## 3) Decision Trees
## 4) Random Forests
## 5) $k$-Nearest Neighbors
## 6) Support Vector Classifiers
## 7) XGBoost
## 8) AdaBoost

A special thanks to Ken Jee whose notebook [Titanic Project Example](https://www.kaggle.com/kenjee/titanic-project-example) has helped many lerners including myself begin their data science journeys by collecting a number of applied classification algorithms in one place. This is the starting point for this notebook, which will hopefully expand as I encounter more "in the wild".

The following is the standard code to start up a notebook

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for creating plots

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Now let's load some packages that we will be using throughout

In [None]:
#we import some preprocessing methods for train/test split and feature encoding, as well as an accuracy checker
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score

# <div id="naive">Naive Bayes</div>

## Summary

The principle source I used here was the Wikipedia article on [naive Bayes classifiers](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

Naive Bayes is a conditional probability model. That is, given a feature vector $\mathbf x$ it produces the conditional probability $P(y|\mathbf x)$ that the dependent variable $y$ has a given value, given that the feature vector is $\mathbf x$. To turn this into a classifier, we simply take the value of $y$ which maximizes $P(y|\mathbf x)$ .

The most straightforward thing we could do is to  estimate this probability directly from the training data as follows
\begin{align}
    P(y | \mathbf x ) = \frac{\text{no of samples with feature vector $\mathbf x$ and dependent variable $y$}}{\text{no of samples with feature vector $\mathbf x$}}
\end{align}

In practice, this is rarely feasible for the following reasons:

* We might want to make a prediction based on a novel collection of features $\mathbf x$ that isn't in the training data. For example, when a continuous $x_i$ takes a new value, or when $x_i$ and $x_j$ come in a pair not present in the training data.
* Even if $\mathbf x$ is in the training data, the numbers of such samples may be small and we might not expect the above estimate to be a good approximation to the true population value of $P(y | \mathbf x )$

These issues are especially likely if the number of features is large or if the features may take a large number of values. In particular, if there are continuous features we will almost always be querying $\mathbf x$'s that do not apper in the training data.

The idea behind a Naive Bayes model is to get around this by using Bayes theorem to invert the conditional probability

\begin{align}
    P(y | \mathbf x) = \frac{P(\mathbf x | y) P(\mathbf y)}{P(x)}
\end{align}

and then calculate the numerator using the simplifying assumption that the various features in $\mathbf x$ are independent (conditional on $y$)

\begin{align}
    P(\mathbf x | y ) = P ( x_1 | y ) \cdots P ( x_n | y ) .
\end{align}

We then have

\begin{align}

    P(y | \mathbf x) \propto P ( x_1 | y ) \cdots P ( x_n | y ) P(\mathbf y)
\end{align}

where we have thrown out an unimportant $y$ independent constant. The right hand side of this expression is less likely to have the problems mentioned above since the estimates of each factor will be based off of a larger number of samples from the training data.

One can check the independence of two continuous variables by evaluating the correlation

\begin{align}
    \text{corr}^{(y)} (X_i,X_j) = \frac{E^{(y)}[(X_1 - \mu_{X_1})(X_2 - \mu_{X_2})]}{\sigma^{(y)}_{X_1} \sigma^{(y)}_{X_2}}
\end{align}

with the training data. Here the superscripts indicate that everything is conditional on $Y=y$. This is valued in $[-1,1]$, and if $X_1$ and $X_2$ are independent, the correlation must be zero (though the converse does not hold). Furthermore, the correlation is  $\pm 1$, iff $X_1$ and $X_2$ are lineraly related.

Later: Correlation measures for two categorical variables and for one continuous and one categorical variable. It looks like a good discussion can be found [here](https://datascience.stackexchange.com/questions/893/how-to-get-correlation-between-two-categorical-variable-and-a-categorical-variab)

### Notes:
* Highly accurate when independence assumption is justified and if few sample probabilities near zero, but independence is a big assumption.
* Classifier may be trained in time linear in features/samples.
* When the estimated probability of a feature in a class is 0, must do some regularization. Even if it is very small, model is highly sensitive to these small values.

### Example

As a simple example, let's use some data from the Titanic Kaggle competition, and try to use Naive Bayes to predict whether or not a passenger has survived based entirely off of what class ticket he purchased. We then have classes

$$C_\text{perished} = 0 , \qquad  C_\text{survived} = 1,$$

and we wish to calculate

$$P ( C_k | x ) \propto P(x | C_k) P(C_k)$$

for each $x \in \{1,2,3\}$.

In the following code cell we implement naive Bayes by hand to illustrate how it works

In [None]:
#read data and isolate only data on class and survival
#there are no missing values
titanic_data = pd.read_csv('/kaggle/input/titanic/train.csv')
X = titanic_data.Pclass.to_numpy().reshape(-1,1)
y = titanic_data.Survived.to_numpy()

#split into training and validation data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)

#now let's try to calculate the parameters of the model from the training data
#we first calculate the $P(C_k)$'s
#1 means survived, 0 means perished
p_c1 = y_train.mean()
p_c0 = 1 - p_c1

print('P(C0) = ',p_c0)
print('P(C1) = ',p_c1)
print()

#now calculate the $P(x|C_k)$'s
#first we give the training sets conditional on the survival status
X_train_c0 = X_train[y_train == 0]
X_train_c1 = X_train[y_train == 1]

#and then extract conditional probabilities from it
p_1_given_c0 = np.mean(X_train_c0 == 1)
p_2_given_c0 = np.mean(X_train_c0 == 2)
p_3_given_c0 = np.mean(X_train_c0 == 3)

p_1_given_c1 = np.mean(X_train_c1 == 1)
p_2_given_c1 = np.mean(X_train_c1 == 2)
p_3_given_c1 = np.mean(X_train_c1 == 3)

#display relevant calculations
print('P(x=1|C0) = ',p_1_given_c0)
print('P(x=2|C0) = ',p_2_given_c0)
print('P(x=3|C0) = ',p_3_given_c0)
print()
print('P(x=1|C1) = ',p_1_given_c1)
print('P(x=2|C1) = ',p_2_given_c1)
print('P(x=3|C1) = ',p_3_given_c1)
print()
print('Input x = 1: ',
      '\n\tP(C_0 | x=1) ~ ',p_1_given_c0*p_c0,
      '\n\tP(C_1 | x=1) ~ ',p_1_given_c1*p_c1)
print('Input x = 2: ',
      '\n\tP(C_0 | x=2) ~ ',p_2_given_c0*p_c0,
      '\n\tP(C_1 | x=2) ~ ',p_2_given_c1*p_c1)
print('Input x = 3: ',
      '\n\tP(C_0 | x=3) ~ ',p_3_given_c0*p_c0,
      '\n\tP(C_1 | x=3) ~ ',p_3_given_c1*p_c1)

Hence we predict that a first class passenger survives, but second and third class passengers perish.

Now, we implement this using sklearn in the code below. We see that it exactly matches our by-hand calculation.

In [None]:
#training the model and predicting
#the alpha parameter is a smoothing parameter for the case that some features do not appear at all in the test data
#we set this to zero since this does not happen here
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB(alpha=1.0e-10)
clf.fit(X_train,y_train)

#the parameters of the model
print('Probabilities p(C0) and p(C1): ')
print('\t',[np.exp(x) for x in clf.class_log_prior_.tolist()])
print('Probabilities of class 1, 2, or 3 given C0: ')
print('\t',[np.exp(x) for x in clf.feature_log_prob_[0][0].tolist()])
print('Probabilities of class 1, 2, or 3 given C1: ')
print('\t',[np.exp(x) for x in clf.feature_log_prob_[0][1].tolist()])

I believe the first entry in the last two lists is for none of the above. Let's also make some sample predictions and see how the model did for illustrations sake. Below we run the model on the validation data, displaying the predicted probability that an individual survives, the prediction itself, and whether or not that person did. We then evaluate the overall accuracy.

In [None]:
#predicting on some sample data
y_pred = clf.predict(X_test)
predictions = pd.DataFrame({'class':X_test.flatten(),'predict_proba':clf.predict_proba(X_test)[:,1],'predict':y_pred,'survived':y_test})
print('Some sample predictions: ')
print(predictions)
print()

#evaluating the accuracy
print('Accuracy: ',accuracy_score(y_pred,y_test))

## Gaussian Naive Bayes

The above example was performed without any assumptions on the form of the conditional distributions $P(x_i|C_k)$. Rather, we calculated these conditional probabilities using the training data. This was easy to do since the feature data was discrete with few classes. However, in many cases this will not be so easy. For instance, suppose we are given the training data below (this example is pulled from [here](https://chrisalbon.com/machine_learning/naive_bayes/naive_bayes_classifier_from_scratch/)) and wish to use a naive Bayes classifier to predict the gender of a person of given height, weight, and foot size. Note these are not anticipated to be statistically independent, so naive Bayes may be a poor model, but it works for instructional purposes. Here 0 means male and 1 female

In [None]:
# data entry
data = pd.DataFrame()
data['Height'] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data['Weight'] = [180,190,170,165,100,150,130,150]
data['Foot_Size'] = [12,11,12,10,6,8,7,9]
data['Gender'] = [0,0,0,0,1,1,1,1]

# View the data
data

Since the features are continuous, we need some model for $P(x_i |C_k)$ in order to get its value for any possible $x_i$. In this case, it's reasonable to assume all features are gaussian distributed and select their means and variances according to the training data. We are then postulating

$$p(x_i | C_k) \propto \frac{1}{\sqrt{2 \pi \sigma^2_{ki}}}e^{- \frac{(x_i - \mu_{ki})^2}{2 \sigma^2_{ki}}}$$

where here $\mu_{ki}$ and $\sigma^2_{ki}$ are the means and Bessel corrected variance from the training data for the feature $x_i$ conditioned on $C_k$. Of course, for other problems, different probability distributions will be relevant.

We implement this proceedure using `sklearn` in the example below and not bother carrying the proceedure out by hand.

In [None]:
#setting up the model
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

#formatting the training data
X_train = data[['Height','Weight','Foot_Size']]
y_train = [0,0,0,0,1,1,1,1]

#fitting the model
clf.fit(X_train,y_train)

#how to retrieve properites of the model
print('Summary of parameters of our model: ')
print('\tClasses to predict: ', clf.classes_)
print('\tPrior on each class: ', clf.class_prior_)
print('\tMeans for feature in each class: ')
print(clf.theta_)
print('\tStandard deviations for feature in each class: ')
print(clf.sigma_)

We don't really have enough data to perform a good train/test split of the data, so just for illustrative purposes lets put in some data by hand and see how to retrieve predictions.

In [None]:
#predicting on some sample data
print('Some sample predictions: ')
test_data = pd.DataFrame({'Height':[5.67,5],'Weight':[145,110],'Foot_Size':[12,7]})
test_results = pd.DataFrame({'gender':clf.predict(test_data)})
print(test_data.join(test_results))

# <div id='logit'>Logistic Regression</div>


Logistic regression models are used to make binary categorical predictions $Y \in \{0,1\}$. The basic idea is we model $Y$ as a Bernoulli random variable with some probability $p$ for 'success' 1

$$P(y)=p^y(1-p)^{1-y} .$$

The parameter $p$ will then depend on the features $x_i$. These independent variables may be either continuous or categorical, but if they are categorical they should be binary. Our goal is to then find some reasonable function $p(x_i)$. A good starting point is

$$\ln \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n$$

that is, the log-odds are linear in the features. $p$ as a function of the log-odds is shown below

In [None]:
logit = np.arange(-5.0, 5.0, 0.01)
p = 1/(1+np.exp(-logit))
plt.plot(logit, p)

plt.xlabel('logit(p)')
plt.ylabel('p')
plt.grid(True)
plt.show()

This is a reasonable formula since both the left and right hand sides are unbounded in $\mathbb R$, however, the linearity assumption is a big one. We can solve for $p$

$$p = \frac{1}{1 + e^{- \beta^T x}}.$$

where we have denoted $x= ( 1 ~ x_1 ~\dots~ x_n)^T$ and $\beta = ( \beta_0 ~ \beta_1 ~ \dots ~ \beta_n)^T$.

A choice of $\beta$ that best fits the data is given by minimizing the log-likelihood of the set of observations provided in the training set. See the [documentation](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) for more details along with a discussion of regularization techniques to prevent overfitting. These are important but we will add discussion on this at a later date.

## Example
We work through the example found [here](https://en.wikipedia.org/wiki/Logistic_regression#Examples). In this example we are given data on how long 20 students study for a test and whether they passed or not. We fit a logistic regression model, which allows us to obtain the probability that a student passes as a function of how long they study. Finally, we show how to retrieve the relevant parameters of the model

In [None]:
# data entry
data = pd.DataFrame()
data['Hours_Studied'] = [0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50]
data['Pass'] = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

#setting up the model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='none') #Penalty sets the regularization used when finding an MLE for the betas. We found that regularization messed with the results too much. We could alternatively keep regularization but set the regularization parameter C to be very large

#formatting the training data
X_train = data[['Hours_Studied']]
y_train = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

#fitting the model
model.fit(X_train,y_train)

#how to retrieve properites of the model
print('Summary of parameters of our model: ')
print('\tIntercept: ', model.intercept_)
print('\tCoefficients: ', model.coef_)

Let's take a look at how the model assigns probabilities verses hours studied. 

In [None]:
x = np.arange(0.0, 6.0, 0.01).reshape(-1,1)
logit = model.decision_function(x)
p = 1/(1+np.exp(-logit))
plt.plot(x, p)
plt.plot(X_train,y_train,'o')

plt.xlabel('Hours Studied')
plt.ylabel('Prob of Success')
plt.grid(True)
plt.show()

Note that the model finds a probability distribution function that fits the data best. If we want to actually have a classifier, we need set a threshold probability, above which we predict success, and below which we predict failure. It's easy to retrieve this from `model.decision_function()`. This is simply the function $\beta_0 + \beta_1 x_1 + \cdots \beta_n x_n$. Selecting 1 when this is positive and 0 when it is negative gives a classifier with threshold probability $ 1 / 2$. To get another threshold probability we simply need to shift the y-intercept. The method `model.predict()` does this without any shift. Here's a demonstration on how to retrieve some predictions on sample data.

In [None]:
#predicting on some sample data

print('Some sample predictions: ')
test_data = pd.DataFrame({'Hours_Studied':[1,2,3,4,5]})
test_probs = pd.DataFrame({'Pass_Probability':model.predict_proba(test_data)[:,1]})
test_results = pd.DataFrame({'Pass_Prediction':model.predict(test_data)})
print(test_data.join(test_probs).join(test_results).set_index('Hours_Studied'))

# <div id='tree'>Decision Trees</div>

A decision tree is essentially a flow chart where each node represents a binary choice based off some feature. Each branch coming from a node represents the decision taken. The terminal nodes are called leaves and label which category a given input is predicted to be in. Example taken from [here](https://towardsai.net/p/programming/decision-trees-explained-with-a-practical-example-fe47872d3b53)

<img src="https://cdn-images-1.medium.com/max/824/0*J2l5dvJ2jqRwGDfG.png" width="400px">

See [here](https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart) for a more detailed discussion of the decision tree algorithm used by `sklearn`.

In `sklearn`, decision trees are constructed from a data set via the CART (Classification and Regression Tree) algorithm which we now outline. In words, this is a "top-down" algorithm that iteratively produces a binary splitting at each node, starting from the top or "root" node containing all data, so as to maximize the amount of information gained at each node.

Though this algorithm works for an arbitrary finite classification problem, we will focus on binary classification.

## The Algorithm

Let $x_{ij}$ for $i = 1 , . . . , n$ and $j=1,...,m$ be a set of $n$ numerical features for $m$ (training) data points. For binary categorical features $i$, $x_{ij}$ is a 0 or a 1 for for all data points $j$. Let $y_j$ for $j=1,...,m$ be the dependent binary categorical variable for each sample, again encoded as a 0 or 1. The proceedure is to go through the data and select features $i$ and "splits" in that feature $\theta_i$ such that there is maximum "information gain" in the split according to some measure.

More precisely, supose we have are given a binary tree. If $Q \subseteq \{ 1,...,m\}$ is a subset of samples present at some leaf, and $\theta = (i,\theta_i)$ is a given split, then we create two braches, one leading to $Q_\text{left} (\theta)$ and the other to $Q_\text{right}(\theta)$ where

$$Q_\text{left} (\theta) = \{ j \in Q | x_{ij} \leq \theta_i \} ,$$
$$Q_\text{right} (\theta)= \{ j \in Q | x_{ij} > \theta_i \} .$$

We then evaluate "gain", or the weighted average

$$G(Q,\theta)= H(Q) - \frac{|Q_\text{left} (\theta)|}{|Q|} H (Q_\text{left} (\theta)) - \frac{|Q_\text{right} (\theta)|}{|Q|} H (Q_\text{right} (\theta))$$

of some attribute selection measure (AMS) $H(Q)$, discussed below. We iterate through all splitings (we will not discuss the details of how these are chosen), choose the splitting $\theta = (i , \theta_i)$ that maximized $G(Q,\theta)$ and create a new tree where the leaf $Q$ is replaced by this binary node.

To create the tree, we start with a single "root" node $Q = \{ 1,...,m\}$ that contains the entire data set and apply the above algorithm iteratively until we

* run out of features
* every leaf is homogeneous, i.e. the data contained in each leaf is all of a single class
* a pruning condition has been reached (to be discussed)

## Attribute Selection Measures (ASM)

Let's discuss the purity/impurity measure $H(Q)$ used to select the splitting at each node.


### The Gini index
The Gini index of a node $Q$ is defined to be

$$H(Q) = \sum_{k=0,1} p(k|Q) ( 1 - p(k|Q) )= 1 - p(0|Q)^2 - p(1|Q)^2$$

where $k$ labels the classification outcome and $p(k|Q)$ is the probability that an observation falling in the node $Q$ has classification outcome $k$

$$p(k|Q) = \frac{\text{# of training observations at node $Q$ that are of type $k$}}{\text{# of training observations in the node $Q$}}  .$$

More precisely, it is the estimation of the probability from the training data.


The gini index is a measure of how homogeneous the node is, i.e. it is a function of the training data that takes value 0 when a node contains observations all with a single outcome, and is a maximum when the probability of being in a given category is $ 1 /2$. Splits in the decision tree are chosen to minimize $G(Q,\theta)$.

In [None]:
    p_0 = np.arange(0, 1.0, 0.01)
    g = 1 - p_0 ** 2 - (1 - p_0) ** 2
    plt.plot(p_0, g)

    plt.xlabel('p_0')
    plt.ylabel('Gini coefficient')
    plt.grid(True)
    plt.show()

### The entropy

The entropy of a node $Q$ is defined to be

$$ H(Q) = - \sum_{k=0,1} p ( k | Q ) \log_2 p ( k | Q ) .$$

Again, we create decision nodes so as to minimize $G(Q,\theta)$. The entropy has the same features as the Gini coefficient that made it a good measure of homogeneity, or "information". It however has strong theoretical advantages which we will not discuss here.

In [None]:
    p_0 = np.arange(0.001, 1.0, 0.01)
    E = - p_0 * np.log2(p_0) - (1-p_0) * np.log2(1-p_0)
    plt.plot(p_0, E)

    plt.xlabel('p_0')
    plt.ylabel('E')
    plt.grid(True)
    plt.show()

## Pruning 
With large data sets and enough features it is possible for the decision tree to overfit the data. In the most extreme case, the tree may simply memorize the training set with one leaf per training observation $(x_{ij}, y_j)$. Hence we will want to "prune" the tree by removing branches. This is easily done using the `DecisionTreeClassifier()` method of `sklearn` by feeding it arguments that give the algorithm an earlier stopping condition. Some examples of these are

* `min_samples_leaf`: A split will only be considered if it leaves at least `min_samples_leaf` in each of the left and right branches. The default value is 1.
* `min_samples_split`: The minimum number of samples required at a node for the algorithm to considre splitting it. The default is 2.
* `min_impurity_decrease`: The algorithm will only split a node if the impurity of the node is decreased by at least this amount. Default is 0.
* `max_features`: Considers at most this number of features when creating a split. These features are randomly selected from the set of remaining features. Default is `None`.

## Example

In this example we take data on weather conditions and whether or not a scheduled tennis match was held that day. We show how to encode the data numerically so that the model can be trained, how to train the model, and make some sample predictions. This example was found [here](https://towardsdatascience.com/understanding-decision-tree-classification-with-scikit-learn-2ddf272731bd). First read the data.

In [None]:
#reading the example data on tennis matches
df = pd.read_csv('/kaggle/input/play-tennis/play_tennis.csv')
print('Shape: ', df.shape)
print('Sample of the training data:')
print(df.head())

Now we prepare the data using, encoding the categorical data numerically since the decision tree classier requires numerical inputs for making splits.

In [None]:
#break the data up into features and dependent variable
indep_vars = ['outlook','temp','humidity','wind']
X = df[indep_vars]
y = df['play']

#perform a train/test split
#unfortunately the dataset is pretty small so there isn't much room for validation
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=9,random_state=0)#encode the categorical variables numerically

#encode the catecorical data numerically
X_encoder = OneHotEncoder(sparse = False)
X_train_enc = pd.DataFrame(X_encoder.fit_transform(X_train),
                           columns = X_encoder.get_feature_names(indep_vars) )
X_test_enc = pd.DataFrame(X_encoder.transform(X_test),
                           columns = X_encoder.get_feature_names(indep_vars) )

y_encoder = LabelEncoder()
y_train_enc = pd.DataFrame(y_encoder.fit_transform(y_train),columns = ['play'])
y_test_enc = pd.DataFrame(y_encoder.transform(y_test),columns = ['play'])

Now that the data is encoded we train the model and make some predictions that we check against the test data. Note that we have decided to prune the tree as the dataset is quite small and a decision tree classifier can easily simply memorize the dataset. It's worth playing around with this parameter. For example, if we use `min_samples_leaf=1`, i.e. no pruning, one sees below that the tree simply memorizes the dataset. Moreover, the accuracy is 0.0 when applied to the test data! However, with `min_samples_leaf=3` we get a tree with only a single decision node, but an accuracy of 0.6 on the test data.

In [None]:
#building and fitting the model
#criterion is the attrivute selection measure discussed above
#without some pruning the model simply memorizes the dataset
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy',min_samples_leaf=3)
tree.fit(X_train_enc,y_train_enc)

#predicting whether or not we play using the decision tree
#since we used pruning, the leaves do not have homogeneous outputs
#hence, predict_proba tells you the probability of falling into each class once you've followed your way through the tree
y_pred_proba = pd.DataFrame(tree.predict_proba(X_test_enc)[:,1],columns=['play_proba'])
y_pred = pd.DataFrame(y_encoder.inverse_transform(tree.predict(X_test_enc)),columns=['prediction'])

#display results
X_test = pd.DataFrame(X_encoder.inverse_transform(X_test_enc),columns=indep_vars)
y_test = pd.DataFrame(y_encoder.inverse_transform(np.ravel(y_test_enc)),columns=['play'])
accuracy = accuracy_score(y_test,y_pred)

print('Predictions:')
print(X_test.join(y_pred_proba).join(y_pred).join(y_test))
print()
print('Accuracy: ',accuracy)

Here we demonstrate how to graphically display the decision tree using the `plot_tree` function. Note that all of the leaves are "pure", that is all the training observations belonging to a particular leaf are of the same type. For large data sets this is a sign of overfitting and we we may want to prune the tree.

In [None]:
#as well as displaying them
from sklearn.tree import plot_tree

fig = plt.figure(figsize=(10,8))
plot_tree(tree, 
                   feature_names=X_encoder.get_feature_names(['outlook','temp','humidity','wind']),  
                   class_names=['yes','no'],
                   filled=True)
plt.show()

# Random Forests

A random forest classifier is an ensemble classifier constructed from a large number of randomly generated decision tree classifiers. This randomness greatly reduces variance in the predictions at a slight cost in bias and so is a good way to prevent overfiting. Randomness is introduced to the model in two ways:

1) Only train on a randomly selected subset of the training data (with replacement). This is called "bootstraping". Sub-sample size is controlled with the `max_samples` parameter if `bootstrap=True`. This can be an integer number of samples, or a float between 0 and 1, in which case it is interpreted as a fraction of samples to use from the training data.

2) In training each tree, only use a randomly selected subset of features of size `max_features` when deciding on how to split a node.

Results from the different decision trees are then averaged weighted by the probability that each tree gives for being in each class of the classification problem. The mode (most likely result) of the resulting probability distribution is then selected. A more detailed discussion of random forests can be found [here](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees).

# Example

Let's demonstrate how to use `RandomForestClassifier` from `sklearn`. Random forests are best used when there's large amounts of data and the danger of overfitting, so lets use the Titanic dataset again and try to predict survival using both bootstrapping and `max_features`. The number of trees generated is `n_estimators`.

In [None]:
from sklearn.ensemble import RandomForestClassifier

#reading the titanic data
#since the point of this is to provide a simple example, we only keep those features that don't need much engineering
#for same reason, don't do any imputing and just drop all rows with null values
df = pd.read_csv('/kaggle/input/titanic/train.csv')
df = df[['Age','SibSp','Parch','Fare','Pclass','Sex','Embarked','Survived']].dropna()
X_num = df[['Age','SibSp','Parch','Fare']]
X_cat = df[['Pclass','Sex','Embarked']]

#to plug this into a random forest classifier, which requires numeric inputs, we need to do some feature encoding
X_encoder = OneHotEncoder(sparse=False)
X_cat_enc = pd.DataFrame(X_encoder.fit_transform(X_cat),
                           columns = X_encoder.get_feature_names(['Pclass','Sex','Embarked']) )

#combine the categorical features with the numerical ones to create a single dataframe of training data
#recall the index is meaningful after having droped na's, so we need to make them consistent before joining
X_cat_enc.index = X_num.index
X = X_num.join(X_cat_enc)
y = np.ravel(df[['Survived']])

#perform a train/test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

#training a random forest classifier with n_estimators=500 randomly generated trees
#note bootstrap = True by default
model = RandomForestClassifier(n_estimators=100,max_samples=0.2,max_features=3,random_state=0)
model.fit(X_train,y_train)

#now let's see how accurate the predictions are
y_pred = pd.DataFrame(model.predict(X_test),columns = ['Survived'])
score = accuracy_score(y_pred,y_test)
print('Percentage correct in our model: ',score * 100 , '%')

# $k$-Nearest Neighbors

For $k$-nearet neighbors, the features may be continuous or categorical, in which case the feature is encoded by integers. The dependent variable $y$ is categorical and can lie in some set $C = \{ 1, \dots , N \}$. Now suppose we are given a data point with feature vectore $\mathbf x$. The $k$-nearest neighbors algorithm classifies this data point by comparing it with samples in the training data that are closest to it in some sense. When the features are continuous, we typically use the euclidean metric to measure distance. When they are categorical, a good measure is the Hamming distance, which simply adds up how many features are different. Given $k$, the algorithm computes which $k$ training points $\mathbf x_{j_a}$, $a = 1,...,k$ are the closest to $\mathbf x$ in $\mathbb R^n$ and then makes a decision as to which class the test sample lies in by taking a weighted sum

$$P(y= p | \mathbf x) = \sum_{a=1}^k w_a I_{p}(\mathbf x_{j_a}) .$$

Here $I_{p} (\mathbf x_j )$ is an indicator function telling whether or not a given point $\mathbf x_j$ from the training set has $y_j = p$ and $w_a$ is a probability measure so that $\sum_{a=1}^k w_a = 1$. This then gives a probability that the point $x_i$ has $y = p$. The classifier then returns the category with the greatest probability.

There are many different weightings one may choose. The two most common are the uniform weighting where $w_a = 1/k$ for all $a$ and the distance weighting where $w_a$ is proportial to the inverse of the euclidean distance of $\mathbf x_{j_a}$ from $\mathbf x$.

The distance measure is also highly customizable, for example, scaling the coordinates can make certain features more or less important. With categorical variables using the Hamming distance, we may want to weight certain features more than other. In more sophisticated applications we may want to "learn" the best distance measure for a model, but a good starting point is to scale all variables to have mean 0 and unit variance so the model weights them approximately euqally.

Our source for details on the $k$-nearest neighbors algorithm can be found [here](https://scikit-learn.org/stable/modules/neighbors.html#).
We will work through an example with continuous features and three-category classification originally found [here](https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py).

## Example

In this example, we demonstrate how to use the `KNeighborsClassifier` from `sklearn` on the iris dataset provided by `sklearn`. The iris dataset gives the sepal length, sepal width, petal length, and petal width of 150 irises and which species of iris that flower is from. The data set only includes data from three species, Setosa, Versicolour, and Virginica, encoded as 0, 1, and 2 respectively. For ease of presentation, we will only use two of these features. We begin by loading the data

In [None]:
#import the data from the sklearn prepackaged datasets
from sklearn import datasets
iris = datasets.load_iris()

#package this into a dataframe with column labels to keep track of what information is what
iris_features = pd.DataFrame(iris.data,
                                columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris_target = pd.DataFrame(iris.target,
                            columns = ['Species'])
print('The iris dataset:')
print(iris_features.join(iris_target))

Now let's prepare our data for modelling

In [None]:
#We intend to use only the first two features for our model
X = iris.data[:,:2]
y = iris.target

#creating a train/test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

Now let's create our $k$-nearest neighbors model. One can play around with $k$ and well as the weighting to see how this affects the model. We find it particularly easy to see the behavior of the model in the upcoming graphic when we take $k=1$, a nearest neighbor model.

In [None]:
#import the relevant method from sklearn
from sklearn.neighbors import KNeighborsClassifier

#create and fit the model
clf = KNeighborsClassifier(n_neighbors=1, weights='uniform')
clf.fit(X_train, y_train)

#make predictions with the model and compare them to the actual result
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)

#printing the results
model_validation = np.concatenate((X_test,y_pred.reshape(-1,1),y_test.reshape(-1,1)),axis=1)
results = pd.DataFrame(model_validation,columns=['Sepal_Length','Sepal_Width','Species_Predicted','Species']).iloc[-10:]
print('Comparing predictions made by the model to the true result:')
print(results)
print()

#accuracy evaluation
accuracy = accuracy_score(y_pred,y_test)
print('Accuracy: ', 100 * accuracy , '%')

Here we demonstrate the probability that the model associates to each sample being in a particular class. It's worth playing around with $k$ and the weighting to see what this does to the probabilities (for instance, for $k=1$, the probability will always be 0 or 1). In the end, the model selects the class with the greatest probability

In [None]:
pd.DataFrame(clf.predict_proba(X_test),columns=[['prob_0','prob_1','prob_2']]).iloc[-10:]

A very efficient way to demonstrate the behavior of the model is to graph the domains that are mapped to a particular category. We do that here, superimposed with the training data. Play around with the parameters of the model. The behavior is particularly clear when `n_neighbors=1`.

In [None]:
#NB: Much of this plotting code was lifted from a very helpful example which I found online but have since lost the link to.
#If anyone could point out the source I would really appreciate it!

#we will represent iris species with color, this requires a ListedColormap
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])

#generating the domain we will plot and making predictions for every point in a mesh of that domain
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),
                     np.arange(y_min, y_max, .02))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

#now let's make the plot, to do so we first have to reshape Z, which is just a 1d array, to match the shape of xx and yy
#then plot the color coded mesh
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6), dpi=100)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

#also plot the training points to see the behavior
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold,
                edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.xlabel('Sepal Length')
plt.ylim(yy.min(), yy.max())
plt.ylabel('Sepal Width')
plt.title("Iris Species Classification")

plt.show()

# Support Vector Classifier

Support vector classifiers work by viewing the training set $\mathbf x_j$, $j=1,...,m$ as points in $\mathbb R^n$ (where $n$ is the number of features), and trying to divide regions associated with different categories with "domain walls". In the simplest case, categories are seperable by an $n-1$ plane. We then choose that $n-1$ plane such that it has the greatest distance between itself and the nearest training points of any class. This plane is called the "maximal-margin hyperplane" and the region bounded by parallel planes passing through the closest points is called the "margin". The plane is determined by the closest $n+1$ training examples, which are called the "support vectors". In this discussion we will focus on binary classification so that there is only one maximal-margin hyperplane. The image below is taken from `sklearn`'s discussion of support vector machines, which provides an excellent overview of the method and can be found [here](https://scikit-learn.org/stable/modules/svm.html#classification).

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_separating_hyperplane_0011.png">


If the classes cannot be separated by a hyperplane, but can be to a good approximation, we still use this method, but the margin will be allowed to contain some points from the training set. The maximal-margin hyperplane is then defined by some "cost function", which minimized by some optimal plane. This cost function should induce a greater cost when the margin is small (we want to separate the classes as well as possible) as well as a greater cost the deeper the training set impinges on the marginal region.

The above strategy constructs a binary classifier. When building a classifier to predict more than two classes, one may use either a "one-vs-one" or "ovo" approach or a "one-versus-rest" or "ovr" approach. In the ovo approach a decision boundary is created from data for each pair of categories. In the ovr approach, a decision boundary is created comparing each category to everything that is not in that category. I haven't taken the time yet to figure out what happens when a point is in the positive result region for multiple or no categories.

## Linear support vector classifiers

To forumlate the optization problem, let the two categories by represented by $y= \pm 1$ and let the hyperplane in question be defined by the equation

$$ \mathbf w^T \mathbf x + b = 0$$

where $\mathbf w \in \mathbb R^n$ is the (un-normalized) normal vector to the plane. The marginal region is bounded by parallel hyperplanes


$$ \mathbf w^T \mathbf x + b = 1$$
$$ \mathbf w^T \mathbf x + b = -1$$

The first equation is satisfied by support vectors $\mathbf x_j$ with $y_j =1$ and the second by support vectors with $y_j = -1$. Note that this parameterization fixes the normalization of $b$ and $\mathbf w$, which are left unfixed by the first equation. The width of the marginal region is then $\frac{2}{||\mathbf w||}$.

<img src="https://upload.wikimedia.org/wikipedia/commons/7/72/SVM_margin.png" style="background-color:white" width = "500">

Now a training point $\mathbf x_j$ of class $y_j$ crosses the planes bounding the marginal region into iff

\begin{align}
    y_j ( \mathbf w^T \mathbf x_j + b ) = 1 - \zeta_j
\end{align}

for some $\zeta_j \geq 0$. The distance it impinges is given by $\frac{\zeta_j}{|| \mathbf w ||}$.

Hence we seek to minimize

\begin{align}
    C \sum_{j=1}^m \zeta_j + \frac 1 2 || \mathbf w ||^2 ,\nonumber \\
    \text{subject to} ~~ y_i ( \mathbf w^T \mathbf x_j + b ) \geq 1 - \zeta_j , \nonumber \\
    \text{where} ~~ \zeta_j \geq 0 
\end{align}

over $\mathbf w, b, \zeta_j$. Here $C$ is some constant controlling the relative cost of keeping the regions well separated and of having impinging points. This is the so called "primal problem".

## The primal problem

The above is a reasonable approach if the categories are approximately separated by a good hyperplace, that is, if the minimum of the cost function is a small number. If this is not the case, we need to use some non-linear dividing region. This is accomplished by mapping the feature space into some other, possibly higher dimensional space $\phi: \mathbb R^n \rightarrow \mathbb R^p$ where the dividing region is better approximated as linear and then carrying out the above algorithm there. The map $\phi$ is called a feature map, and should be a non-linear map. We then seek to minimize

\begin{align}
    C \sum_{j=1}^m \zeta_j + \frac 1 2 || \mathbf w ||_2^2 ,\nonumber \\
    \text{subject to} ~~ y_j ( \mathbf w^T \phi (\mathbf x_j) + b ) \geq 1 - \zeta_j , \nonumber \\
    \text{where} ~~ \zeta_j \geq 0 
\end{align}

over $\mathbf w, b, \zeta_j$.

## The dual problem

For computational purposes it is apparently convenient to switch to the dual problem. For the details on how this works we refer to the "Constraints, Lagrange Multipliers, and Duality" note (those interested can message me). In the dual problem, rather than seeking to minimized a constrained lagrangian, we seek to maximize the dual lagrangian, which is a function of the lagrange multipliers that implement the constraints

\begin{align}
    q(\alpha) = \sum_{j=1}^m \alpha_j - \frac 1 2 \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j \mathbf x^T_i \mathbf x_j .
\end{align}

Since the lagrange multipliers implement inequality rather than equality constraints, there are some constraints that still come along for the ride in the dual problem. These are

\begin{align}
    0 \leq \alpha_j \leq C ,
    \qquad \qquad
    \sum_{j=1}^m \alpha_j y_j = 0 .
\end{align}

Non-zero $\alpha_j$'s correspond to support vectors, that is, vectors at or within the marginal planes.


Once the maximum is found, the maximal-margin hyperplane can be retrieved via

\begin{align}
    \mathbf w = \sum_{j=1}^m \alpha_j y_j \mathbf x_j
\end{align}

and

\begin{align}
	b = y_j  - \mathbf w^T \mathbf x_j.
\end{align}

The equation for $b$ is for $j$ such that $0 < \alpha_j < C$, that is, a support vector.

## The kernel trick

From the dual problem, it is clear that the problem only depends on the so called Gram matrix, the matrix of inner products $x^T_i x_j$. This is nice because there is (apparently) a large computational cost to computing dot products, and we do not have to search over more than $m^2$. This is also a useful point of generalization for building non-linear support vector classifiers, that is classifiers without hyper-plane boundaries.

The way this is done is the so-called kernel trick, where the Gram matrix is replaced in the dual problem with some symmetric "kernel" $K_(\mathbf x_i , \mathbf x_j )$. We then solve the same primal problem and retrieve the maximal-margin hyperplane via
\begin{align}
    y = \sum_{j=1}^m \alpha_j y_j K ( \mathbf x_j , \mathbf x ) + b
\end{align}

where

\begin{align}
	b = y_j  - \sum_{k=1}^m \alpha_k y_k K ( \mathbf x_k , \mathbf x_j ).
\end{align}

for some support vector $j$.

The idea behind the kernel is that it is an inner product in an auxiliary feature space $K_(\mathbf x_i , \mathbf x_j ) = \phi ( \mathbf x_i )^T \phi ( \mathbf x_j)$ where $\phi$ is a non-linear map into a possibly higher-dimensional space. If a map can be found that better separates the classes linearly, we then carry out the linear support vector algorithm there. A large computational savings comes from realizing we do not need to perform this map explicity, we only need to know $K$ itself. Some common kernels built into `sklearn` are:

* `linear`: $K_(\mathbf x_i , \mathbf x_j ) = \mathbf x_i^T \mathbf x_j$
* `poly`: $K_(\mathbf x_i , \mathbf x_j ) = (\gamma x_i^T \mathbf x_j + r )^d$
* `rbf`: $K_(\mathbf x_i , \mathbf x_j ) = \exp \left(- \gamma || x_i - x_j ||^2_2 \right)$
* `sigmoid`: $K_(\mathbf x_i , \mathbf x_j ) = \tanh \left(\gamma || x_i - x_j ||^2_2 + r \right)$

Here $\gamma$ is always set by parameter `gamma`, $r$ by `coef0` and $d$ by `degree`. Different kernels will be good at creating classification boundaries of different shapes, but we have not investigated how they perform yet.


## Example

We again use the iris dataset and try to build a classifier using a support vector classifier that predicts the species of an iris based off of only the first two features, the sepal length and sepal width. Below we load the data

In [None]:
#import the data from the sklearn prepackaged datasets
from sklearn import datasets
iris = datasets.load_iris()

#package this into a dataframe with column labels to keep track of what information is what
iris_features = pd.DataFrame(iris.data,
                                columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris_target = pd.DataFrame(iris.target,
                            columns = ['Species'])
print('Sample of the iris dataset')
print(iris_features.join(iris_target).head())


We create a support vector classifier and train it on the data prepared above. The kernel is specified to be linear so that we use the linear classification discussed above. We will return to this later to better understand the non-linear kernels, but for now, one can play with some of the other opetions. It's also worth playing around with the relative cost parameter $C$ introduced above. Since we have a tri-partite classification problem, we need to specify how the planes are chosen. We choose a "one-versus-rest" proceedure as discussed above (this is also the default).

First let's get the data ready

In [None]:
#We intend to use only the first two features for our model
X = iris.data[:,:2]
y = iris.target

#creating a train/test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

And now train the model and make predictions

In [None]:
from sklearn.svm import SVC

clf = SVC(C=1.0,kernel='linear',decision_function_shape='ovr',gamma=1.0,coef0=0.5,degree=4)
clf.fit(X_train,y_train)

#make predictions with the model and compare them to the actual result
y_pred = clf.predict(X_test)

#printing the results
model_validation = np.concatenate((X_test,y_pred.reshape(-1,1),y_test.reshape(-1,1)),axis=1)
results = pd.DataFrame(model_validation,columns=['Sepal_Length','Sepal_Width','Species_Predicted','Species']).iloc[-10:]
print('Comparing predictions made by the model to the true result:')
print(results)
print()

#accuracy evaluation
accuracy = accuracy_score(y_pred,y_test)
print('Accuracy: ', 100 * accuracy , '%')

Finally, let's display the behavior of the model graphically as before

In [None]:
#we will represent iris species with color, this requires a ListedColormap
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])

#generating the domain we will plot and making predictions for every point in a mesh of that domain
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),
                     np.arange(y_min, y_max, .02))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

#now let's make the plot, to do so we first have to reshape Z, which is just a 1d array, to match the shape of xx and yy
#then plot the color coded mesh
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6), dpi=100)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

#also plot the training points
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold,
                edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.xlabel('Sepal Length')
plt.ylim(yy.min(), yy.max())
plt.ylabel('Sepal Width')
plt.title("Iris Species Classification")

plt.show()

# XGBoost Classification

For more details on this algorithm see [here](https://xgboost.readthedocs.io/en/latest/tutorials/model.html). XGBoost and regular gradient boost are essentially the same algorithm, but XGBoost has regularization, more optimizations, and built in cross-validation. For a list of differences see [here](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/).

As usual, suppose we are given a training set $\mathbf x_{j}$ with dependent variable $y_j \in \{ 0 , 1 \}$ where $j = 1 , \dots , m$ and the $\mathbf x$'s lie in a n-dimensional feature space. In this algorithm, our goal is to create a rule that takes a given feature vector $\mathbf x$ and outputs a probability $p_\mathbf{x} \in [0,1]$ that the result is a "success", that is, $y=1$.

Our plan is to build many trees $T^{(k)}$. Each tree will produces a numerical output for a given feature vector $\mathbf x$, which we denote $f^{(k)}(\mathbf x)$. The tree ensemble will be built iteratively with each tree correcting the result for $p_\mathbf{x}$ produced by all the previous trees. As such, each tree is not indidually interpretretable as it is say in a random forest. Each tree adds it's own numerical contribution to the log-odds of a positive result

\begin{align}
    l = \ln \left( \frac p {1-p} \right).
\end{align}

We add these contributions to log-odds rather than the probability since this is valued in $\mathbb R$, not $[0,1]$, so we don't have to worry about leaving a certain region.
Thus we get a sequence of predictions

\begin{align}
    l^{(0)}_\mathbf{x} &= 0 , \nonumber \\
    l^{(1)}_\mathbf{x} &= l^{(0)}_\mathbf{x} + \epsilon f^{(1)} ( \mathbf x), \nonumber \\
    &\qquad \cdots \nonumber \\
    l^{(k)}_\mathbf{x} &= l^{(k-1)}_\mathbf{x} + \epsilon f^{(k)}(\mathbf x ) = \epsilon \sum_{a = 1}^k f^{(a)} ( \mathbf x )
\end{align}

with better and better accuracy (we will discuss how to measure accuracy in a bit). 

## The proceedure

The initial tree $T^{(0)}$ is just a root, sending $\mathbf x \rightarrow f^{(0)}$ (a constant typically set to 0) for any input feature vector. Now suppose we are given an ensemble of trees $\{ T^{(0)}, \dots , T^{(k)} \}$ and let $p^{(k)}_j$ be it's output probability of success for each feature vector $\mathbf x_j$. We evaluate our success with an "objective function"

$$
    O = \sum_{j=1}^m L ( y_j , p^{(k)}_j) + \sum_{a=1}^k \Omega ( T^{(a)} )
$$

and try to choose the parameters for our new tree $T^{(k+1)}$ so that it's contribution to $O$ is minimized. Here $L$ is a loss function that captures how well the model matches the training data and $\Omega$ is a "regularization function" to prevent over-fitting.

## Loss and regularization
The loss function used for classification problems is

\begin{align}
    L(y,p) &= - \ln ( p^y (1-p)^{1-y}) .
\end{align}

This is the negative-log likelyhood of the Bernoulli parameter $p$ given a measurement of the dependent variable $y$. Hence if $p$ seems likely given the observation $y$, the loss is lower, if $p$ seems unlikely given $y$, the loss is higher. We take the log since since then the errors found in multiple trials add up, that is, the log-likelihood of many observations is just the sum of the log-likelihoods of each observation individually.

A good regularization function will be larger for "more complicated" trees, so that it's cost in the objective function prevents over-fitting.
Let $t$ be the number of leaves in a tree $T$ and let $w_b$ be the numerical outputs of each leaf with $b$ labelling the leaves. The regularization term used is

\begin{align}
    \Omega ( T ) = \gamma t + \frac 1 2 \lambda \sum_{b=1}^t w^2_b
\end{align}

for some parameters $\gamma$ and $\lambda$ which must be specified. Adding this to the objective function makes the model tend to prefer trees with a smaller number of leaves and weight more concentrated in a few of them.

## Building trees

At the end of the day though, for speed and simplicity, `XGBoost` minimizes the Taylor approximation to the objective function. Let the objective function at step $k$ be

\begin{align}
    O = \sum_{a=1}^k O^{(a)}
\end{align}

with

\begin{align}
    O^{(a)} = \sum_{b=1}^{t^{(a)}} \left( G^{(a)}_b w^{(a)}_b + \frac 1 2 ( H^{(a)}_b + \lambda ) (w^{(a)}_b)^2 \right) + \gamma t^{(a)}
\end{align}

being the contribution from the $a$th tree. Here $G_b$ and $H_b$ are the Taylor coefficients of the objective function introduced above. They are associated to each leaf of a tree and take the values

\begin{align}
    G^{(a)}_b = - \sum_j' ( y_j - p^{(a-1)}_j) ,
    &&H^{(a)}_b = \sum_j'  p^{(a-1)}_j ( 1 - p^{(a-1)}_j ).
\end{align}

Here the prime denotes that we sum over $(\mathbf x_j, y_j)$ in the training set that lie in this leaf.
In other words, the titular gradient $G^{(a)}_b$ is the sum of the "residues" of the training data points that lie in the $b$th leaf.

Now let's turn to building the trees in the enseble. For now, consider a tree $T^{(a)}$ with a given structure of nodes and branches. We can minimize its "score" $O^{(a)}$ by selecting weights

\begin{align}
    w^{(a)}_b = - \frac{G_b}{H_b + \lambda}
\end{align}

in which case the tree has a score

\begin{align}
    O^{(a)} = - \frac{G_b^2}{H_b + \lambda} + \gamma t^{(a)}
\end{align}

To actually select a good tree, we take the top-down approach discussed above in the construction of decision trees. Starting from the root, we work our way downward, trying to split a leaf into a node with two branches. We introduce a new splitting if the "gain" is greater than zero

\begin{align}
    \text{Gain} &= \text{Objective function before split} - \text{Objective function after split}\nonumber \\
        &= \frac 1 2 \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda }\right) - \gamma
\end{align}

where $L$ and $R$ denote the left and right leaves that you get after the splitting. In the presence of many possible splittings of a leaf, we take the one with the greatest gain. Note that the effect of $\gamma$ is to give a bias to the gain function, so that larger values of $\gamma$ tend to prune the tree and prevent over-fitting. Similarly, larger values of $\lambda$ decrease the positive contributions to the gain and so discourage adding new branches.

Apparently, XGBoost only accounts for $\gamma$ after the tree is made, not at each step. Hence the tree is constructed as if $\gamma$ were zero, and then one goes back and looks at all of the lowest descision nodes. If the gain is greater than $\gamma$, that decision node is maintained, otherwise it is removed. We then proceed up the tree in this way. This is quite different from pruning at each step: we might have a split at the root node for instance whose gain does not exceed $\gamma$, however if we never reach the root node from this bottom up proceedure, the split is maintained.

## Example

Now let's use `XGBoost` to produce such a model. The example taken from [here](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/). The dataset we use is some health data for females of pima indian ancestry and whether or not they have developed diabetes. Let's take a quick look.

In [None]:
#load the dataset and take a look
dataset = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
print(dataset.head())

Fortunately there is no data cleaning to do. Let's create an `XGBClassifier` then and train it. A sample of important parameters for the model is (for others see [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)):

* `n_estimators` is the number of trees generated in the ensemble
* `learning_rate` is $\epsilon$ from above discustion
* `reg_lambda` is the regularization parameter $\lambda$
* `min_split_loss` is the regularization parameter $\gamma$

In [None]:
#loading the necessary methods
from xgboost import XGBClassifier

#separate out the features and the dependent variable
X = dataset.iloc[:,:-1]
y = dataset.iloc[:,-1]

#split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)

#now create the model and train it
#eval_set is provided for cross validation across the training and test sets
#eval_metric then specifies which measures of error we want to keep track of
clf = XGBClassifier(n_estimators=100,learning_rate=0.1,reg_lambda=1,min_split_loss=0)
eval_set=[(X_train,y_train),(X_test,y_test)]
clf.fit(X_train,y_train,
        eval_metric=['logloss','error'],eval_set=eval_set,verbose=False)

#now lets make some predictions on the test set and see how they fare
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy on the test set data:',100*accuracy,'%')

To get some more insight on how the model develops as we add more trees, we display plots of the error and logloss (or negative log-likelihood). The later is just the $L$ function introduced above, evaluated at each stage in the process. Note that `clf.evals_result()` is a dictionary whose first key, `'validation_0'` or `'validation_1'` indicates whether we are looking at the training or test set, and whose second key is one of the loss functions from `eval_metric`. Given these keys, the values are simply that of the given loss function at each round.

In [None]:
# retrieve performance metrics
results = clf.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)
# plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.xlabel('Rounds')
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()
# plot error
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
plt.xlabel('Rounds')
plt.ylabel('Error')
plt.title('XGBoost Error')
plt.show()

It looks like the model has overfit a bit and we should stop after about 20 rounds.

# AdaBoost

For a good discussion of the mathematics behind AdaBoost, see [here](https://en.wikipedia.org/wiki/AdaBoost). [This](https://medium.com/analytics-vidhya/add-power-to-your-model-with-adaboost-algorithm-ff3951c8de0) is also a good source. The original paper with the SAMME algorithm used by `sklearn` can be found [here](https://web.stanford.edu/~hastie/Papers/samme.pdf).

Like XGBoost, AdaBoost is a boosting algorithm, that is it creates a sequence of so-called "weak" classifiers, this ensemble of classifiers is used to make a prediction, and then each successive classifier is chosen to correct the results of the ensemble in the best possible way. These weak classifiers may be of any type, however, they are often depth=1 decision trees, or stumps (this is the `sklearn` default). Moreover, the way in which their performance is evaluated and added to the ensemble is different. The main idea is that if the ensemble misclassifies a training example, the next tree should put more weight on that example. Hence computational time is spent focussing on learning patterns in the data that have not been learned yet (though it seems to me this would also encourage over-fitting and oversensitivity to outliers to me).

Now let's give the details. Suppose we have training data $\{(\mathbf x_j , y_j )\}$ where $y_j \in \{ 1 , - 1 \}$ is a binary classificaton. In the first step we create a decision stump with the lowest weighted Gini index

\begin{align}
    \frac{| Q_L |}{|Q|} G(Q_L ) + \frac{|Q_R|}{|Q|} G(Q_R) .
\end{align}
Here $Q$ is the collection of data points in a given node\leaf, L and R denote the two leaves in the stump, and $G$ is the Gini index of a single node. We let
\begin{align}
    C_1 ( \mathbf x) = \alpha_1 k_1 (\mathbf x)
\end{align}
where $k_1$ is this decision stump and $\alpha_1$ is a constant we show how to get later (it doesn't matter right now as this will just be an overall scale). A point is classified as $+1$ if $C_1 \geq 0$ and $-1$ otherwise.



Now suppose at the $m$th step we are given $m-1$ decision stumps $k_a$, $a=1,...,m-1$ with weights $\alpha_a$. Importantly, the $k_a$'s are interpreted as classifiers: they return $\pm 1$, not a probability that the result is a $+1$. Given a feature vector $\mathbf x$, the ensemble makes a weighted vote of these classifiers

\begin{align}
    C_{(m-1)} (\mathbf x) = \alpha_1 k_1 (\mathbf x) + \cdots + \alpha_{m-1} k_{m-1} ( \mathbf x )
\end{align}

and returns the sign of $C_{(m-1)} (\mathbf x)$.


Our goal is to add to the ensemble an $m$th classifier $k_m$ with weight $\alpha_m$ so that

\begin{align}
    C_m ( \mathbf x ) = C_{m-1} ( \mathbf x ) + \alpha_m k_m ( \mathbf x ) .
\end{align}


We will discuss how to create the stump in a moment, but once we have the stump, the weight $\alpha_m$ will be chosen to minimize some loss function, where the loss function is chosen to put greater weight on the points in the training data that were misclassified at the $(m-1)$th step. A loss function that accomplishes this is the exponential loss

\begin{align}
    L(y_j, C_m) = \sum_{j=1}^{n_\text{samples}} e^{- y_j C_m (\mathbf x_j) }
\end{align}

The expression in the exponential is called the "amount of say". This is larger when $\mathbf x_j$ is misclassified ($-y_j C_m (\mathbf x_j)$ is positive) and indeed is sensitive to "how badly" it's been misclassified. Let

\begin{align}
    w^{(m)}_j = e^{-y_j C_{m-1} ( \mathbf x_j)} = w^{(m-1)}_j e^{-y_j \alpha_{m-1} k_{m-1} ( \mathbf x_j)}.
\end{align}

This parameterizes how badly the $(m-1)$th instance of the classifier mis-classifies the $j$th training sample. Rearranging the loss function, we have

\begin{align}
    L(y_j, C_m) &= \sum_{j=1}^{n_\text{samples}} w^{(m)}_j e^{- y_j \alpha_m k_m (\mathbf x_j) } \nonumber \\
        &= e^{- \alpha_m } \sum_{j=1}^{n_\text{samples}} w^{(m)}_j + (e^{\alpha_m} - e^{- \alpha_m} ) \sum_{y_j \neq k_m (\mathbf x_j )} w^{(m)}_j
\end{align}


Let's introduce the "weighted error rate" of $k_m$

\begin{align}
    \epsilon_m = \sum_{y_j \neq k_m (\mathbf x_j)} w^{(m)}_j \big / \sum_j w^{(m)}_j .
\end{align}

Then given a fixed tree $k_m$, the loss function is minimized as a function of $\alpha_m$ by

\begin{align}
    \alpha_m = \frac 1 2 \ln \left( \frac{1-\epsilon_m}{\epsilon_m} \right) .
\end{align}

Note that stumps with higher accuracy are given greater weight. Note that the `sklearn` implementation with the `SAMME` algorithm replaces the $1/2$ out front with a learning rate parameter $l$. This can be accomplished in the loss function by putting an overall $l$ factor in the exponential. See [here](https://stats.stackexchange.com/questions/82323/shrinkage-parameter-in-adaboost?noredirect=1&lq=1).

In choosing $\alpha_m$, the model already gives greater weight to samples that were miscalssified by $C_{m-1}$ since each sample enters the loss function with prefactor $w^{(m)}_j = e^{- y_j C_{m-1} ( \mathbf x_j)}$. However, apparently the SAMME algorithm used in `sklearn`'s implementation of AdABoost goes beyond this, emphasizing the misclassified data points in in training each individual classifier $k_m$ by weighting the training samples with weight $w^{(m)}_j$. This goes beyond simple gradient boosting i.e. the logic is not merely the minimization of a loss function when combining the individual classifiers. Instead, each individual classifier being used in the boosting process is itself focussing more on the samples miss-classified in the previous step.


The stumps are selected to so that the split has the lowest Gini index. In computing the proper split, we should give more weight to samples that $C_{m-1}$ misclassified in the following way. We could do this by calculating a weighted Gini index using the weights $w^{(m)}_j$, but apparently in the AdaBoost package, it is done by sampling from the original data set at step 0 with probabilities $\propto w^{(m)}_j$ and creating a new sample training set of the same size, then using the unweighted Gini index.

What we have just described is known as the SAMME algorith. `sklearn`'s implementation also allows for using the SAMME.R algoritm, which may be found described in detail [here](https://web.stanford.edu/~hastie/Papers/samme.pdf). The SAMME.R algorithm tends to converge to higher accuracy models quicker by using weak estimators that output a probability rather than simply a classification, however, we have not had time to figure out the details.

## Example

Again we train the model on the Pima indians data set. To see how the two AdaBoost algorithms compare, we run the following code

In [None]:
#load the dataset and take a look
dataset = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

#separate out the features and the dependent variable
X = dataset.iloc[:,:-1]
y = dataset.iloc[:,-1]

#split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)

#loading the necessary methods
from sklearn.ensemble import AdaBoostClassifier

#now create the model and train it
#eval_set is provided for cross validation across the training and test sets
#eval_metric then specifies which measures of error we want to keep track of
n_estimators = 110
learning_rate = 1.0
random_state = 0

algs = ['SAMME', 'SAMME.R']
models = [None]*len(algs)

for i in np.arange(len(algs)):
    models[i] = AdaBoostClassifier(n_estimators=n_estimators,learning_rate=learning_rate,algorithm=algs[i],random_state=random_state)
    models[i].fit(X_train,y_train)
    
    y_pred = models[i].predict(X_test)
    error_rate = 1 -  accuracy_score(y_test,y_pred)
    print('{} Error Rate: '.format(algs[i]), error_rate)

Here we demonstrate how to see the performance behavior throughout training.

In [None]:
for i in np.arange(len(algs)):
    error_train = np.zeros(n_estimators)
    error_test = np.zeros(n_estimators)
    
    for j, y_pred_train in enumerate(models[i].staged_predict(X_train)):
        error_train[j] = 1 - accuracy_score(y_train,y_pred_train)
    
    for j, y_pred_test in enumerate(models[i].staged_predict(X_test)):
        error_test[j] = 1 - accuracy_score(y_test,y_pred_test)

    iterations = np.arange(n_estimators)
    plt.plot(iterations,error_train,label='train error')
    plt.plot(iterations,error_test,label='test error')
    
    plt.xlabel('iterations')
    plt.ylabel('error rate')
    plt.title('{} validation'.format(algs[i]))
    plt.legend()
    plt.grid(True)
    
    plt.show()