# Gradient Boosting For Classification:

# Algorithm

$Input$: Data${(x_{i},{y}_i)}_{1}^N$ and a Differentiable $Loss$ $Function$ $L(y_{i},F(x))$.

$Step - [1]$: Initialize model with a constant value.

\begin{equation}
F_{0}(x) = \underset{\gamma}{\mathrm{argmin}}\sum_{i=1}^{N}L({y}_i,{\gamma})
\end{equation}

$Step-[2]$: for $m=1$ to M:

$[A]$ -  Compute 

\begin{equation}
r_{im} = -[\frac{\partial L({y}_i,F({x}_i))}{\partial F({x}_i)}]_{F(x) = F_{m-1}(x)}
\end{equation}

$[B]$- fit a Regreesion tree to the $r_{im}$ values and create terminal regions $R_{jm}$ for $j$ = $1......j_{m}$.

$[C]$ - for $j$ = $1....j_{m}$ Compute.

\begin{equation}
{\gamma} = \underset{\gamma}{\mathrm{argmin}}\sum_{x_{i}\in{R_{ij}}}L({y}_i,F_{m-1}({x}_i) + {\gamma})
\end{equation}

${[D]}$ - Update.

\begin{equation}
F_{m}(x) = F_{m-1}(x) + {\mu}\sum_{j=1}^{j_{m}}{\gamma_{m}}I(x\in{R_{jm}})
\end{equation}

${Step}$ - Output ${F_{m}(x)}$.

# Let's start from the  𝐼𝑛𝑝𝑢𝑡:

$Data$ ${({x}_i,{y}_i)_{i=1}^N}$, This is Describes in an abstract way the $Training Dataset$,and the method we will use to evaluate how will the $Model$ fits the $Training Dataset$.

now here we creating small sample dataset for explaining detailed math of this algorithm.

In [2]:
import pandas as pd

In [3]:
data = pd.DataFrame()

In [4]:
data['Like Popcorn'] = ['yes','no','no']

In [5]:
data['Age'] = [12,87,44]

In [6]:
data['favorite'] = ['Blue','Green','Blue']

In [7]:
data['Loves Troll_2']=['yes','yes','no']

In [8]:
data.head()

Unnamed: 0,Like Popcorn,Age,favorite,Loves Troll_2
0,yes,12,Blue,yes
1,no,87,Green,yes
2,no,44,Blue,no


we know that this Refers to the $Training$ $Dataset$.

$x_{i}$ refer to a row of measurements that we will use to $predict$ if someone $Loves Troll2$.

and $y_{i}$ refer whether or not someone $Loves Troll2$.

Now we Need a Differentiable $Loss$ $Function$ that will work for $classification$.

I think the easiest way to understand the most commonly used $Loss$ $Function$ for classification is show how it works on a graph

the $red$ $dot$ with the $probability$ $of$ $Loving$ $Troll2$ = $0$ represents the one person that does not $Love$ $Troll2$  

and the $Blue$ $Dots$ with a $probability$ $of$ $Loving$ $Troll2=1$ , represents the two people that $Love$ $Troll2$.

in other words the $Red$ and $Blue$  dots are the $observed$ values.

and we can draw a dotted line to represent the $predicted$ $probability$ that someone $Love$ $Troll2$.

In this examples i have set the $probability$ to $0.67$ $predicted$.

Now just we do for $Logistic$ $Regression$ we can calculate the $log(likelihood)$ of the data given the predicted probability.

Log(likelihood of the observed data given the prediction) equal to.

\begin{equation}
\sum_{i=1}^Ny_{i}{\log(p)} + (1-y_{i}){\log(1-p)}
\end{equation}

p's refer to the predicted probability which is 0.67 in this example:

and y_{i} refer to the observed values for $Loves$ $Troll2$.

for the two people who $Love$ $Troll2$ $y_{i} = 1$ which means that this term will be $0$ ,${(1-y_{i}){\log}(1-p)}$ 

leaving just ${\log(p)}$.

in constrast  for the one person who does not $Love$ $Troll2$ $y_{i}=0$ which means $y_{i}{\log(p)}$ this trem will be zero.

leaving just ${\log(1-p)}$.

Now Summation to calculate the $log(likelihood)$ of all three $observed$ values.

we will start by calculating the $log(likelihood)$ for the first person...

because this person $love$ $troll2$ $y_{1}=1$.

\begin{equation}
y_{1}*{\log(p)} + (1-y_{1})*{\log(1-p)}
\end{equation}

\begin{equation}
1*{\log(p)} + (1-1)*{\log(1-p)}
\end{equation}

\begin{equation}
1*{\log(p)}  = {\log(0.67)}
\end{equation}

and the $log(likelihood)$ for the first person given the predicted probability is the ${\log(0.67)}$.

now let's calculate the $log(likelihood)$ for the second person ${y_{2} = 1}$.

\begin{equation}
y_{2}*{\log(p)} + (1-y_{2})*{\log(1-p)}
\end{equation}

\begin{equation}
1*{\log(p)} + (1-1)*{\log(1-p)}
\end{equation}

\begin{equation}
1*{\log(p)}  = {\log(0.67)}
\end{equation}

same for third person $y_{3} = 0$.

\begin{equation}
y_{3}*{\log(p)} + (1-y_{3})*{\log(1-p)}
\end{equation}

\begin{equation}
0*{\log(p)} + (1-0)*{\log(1-p)}
\end{equation}

\begin{equation}
1*{\log(1-p)} = {\log(1-0.67)}
\end{equation}

\begin{equation}
{\log(1-0.67)} = {\log(0.33)}
\end{equation}

Note: The better the prediction , the larger the ${log(likelihood)}$ and this is and this is why, when doing $Logistic$ $Regression$ , the goal is to maximize the $log(likelihood)$.

that mean that if we want to use the $log(likelihood)$ as a $loss$ $function$, where smaller values represent better fitting models then we need to multiply the log(likelihood) by (-1).

\begin{equation}
-\sum_{i=1}^Ny_{i}{\log(p)} + (1-y_{i}){\log(1-p)}
\end{equation}

and since a $loss$ $function$ sometime only deals with one sample at a time , we get rid of the summation.

\begin{equation}
-[y*{\log(p)} + (1-y)*{\log(1-p)}]
\end{equation}

and to make it easier to read we will replace (y) with $observed$ .

\begin{equation}
-[observed*{\log(p)} + (1-observed)*{\log(1-p)}]
\end{equation}

now we need to transform is the equation into the  $negative$ $log(likelihood)$ ,so that it is a function of the $predicted$ $log(odds)$ instead of the predicted probability $p$.

we also need to simplify it.

\begin{equation}
-[observed*{\log(p)} + (1-observed)*{\log(1-p)}]
\end{equation}

\begin{equation}
-observed*{\log(p)} - (1-observed)*{\log(1-p)}
\end{equation}

\begin{equation}
-observed*{\log(p)} - {\log(1-p)} +observed*{\log(1-p)} 
\end{equation}

\begin{equation}
-observed*{\log(p)} +observed*{\log(1-p)}  - {\log(1-p)}
\end{equation}

\begin{equation}
-observed*[{\log(\frac{p}{1-p})}]  - {\log(1-p)}
\end{equation}

\begin{equation}
{\log(\frac{p}{1-p})} = log(odds)
\end{equation}