The intuition behind gradient boosting is that we start off with any random model, usually being an unoptimized tree, and then, based off its predictions, optimize the bias. Since most models rely on $\hat{y}+bias$ to get the real prediction, we're pretty much just approximating our function, but through the bias term rather than coefficients.

#### Intuitive description of the algorithm
- The starting model, based on which we are going to optimize the bias, could be anything, even the mean value across all predictions. By plugging $X$ we get initial predictions $\hat{y}$;
- We calculate residuals $e=y-\hat{y}$. Now we build a "weak learner" for the bias term. For example, it could be a decision tree that predicts the bias. Not that in the regression problem if there are more than one value left in the leaf node, we take the average to yeild the prediction. For more generalization we plug $\eta$ as a learning rate parameter to the predicted residuals;
- We calculate new predictions based on $\hat{y}+\eta\times e_1$ and then repeat the process until we reach $M$ iterations or a desired quality

#### Mathematically strictier description
- Initially we want to find such an arbitrary function $\hat{f}(x)$ that minimizes $y\approx\hat{f}(x)$. That means that $\hat{f}(x)=argmin_{f(x)}L(y,f(x))$, where $L(y,f)$ is some **differentiable** loss function
- Since the range of possible functions approximating $f(x)$ is infinite, we limit the problem to a family of functions $f(x,\theta), \theta\in R^d$ (for example a decision tree) and transform the problem to finding best $\theta$ instead of some "function": $\hat{\theta}=argmin_{\theta}E_{x,y}[L(y,f(x,\theta))]$, where $E$ - expected value or mean. 
    > **Improtant reminder** is that just like in every other NN we get best parameters by applying gradient descend with some multiplication by $\eta\in(0,1]$, i.e. each parameter actually consists of a negative sum of losses with preadded initial value of the parameter. Therefore, for better intuition we can rewrite the optimization problem as $\hat{\theta}=\sum^N_{i=1}L(y_i,f(x_i,\hat{\theta}))$
- Right now it seems that GBMs are nothing more but a fancier way to describe regular linear models. To elevate confusion it is important to say that GBMs approximate a function as a sum of incremental improvements, **each being a function**, i.e. $\hat{f}(x)=\sum_{i=0}^M\hat{f_i}(x)$, where $\hat{f}(x)$ is limited by some function family $\hat{f}(x)=h(x,\theta)$. Moreover on every step of finding another function (from now on we will also refer to them as "models") we also need to select an optimal $\rho\in R$.


In [None]:
import numpy as np
from sklearn.metrics import accuracy_score

class Loss(object):
    def loss(self, y_true, y_pred):
        return NotImplementedError()

    def gradient(self, y, y_pred):
        raise NotImplementedError()

    def acc(self, y, y_pred):
        return 0

class SquareLoss(Loss):
    def __init__(self): pass

    def loss(self, y, y_pred):
        return 0.5 * np.power((y - y_pred), 2)

    def gradient(self, y, y_pred):
        return -(y - y_pred)

class CrossEntropy(Loss):
    def __init__(self): pass

    def loss(self, y, p):
        # Avoid division by zero
        p = np.clip(p, 1e-15, 1 - 1e-15)
        return - y * np.log(p) - (1 - y) * np.log(1 - p)

    def acc(self, y, p):
        return accuracy_score(np.argmax(y, axis=1), np.argmax(p, axis=1))

    def gradient(self, y, p):
        # Avoid division by zero
        p = np.clip(p, 1e-15, 1 - 1e-15)
        return - (y / p) + (1 - y) / (1 - p)

In [None]:
from sklearn.tree import DecisionTreeRegressor
from tqdm import tqdm

class gbm():
    
    def  __init__(self, eta, m, min_samples, min_impurity, 
                  max_depth, is_reg) -> None:
        self.eta = eta
        self.m = m
        self.min_samples = min_samples
        self.min_impurity = min_impurity
        self.max_depth = max_depth
        self.is_reg = is_reg

        self