# To explore the `fit` method of class `GBDT`
Firstly, explore the function of binary classification

In [1]:
# open('data/credit.data.csv').read()

In [2]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [3]:
from random import sample
from gbdt.data import DataSet
from gbdt.model import GBDT
from gbdt.model import BinomialDeviance

In [4]:
dataset = DataSet('data/credit.data.csv')
dataset.get_label_size()

2

#### The initial properties

In [5]:
max_iter = 20
sample_rate = 0.8
learn_rate = 0.5
max_depth = 7
loss_type = 'binary-classification'
split_points = 0

trees = dict()

#### Loss function: Binary classification

In [6]:
loss = BinomialDeviance(n_classes=dataset.get_label_size())

In [7]:
f = dict()

In [8]:
loss.initialize(f, dataset)
len(f.keys())
# f: f is a dict, and it has 653 key-value pairs. Keys are integers from 1 to 653, and all the values are set to be 0

653

In [9]:
train_data = dataset.get_instances_idset()
# train_data is a set, which contains IDs from 1 to 653

Binary classification:

The residual is $Residual_i = \frac{2y_i}{1+exp(2y_if_i)} $. Log loss deviance


Why loss is as above? Refer to 
- [Scikit Binomial Deviance Loss Function](https://stats.stackexchange.com/questions/157870/scikit-binomial-deviance-loss-function)
- [How to derive bernoulli deviance](https://stats.stackexchange.com/questions/208331/how-to-derive-bernoulli-deviance)
- It is called binomial negative log-likelihood loss:
https://pdfs.semanticscholar.org/7efc/245d8ad4cbd6489e3dca6688264bf4f83579.pdf
- in ESL, it is called __binomial deviance__ or __binomial negative log-likelihood__ P346 
- [GBDT模型](https://www.jianshu.com/p/0bc32c8e4ca8)
- [GBDT训练分类器时，残差是如何计算的？](https://blog.csdn.net/mmc2015/article/details/52398488)
- Friedman's paper: [Greedy Function Approximation: A Gradient Boosting Machine](http://docs.salford-systems.com/GreedyFuncApproxSS.pdf)

The negative binomial log-likelihood loss: $L(y, F) = \log(1+ \exp(-2yF)),~~~ y\in\{-1,1\}$. There is an coefficient of 2, because here the label is -1 and 1. If the label is 0 and 1, 

In [10]:
iter = 1 # for iter in range(1, max_iter+1):
subset = train_data
print('the size of initial subset is', len(subset))
if 0 < sample_rate < 1:
    subset = sample(subset, int(len(subset)*sample_rate))
    print('the size of sampled subset is', len(subset), ', only 80% of the initial size')
residual = loss.compute_residual(dataset, subset, f) # residual is a dict, in which keys are the sampled IDs 
print(len(residual))

leaf_nodes = []
targets = residual


the size of initial subset is 653
the size of sampled subset is 522 , only 80% of the initial size
522
