# Градиентный бустинг

Будет использоваться датасет `boston` из `sklearn.datasets`. 

25% объектов для контроля качества: `X_train`, `y_train` и `X_test`, `y_test`.

Реализовываем простой вариант градиентного бустинга над регрессионными деревьями для случая квадратичной функции потерь.

In [33]:
from sklearn import ensemble, cross_validation, metrics, tree

import numpy as np

## Генерация и изучение данных

In [26]:
from sklearn.datasets import load_boston
boston = load_boston()
print'data:', boston.data.shape,'\n', 'target:', boston.target.shape

data: (506L, 13L) 
target: (506L,)


In [23]:
print boston

{'data': array([[  6.32000000e-03,   1.80000000e+01,   2.31000000e+00, ...,
          1.53000000e+01,   3.96900000e+02,   4.98000000e+00],
       [  2.73100000e-02,   0.00000000e+00,   7.07000000e+00, ...,
          1.78000000e+01,   3.96900000e+02,   9.14000000e+00],
       [  2.72900000e-02,   0.00000000e+00,   7.07000000e+00, ...,
          1.78000000e+01,   3.92830000e+02,   4.03000000e+00],
       ..., 
       [  6.07600000e-02,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.96900000e+02,   5.64000000e+00],
       [  1.09590000e-01,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.93450000e+02,   6.48000000e+00],
       [  4.74100000e-02,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.96900000e+02,   7.88000000e+00]]), 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
        18.9,  15. ,  18.9,  21.7,  20.4,  18.2,  19.9,  23.1,  17

In [12]:
print boston.data.shape

(506L, 13L)


In [13]:
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='|S7')

In [21]:
print boston.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

### Сплит данных для целей анализа

In [29]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(boston.data, 
                                                                     boston.target, 
                                                                     test_size = 0.25,
                                                                     random_state = 1)

In [30]:
# test
X_train.shape

(379L, 13L)

**L'z(Yi - An-1(xi)) = 2*(An-1(xi) - Yi)** --> **(An-1(xi) - Yi)**

## часть 1

Реализуем в цикле последовательно N = 50 деревьев по методике градиэнтного бустинга. Контроль качества будем делать с помощью `mean_squared_error` в `sklearn.metrics`. 

Начнем с массива для объектов `DecisionTreeRegressor` (будем их использовать в качестве базовых алгоритмов) и для вещественных чисел (это будут коэффициенты перед базовыми алгоритмами). 

In [128]:
def gbm_predict(X):
    return [sum([coeff * algo.predict([x])[0] for algo, coeff in zip(base_algorithms_list, coefficients_list)]) for x in X]

In [129]:
base_algorithms_list = []
#coefficients_list = []

DT_Regressor = tree.DecisionTreeRegressor(max_depth=5, random_state=42)

N = 50
coefficients_list = np.full(N, 0.9)
b = np.zeros(len(y_train))

for i in range(N):
    #DT_Regressor.fit(X_train, y_train - coefficients_list[i] * b)
    #base_algorithms_list.append(DT_Regressor.predict(X_train))
    base_algorithms_list.append(DT_Regressor.fit(X_train, y_train - coefficients_list[i] * b))
    #base_algorithms_list.append(DT_Regressor.fit(X_train, y_train - b))
    b += DT_Regressor.predict(X_train)
    


In [130]:
gbm_predict(X_train)

[-0.0085389294452014822,
 -0.0085389294452014822,
 -0.053125777035471777,
 0.046549086086900004,
 -0.0085389294452014822,
 -0.053125777035471777,
 -0.0085389294452014822,
 0.046549086086900004,
 0.046549086086900004,
 -0.0085389294452014822,
 -0.0085389294452014822,
 0.046549086086900004,
 -0.20664818011525182,
 0.17138918249475221,
 -0.0085389294452014822,
 -0.0085389294452014822,
 -0.053125777035471777,
 -0.053125777035471777,
 0.046549086086900004,
 0.046549086086900004,
 -0.0085389294452014822,
 0.046549086086900004,
 -0.0085389294452014822,
 -0.0085389294452014822,
 -0.0085389294452014822,
 -0.0085389294452014822,
 0.046549086086900004,
 -0.0085389294452014822,
 -0.053125777035471777,
 -0.0085389294452014822,
 0.046549086086900004,
 0.046549086086900004,
 -0.0085389294452014822,
 -0.053125777035471777,
 -0.0078632717986515319,
 0.33811608096550899,
 -0.0085389294452014822,
 -0.0085389294452014822,
 0.046549086086900004,
 -0.053125777035471777,
 -0.0085389294452014822,
 -0.00853892

In [131]:
y_predict = gbm_predict(X_test)

In [132]:
RMSE = metrics.mean_squared_error(y_predict, y_test) ** 0.5

In [133]:
RMSE

25.139426072591505