**Случайные процессы, прикладной поток. Семинар 1.**

**Артем Куприянов**

In [298]:
import numpy as np
import pandas as pd
import scipy.stats as sps
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Линейная регрессия в Python

В `scikit-learn` линейная регрессия реализована в классе `sklearn.linear_model.LinearRegression`

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [400]:
from sklearn import datasets
from sklearn.metrics import mean_absolute_error
from sklearn.cross_validation import train_test_split

$$ MAE = \frac{\sum\limits_{k=1}^{n} \left|y\_true[i] - y\_pred[i]\right|}{n}$$

## Описание датасета

In [401]:
boston = datasets.load_boston()
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [402]:
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.8, 
                                                    random_state=42)

In [403]:
print(X_train.shape, sep='\n\n')

(404, 13)


In [404]:
pd.DataFrame(X_train).head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,15.0234,0.0,18.1,0.0,0.614,5.304,97.3,2.1007,24.0,666.0,20.2,349.48,24.91
1,0.62739,0.0,8.14,0.0,0.538,5.834,56.5,4.4986,4.0,307.0,21.0,395.62,8.47
2,0.03466,35.0,6.06,0.0,0.4379,6.031,23.3,6.6407,1.0,304.0,16.9,362.25,7.83
3,7.05042,0.0,18.1,0.0,0.614,6.103,85.1,2.0218,24.0,666.0,20.2,2.52,23.29
4,0.7258,0.0,8.14,0.0,0.538,5.727,69.5,3.7965,4.0,307.0,21.0,390.95,11.28
5,0.19186,0.0,7.38,0.0,0.493,6.431,14.7,5.4159,5.0,287.0,19.6,393.68,5.08
6,0.03961,0.0,5.19,0.0,0.515,6.037,34.5,5.9853,5.0,224.0,20.2,396.9,8.01
7,0.02055,85.0,0.74,0.0,0.41,6.383,35.7,9.1876,2.0,313.0,17.3,396.9,5.77
8,15.1772,0.0,18.1,0.0,0.74,6.152,100.0,1.9142,24.0,666.0,20.2,9.32,26.45
9,14.4383,0.0,18.1,0.0,0.597,6.852,100.0,1.4655,24.0,666.0,20.2,179.36,19.78


## Linear Regression

Оптимизируется функционал $$Q = || Y -  X\theta|| \rightarrow \min\limits_{\theta}$$

Импортируем модуль линейной регрессии

In [405]:
from sklearn.linear_model import LinearRegression

In [406]:
regressor = LinearRegression()

Обучаем модель (оптимизируется функционал выше)

In [407]:
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

После этого, в атрибуте `coef_` будут лежать значения параметра $\widehat{\theta}$

In [408]:
print(len(regressor.coef_), regressor.coef_, sep='\n\n')

13

[ -1.12463481e-01   3.00810168e-02   4.07309919e-02   2.78676719e+00
  -1.72406347e+01   4.43248784e+00  -6.23998173e-03  -1.44848504e+00
   2.62113793e-01  -1.06390978e-02  -9.16398679e-01   1.24516469e-02
  -5.09349120e-01]


Отклик на новых объектах:

In [409]:
y_pred = regressor.predict(X_test)

Смотрим на метрику на тесте.
Средняя абсолютная ошибка выбрана для того, чтобы показать, на сколько домов мы все-таки ошибаемся:

In [410]:
mean_absolute_error(y_test, y_pred)

3.1915089722747783

## Lasso Regression

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

Оптимизируется функционал $$Q = \dfrac{1}{2 n\_samples} || Y -  X\theta||_2^2 +  \alpha \sum\limits_{i = 1}^{n}|\theta_i| \rightarrow \min\limits_{\theta}$$

$\alpha$ -- коэффициент регуляризации

Lasso способствует отбору признаков (обнуляет веса при незначимых признаках)

In [411]:
from sklearn.linear_model import Lasso

In [412]:
lasso_regressor = Lasso(alpha=0.4)

Обучаем модель (оптимизируется функционал выше)

In [413]:
lasso_regressor.fit(X_train, y_train)

Lasso(alpha=0.4, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

Отклик на новых объектах:

In [269]:
y_pred = lasso_regressor.predict(X_test)

In [414]:
mean_absolute_error(y_test, y_pred)

3.1915089722747783

## Ridge Regression

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge

Оптимизируется функционал $$Q = || Y -  X\theta||^2_2 + \frac{\alpha}{2}||\theta||^2 \rightarrow \min\limits_{\theta}$$

$\alpha$ -- коэффициент регуляризации

Модифицорванное МНК-pешение

$$\Theta^{*}_{\alpha} = (X^TX + \alpha I)^{-1}X^TY$$

Гребневая регрессия сокращает веса признаков

In [415]:
from sklearn.linear_model import Ridge

In [416]:
ridge_regressor = Ridge(alpha=2)

Обучаем модель (оптимизируется функционал выше)

In [417]:
ridge_regressor.fit(X_train, y_train)

Ridge(alpha=2, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

Отклик на новых объектах:

In [398]:
y_pred = ridge_regressor.predict(X_test)

In [419]:
mean_absolute_error(y_test, y_pred)

3.1915089722747783