# Machine Learning in Python

by [Piotr Migdał](http://p.migdal.pl/)

Inkubator Uniwersytetu Warszawskiego

## 4. Random Forest Regression

In simple words: random forest which persorm regression task. But what exactly is random forest?  

For now let's stick to scikit-learn documentation definition: 

_A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve the **predictive accuracy** and control **over-fitting**._

more: http://blog.yhat.com/posts/random-forests-in-python.html

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_score, cross_val_predict

%matplotlib inline

### Decision Trees

Before we will take a look at the whole forest, let's consider single tree. :)

Decision trees are very powerful tool, easy to understand and visualize.

![source: wikipedia.org](https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png)

See also:

* [A visual introduction to machine learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)

### Random Forest

In principles it means that many decision trees (forest) are fitted to sampled data (random) and the final decision bases on votes from all trees.

Materials:
 - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
 - https://www.youtube.com/watch?v=zvUOpbgtW3c

In [None]:
bicycles_weather_data = pd.read_csv("data/dane_zsumowane_z_pogoda.csv", index_col=0)

In [None]:
cols = ['temp_avg', 'temp_min', 'temp_max', 'snieg', 'deszcz']
street = 'Banacha'
bicycles_weather_subset = bicycles_weather_data[cols + [street]]
bicycles_weather_subset = bicycles_weather_subset.dropna()

In [None]:
X = bicycles_weather_subset[cols]
y = bicycles_weather_subset[street]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
# n_estimators - # of trees in the forest.
# criterion - the function to measure the quality of a split either mse or mae

rforest = RandomForestRegressor(n_estimators=10, criterion='mse')

Compare the code below with Linear Regression. It's basically the same!

In [None]:
rforest.fit(X_train, y_train)

## Model evaluation

In [None]:
rforest.score(X_train, y_train)

In [None]:
rforest.score(X_test, y_test)

In [None]:
y_pred = rforest.predict(X_test)

http://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
print("MAE: {:.2f}".format(metrics.mean_absolute_error(y_test, y_pred)))
print("RMSE: {:.2f}".format(np.sqrt(metrics.mean_squared_error(y_test, y_pred))))
print("R^2: {:.2f}".format(metrics.r2_score(y_test, y_pred)))

What is $R^2$?

$$
R^2 = 1 - \frac{SS_{\mathrm{resid}}}{SS_{\mathrm{total}}}
$$

[Coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)

In [None]:
pd.Series(rforest.feature_importances_, index=cols).plot(kind='barh')

In [None]:
plt.plot(y_test, y_pred, '.')

In [None]:
rforest = RandomForestRegressor(n_estimators=100, criterion='mse')
scores = cross_val_score(rforest,  X, np.sqrt(y), cv=5)
print("Cross-validation score: {:.2f} +- {:.2f}".format(scores.mean(), scores.std()))

In [None]:
scores

In [None]:
plt.plot(y, cross_val_predict(rforest, X, y, cv=5), '.')

In [None]:
metrics.r2_score(y, cross_val_predict(rforest, X, y, cv=5))

See also:

* http://scikit-learn.org/stable/modules/cross_validation.html