# Machine Learning in Python

by [Piotr Migdał](http://p.migdal.pl/) & [Dominik Krzemiński](https://github.com/dokato/)

for El Passion, 2017

## 4. Random Forest Regression

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
plt.style.use('ggplot')
sns.set_style('whitegrid')

%matplotlib inline

In [None]:
bicycles_weather_data = pd.read_csv("data/bicycles_weather.csv", index_col=0)

## Validating model performance

Two most common approaches:

- test/train split

- crossvalidation

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
street = 'Banacha'
bicycles_weather_subset = bicycles_weather_data[['temp', street]]
bicycles_weather_subset = bicycles_weather_subset.dropna()

In [None]:
x = bicycles_weather_subset['temp'].to_frame()
y = bicycles_weather_subset[street].to_frame()

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

## Random Forest Regression

Before we will take a look at the whole forest, let's consider single tree :)

Decision trees are very powerful tool and easy to understand and visualize.

![source: wikipedia.org](https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png)

## Random Forest

Scikit-learn definition:

_A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve the **predictive accuracy** and control **over-fitting**._

In principles it means that many decision trees (forest) are fitted to sampled data (random) and the final decision bases on votes from all trees.

Materials:
 - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
 - https://www.youtube.com/watch?v=zvUOpbgtW3c

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# n_estimators - # of trees in the forest.
# criterion - the function to measure the quality of a split either mse or mae

rforest = RandomForestRegressor(n_estimators=10, criterion='mse')

Compare the code below with Linear Regression. It's basically the same!

In [None]:
rforest.fit(x_train, y_train)

In [None]:
y_pred = rforest.predict(x_test)

## Model evaluation

http://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
print("MAE: {:.2f}".format(mean_absolute_error(y_test, y_pred)))
print("MSE: {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("R^2: {:.2f}".format(r2_score(y_test, y_pred)))

What is $R^2$?

$$
R^2 = 1 - \frac{SS_{\mathrm{resid}}}{SS_{\mathrm{total}}}
$$

[Coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)