# Random Forest
Src: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

Random Forest is an ensemble of Decision Trees, generally trained using the Bagging method (or sometimes Pasting).

Extra randomness when growing trees: instead of searching for the very best feature when splitting a node, it searches for the best feature **among a random subset of features**.

## The algorithm
1. Assume number of cases in the training set is N. Then, **sample of these N cases** is taken at random but **with replacement**.

2. If there are M input variables (features), a number **m < M** is specified such that at each node, **m variables are selected at random out of the M**. The best split on these m is used to split the node. The value of m is held constant while we grow the forest.

3. Each tree is grown to the largest extent possible and there is **no pruning**.

4. Predict new data by **aggregating the predictions** of the ntree trees (i.e., majority votes for classification, average for regression).

In [1]:
import pandas as pd
import numpy as np

Taking a look at the dataset

The problem: **predicting the max temperature for tomorrow** in our city using one year of past weather data

Obs: 
- **friend**: your friend’s prediction, a random number between 20 below the average and 20 above the average
- actual: max temperature measurement

In [4]:
df = pd.read_csv('temps.csv')
df = df.drop(columns=["forecast_noaa", "forecast_acc", "forecast_under"])
df = pd.get_dummies(df)
df.head()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2016,1,1,45,45,45.6,45,29,1,0,0,0,0,0,0
1,2016,1,2,44,45,45.7,44,61,0,0,1,0,0,0,0
2,2016,1,3,45,44,45.8,41,56,0,0,0,1,0,0,0
3,2016,1,4,44,41,45.9,40,53,0,1,0,0,0,0,0
4,2016,1,5,41,40,46.0,44,41,0,0,0,0,0,1,0


In [5]:
X = df.drop(columns="actual")
y = df["actual"]

Splitting test/train set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Creating a baseline

In [7]:
baseline = X_test['average']
baseline_errors = abs(baseline - y_test)
print('Average baseline error: ', round(np.mean(baseline_errors), 2))

Average baseline error:  5.06


Training the model

In [8]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, y_train);

Making predictions

In [11]:
y_pred = rf.predict(X_test)
model_errors = abs(y_pred - y_test)
print('Mean Absolute Error:', round(np.mean(model_errors), 2))

Mean Absolute Error: 3.83


Calculating accuracy

In [12]:
mape = 100 * (model_errors / y_test)

accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%')

Accuracy: 93.98 %


In [13]:
from sklearn import metrics
print("Root Mean Squared Error: " + str(np.sqrt(metrics.mean_squared_error(y_test,y_pred))))
print("Score: "+ str(rf.score(X_test,y_test))) #R^2

Root Mean Squared Error: 5.039812940106445
Score: 0.8173586859802093


Visualizing a tree

In [15]:
from sklearn.tree import export_graphviz
import pydot
tree = rf.estimators_[5]
export_graphviz(tree, out_file = 'tree.dot', feature_names = list(X.columns), rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')