Random forest regression is an ensemble learning method that uses multiple decision trees each trained on a random subset of data (with random sampling of a subset of features) in parallel with one another. Predictions of each tree are then averaged to arrive at the final random forest prediction. 

We are using random forest regression to predic the petrol consumption of a U.S. state based upon features of said state including the petrol tax (cents), the average income (dollars), the ammount of paved highways (miles), and the proportion of the population with a driver’s license.

https://medium.com/@theclickreader/random-forest-regression-explained-with-implementation-in-python-3dad88caf165

In [162]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor


In [163]:
# load dataset
df = pd.read_csv('./data/petrol_consumption.csv').dropna()
df.head()


Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [164]:
x = df.drop('Petrol_Consumption', axis=1)
y = df[['Petrol_Consumption']]
print(x.head())
print(y.head())


   Petrol_tax  Average_income  Paved_Highways  Population_Driver_licence(%)
0         9.0            3571            1976                         0.525
1         9.0            4092            1250                         0.572
2         9.0            3865            1586                         0.580
3         7.5            4870            2351                         0.529
4         8.0            4399             431                         0.544
   Petrol_Consumption
0                 541
1                 524
2                 561
3                 414
4                 410


In [169]:
# split dataset into train and test data
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.25, random_state=12)


In [166]:
# determine baseline error
# let's take average of all petrol consumption in dataset as baseline
avg_consumption = df['Petrol_Consumption'].sum() / len(df)
avg_consumption = np.full((len(y_test), 1), avg_consumption)
mean_squared_error(y_test, avg_consumption)


21243.71223958333

In [167]:
# random forest regression model using 10 random decision trees
randomForestModel = RandomForestRegressor(n_estimators=10, random_state=12)
# fit model
randomForestModel.fit(x_train, y_train.values.ravel())

# calculate model loss using mean squared error
y_pred = randomForestModel.predict(x_test)
error = mean_squared_error(y_test, y_pred)
error


16447.090833333335

The baseline mean squared error has improved by ~23% by using random forest regression.

In [168]:
# see how predicted values for test dataset compare to ground truth
test_df = pd.DataFrame(y_test).rename(columns={"Petrol_Consumption": 'y_test'})
test_df['y_pred'] = y_pred
test_df


Unnamed: 0,y_test,y_pred
26,577,572.6
44,782,584.3
7,467,488.4
39,968,622.9
36,640,601.8
46,610,625.7
20,649,573.7
29,534,521.2
19,640,726.6
8,464,467.2
