#__Applying Random Forest__

Let's examine how to construct a random forest regression. 

## Step 1: Import Required Libraries and Read the Dataset

- Import pandas and NumPy libraries
- Read the dataset and display the head
- Check the dataset information


In [33]:
import pandas as pd
import numpy as np

In [34]:
dataset = pd.read_csv('petrol_consumption.csv')

In [35]:
dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


__Observation__
- Here, you can see the first few rows of the dataset.

We will predict petrol consumption based on the above attribute.  

In [36]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Petrol_tax                    48 non-null     float64
 1   Average_income                48 non-null     int64  
 2   Paved_Highways                48 non-null     int64  
 3   Population_Driver_licence(%)  48 non-null     float64
 4   Petrol_Consumption            48 non-null     int64  
dtypes: float64(2), int64(3)
memory usage: 2.0 KB


__Observation__
- All data types are in numeric and there are no missing values.

## Step 2: Prepare the data

- Let's create X and y.


In [37]:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

## Step 3: Split the Data into Training and Testing Sets

- Use train_test_split from sklearn.model_selection to split the data


In [38]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Step 4: Standardize the Data

- Standardize the data using StandardScaler from sklearn.preprocessing


In [39]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Step 5: Train the RandomForestRegressor

- Import RandomForestRegressor from sklearn.ensemble
- Create a regressor object, fit it with the training data, and make predictions on the test data


In [40]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

## Step 6: Evaluate the Performance of the RandomForestRegressor

- Calculate the Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error using metrics from sklearn


In [41]:
from sklearn import metrics

print('Train MAE:', metrics.mean_absolute_error(y_train, regressor.predict(X_train)))
print('R Square:', metrics.mean_squared_error(y_train, regressor.predict(X_train)))
print('Train RMSE:', np.sqrt(y_train, regressor.predict(X_train)))

Train MAE: 19.986842105263165
R Square: 859.2686842105269
Train RMSE: [21.70253441 23.53720459 25.05992817 25.11971337 22.89104628 23.68543856
 24.69817807 29.41088234 25.19920634 23.9582971  21.54065923 24.55605832
 26.43860813 25.47547841 25.13961018 21.54065923 22.53885534 23.79075451
 21.37755833 26.72077843 25.37715508 23.79075451 25.45584412 22.89104628
 22.91287847 24.31049156 21.44761059 18.54723699 23.38803113 25.29822128
 23.23790008 25.29822128 22.3159136  31.11269837 22.58317958 20.34698995
 23.2594067  27.96426291]


In [42]:
print('Train MAE:', metrics.mean_absolute_error(y_test, regressor.predict(X_test)))
print('R Square:', metrics.mean_squared_error(y_test, regressor.predict(X_test)))
print('Train RMSE:', np.sqrt(y_test, regressor.predict(X_test)))

Train MAE: 51.76500000000001
R Square: 4216.166749999999
Train RMSE: [23.10844002 20.24845673 24.0208243  23.89560629 24.0208243  26.53299832
 22.06807649 24.22808288 21.61018278 24.08318916]


__Observation__

- Notice that there is a huge difference in metric value in terms of test and train using cross validation.

In the case of a classification problem, we will need to change the random forest regressor to a random forest classifier, and we will be able to get the classifier model.