#### Use petrol_consumption dataset. Your task is to predict the gas consumption (in millions of gallons) in 48 of the US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license. Build the regression model using Random Forest Regressor. Analyze the prediction ability of your model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/rahul96rajan/sample_datasets/master/petrol_consumption.csv')
data.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Petrol_tax                    48 non-null     float64
 1   Average_income                48 non-null     int64  
 2   Paved_Highways                48 non-null     int64  
 3   Population_Driver_licence(%)  48 non-null     float64
 4   Petrol_Consumption            48 non-null     int64  
dtypes: float64(2), int64(3)
memory usage: 2.0 KB


In [4]:
X = data.drop('Petrol_Consumption', axis=1)
y = data['Petrol_Consumption']

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [32]:
rfr = RandomForestRegressor(random_state=0, n_jobs=-1)
params = {'n_estimators':[50,100,200],'max_features': [2,3],
          'max_leaf_nodes': [3,4,5]}

gs = GridSearchCV(estimator=rfr, param_grid=params, scoring='r2', cv=4)

In [33]:
gs.fit(X_train, y_train)

GridSearchCV(cv=4, estimator=RandomForestRegressor(n_jobs=-1, random_state=0),
             param_grid={'max_features': [2, 3], 'max_leaf_nodes': [3, 4, 5],
                         'n_estimators': [50, 100, 200]},
             scoring='r2')

In [34]:
print(gs.best_params_)
print(gs.best_estimator_)

{'max_features': 2, 'max_leaf_nodes': 5, 'n_estimators': 200}
RandomForestRegressor(max_features=2, max_leaf_nodes=5, n_estimators=200,
                      n_jobs=-1, random_state=0)


In [35]:
y_pred_test = gs.predict(X_test)
y_pred_train = gs.predict(X_train)

In [36]:
print('R-sq. Score(Train): ', r2_score(y_train, y_pred_train))
print('R-sq. Score(Test): ', r2_score(y_test, y_pred_test))

R-sq. Score(Train):  0.8200947849791886
R-sq. Score(Test):  0.6995448939468427


### <u>Note</u>: Dataset is quite small, only 48 samples in total.