**THIS FILE CONTAINS THE NECESSARY CODE FOR MODELLING AND HYPERPARAMETER TUNING FOR RANDOM-FOREST REGRESSOR.**

**Taking necessary files from drive**

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from tqdm import tqdm
import sklearn

In [None]:
from sklearn.preprocessing import LabelEncoder,MinMaxScaler
from sklearn.model_selection import StratifiedKFold,KFold
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.metrics import mean_squared_error
import xgboost
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from scipy.stats import uniform,randint
from sklearn.model_selection import train_test_split

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
file=open('/content/drive/MyDrive/Project Energy Consumption/df_tr_red_final_modified.txt','rb')
df_tr_red_final=pickle.load(file)

In [None]:
df_tr_red_final.reset_index(inplace=True)

In [None]:
df_tr_red_final.drop(['index','timestamp'],axis=1,inplace=True)

In [None]:
df_tr_red_final.drop('level_0',axis=1,inplace=True)

**TARGET TRANSFORMATION**

1.   AS THE METRIC IS RMSLE I AM TAKING THE LOG1P OF THE METER READINGS THEN TAKING THE EVALUATION METRIC TO BE RMSE.



In [None]:
y_tr=np.log1p(df_tr_red_final['meter_reading'])
df_tr_red_final.drop('meter_reading',axis=1,inplace=True)

**DROPPING THE FEATURES WHICH ARE NOT IMPORTANT**

In [None]:
df_tr_red_final.drop(['cloud_coverage','sea_level_pressure','wind_direction','wind_speed',
                      'is_summer_month','is_pub_holiday'],axis=1,inplace=True)

**DIVIDING THE DATA INTO TRAIN AND TEST**

In [None]:
X_train,X_test,y_train,y_test=train_test_split(df_tr_red_final,y_tr,test_size=0.2,random_state=0)

**HYPERPARAMETER TUNING**

1.   HERE I AM DOING THE HYPERPARAMETER TUNING USING RANDOMIZED SEARCH CV WITH THE USE OF GPU.

2.   IT HELPS THE MODEL TO FIND THE BEST PARAMS REQUIRED FOR TRAINING.



In [None]:
rf_reg=RandomForestRegressor(n_jobs=-1)
params={'n_estimators':[20,40,60,80,100],
        'max_depth':[3,5,7,9]}
random_clf=RandomizedSearchCV(rf_reg,params,scoring='neg_root_mean_squared_error',n_jobs=-1,cv=3,verbose=15,n_iter=5,random_state=0)
random_clf.fit(X_train,y_train)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed: 23.9min
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed: 48.6min
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed: 73.3min
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed: 131.0min
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed: 131.4min
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed: 133.3min
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed: 214.8min
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed: 241.9min
[Parallel(n_jobs=-1)]: Done  10 out of  15 | elapsed: 299.7min remaining: 149.8min
[Parallel(n_jobs=-1)]: Done  12 out of  15 | elapsed: 319.5min remaining: 79.9min
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed: 326.7min finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100, n_jobs=-1,
                   

**BEST PARAMS**

In [None]:
random_clf.best_params_

{'max_depth': 9, 'n_estimators': 100}

**BEST SCORE**

In [None]:
random_clf.best_score_

-1.4844006487932297

**PREDICTION ON THE TEST SET USING THE BEST PARAMS FOUND FROM HYPERPARAMETER TUNING**

In [None]:
test_pred=random_clf.predict(X_test)
test_score=np.sqrt(mean_squared_error(y_test,test_pred))

In [None]:
test_score

1.485072964644452

**FITTING THE MODEL WITH BEST PARAMS ON THE FINAL TRAINING SET**

In [None]:
rf_reg_acc=RandomForestRegressor(max_depth=9,n_estimators=100,n_jobs=-1)

In [None]:
rf_model=rf_reg_acc.fit(df_tr_red_final,y_tr)

**STORING THE BEST MODEL IN THE FORM OF PICKLE FILE**

In [None]:
filename='rf_model_modified.txt'
my_file=open(filename,'wb')

In [None]:
pickle.dump(rf_model,my_file)