**This file contains the base model(XGBOOST and LGBM) for Ensembling**

Importing necessary files from drive

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from tqdm import tqdm
import sklearn

In [None]:
from sklearn.preprocessing import LabelEncoder,MinMaxScaler
from sklearn.model_selection import StratifiedKFold,KFold
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
import xgboost as xgb
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from scipy.stats import uniform,randint
from sklearn.model_selection import train_test_split

In [None]:
from xgboost import XGBRegressor

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
file=open('/content/drive/MyDrive/Project Energy Consumption/df_tr_red_final_modified.txt','rb')
df_tr_red_final=pickle.load(file)

In [None]:
df_tr_red_final.reset_index(inplace=True)

In [None]:
df_tr_red_final.drop(['index','timestamp'],axis=1,inplace=True)

In [None]:
df_tr_red_final.drop('level_0',axis=1,inplace=True)

**Target Transformation**

1.   Here I am taking  log1p of the meter readings and then I will evaluate my base models on RMSE which by default becomes the RMSLE(The evaluation metric on which we have to evaluate on).



In [None]:
y_tr=np.log1p(df_tr_red_final['meter_reading'])
df_tr_red_final.drop('meter_reading',axis=1,inplace=True)

**Dropping the features which are not important**

In [None]:
df_tr_red_final.drop(['cloud_coverage','sea_level_pressure','wind_direction','wind_speed',
                      'is_summer_month','is_pub_holiday'],axis=1,inplace=True)

**Custom Ensembling**

1.   Here first I will divide my train data into 80-20 split.Now from that 80% data I will further divide it into 50-50.After that from that 50% I will start doing sampling with replacement.Now my base models will train on that sampled data and will predict on the remaining 50% data.



In [None]:
X_train,X_test,y_train,y_test=train_test_split(df_tr_red_final,y_tr,test_size=0.2,random_state=0)

In [None]:
X_train_d1,X_train_d2,y_train_d1,y_train_d2=train_test_split(X_train,y_train,test_size=0.5,random_state=0)

**Doing Sampling with replacement.Setting up random state helps to reproduce the results**

In [None]:
s1_d1=X_train_d1.sample(frac=0.8,replace=True,random_state=0)
y1_d1=y_train_d1.sample(frac=0.8,replace=True,random_state=0)

In [None]:
s2_d1=X_train_d1.sample(frac=0.8,replace=True,random_state=1)
y2_d1=y_train_d1.sample(frac=0.8,replace=True,random_state=1)

**Hyperparameter Tuning on for XGBOOST(Base Model)**

In [None]:
x_cfl=XGBRegressor(tree_method='gpu_hist')
params={'n_estimators':[300,500,1000,1500,2000],
        'learning_rate':[0.01,0.03,0.05,0.1],
        'max_depth':[3,5,7,9],
        'colsample_bytree':[0.5,0.8,0.9,1]}
random_xgb=RandomizedSearchCV(x_cfl,params,scoring='neg_root_mean_squared_error',n_jobs=-1,cv=3,verbose=10,random_state=1,n_iter=10)
random_xgb.fit(s1_d1,y1_d1)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed: 17.7min remaining:  2.0min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 19.2min finished




RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                          colsample_bylevel=1,
                                          colsample_bynode=1,
                                          colsample_bytree=1, gamma=0,
                                          importance_type='gain',
                                          learning_rate=0.1, max_delta_step=0,
                                          max_depth=3, min_child_weight=1,
                                          missing=None, n_estimators=100,
                                          n_jobs=1, nthread=None,
                                          objective='reg:linear',
                                          random_state=0, reg_alpha=...
                                          tree_method='gpu_hist', verbosity=1),
                   iid='deprecated', n_iter=10, n_jobs=-1,
                   param_distributions={'colsampl

**Getting the best params from the above RandomSearch**

In [None]:
random_xgb.best_params_

{'colsample_bytree': 0.8,
 'learning_rate': 0.1,
 'max_depth': 7,
 'n_estimators': 2000}

In [None]:
random_xgb.best_score_

-0.7204432686169943

**Fitting the model with the best params on the sampled data**

In [None]:
xgb_model_s1=XGBRegressor(n_estimators=500,learning_rate=0.1,max_depth=9,colsample_bytree=0.8,tree_method='gpu_hist')
xgb_model_s1.fit(s1_d1,y1_d1)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=9, min_child_weight=1, missing=None, n_estimators=500,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, tree_method='gpu_hist', verbosity=1)

**Important Points**

1.   After saving my best model I will predict on the other 50% data using my base model(XGBOOST) and then I will convert that into a dataframe which will serve as input for my meta model and the target variable will be taken from the other 50% data(Ground Truth).



In [None]:
filename='xgb_model_ensemble.txt'
model_1=open(filename,'wb')

In [None]:
pickle.dump(xgb_model_s1,model_1)

In [None]:
s1_pred=xgb_model_s1.predict(X_train_d2)

In [None]:
s1_test_pred=xgb_model_s1.predict(X_test)

In [None]:
s1_pred_df=pd.DataFrame(s1_pred,columns=['s1_pred'])

In [None]:
s1_test_df=pd.DataFrame(s1_test_pred,columns=['s1_test_pred'])

In [None]:
filename='s1_pred_df.txt'
my_file_1=open(filename,'wb')

In [None]:
pickle.dump(s1_pred_df,my_file_1)

In [None]:
filename='s1_test_df.txt'
my_file_2=open(filename,'wb')

In [None]:
pickle.dump(s1_test_df,my_file_2)

**From here on I will build another Base Model(Catboost Model).**

In [None]:
pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/20/37/bc4e0ddc30c07a96482abf1de7ed1ca54e59bba2026a33bca6d2ef286e5b/catboost-0.24.4-cp36-none-manylinux1_x86_64.whl (65.7MB)
[K     |████████████████████████████████| 65.8MB 50kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.24.4


In [None]:
from catboost import CatBoostRegressor

**Doing Hyperparameter tuning for Catboost Model**

In [None]:
params={'max_depth':[3,5,7,9,11,13,15],
'n_estimators':[300,500,800,1000,1200,1500],
'learning_rate':[0.1,0.01,0.03,0.05]}
cat_reg=CatBoostRegressor()
random_cat=RandomizedSearchCV(cat_reg,params,scoring='neg_root_mean_squared_error',n_jobs=-1,cv=3,verbose=1,random_state=1,n_iter=8)
random_cat.fit(s2_d1,y2_d1) 

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed: 305.6min finished


0:	learn: 1.9809273	total: 4.11s	remaining: 1h 22m 13s
1:	learn: 1.8975109	total: 7.1s	remaining: 1h 10m 52s
2:	learn: 1.8259093	total: 10.2s	remaining: 1h 7m 36s
3:	learn: 1.7638732	total: 13.1s	remaining: 1h 5m 10s
4:	learn: 1.7084046	total: 15.9s	remaining: 1h 3m 23s
5:	learn: 1.6589281	total: 19s	remaining: 1h 2m 55s
6:	learn: 1.6160326	total: 22s	remaining: 1h 2m 32s
7:	learn: 1.5784450	total: 24.9s	remaining: 1h 1m 47s
8:	learn: 1.5470223	total: 27.7s	remaining: 1h 1m 8s
9:	learn: 1.5171361	total: 30.7s	remaining: 1h 56s
10:	learn: 1.4932159	total: 33.9s	remaining: 1h 1m 8s
11:	learn: 1.4711221	total: 37.5s	remaining: 1h 1m 53s
12:	learn: 1.4524577	total: 40.5s	remaining: 1h 1m 34s
13:	learn: 1.4375892	total: 43.3s	remaining: 1h 1m 9s
14:	learn: 1.4233547	total: 46.5s	remaining: 1h 1m 16s
15:	learn: 1.4096439	total: 49.7s	remaining: 1h 1m 18s
16:	learn: 1.3993247	total: 52.9s	remaining: 1h 1m 18s
17:	learn: 1.3890958	total: 56.1s	remaining: 1h 1m 23s
18:	learn: 1.3758981	total: 5

RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=<catboost.core.CatBoostRegressor object at 0x7f156e85aa90>,
                   iid='deprecated', n_iter=8, n_jobs=-1,
                   param_distributions={'learning_rate': [0.1, 0.01, 0.03,
                                                          0.05],
                                        'max_depth': [3, 5, 7, 9, 11, 13, 15],
                                        'n_estimators': [300, 500, 800, 1000,
                                                         1200, 1500]},
                   pre_dispatch='2*n_jobs', random_state=1, refit=True,
                   return_train_score=False,
                   scoring='neg_root_mean_squared_error', verbose=1)

**Finding the best params using the Randomized Search CV**

In [None]:
random_cat.best_params_

{'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 1200}

In [None]:
random_cat.best_score_

-0.6304033085743497

**Fitting the model with the best params on the sampled data**

In [None]:
catboost_reg_s2=CatBoostRegressor(learning_rate=0.1,max_depth=15,n_estimators=1200)
catboost_reg_s2.fit(s2_d1,y2_d1)

0:	learn: 1.9809273	total: 3.74s	remaining: 1h 14m 43s
1:	learn: 1.8975109	total: 6.42s	remaining: 1h 4m 5s
2:	learn: 1.8259093	total: 9.13s	remaining: 1h 40s
3:	learn: 1.7638732	total: 11.9s	remaining: 59m 5s
4:	learn: 1.7084046	total: 14.5s	remaining: 57m 42s
5:	learn: 1.6589281	total: 17.2s	remaining: 56m 54s
6:	learn: 1.6160326	total: 19.9s	remaining: 56m 30s
7:	learn: 1.5784450	total: 22.6s	remaining: 56m 7s
8:	learn: 1.5470223	total: 25.3s	remaining: 55m 47s
9:	learn: 1.5171361	total: 29.2s	remaining: 57m 49s
10:	learn: 1.4932159	total: 33s	remaining: 59m 28s
11:	learn: 1.4711221	total: 35.9s	remaining: 59m 9s
12:	learn: 1.4524577	total: 38.5s	remaining: 58m 31s
13:	learn: 1.4375892	total: 41.1s	remaining: 58m
14:	learn: 1.4233547	total: 43.9s	remaining: 57m 51s
15:	learn: 1.4096439	total: 48.2s	remaining: 59m 26s
16:	learn: 1.3993247	total: 51.1s	remaining: 59m 13s
17:	learn: 1.3890958	total: 53.7s	remaining: 58m 48s
18:	learn: 1.3758981	total: 56.4s	remaining: 58m 25s
19:	learn

<catboost.core.CatBoostRegressor at 0x7f5426476dd8>

**Important points**

1.   From here on after saving my best model I will predict on the another 50% data and then convert it into dataframe which will serve as input for my base model.



In [None]:
filename='catboost_model_ensemble.txt'
model_2=open(filename,'wb')

In [None]:
pickle.dump(catboost_reg_s2,model_2)

In [None]:
from google.colab import files
files=files.download('/content/catboost_model_ensemble.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
s2_predict=catboost_reg_s2.predict(X_train_d2)

In [None]:
s2_predict_df=pd.DataFrame(s2_predict,columns=['s2_predict'])

In [None]:
s2_predict_test=catboost_reg_s2.predict(X_test)

In [None]:
s2_predict_test_df=pd.DataFrame(s2_predict_test,columns=['s2_predict_test'])

In [None]:
filename='s2_predict_df.txt'
my_file_3=open(filename,'wb')

In [None]:
pickle.dump(s2_predict_df,my_file_3)

In [None]:
filename='s2_predict_test_df.txt'
my_file_4=open(filename,'wb')

In [None]:
pickle.dump(s2_predict_test_df,my_file_4)

                                                             **End of Notebook**