# **Accuracy comparison of energy predictions for different building types in Phoenix with three machine learning models**

<font size=4>Dou Zhenyu A0213310E</font>

**<font size=5>1. Introduction</font>**

This notebook would like to use energy consumption data of buildings in Phoenix in the dataset 'Building Genome Project 1' and then get the average energy consumption per square meter of different building types.  
  
After getting that values, this notebook would like to use three kinds of machine learning models to do the energy prediction for them, which are LightGBM, Random forest and XGBoost. After the prediction, this notebook would compare the accuracy of different prediction models and get some conclusions.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas_profiling import ProfileReport

import matplotlib.pyplot as plt
from matplotlib import dates as md
import seaborn as sns
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)
import os
import lightgbm as lgb
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



 **<font size=5>2. Data filtering and cleaning</font>**

The first step is to filter buildings in Phoenix, based on the preliminary understanding of the data below, there will be several weather files for each city. This may be because the buildings are located in different areas of the city. To simplify the analysis, this notebook will select the weather with the largest proportion in Phoenix to analyze the buildings contained in it.

# **<font size=4>2.1 Weather data and Schedule data filtering</font>**

In [None]:
df_meta = pd.read_csv('/kaggle/input/building-data-genome-project-v1/meta_open.csv')
df_meta_Phoenix = df_meta[df_meta['timezone']=='America/Phoenix']
df_meta_Phoenix

In [None]:
df_meta_Phoenix.pivot_table(index='timezone',columns='newweatherfilename', values='uid', aggfunc='count').plot.bar(stacked=True, figsize=(8,5))

In [None]:
df_meta_Phoenix.pivot_table(index='timezone',columns='annualschedule', values='uid', aggfunc='count').plot.bar(stacked=True, figsize=(8,5))

**<font size=4>2.2 Buildings filtering and getting avarage value of different building types</font>**

From the diagram above, we can notice that there are only weather0 and Schedule2 in Phoenix. Hence, this notebook will next filter the energy consumption data of those buildings in weather0. In order to reduce the error caused by different buildings, this article hopes to use the energy consumption per square meter and obtain the average value of the same type of buildings.

In [None]:
list_Phoenix = df_meta_Phoenix['uid'].to_list()
list_Phoenix

In [None]:
df_powermeter = pd.read_csv('/kaggle/input/building-data-genome-project-v1/temp_open_utc_complete.csv', index_col='timestamp', parse_dates=True)
df_powermeter.index = df_powermeter.index.tz_localize(None)
df_powermeter = df_powermeter/df_meta.set_index('uid').loc[df_powermeter.columns, 'sqm']
df_powermeter_Phoenix =  df_powermeter[list_Phoenix].dropna(how='all')
df_powermeter_Phoenix

In [None]:
df_Office = df_powermeter_Phoenix.iloc[:, 0:22]
df_Office['Office'] = df_Office.apply(lambda x: x.mean(), axis=1)
df_Office = df_Office.iloc[:, 22:23]
df_PrimClass = df_powermeter_Phoenix.iloc[:, 22:24]
df_PrimClass['PrimClass'] = df_PrimClass.apply(lambda x: x.mean(), axis=1)
df_PrimClass = df_PrimClass.iloc[:, 2:3]
df_UnivClass = df_powermeter_Phoenix.iloc[:, 24:54]
df_UnivClass['UnivClass'] = df_UnivClass.apply(lambda x: x.mean(), axis=1)
df_UnivClass = df_UnivClass.iloc[:, 30:31]
df_UnivDorm = df_powermeter_Phoenix.iloc[:, 54:67]
df_UnivDorm['UnivDorm'] = df_UnivDorm.apply(lambda x: x.mean(), axis=1)
df_UnivDorm = df_UnivDorm.iloc[:, 13:14]
df_UnivLab = df_powermeter_Phoenix.iloc[:, 67:96]
df_UnivLab['UnivLab'] = df_UnivLab.apply(lambda x: x.mean(), axis=1)
df_UnivLab = df_UnivLab.iloc[:, 29:30]
df_Phoenix_avarage = pd.concat([df_Office,df_PrimClass,df_UnivClass,df_UnivDorm,df_UnivLab],axis=1)
df_Phoenix_avarage

 **<font size=5>3. Prediction modeling</font>**

As for the modelling part, this notebook will use LightGBM model, Random Forest and XGBoost model as prediction models and use time, temperature and schedule as trainnig features. Three months' data (Apr, Aug, Dec) would be the test data and the remaining 9 months' data would be training data. After prediction, this notebook would use R-SQUARED and MAPE to check the model accuracy and make some comparisons.

 **<font size=4>3.1 Pre-processing features</font>**

According to part2, the weather file of Phoenix is weather0 and the schedule file is scedule2. Next step is to do some pre-processing work for the prediction model.

3.1.1 Weather data processing

In [None]:
df_weather = pd.read_csv('/kaggle/input/building-data-genome-project-v1/weather0.csv', index_col='timestamp', parse_dates=True)
df_weather = df_weather.select_dtypes(['int', 'float'])

for col in df_weather.columns:
    df_weather.loc[df_weather[col]<-100, col] = np.nan
    
df_weather.fillna(method='ffill')

df_weather = df_weather.reset_index().drop_duplicates(subset=['timestamp'])

df_weather = df_weather.set_index('timestamp').resample('1H').mean()

df_weather.loc[:, df_weather.columns.str.contains('TemperatureC')].iplot()

3.1.2 Schedule data processing

In [None]:
df_schedule = pd.read_csv('/kaggle/input/building-data-genome-project-v1/schedule2.csv', header=None)
df_schedule = df_schedule.rename(columns={0:'date',1:'date_type'})
df_schedule['date'] = pd.to_datetime(df_schedule['date'])
df_schedule_encode = df_schedule.copy()
df_schedule_encode['date_type'] = LabelEncoder().fit_transform(df_schedule_encode['date_type'])
df_schedule_encode.set_index('date').iplot()

 **<font size=4>3.2 LightGBM model</font>**

[LightGBM](https://lightgbm.readthedocs.io/en/latest/) uses histogram-based algorithms, which bucket continuous feature (attribute) values into discrete bins. This speeds up training and reduces memory usage. (Ref: https://lightgbm.readthedocs.io/en/latest/)

The following code cells were greatly helped by the Kaggle notebook from
Fu Chun: https://www.kaggle.com/patrick0302/load-prediction-for-bdg1-0

In [None]:
R_light=[]
M_light=[]

In [None]:
for name in ['Office','PrimClass','UnivClass','UnivDorm','UnivLab']:
    
    print(name)

    # Prepare data    
    
    df_temp = df_Phoenix_avarage[[name]].copy()
    df_temp = df_temp.dropna()

    # Add time features
    df_temp['weekday'] = df_temp.index.weekday
    df_temp['hour'] = df_temp.index.hour
    df_temp['date'] = pd.to_datetime(df_temp.index.date)

    # Add temperature features
    df_temp = df_temp.rename(columns={name: 'load_meas'})
    df_temp = df_temp.merge(df_weather.loc[:, df_weather.columns.str.contains('TemperatureC')], left_index=True, right_index=True)

    # Add schedule features
    index = df_temp.index.copy()
    df_temp = df_temp.merge(df_schedule_encode, on='date')
    df_temp.index = index

    # Split data to tainning data and testing data
    traindata = df_temp.loc[df_temp.index.month.isin([1,2,3,5,6,7,9,10,11])].dropna().copy()
    testdata = df_temp.loc[df_temp.index.month.isin([4,8,12])].copy()

    train_labels = traindata['load_meas']
    test_labels = testdata['load_meas']

    train_features = traindata.drop(['load_meas', 'date'], axis=1)
    test_features = testdata.drop(['load_meas', 'date'], axis=1)
    
    # Instantiate model 
    LGB_model = lgb.LGBMRegressor()
    
    # Train the model on training data
    LGB_model.fit(train_features, train_labels)

    testdata['load_pred'] = LGB_model.predict(test_features)
    df_temp.loc[testdata.index, 'load_pred'] = testdata['load_pred']
    
    # Calculate the absolute errors
    errors = abs(testdata['load_pred'] - test_labels)

    RSQUARED = r2_score(testdata.dropna()['load_meas'], testdata.dropna()['load_pred'])
    MAPE = errors/test_labels
    MAPE = MAPE.loc[MAPE!=np.inf]
    MAPE = MAPE.loc[MAPE!=-np.inf]
    MAPE = MAPE.dropna().mean()*100
    
    # Visualization
    print("R SQUARED: "+str(round(RSQUARED,3)))
    print("MAPE: "+str(round(MAPE,1))+'%')
    testdata[['load_meas', 'load_pred']].reset_index(drop=True).iplot()
    
    #Summary
    R_light.append(RSQUARED)
    M_light.append(MAPE)    

 **<font size=4>3.3 Random forest model</font>**

[Random forests](https://en.wikipedia.org/wiki/Random_forest) or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. (Ref: https://en.wikipedia.org/wiki/Random_forest)

In [None]:
R_random=[]
M_random=[]

In [None]:
for name in ['Office','PrimClass','UnivClass','UnivDorm','UnivLab']:
    
    print(name)

    # Prepare data    
    
    df_temp = df_Phoenix_avarage[[name]].copy()
    df_temp = df_temp.dropna()

    # Add time features
    df_temp['weekday'] = df_temp.index.weekday
    df_temp['hour'] = df_temp.index.hour
    df_temp['date'] = pd.to_datetime(df_temp.index.date)

    # Add temperature features
    df_temp = df_temp.rename(columns={name: 'load_meas'})
    df_temp = df_temp.merge(df_weather.loc[:, df_weather.columns.str.contains('TemperatureC')], left_index=True, right_index=True)

    # Add schedule features
    index = df_temp.index
    df_temp = df_temp.merge(df_schedule_encode, on='date')
    df_temp.index=index

    # Split data to tainning data and testing data
    traindata = df_temp.loc[df_temp.index.month.isin([1,2,3,5,6,7,9,10,11])].dropna().copy()
    testdata = df_temp.loc[df_temp.index.month.isin([4,8,12])].copy()

    train_labels = traindata['load_meas'].fillna(0)
    test_labels = testdata['load_meas'].fillna(0)

    train_features = traindata.drop(['load_meas', 'date'], axis=1).fillna(0)
    test_features = testdata.drop(['load_meas', 'date'], axis=1).fillna(0)

    # Instantiate model 
    rf = RandomForestRegressor()
    
    # Train the model on training data
    rf.fit(train_features, train_labels);
    
    testdata['load_pred'] = rf.predict(test_features)
    df_temp.loc[testdata.index, 'load_pred'] = testdata['load_pred']
    
    # Calculate the absolute errors
    errors = abs(testdata['load_pred'] - test_labels)

    RSQUARED = r2_score(testdata.dropna()['load_meas'], testdata.dropna()['load_pred'])
    MAPE = errors/test_labels
    MAPE = MAPE.loc[MAPE!=np.inf]
    MAPE = MAPE.loc[MAPE!=-np.inf]
    MAPE = MAPE.dropna().mean()*100
    
    #Visualization
    print("R SQUARED: "+str(round(RSQUARED,3)))
    print("MAPE: "+str(round(MAPE,1))+'%')
    testdata[['load_meas', 'load_pred']].reset_index(drop=True).iplot()
    
    #Summary
    R_random.append(RSQUARED)
    M_random.append(MAPE)  


 **<font size=4>3.4 XGBoost model</font>**

[XGBoost](https://xgboost.readthedocs.io/en/latest/) is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples. (Ref: https://xgboost.readthedocs.io/en/latest/)

In [None]:
R_xg=[] 
M_xg=[] 

In [None]:
for name in ['Office','PrimClass','UnivClass','UnivDorm','UnivLab']:
    
    print(name)

    # Prepare data  
    
    df_temp = df_Phoenix_avarage[[name]].copy()
    df_temp = df_temp.dropna()

    # Add time features
    df_temp['weekday'] = df_temp.index.weekday
    df_temp['hour'] = df_temp.index.hour
    df_temp['date'] = pd.to_datetime(df_temp.index.date)

    # Add temperature features
    df_temp = df_temp.rename(columns={name: 'load_meas'})
    df_temp = df_temp.merge(df_weather.loc[:, df_weather.columns.str.contains('TemperatureC')], left_index=True, right_index=True)

    # Add schedule features
    index = df_temp.index
    df_temp = df_temp.merge(df_schedule_encode, on='date')
    df_temp.index=index

    # Split data to tainning data and testing data
    traindata = df_temp.loc[df_temp.index.month.isin([1,2,3,5,6,7,9,10,11])].dropna().copy()
    testdata = df_temp.loc[df_temp.index.month.isin([4,8,12])].copy()

    train_labels = traindata['load_meas'].fillna(0)
    test_labels = testdata['load_meas'].fillna(0)

    train_features = traindata.drop(['load_meas', 'date'], axis=1).fillna(0)
    test_features = testdata.drop(['load_meas', 'date'], axis=1).fillna(0)

    # Instantiate model 
    xg_reg = xgb.XGBRegressor()
    
    # Train the model on training data
    xg_reg.fit(train_features, train_labels);
    
    testdata['load_pred'] = xg_reg.predict(test_features)
    df_temp.loc[testdata.index, 'load_pred'] = testdata['load_pred']
    
    # Calculate the absolute errors
    errors = abs(testdata['load_pred'] - test_labels)

    RSQUARED = r2_score(testdata.dropna()['load_meas'], testdata.dropna()['load_pred'])
    MAPE = errors/test_labels
    MAPE = MAPE.loc[MAPE!=np.inf]
    MAPE = MAPE.loc[MAPE!=-np.inf]
    MAPE = MAPE.dropna().mean()*100
    
    #Visualization
    print("R SQUARED: "+str(round(RSQUARED,3)))
    print("MAPE: "+str(round(MAPE,1))+'%')
    testdata[['load_meas', 'load_pred']].reset_index(drop=True).iplot()
    
    #Summary
    R_xg.append(RSQUARED)
    M_xg.append(MAPE)  

 **<font size=5>4. Accuracy comparison</font>**

From the diagrams above, we can summarize the RSQUARED and MAPE of different models.

In [None]:
# For RSQUARED
R = {'Type':['Office','PrimClass','UnivClass','UnivDorm','UnivLab'],'LightGBM':R_light,'Randomforest':R_random,'XGBoost':R_xg}
df_R=pd.DataFrame(R)
df_R=df_R.set_index('Type')
df_R.iplot(kind='bar',title='R-SQUARED',yaxis_title='R-SQUARED',xaxis_title='Building Type')

In [None]:
# For MAPE
M = {'Type':['Office','PrimClass','UnivClass','UnivDorm','UnivLab'],'LightGBM':M_light,'Randomforest':M_random,'XGBoost':M_xg}
df_M=pd.DataFrame(M)
df_M=df_M.set_index('Type')
df_M.iplot(kind='bar',title='MAPE',yaxis_title='MAPE(%)',xaxis_title='Building Type')

 **<font size=5>5. Conclusion</font>**

From the analysis above, this notebook concluded that:  
1. Compared with Random Forest, LightGBM and XGBoost have better accuracy, along with better running speed.  
2. Office and Lab are easiest to predict comparing with Class and Dormitory, this may due to the more regular schedule of Offices and Labs. 
3. The error of Primary Class is most unacceptable, this may because there are only two Primary Class buildings in this dataset.