# ASHRAE – Great Energy Predictor III
By Mehul Haria, Raymond Huang, Naureen Pethani, Shelina Khoja, Kinnari Patel

### Abstract:
We are using a dataset related to ASHRAE – Great Energy Predictor III (How much energy will a building consume?). The goal is to develop models from ASHRAE’s 2016 data in order to better understand metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buildings over a one-year timeframe. The method chosen to solve the problem is Linear Regression.
    

### Introduction:
As the impact of climate change is being felt more increasingly, organizations are looking for ways to lower their energy footprint. Our organization (Greentech Inc) has developed technology to provide highly efficient heating and cooling systems that operate at much lower energy consumption.

We have partnered with governments across the world to retrofit the highest energy users with significant subsidies from our government partners. Highest energy users are defined as those with the highest energy consumption as determined by their energy meter readings.

By starting with the highest energy consumption users, we can prioritize our resources to provide the greatest impact, ultimately resulting in reduced energy consumption and a more environmentally friendly solution.

    
    
### Background:
The four dataset presented in (Kaggle ASHRAE Energy Predictor III dataset) are related to ASHRAE -  Energy  Predictor III.

As a specific purpose of lab assignment, we are looking at Linear Regression problem using ASHRAE -Energy Predictor III dataset. Full library of the datasets and their description are located here: (Kaggle ASHRAE Energy Predictor III dataset).

### Objective

The objective of this article is to provide a reliable and feasible recommendation algorithm to predict
How much energy will a building consume?  The train dataset has our target variable called “meter reading” with datatype float, hence the task could be solved by Linear Regression methods. The following methodology is used: 

Linear Regression tasks will be applied to the problem:
• By putting all relevant variables in the model
• Leave the irrelevant variables out
• Check linearity
• Check regression assumptions:
– Residuals have a mean of zero
– Normality of errors
– Linearity of variables

### Outline
1.Data Understanding

2.Data Preparation

2.1 Mergin tables

2.2 Droping columns and filling null value for column: 'air_temperature', 'wind_speed', 'precip_depth_1_hr', 'cloud_coverage'

3.Data Modeling

3.1 Linear Modeling and optimization

3.2 Decision Tree modeling

3.3 Random Forest modeling
    
    
    

In [None]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection  import train_test_split
import numpy as np
from scipy.stats import norm # for scientific Computing
from scipy import stats, integrate
import matplotlib.pyplot as plt


### 1.Data Understanding

Reading the datasets

The dataset presented in (Kaggle ASHRAE Energy Predictor III dataset) of Energy Predictor has Five datasets. 
The five primary files to be used are described below, with the variable names also included:
train.csv (202116100, 3)

building_id - Foreign key for the building metadata.
meter - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
timestamp - When the measurement was taken
meter_reading - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error. 

building_meta.csv with (1449,6) data1449, 6) (1449, 6)(

site_id - Foreign key for the weather files.
building_id - Foreign key for training.csv
primary_use - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
square_feet - Gross floor area of the building
year_built - Year building was opened
floor_count - Number of floors of the building

weather_[train/test].csv with (139773, 8) / (277243, 8) data

Weather data from a meteorological station as close as possible to the site.
site_id
air_temperature - Degrees Celsius
cloud_coverage - Portion of the sky covered in clouds, in oktas
dew_temperature - Degrees Celsius
precip_depth_1_hr - Millimeters
sea_level_pressure - Millibar/hectopascals
wind_direction - Compass direction (0-360)
wind_speed - Meters per second

test.csv with (41697600, 3) data

row_id - Row id for your submission file
building_id - Building id code
meter - The meter id code
timestamp - Timestamps for the test data period

sample_submission.csv with (41697600, 2) data

A valid sample submission.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:


# Any results you write to the current directory are saved as output.
ASHRAE_train =  pd.read_csv('/kaggle/input/ashrae-energy-prediction/train.csv')
ASHRAE_test=pd.read_csv('/kaggle/input/ashrae-energy-prediction/weather_test.csv')
weather_train=pd.read_csv('/kaggle/input/ashrae-energy-prediction/weather_train.csv')
weather_test=pd.read_csv('/kaggle/input/ashrae-energy-prediction/weather_test.csv')
building_meta=pd.read_csv('/kaggle/input/ashrae-energy-prediction/building_metadata.csv')

To perform the analysis, certain Python libraries were used. The code was used to load and initialize the libraries. We have 41 million rows to predict with the built model.
We combined three datasets train.csv, building_metadata, weather train with foreign keys buildeing_id and timestamp respectively. So, we are dealing with big datasets here (20 and 40 million rows).
To save some space from the memory, we are going use a function built as part of this popular notebook to reduce the memory size use of the datasets. 
After memory reduction, the original datatype was changed from int_64 to int_16 for building_id. Memory usage was reduced greatly by half in order to improve speed and performance.


In [None]:
ASHRAE_train.info()


In [None]:
weather_train.info()


So let us reduce the data type and reduce memory usage using define function.

In [None]:
## Function to reduce the DF size
def reduce_memory_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df


In [None]:
reduce_memory_usage(building_meta)
reduce_memory_usage(weather_train)
reduce_memory_usage(ASHRAE_train)

reduce_memory_usage(weather_test)
reduce_memory_usage(ASHRAE_test)


Data attributes summary

Quick view of the data attributes statistics presented in the Table 2. For each attribute in the dataset
this table shows min, max, mean and normal distribution 1st and 3rd quartiles values.

year_built, floor_count, cloud_coverage, precip_depth_1_hr, sea_level_pressure and wind_direction are all missing significant information based on the counts. 


In [None]:
ASHRAE_train.describe()

After memory reduction, the original type was changed from int 64 to int 16 for building_id. Memory usage was reduced greatly by half in order to improve speed and performance.

Preview of ASHRAE_train data

Quick view of the data attributes statistics of ASHRAE_train and building_meta presented below. For each attribute in the dataset this table shows min, max, mean and normal distribution 1st and 3rd quartiles values.

In [None]:
print('Size of the building dataset is', building_meta.shape)
print('Size of the weather_train dataset is', weather_train.shape)
print('Size of the train dataset is', ASHRAE_train.shape)

In [None]:
ASHRAE_train.describe()

In [None]:
building_meta.describe()

Checking unique elements in primary use column within building_meta table.

In [None]:
primary_use_numbersOfUniqueValue = building_meta['primary_use'].nunique()
 
print('Number of unique values in column "primary_use" of the building_meta : ')
print(primary_use_numbersOfUniqueValue)
primary_use_element = building_meta['primary_use'].unique()
 
print('Unique element in column "primary_use" of the building_meta : ')
print(primary_use_element)

In [None]:
print('Columns of the building dataset is', building_meta.columns)
print('Columns of the weather_train dataset is', weather_train.columns)
print('Columns of the train dataset is', ASHRAE_train.columns)

The connection between building_meta and train dataset is building_id while building_meta dataset can merge with weather_train dataset on site_id.

So basically building_id was the primary key and site_id was the foreign key for the building_meta table

building_id was primary key for train dataset while site_id was primary key for weather_train dataset.

Here, Heatmap was used to check missing value in building_meta. And we found out that year_built and floor_count are missing a lot of data, thus may not be useful for this analysis.

In [None]:
fig, ax = plt.subplots(figsize=(15,7))
sns.heatmap(building_meta.isnull(), yticklabels=False,cmap='viridis')

In [None]:

print("Percentage of missing values in the building_meta dataset")
building_meta.isna().sum()/len(building_meta)*100

In [None]:

print("Percentage of missing values in the train dataset")
ASHRAE_train.isna().sum()/len(ASHRAE_train)*100

In [None]:

print("Percentage of missing values in the weather_train dataset")
weather_train.isna().sum()/len(weather_train)*100

## 2.0 Data Praparation

### 2.1 Merging tables for analysis

In [None]:
#pd.merge(df1, df2, on='employee')
BuildingTrainMerge=building_meta.merge(ASHRAE_train,left_on='building_id',right_on='building_id',how='left')
BuildingTrainMerge.shape

In [None]:
BTW_train=BuildingTrainMerge.merge(weather_train,left_on=['site_id','timestamp'],right_on=['site_id','timestamp'],how='left')
BTW_train.shape

In [None]:
BTW_train.columns

In [None]:
print("Percentage of missing values in the BTW_train dataset")
BTW_train.isna().sum()/len(BTW_train)*100

In [None]:
BTW_train.hist('sea_level_pressure')
BTW_train[['sea_level_pressure']].describe()

In [None]:
BTW_train.hist('cloud_coverage')
BTW_train[['cloud_coverage']].describe()

In [None]:
BTW_train.hist('precip_depth_1_hr')
BTW_train[['precip_depth_1_hr']].describe()

In [None]:
BTW_train.hist('wind_speed')
BTW_train[['wind_speed']].describe()

In [None]:
BTW_train.hist(column='air_temperature')
BTW_train[['air_temperature']].describe()

We wanted to take a bit more of a look at "sea level pressure", "cloud coverage", "air temperature", "wind speed" and "precip_depth_1hr" to get a better idea of how the values are spread out. The precip_depth_1hr is heavily skewed, while "cloud coverage" and "sea level pressure" have a relatively small range between their max and min. Air temperature and wind speed have fairly normal distribution. For now, the decision was made to keep the data.

In [None]:
sns.boxplot(x = 'meter', y = 'meter_reading', data = BTW_train)

Define outlier

In [None]:
def outlier_function(df, col_name):
    ''' this function detects first and third quartile and interquartile range for a given column of a dataframe
    then calculates upper and lower limits to determine outliers conservatively
    returns the number of lower and uper limit and number of outliers respectively
    '''
    first_quartile = np.percentile(
        np.array(df[col_name].tolist()), 25)
    third_quartile = np.percentile(
        np.array(df[col_name].tolist()), 75)
    IQR = third_quartile - first_quartile
                      
    upper_limit = third_quartile+(3*IQR)
    lower_limit = first_quartile-(3*IQR)
    outlier_count = 0
                      
    for value in df[col_name].tolist():
        if (value < lower_limit) | (value > upper_limit):
            outlier_count +=1
    return lower_limit, upper_limit, outlier_count

In [None]:
print("{} percent of {} are outliers."
      .format((
              (100 * outlier_function(BTW_train, 'meter_reading')[2])
               / len(BTW_train['meter_reading'])),
              'meter_reading'))

In [None]:
# Distribution of the meter reading in meters without zeros
plt.figure(figsize=(12,10))

#list of different meters
meters = sorted(BTW_train['meter'].unique().tolist())

# plot meter_reading distribution for each meter
for meter_type in meters:
    subset = BTW_train[BTW_train['meter'] == meter_type]
    sns.kdeplot(np.log1p(subset["meter_reading"]), 
                label=meter_type, linewidth=2)

# set title, legends and labels
plt.ylabel("Density")
plt.xlabel("Meter_reading")
plt.legend(['electricity', 'chilled water', 'steam', 'hot water'])
plt.title("Density of Logartihm(Meter Reading + 1) Among Different Meters", size=14)

In [None]:
BTW_train.columns

In [None]:
corrmat=BTW_train.corr()
fig,ax=plt.subplots(figsize=(12,10))
sns.heatmap(corrmat,annot=True,annot_kws={'size': 12})


### 2.2 Dropping columns and filling null value





In [None]:
BTW_train = BTW_train.drop(columns=['year_built', 'floor_count', 'wind_direction', 'dew_temperature'])
BTW_train ['timestamp'] =  pd.to_datetime(BTW_train['timestamp'])
BTW_train['Month']=pd.DatetimeIndex(BTW_train['timestamp']).month
BTW_train['Day']=pd.DatetimeIndex(BTW_train['timestamp']).day


In [None]:
BTW_train= BTW_train.groupby(['meter',BTW_train['building_id'],'primary_use',BTW_train['Month'], BTW_train['Day']]).agg({'meter_reading':'sum', 'air_temperature': 'mean', 'wind_speed': 'mean', 'precip_depth_1_hr': 'mean', 'cloud_coverage': 'mean', 'square_feet': 'mean'})

In [None]:
BTW_train.columns

In [None]:
BTW_train = BTW_train.reset_index()

In [None]:
BTW_train.describe()

Change data type to float 32 for filling NA value before transforming them into int for smooth modeling processing

In [None]:
BTW_train['wind_speed'] = BTW_train['wind_speed'].astype('float32')
BTW_train['air_temperature'] = BTW_train['air_temperature'].astype('float32')
BTW_train['precip_depth_1_hr'] = BTW_train['precip_depth_1_hr'].astype('float32')
BTW_train['cloud_coverage'] = BTW_train['cloud_coverage'].astype('float32')

In [None]:
BTW_train['precip_depth_1_hr'].fillna(method='ffill', inplace = True)
BTW_train['cloud_coverage'].fillna(method='bfill', inplace = True)

BTW_train['wind_speed'].fillna(BTW_train['wind_speed'].mean(), inplace=True)
BTW_train['air_temperature'].fillna(BTW_train['air_temperature'].mean(), inplace=True)
BTW_train.isnull().sum()

In [None]:
BTW_train.shape

In [None]:
BTW_train.dtypes

In [None]:
BTW_train.columns

## Data Modeling  845701 records for modeling

### 3.1 Linear Regression


Here column 'primaty_use' was treated by get_dummies function

In [None]:

BTW_linearR = pd.get_dummies(BTW_train, columns=['primary_use'])

In [None]:
BTW_linearR.columns

In [None]:
X =BTW_linearR[['building_id', 'meter', 'air_temperature', 'wind_speed', 'precip_depth_1_hr', 'cloud_coverage',
       'square_feet', 'primary_use_Education', 'primary_use_Entertainment/public assembly',
       'primary_use_Food sales and service', 'primary_use_Healthcare',
       'primary_use_Lodging/residential',
       'primary_use_Manufacturing/industrial', 'primary_use_Office',
       'primary_use_Other', 'primary_use_Parking',
       'primary_use_Public services', 'primary_use_Religious worship',
       'primary_use_Retail', 'primary_use_Services',
       'primary_use_Technology/science', 'primary_use_Utility',
       'primary_use_Warehouse/storage', 'Month', 'Day']]

# Create target variable
y = BTW_linearR['meter_reading']

# Train, test, split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .20, random_state= 0)

In [None]:
# Fit
# Import model
from sklearn.linear_model import LinearRegression

# Create linear regression object
regressor = LinearRegression()

# Fit model to training data
regressor.fit(X_train,y_train)

In [None]:
# Predicting test set results
y_pred = regressor.predict(X_test)

In [None]:
print('Accuracy %d', regressor.score(X_test, y_test))

In [None]:
#Calculate R Sqaured
print('R^2 =',metrics.explained_variance_score(y_test,y_pred))

In [None]:
cdf = pd.DataFrame(data = regressor.coef_, index = X.columns, columns = ['Coefficients'])
cdf

In [None]:
cdf.Coefficients.nlargest(10).plot(kind='barh')

In [None]:
import statsmodels.api as sm
from scipy import stats
X =BTW_linearR[['building_id', 'meter', 'air_temperature', 'wind_speed', 'precip_depth_1_hr', 'cloud_coverage',
       'square_feet', 'primary_use_Education', 'primary_use_Entertainment/public assembly',
       'primary_use_Food sales and service', 'primary_use_Healthcare',
       'primary_use_Lodging/residential',
       'primary_use_Manufacturing/industrial', 'primary_use_Office',
       'primary_use_Other', 'primary_use_Parking',
       'primary_use_Public services', 'primary_use_Religious worship',
       'primary_use_Retail', 'primary_use_Services',
       'primary_use_Technology/science', 'primary_use_Utility',
       'primary_use_Warehouse/storage', 'Month', 'Day']]

# Create target variable
y = BTW_linearR['meter_reading']
 
 
 
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

P-Values of those criteria suggest that they are acceptable variables to use in modelling as they likely reject the null hypothesis.
However, looking at the split at the first node, 95% of the data was under represented for falling under the square feet criteria. The R2 value was then only representative of 5% of the data, skewed towards the higher square feet values. The weight given to “meter’ is understandable, as it was shown in a box plot earlier that the “meter” type influenced the ‘meter readings’, specifically for higher values.

Because of the high skew towards larger ‘Square feet’ values, we don’t believe that the Decision tree model is an accurate model to use, despite the higher R2 value.


The linear modeling suggest that the primary use have great impact on the meter reading modeling. But it comes with a low accuracy score.


### 3.1 Linear Modeling only with few important features

In [None]:
K =BTW_linearR[['meter','wind_speed', 'cloud_coverage',
                'primary_use_Education','primary_use_Entertainment/public assembly', 'primary_use_Healthcare',
       'primary_use_Manufacturing/industrial', 'primary_use_Office',
       'primary_use_Other', 'primary_use_Parking','primary_use_Religious worship',
       'primary_use_Retail','primary_use_Technology/science', 'primary_use_Utility', 'Month']]

# Create target variable
y = BTW_linearR['meter_reading']

In [None]:
lm = LinearRegression()

# Fit model to training data
lm.fit(K,y)

In [None]:
# Train, test, split
from sklearn.model_selection import train_test_split
K_train, K_test, y_train, y_test = train_test_split(K,y, test_size = .20, random_state= 0)

In [None]:
print('Accuracy %d', lm.score(K_test, y_test))

In [None]:
y_pred = lm.predict(K_test)

In [None]:
print('R^2 =',metrics.explained_variance_score(y_test,y_pred))

In [None]:
lm.score(K,y)

In [None]:
regressor.score(X_train,y_train)

In [None]:
cdf1 = pd.DataFrame(data = lm.coef_, index = K.columns, columns = ['Coefficients'])

In [None]:
cdf1 .Coefficients.nlargest(10).plot(kind='barh')

Optimizing this linear modeling did not change any big difference in improving score.

### Model 3.2 Decision Tree

In [None]:
XD =BTW_linearR[['building_id', 'meter', 'air_temperature', 'wind_speed', 'precip_depth_1_hr', 'cloud_coverage',
       'square_feet', 'primary_use_Education', 'primary_use_Entertainment/public assembly',
       'primary_use_Food sales and service', 'primary_use_Healthcare',
       'primary_use_Lodging/residential',
       'primary_use_Manufacturing/industrial', 'primary_use_Office',
       'primary_use_Other', 'primary_use_Parking',
       'primary_use_Public services', 'primary_use_Religious worship',
       'primary_use_Retail', 'primary_use_Services',
       'primary_use_Technology/science', 'primary_use_Utility',
       'primary_use_Warehouse/storage', 'Month', 'Day']]

# Create target variable
YD = BTW_linearR['meter_reading']

# Train, test, split
from sklearn.model_selection import train_test_split
XD_train,XD_test, YD_train, YD_test = train_test_split(XD,YD, test_size = .20, random_state= 0)

In [None]:
from sklearn.tree import DecisionTreeRegressor
regr_depth2 = DecisionTreeRegressor(max_depth=2)
regr_depth5 = DecisionTreeRegressor(max_depth=5)
regr_depth2.fit(XD_train, YD_train)
regr_depth5.fit(XD_train, YD_train)

In [None]:
y_1 = regr_depth2.predict(XD_test)
y_2 = regr_depth5.predict(XD_test)

In [None]:
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_1})
df.head()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_1))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_1))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_1)))

In [None]:
#Calculate R Sqaured
print('R^2 =',metrics.explained_variance_score(y_test,y_1))

For depth 2 desicion tree modeling, R2 was obtained at 0.147

In [None]:
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_2})
df.head()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_2))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_2))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_2)))

In [None]:
print('R^2 =',metrics.explained_variance_score(y_test,y_2))

For depth 5 desicion tree modeling, R2 was obtained at 0.723

In [None]:
plt.plot(XD_test, y_1, color="blue",label="max_depth=2", linewidth=2)
plt.plot(XD_test, y_2, color="green", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.show()

In [None]:
print('Accuracy %d', regr_depth2.score(XD_train, YD_train))

In [None]:
print('Accuracy %d', regr_depth5.score(XD_train, YD_train))

In [None]:
yd_pred = regr_depth5.predict(XD_test)

In [None]:
yd_pred

In [None]:
print('XD 19',XD.columns[19], 'XD 6',XD.columns[6],'X0',XD.columns[0],'X1',XD.columns[1],'X23',XD.columns[23],'X19',XD.columns[19],'X2',XD.columns[2])

From desicion tree, square feet under 333282.5, meter is electricity, month from January to June, air temperature under 4.488 result in 81 samples point showing 276450653 and 63 samples showing 68649752 for meter reading.

In [None]:
YD.describe()

In [None]:
XD.columns

In [None]:
feat_importancesDT = pd.Series(regr_depth5.feature_importances_, index=XD.columns)
feat_importancesDT.nlargest(10).plot(kind='barh')

In general, applying decision tree modeling, accuracy score was increased from 0.0012-0.0014 in linear modleing to 0.147 or 0.723 from decision tree depending the depth of a tree. Decision tree modeling also suggest month, air temperature, meter type, and square feet plays a important role in its modeling.

### Model 3.2 Randomforest model

In [None]:
import sklearn.ensemble as ske
import matplotlib.pyplot as plt
RFR = ske.RandomForestRegressor()


In [None]:
RFR.fit(XD,YD)

In [None]:
RFR.score(XD,YD)

Randomforest gave the highest score for this modeling at 0.97 on accuracy score

In [None]:
YR_pred = RFR.predict(XD_test)

In [None]:
feat_importancesRFR = pd.Series(RFR.feature_importances_, index=XD.columns)
feat_importancesRFR.nlargest(10).plot(kind='barh')

In [None]:
#pip install pydot

Random forest visualization

## 4.0 Conclusion
ASHRAE energy dataset was explored using a different methods of Regression. For each method,
an algorithm was developped to predict the meter readings, representing energy output, using characteristics of the building, use, location, and weather.
To begin with, a linear regression model was applied and achieved a low accuracy score of less than 0.01%, and R2 score of less than 0.01%. This suggested that a linear regression model was an extremely poor model for predicting energy usage.

Then a decision tree was applied at different depths. At a depth of 2, the R2 value increased to almost 15%, but at a depth of 5, the R2 jumped to over 82%. At the mean time, the accuracy score was changed from 0.147 to 0.723 as the treee depth increased from 2 to 5. This suggested a strong fit for our “Decision Tree Regression” model. However, a closer look at the data showed that the results were skewed towards high square feet, which only represented a small percentage of overall data.

On the basis of results from decision tree, a Random Forest regression was applied, which gave an accuracy score of 96%, which was higher than previous models. This can be explained by the theory that random forest was serveal combinations of decision tree. In Random forest tree model, there were more trees and depth for modeling and splitting.

Ultimately a model was implemented that had a high accuracy rate of predicting the energy use using a Random Forest Regression model. The more important variables accounted for the seasonality (month), site location (air_temperature), and characteristics of the building itself (square feet and meter).

Our conclusion is that a random forest model is a strong model to predict energy readings, and more exploration needs to be done on the variables that were considered more important in this model.

If anyone interested in visualization of decision tree, let me know as the code did not work here due to separate package needed to be installed.



## 5.0 Future Work

1.Developing a way to dealing with memory issue and visualization of random forest modeling. Batch processing may be a good way in this regard.

2.Applying confusion metrices to further analysis relationship between accuracy score and precision.

3.Different ways of replacing null value will be investigated and so dose the consequence.