# Energy consumption prediction models for efficient, environment-friendly metered building energy usage

### Objective:
To build a model to predict energy consumption of buildings without retrofits aimed at assessing the performance of existing retrofits for efficient, environment-friendly metered building energy usage

### Motivation:

This proposition is not my own but was devised by the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE), and I was captivated by this idea after coming across the Kaggle's competition website.

The competition is currently live and ends December 19, 2019. This suits the curriculum of the Data Incubator program, which in turn keeps my motivation higher and stay focused on the topic.

### Introduction:

Metered building energy usage helps us consume fuel and water wisely while inspiring us to improve building efficiencies to reduce costs and emissions by installing and/or improving retrofit components of the buildings.

Under the pay-for-performance financing, the building owner makes payments based on the difference between their real energy consumption and what they would have used without any retrofits. The latter values have to come from a model, which forms the objective of this project. 

Current methods of estimation are fragmented and do not scale well. Some assume a specific meter type, or don't work with different building types. Building counterfactual models to estimate the usage seems the only way to assess the energy consumption.

Through this project, I intend to work toward building accurate models of metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters.

The dataset is for public use but available to Kaggle users (upon registration), and is hosted by ASHRAE at kaggle.com website. The data comes from hourly meter readings taken during a three-year timeframe of over 1000 buildings located at different sites worldwide.

### Files:
> #### train.csv
- contains building_id, meter, timestamp, meter_reading
- building_id is the unique id referenced in other data files to enable merging of data
- the meter has 4 codes, namely, {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
- timestamp is the time the measurement taken
- meter_reading is the target variable, i.e. the energy consumption in kWh (or equivalent units).

> #### building_meta.csv
- contains site_id, building_id, primary_use, square_feet, year_built, floor_count
- site_id is the unique id for the area where the building is located, and enables merging of weather data
- building_id is the unique id of the building, enables merging of train and test data
- primary_use is the indicator of the primary category of activities of the building based on the EnergyStar property type definitions
- square_feet is the Gross floor area of the building
- year_built is the year building was opened
- floor_count is the number of floors of the building

> #### weather_train/test.csv
- contains site_id, air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr, sea_level_pressure, wind_direction, wind_speed
- site_id connects with the building_meta data file
- air_temperature and dew_temperature in deg celcius, precip_depth_1_hr
- precip_depth_1_hr in millimeters
- sea_level_pressure in Millibar/hectopascals
- cloud_coverage is the portion of the sky covered in clouds
- wind_direction is the compass direction i.e. in 0 to 360 degrees
- wind_speed in meters/second

> #### test.csv
The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order. Here the row_id refers to the row id for your submission file
- building_id - Building id code
- meter - The meter id code
- timestamp - Timestamps for the test data period

> #### sample_submission.csv
A valid sample submission.

All floats in the solution file were truncated to four decimal places; we recommend you do the same to save space on your file upload.
There are gaps in some of the meter readings for both the train and test sets. Gaps in the test set are not revealed or scored. 


### Evaluation Metric:

The ASHRAE evaluates the model using the metric `Root Mean Squared Logarithmic Error (RMSLE)`.

The RMSLE is calculated as:

$ ϵ=1n∑i= \sqrt{ 1/n (log(pi+1)−log(ai+1))^2 } $
Where:

- ϵ is the RMSLE value (score)
- n is the total number of observations in the (public/private) data set,
- pi is your prediction of target, and
- ai is the actual target for i.
- log(x) is the natural logarithm of x

### Preprocessing / Exploratory Data Analysis:

Since the data is already cleaned and in a suitable format for processing in Python platform, we don't need to preprocess the data. However, since the data storage is not optimized for memory space, I looked at options of reducing the data size by assigining the feature columns appropriate data types with due care to not to cause any loss / effects on further analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import gc, math

from sklearn.metrics import mean_squared_error
# pip install lightgbm
import lightgbm as lgb
from sklearn.model_selection import KFold, StratifiedKFold, GroupKFold
import tqdm
from pandas.tseries.holiday import USFederalHolidayCalendar as us_cal
from sklearn.preprocessing import LabelEncoder

In [None]:
sns.set(rc={'figure.figsize':(11,8)})
sns.set(style="whitegrid")

In [None]:
%%time
metadata_ = pd.read_csv('building_metadata.csv')
train_df = pd.read_csv('train.csv', parse_dates=['timestamp'])
test_df = pd.read_csv('test.csv', parse_dates=['timestamp'])
weather_train_df = pd.read_csv('weather_train.csv', parse_dates=['timestamp'])
weather_test_df = pd.read_csv('weather_test.csv', parse_dates=['timestamp'])

In [None]:
weather_train_df.shape, weather_test_df.shape

In [None]:
weather_train_df['site_id'].unique(), weather_test_df['site_id'].unique()

In [None]:
weather = pd.concat([weather_train_df,weather_test_df],ignore_index=True)

temp_skeleton = weather[['site_id', 'timestamp', 'air_temperature']].drop_duplicates(subset=['site_id', 'timestamp']).sort_values(by=['site_id', 'timestamp']).copy()

# calculate ranks of hourly temperatures within date/site_id chunks
temp_skeleton['temp_rank'] = temp_skeleton.groupby(['site_id', temp_skeleton.timestamp.dt.date])['air_temperature'].rank('average')

# create a dataframe of site_ids (0-16) x mean hour rank of temperature within day (0-23)
df_2d = temp_skeleton.groupby(['site_id', temp_skeleton.timestamp.dt.hour])['temp_rank'].mean().unstack(level=1)

# Subtract the columnID of temperature peak by 14, getting the timestamp alignment gap.
site_ids_offsets = pd.Series(df_2d.values.argmax(axis=1) - 14)
site_ids_offsets.index.name = 'site_id'

def timestamp_align(df):
    df['offset'] = df.site_id.map(site_ids_offsets)
    df['timestamp_aligned'] = (df.timestamp - pd.to_timedelta(df.offset, unit='H'))
    df['timestamp'] = df['timestamp_aligned']
    del df['timestamp_aligned']
    return df

In [None]:
weather_train_df = timestamp_align(weather_train_df)
weather_test_df = timestamp_align(weather_test_df)

In [None]:
del weather, df_2d, temp_skeleton, site_ids_offsets

In [None]:
weather_train_df.isna().sum()

In [None]:
weather_test_df.isna().sum()

In [None]:
weather_train_df = weather_train_df.groupby('site_id').apply(lambda group: group.interpolate(limit_direction='both'))
weather_test_df = weather_test_df.groupby('site_id').apply(lambda group: group.interpolate(limit_direction='both'))

In [None]:
## Function to reduce the memory usage
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df