# ASHRAE - Great Energy Predictor III Competition
This notebook contains Exploratory Data Analysis and starter code for the Great Energy Predictor III Competition.

**tl;dr**:
- We are trying to predict energy consuption for 1449 buildings. The value we are trying to predict is the `meter_reading`
- Each building can have multiple meters - The meter id code. Read as `{0: electricity, 1: chilledwater, 2: steam, 3: hotwater}`. Not every building has all meter types.
- We are given:
    1. Historic meter reading data by timestamp for the building (`train.csv`)
    2. Building metadata including the building use, square ft area, year build(`building_meta.csv`). This data does not change between the training and test set.
    3. Weather data with predicpitation, cloud_coverage, `air_temperature` and more (`weather_[train/test].csv`)
- We are also provied csvs to be used for submission:
    1. `test.csv` which contains the meter, building id and timestamp we will be predicting for
    2. `sample_submission.csv` which contains all the future data we would like to predict

**In summary from the competition description:**
*In this competition, you’ll develop accurate predictions of metered building energy usage in the following areas: chilled water, electric, natural gas, hot water, and steam meters. The data comes from over 1,000 buildings over a three-year timeframe.*

## Reading in Data and preprocessing

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pylab as plt
import seaborn as sns

train = pd.read_csv('../input/ashrae-energy-prediction/train.csv')
test = pd.read_csv('../input/ashrae-energy-prediction/test.csv')
weather_te = pd.read_csv('../input/ashrae-energy-prediction/weather_test.csv')
weather_tr = pd.read_csv('../input/ashrae-energy-prediction/weather_train.csv')
bmd = pd.read_csv('../input/ashrae-energy-prediction/building_metadata.csv')

# Set timestamps
train['timestamp'] = pd.to_datetime(train['timestamp'])
test['timestamp'] = pd.to_datetime(test['timestamp'])
weather_tr['timestamp'] = pd.to_datetime(weather_tr['timestamp'])
weather_te['timestamp'] = pd.to_datetime(weather_te['timestamp'])

sns.set(style="whitegrid")
sns.set_color_codes("pastel")

## Time Series Competition

It's important to note that the data provided is `time series` in nature. We are given one year of data (2016) and are asked to predict 2 years of meter readings.

Per the description:

**This competition challenges you to build these counterfactual models across four energy types based on historic usage rates and observed weather. The dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.**

In [None]:
meter_mapping = {0: 'electricity', 1: 'chilledwater', 2: 'steam', 3: 'hotwater'}
train['meter_type'] = train['meter'].map(meter_mapping)
test['meter_type'] = test['meter'].map(meter_mapping)

In [None]:
train.groupby(['timestamp','meter_type'])['meter_reading'] \
    .median() \
    .reset_index().set_index('timestamp') \
    .groupby('meter_type')['meter_reading'] \
    .plot(figsize=(15, 5), title='Median Meter Reading by Meter Type (Test Set)')
plt.legend()
plt.show()

In [None]:
train['train'] = 1
test['train'] = 0
tt = pd.concat([train, test], axis=0, sort=True)

tt.groupby(['timestamp','meter_type'])['meter_reading'] \
    .median() \
    .reset_index().set_index('timestamp') \
    .groupby('meter_type')['meter_reading'] \
    .plot(figsize=(15, 5), title='Median Meter Reading by Meter Type (train and test timeframe)')
plt.legend()
plt.show()

## Evaluation Metric

We will be evaluated by the metirc `Root Mean Squared Logarithmic Error`.

The RMSLE is calculated as:

$ ϵ=1n∑i= \sqrt{ 1/n (log(pi+1)−log(ai+1))^2 } $
Where:

- ϵ is the RMSLE value (score)
- n is the total number of observations in the (public/private) data set,
- pi is your prediction of target, and
- ai is the actual target for i.
- log(x) is the natural logarithm of x

Understanding and optimizing your predictions for this evaluation metric is paramount for this compeition.

As mentioned in this discussion thread this can be calculated as https://www.kaggle.com/questions-and-answers/60012

```
from sklearn.metrics import mean_squared_log_error
np.sqrt(mean_squared_log_error( y_test, predictions ))
```

## Evaluating the Target Variable

As always we will start by looking at the larget variable. As we have 1000+ buildings that we have meter data for. We will look at the meter data by meter type.
- Electricity meters are 3x more common than the next meter type (chilledwater)
- Steam has much larger average meter values than the rest (13882 average reading)

In [None]:
pd.DataFrame(train.groupby('meter_type')['meter_reading'] \
                 .describe() \
                 .astype(int)) \
                 .sort_values('count',
                              ascending=False)

## Plotting the distribution of the target.
First thing we notice here is the extremely skewed distribution due to a few values that are very very large....

In [None]:
train['meter_reading'].plot(kind='hist',
                        bins=50,
                        figsize=(15, 2),
                       title='Distribution of Target Variable (meter_reading)')
plt.show()

Removing the high values we can get a better idea about the distribution of values. We may want to create different models for different buildings.

In [None]:
train.query('meter_reading < 5000')['meter_reading'] \
    .plot(kind='hist',
          figsize=(15, 3),
          title='Distribution of meter_reading, excluding values greater than 5000',
          bins=200)
plt.show()
train.query('meter_reading < 500')['meter_reading'] \
    .plot(kind='hist',
          figsize=(15, 3),
          title='Distribution of meter_reading, excluding values greater than 500',
         bins=200)
plt.show()
train.query('meter_reading < 100')['meter_reading'] \
    .plot(kind='hist',
          figsize=(15, 3),
          title='Distribution of meter_reading, excluding values greater than 100',
         bins=100)
plt.show()

## Target for a single building /w Multiple Meters. Viewing over Time.
On inspection of the data over time, we can see that this data is very messy. There appears to be times when the values drop to zero.

In [None]:
train.query('building_id == 0 and meter == 0') \
    .set_index('timestamp')['meter_reading'].plot(figsize=(15, 3),
                                                 title='Building 0 - Meter 0')

plt.show()
train.query('building_id == 753').set_index('timestamp').groupby('meter')['meter_reading'].plot(figsize=(15, 3),
                                                 title='Building 753 - Meters 0-3')
plt.show()
train.query('building_id == 1322').set_index('timestamp').groupby('meter')['meter_reading'].plot(figsize=(15, 3),
                                                 title='Building 1322 - Meters 0-3')
plt.show()

# Using Building Metadata

In [None]:
# First take a look at the building metadata
bmd.describe()

- We see that there is spike in buildings that were built in the year 1976
- 774 buildings have no year built information

In [None]:
bmd.groupby('year_built')['site_id'] \
    .count() \
    .plot(figsize=(15, 5),
          style='.-',
          title='Building Meta Data - Count by Year Built')
plt.show()
print('{} Buildings have no year data.'.format(np.sum(bmd['year_built'].isna())))

## Building Primary Use
- Education is the most common type of building use, with office second.
- There is a steep drop off in number of buildings after Lodging.

In [None]:
bmd.groupby('primary_use') \
    .count()['site_id'] \
    .sort_values() \
    .plot(kind='barh',
          figsize=(15, 5),
          title='Count of Buildings by Primary Use')
plt.show()

In [None]:
# Aggregate some meter reading stats
meter_reading_stats = train.groupby('building_id')['meter_reading'].agg(['mean','max','min']).reset_index()
bmd_with_stats = pd.merge(bmd, meter_reading_stats, on=['building_id']).rename(columns={'mean':'mean_meter_reading',
                                                                       'max':'max_meter_reading',
                                                                       'min':'min_meter_reading'})

## Building Type and Meter Reading 

In [None]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 
sns.pairplot(bmd_with_stats.dropna(),
             vars=['mean_meter_reading','min_meter_reading',
                   'max_meter_reading','square_feet','year_built'],
             hue='primary_use')
plt.show()

# Time Series Impact on Energy Consumption

In [None]:
train['Weekday'] = train['timestamp'].dt.weekday
train['Weekday_Name'] = train['timestamp'].dt.weekday_name
train['Month'] = train['timestamp'].dt.month
train['DayofYear'] = train['timestamp'].dt.dayofyear
train['Hour'] = train['timestamp'].dt.hour

In order to properly visualize the data, we can normalize the meter reading by type. This allows us to compare how the time series features impact each meter reading type, but on the same scale. The normalized value shows the value in relation to the meter type's average.

In [None]:
train['normalized_meter_reading_type'] = \
    train.groupby('meter_type')['meter_reading'] \
        .transform(lambda x: (x - x.mean()) / x.std())

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
sns.barplot(data=train.groupby(['Weekday_Name','meter_type']).mean().reset_index(),
            x='Weekday_Name',
            y='normalized_meter_reading_type',
            hue='meter_type',
            ax=ax)
plt.title('Day of Week vs. Normalized Meter Reading')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
sns.barplot(data=train.groupby(['Month','meter_type']).mean().reset_index(),
            x='Month',
            y='normalized_meter_reading_type',
            hue='meter_type',
            ax=ax)
plt.title('Month vs. Normalized Meter Reading')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
sns.barplot(data=train.groupby(['Hour','meter_type']).mean().reset_index(),
            x='Hour',
            y='normalized_meter_reading_type',
            hue='meter_type',
            ax=ax)
plt.title('Hour within Day vs. Normalized Meter Reading')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
sns.lineplot(data=train.groupby(['DayofYear','meter_type']).mean().reset_index(),
            x='DayofYear',
            y='normalized_meter_reading_type',
            hue='meter_type',
            ax=ax)
# plt.title('Day of Year vs. Normalized Meter Reading')
plt.show()

## Next.... weather data .. check back soon.