# Introduction

The aim of this notebook is to analyse data that can spoil averages.

To show the data with less noise, a daily grouping has been performed.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
from IPython.display import display
from ashrae_utils import reduce_mem_usage

# Prepare data

In [None]:
data_path = '../input/ashrae-energy-prediction/'

## train

In [None]:
X_train = pd.read_csv(data_path + 'train.csv', engine='python')
X_train, na_list = reduce_mem_usage(X_train)
X_train['timestamp'] = pd.to_datetime(X_train['timestamp'], format='%Y-%m-%d %H:%M:%S')
X_train['meter'] = pd.Categorical(X_train['meter']).rename_categories({0: 'electricity', 1: 'chilledwater', 2: 'steam', 3: 'hotwater'})

In [None]:
X_train.head()

## building metadata

In [None]:
building_metadata = pd.read_csv(data_path + 'building_metadata.csv', engine='python')
building_metadata, na_list = reduce_mem_usage(building_metadata)

In [None]:
building_metadata.head()

# Group data in a daily basis

In [None]:
daily_train = X_train
daily_train['date'] = daily_train['timestamp'].dt.date
daily_train = daily_train.groupby(['date', 'building_id', 'meter']).sum()
daily_train

# Aggregate the data for buildings

In [None]:
daily_train_agg = daily_train.groupby(['date', 'meter']).agg(['sum', 'mean', 'idxmax', 'max'])
daily_train_agg = daily_train_agg.reset_index()
level_0 = daily_train_agg.columns.droplevel(0)
level_1 = daily_train_agg.columns.droplevel(1)
level_0 = ['' if x == '' else '-' + x for x in level_0]
daily_train_agg.columns = level_1 + level_0
daily_train_agg.rename_axis(None, axis=1)
daily_train_agg.head()

In [None]:
fig_total = px.line(daily_train_agg, x='date', y='meter_reading-sum', color='meter', render_mode='svg')
fig_total.update_layout(title='Total kWh per energy aspect')
fig_total.show()

The sum, facetted for each energy aspect, shows some aberrant values.

In [None]:
fig_maximum = px.line(daily_train_agg, x='date', y='meter_reading-max', color='meter', render_mode='svg')
fig_maximum.update_layout(title='Maximum kWh value per energy aspect')
fig_maximum.show()

Looking at the max value for each day, and for each energy aspect, shows that only a single building (for day and energy aspect) is causing the aberrant peaks.

# Identifying outliers

In [None]:
daily_train_agg['building_id_max'] = [x[1] for x in daily_train_agg['meter_reading-idxmax']]
daily_train_agg.head()

In [None]:
def show_building(building, energy_aspects=None):
    fig = px.line(daily_train.loc[(slice(None), building, slice(None)), :].reset_index(),
                  x='date',
                  y='meter_reading',
                  color='meter',
                  render_mode='svg')
    if energy_aspects:
        if 'electricity' not in energy_aspects:
            fig['data'][0].visible = 'legendonly'
        if 'chilledwater' not in energy_aspects:
            fig['data'][1].visible = 'legendonly'
        if 'steam' not in energy_aspects:
            fig['data'][2].visible = 'legendonly'
        if 'hotwater' not in energy_aspects:
            fig['data'][3].visible = 'legendonly'
    fig.update_layout(title='Building ID: {}'.format(building))        
    fig.show()
    display(building_metadata[building_metadata['building_id']==building])

## Electricity

In [None]:
print('Number of days that a building has the maximum electricity consumption of all the buildings:\n')
print(daily_train_agg[daily_train_agg['meter'] == 'electricity']['building_id_max'].value_counts())

The max values of electricity are caused by only 6 buildings. 

In [None]:
daily_train_electricity = daily_train_agg[daily_train_agg['meter']=='electricity'].copy()
daily_train_electricity['building_id_max'] = pd.Categorical(daily_train_electricity['building_id_max'])
fig_daily_electricity = px.scatter(daily_train_electricity,
                                   x='date',
                                   y='meter_reading-max',
                                   color='building_id_max',
                                   render_mode='svg')
fig_daily_electricity.update_layout(title='Maximum consumption values for the day and energy aspect')
fig_daily_electricity.show()

In [None]:
show_building(803, ['electricity'])

This is a building typified as an educational center.
Therefore, there should not be an industrial electricity consumption.

The electricity consumption has an average of about 120 000 kWh for 180 000 ft² each day.

This is 0.66 kWh/ft²/day or 243 kWh/ft²/year.
Where the typical consumptions is about 20 kWh/ft²/year.

This building has a 10 fold more electricity consumption than the typical one.
Maybe the meter or the software that reads it is not configured correctly.

In [None]:
show_building(801, ['electricity'])

This is a building typified as an educational center.
Therefore, there should not be an industrial electricity consumption.

The electricity consumption has an average of about 110 000 kWh for 500 000 ft² each day.

This is 0.22 kWh/ft²/day or 80 kWh/ft²/year.
Where the typical consumptions is about 20 kWh/ft²/year.

This building has a 4 fold more electricity consumption than the typical one.
Maybe the meter or the software that reads it is not configured correctly.

In [None]:
show_building(799, ['electricity'])

This building has a very erractic behaviour with
some periods of no data,
a period of increasing consumption,
and two periods of stable consumptions with a weekly pattern.

The first of the two periods has an average of 75 000 kWh for 500 000 ft² each day.
The second has an average of 175 000 kWh.

This is 0.15-0.35 kWh/ft²/day or 55-130 kWh/ft²/year.
Where the typical consumptions is about 20 kWh/ft²/year.

Both specific consumptions are from 3 to 7 times the typical.

This center could be new. The consumption in the summer period is due to office and equipments.
Then in half september, the course begins. Something changed in November.

The most problable is that the building is not new, you don't build buildings for half a million square feet in one year. Then, meters have been added progressively, and some of them are not well configured.

In [None]:
show_building(1088, ['electricity'])

This building has a steady period in the cold part of the year and a very noisy period in the hot part of the year.

Maybe the total is the sum of more than one meter, and maybe the meter which measures the HVAC is not well configured.

In [None]:
show_building(993, ['electricity'])

This building has a very steady daily consumption with a weekly pattern, and without a seasonal pattern.

It has some vacational pauses at July and December, and some peaks.

In [None]:
show_building(794, ['electricity'])

This building has a very steady daily consumption with a weekly pattern, and without a seasonal pattern.

It has an average of 95 000 kWh for 750 000 ft² each day.
The second has an average of 175 000 kWh.

This is 0.13 kWh/ft²/day or 45 kWh/ft²/year.
Where the typical consumptions is about 20 kWh/ft²/year.
The specific consumption is more than twice the typical.

Some peak has the maximum of all the buildings at a single day.

## Chilledwater

In [None]:
print('Number of days that a building has the maximum chilledwater consumption of all the buildings:\n')
print(daily_train_agg[daily_train_agg['meter'] == 'chilledwater']['building_id_max'].value_counts())

The max values of electricity are caused by only 10 buildings. 

In [None]:
daily_train_chilledwater = daily_train_agg[daily_train_agg['meter']=='chilledwater'].copy()
daily_train_chilledwater['building_id_max'] = pd.Categorical(daily_train_chilledwater['building_id_max'])
fig_daily_chilledwater = px.scatter(daily_train_chilledwater,
                                    x='date',
                                    y='meter_reading-max',  
                                    color='building_id_max', 
                                    render_mode='svg')
fig_daily_chilledwater.update_layout(title='Maximum consumption values for the day and energy aspect')
fig_daily_chilledwater.show()

Only buildings 778 and 1088 have aberrant values.

In [None]:
show_building(778, ['chilledwater'])

The max consumption of the non-aberrant buildings is about 700 000 kWh.
The consumption of this building is 25 times this.
And only for a span of two months.

Probably, the measure is wrong.

In [None]:
show_building(1088, ['chilledwater'])

This building has a typical hot part of the year chilledwater consumption,
with a maximum of 100 000 kWh.
In some other short periods has very high peaks of about a hundred times more.

## Steam

In [None]:
print('Number of days that a building has the maximum steam consumption of all the buildings:\n')
print(daily_train_agg[daily_train_agg['meter'] == 'steam']['building_id_max'].value_counts())

The max values of electricity are caused by only 4 buildings. 

In [None]:
daily_train_steam = daily_train_agg[daily_train_agg['meter']=='steam'].copy()
daily_train_steam['building_id_max'] = pd.Categorical(daily_train_steam['building_id_max'])
fig_daily_steam = px.scatter(daily_train_steam,
                             x='date',
                             y='meter_reading-max',
                             color='building_id_max',
                             render_mode='svg')
fig_daily_steam.update_layout(title='Maximum consumption values for the day and energy aspect')
fig_daily_steam.show()

The building 1099 has a very large consumption and irreal values: 450 000 000 kWh.

The buildings 1168 and 1197 have large consumption: 3 000 000 kWh.

And the building 1148 has values above 1 000 000 kWh.

In [None]:
show_building(1099, ['steam'])

When the value is not gargantuan, it is below 50 000 kWh.

Probably, as it is something like a three order magnitude difference,
is a unit misconfiguration, like kWh instead of Wh.

In [None]:
show_building(1168, ['steam'])

A 1 000 000 kWh daily average is too big up to a 500 000 ft² office.

In [None]:
show_building(1197, ['steam'])

Too much steam energy for a not so big building.

In [None]:
show_building(1148, ['steam'])

Big consumption even for a very huge office.

## Hotwater

In [None]:
print('Number of days that a building has the maximum hotwater consumption of all the buildings:\n')
print(daily_train_agg[daily_train_agg['meter'] == 'hotwater']['building_id_max'].value_counts())

The max values of electricity are caused by only 7 buildings.
Practically, two of them.

In [None]:
daily_train_hotwater = daily_train_agg[daily_train_agg['meter']=='hotwater'].copy()
daily_train_hotwater['building_id_max'] = pd.Categorical(daily_train_hotwater['building_id_max'])
fig_daily_hotwater = px.scatter(daily_train_hotwater,
                                x='date',
                                y='meter_reading-max',
                                color='building_id_max',
                                render_mode='svg')
fig_daily_hotwater.update_layout(title='Maximum consumption values for the day and energy aspect')
fig_daily_hotwater.show()

In [None]:
show_building(1021, ['hotwater'])

Very big value.

In [None]:
show_building(1331, ['hotwater'])

It's a very big consumption for an education building.

# Conclusion

Taking only the buildings that consume more than the others,
could be seen that there are a lot of measure scale errors.

The error could be:

- The meter is not configured correctly. E.g., a bad voltage or current primary to secondary ratio.
- The software has not the units configured correctly. E.g., MJ/kg for steam.
- The software has not the decimal digits configured correctly.
- Using a power variable instead of an energy one.

The measure could be done with an unique meter, or the sum of several of them.

Some changes over time, values go to zero or the scale is changed,
indicates that some buildings have more than one meter.
One error in one meter and the overall measure is garbage.

This notebook has only analised the outliers that influence the maximum consumption in a daily basis.
This is only the tip of the iceberg.
A sound analysis should be done to detect and correct these outliers.

A solution to avoid scale errors is to normalize the values from 0 to 1, for each building and for each energy aspect.