# ASHRAE - Great Energy Predictor III


**IMPORTANT NOTE!!** The inline graphics make the content frame over 16Mb of data - it may take a while to load and render fully, please have patience (I hope your browser renders this bit quickly at least :)


## Heatmap EDA

This notebook shows all the meter readings over time in a 2D heatmap (one per building), as a simple exploration of the time series structure of the train/test sets.

Each plot shows a year of data: a pixel represents one hour, with a row of pixels as one week (24 &ast; 7 = 168 hours), with ~52 rows, one per week of the year.

The plots are saved to png files and can be downloaded from the **Output Files** section (left), for closer inspection.

This gives a good overview of the kind of interactions between hourly, weekly and long term seasonality: many patterns are present, strong seasonal variation, weekly cycles, spikes, and intricate patterns that would be hard to spot on a simple line plot.

The plot for every single building is shown at the end, sorted by site, and act as a map of the entire training set, illustrating the kinds of *conditionality* needed in the models, e.g. quiet Thursdays, busy summers, shorter usage-hours at weekends...

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import gc, os, sys
import matplotlib.pyplot as plt
import cv2
from IPython.display import Image, display, HTML
import calendar

np.seterr(divide='ignore', invalid='ignore')

DTYPE = {
    'building_id': ('int16'),
    'meter': ('int8'),
    'meter_reading': ('float32')
}

INPUT = '../input/ashrae-energy-prediction'

In [2]:
buildings = pd.read_csv(f'{INPUT}/building_metadata.csv', index_col='building_id')
buildings.shape

In [3]:
train = pd.read_csv(f'{INPUT}/train.csv', parse_dates=['timestamp'], dtype=DTYPE)
train.shape

In [4]:
weather = pd.read_csv(f'{INPUT}/weather_train.csv', parse_dates=['timestamp'])
weather.shape

One-liners "R" us - first heatmap is count of building types at each site.

In [5]:
buildings.groupby(['primary_use','site_id']).size().unstack().fillna(0).style.background_gradient(axis=None)

Create (x,y) coordinates for every row, a row is one week, in pandas, `dayofweek` 0 is *Monday*. (Note that all of train is in 2016.)

In [6]:
train.timestamp.dt.year.value_counts()

#### Note: pandas reports first days of training set as weekofyear=53 (of previous year, 2015); use mod 53 as an ugly fix to make those times 'week 0', then week of year is 0..52 inclusive

In [7]:
train.timestamp.dt.weekofyear.min(), train.timestamp.dt.weekofyear.max()

In [8]:
def add_xy(df):
    dt = df.timestamp.dt
    df['x'] = ((dt.dayofweek * 24) + dt.hour).astype('int16')
    df['y'] = (dt.weekofyear % 53).astype('int8')

In [9]:
add_xy(train)
add_xy(weather)
train.x.max(), train.y.max()

In [10]:
WIDTH = train.x.max() + 1
HEIGHT = train.y.max() + 1
WIDTH, HEIGHT

# Grayscale plots

One per building / meter.

In [11]:
PLOTS_GRAY = 'plots_grayscale'
os.makedirs(PLOTS_GRAY, exist_ok=True)

In [12]:
def normalize(c):
    return np.nan_to_num(c / c.max())

def log_normalize(c):
    return normalize(np.log1p(c))

# WARNING! This stretches each color channel independently
#  - it loses relative scale between them
def log_normalize_chan(c):
    return np.dstack([log_normalize(a.T) for a in c.T])

def write_img(fname, img):
    img *= 255
    if len(img.shape) == 3:
        img = cv2.cvtColor(img.astype('uint8'), cv2.COLOR_RGB2BGR)
    cv2.imwrite(fname, img)

Simple counting code.

Using [`np.add.at`][1] is not strictly necessary here.
It ensures that results are accumulated for elements that are indexed more than once, so this code is more generically useful.
However, all buildings have at most one observation per hour, so simple indexing `cc[df.y, df.x] += df.meter_reading` would also work.


[1]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.at.html

In [13]:
for (m, b), df in train.groupby(['meter', 'building_id']):
    cc = np.zeros((HEIGHT, WIDTH), dtype=float)
    np.add.at(cc, (df.y, df.x), df.meter_reading)
    write_img(f'{PLOTS_GRAY}/{m}_{b}_log.png', log_normalize(cc))

Sneak preview - main display is later...

In [14]:
display(Image(f'{PLOTS_GRAY}/1_161_log.png', width=WIDTH*4))

Over 2000 plots :)

In [15]:
!ls -1 $PLOTS_GRAY | wc -l

# RGB Plots

One per building - three meters in <font color=red>red</font>, <font color=green>green</font> and <font color=blue>blue</font> channels. Each channel is log transformed and stretched separately - so some information about absolute energy usage per channel is lost.

In [16]:
PLOTS_RGB = 'plots_rgb'
os.makedirs(PLOTS_RGB, exist_ok=True)

In [17]:
METERS = ['electricity', 'chilledwater', 'steam', 'hotwater']
M_DICT = {k:i for i, k in enumerate(METERS)}

train.meter.value_counts()

Values are: {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}

Simplest decision is: ignore hotwater (least frequent), plot the others;

 - <font color=red>red</font> = **electricity**
 - <font color=green>green</font> = **chilledwater**
 - <font color=blue>blue</font> = **steam**

The brightness of each color channel indicates the energy usage, whilst black could indicate zero energy usage or (much more likely) missing data.

In [18]:
rgb_pngs = {}
for b, df in train.query('meter<3').groupby('building_id'):
    cc = np.zeros((HEIGHT, WIDTH, 3), dtype=float)
    np.add.at(cc, (df.y, df.x, df.meter), df.meter_reading)
    png = f'{PLOTS_RGB}/{b}_log.png'
    rgb_pngs[b] = png
    write_img(png, log_normalize_chan(cc))

Over 1400 plots (well, one for each building... :)

In [19]:
!ls -1 $PLOTS_RGB | wc -l

# Matplotlib Display

Show some plots selected for interesting features...

Many will contain 5, 6 or 7 clearly visible daily peaks - however the exact start and end of those peaks can change between buildings, though is quite correlated for buildings in the same site. Similarly - the length of peak usage per day can vary within a building, from looking dormant, to just being shorter than a 'usual' day.

In [20]:
# Find month offsets
month_y_min = train.groupby(train.timestamp.dt.month).y.min()
ylabels = np.asarray(calendar.month_abbr)[month_y_min.index]
yticks = month_y_min.values

# Tick to mark days - each 24 hours
days = list(calendar.day_abbr)
xlabels = days
xticks = np.arange(7) * 24
WEATHER_COLS = ['air_temperature', 'precip_depth_1_hr', 'sea_level_pressure']

def count_gray(df):
    cc = np.zeros((HEIGHT, WIDTH), dtype=float)
    np.add.at(cc, (df.y, df.x), df.meter_reading)
    return cc

def count_rgb(df):
    cc = np.zeros((HEIGHT, WIDTH, 3), dtype=float)
    np.add.at(cc, (df.y, df.x, df.meter), df.meter_reading)
    return cc

# use rank transform of selected columns (fork to try others!)
def weather_rgb(df):
    cc = np.zeros((HEIGHT, WIDTH, 3), dtype=float)
    for i, c in enumerate(WEATHER_COLS):
        cc[df.y, df.x, i] = df[c].rank(pct=True)
    return cc

def detail_list(series):
    return ''.join([f'<li><i>{k}</i>: <b>{v}</b>'
                    for k,v in series.dropna().items()])

def display_plot(plotdata, title):
    fig, ax = plt.subplots(figsize=(14, 6))
    c = ax.imshow(plotdata)
    ax.set_xticks(xticks, False)
    ax.set_xticklabels(xlabels)
    ax.set_yticks(yticks)
    ax.set_yticklabels(ylabels)
    ax.set_title(title)
    plt.tight_layout()
    plt.show()

def show_plot(src_df, building_id, comment):
    df = src_df.query(f'(building_id=={building_id}) and (meter<3)')
    display(HTML(f'<h1 id="b{building_id}">Building {building_id}</h1>'))
    display(HTML(detail_list(buildings.loc[building_id])))
    display(df.groupby('meter').meter_reading.agg(['count', 'mean', 'max']))
    display(HTML(f'<br/>{comment}'))
    p = log_normalize_chan(count_rgb(df))
    display_plot(p, f'Building {building_id}')

COMMENTS = {
    647: "Some have a lot of missing data",
    675: "Nice intricate pattern",
    677: "<a target='_blank' href='https://www.youtube.com/watch?v=ciz_C3xiuN0'>(Thursday) Here's Why I Did Not Go to Work Today</a> (I didn't know this song - Google suggested it.)",
    182: "Single hour spikes <b>much</b> higher than the mean (see stats above) - <a target='_blank' href='https://www.youtube.com/watch?v=qAkZT_4vL_Y'>what's he <i>building</i> in there?</a> (See comments! &darr;)",
    822: "Looks more like an on/off pattern (constant usage for one day, no hourly variation), and reminds me of a <a target='_blank' href='https://www.kaggle.com/jtrotman/eda-talkingdata-temporal-click-count-plots'>ten minute switching pattern in chinese mobile advert click patterns</a> :) ",
    751: "Remember, blue is <i>steam</i>: Some activity at weekends, but weekends have different seasonal pattern",
    1017: "Abrupt drop in electricity, switches to chilledwater (green)?",
    1063: "Monthly pattern in weekends?",
    747: "Who likes <a target='_blank' href='https://www.google.com/search?q=fruit+salad+sweet&source=lnms&tbm=isch'>Fruit Salads?</a>",
    1247: "Interesting interaction between hour of day and position in year - phase shifts over time.",
    1355: "Looks like four or more separate regimes - longer days in the later half (that start one hour later), and missing data around March/April that is common to most site 15 buildings.",
}

In [21]:
# Add this back if you want separate weather plots
# for site, df in weather.groupby('site_id'):
#     display_plot(weather_rgb(df), f'Weather at site {site}')

In [22]:
for building_id, comment in COMMENTS.items():
    show_plot(train, building_id, comment)

# BY SITE

Enough commenting on individual patterns: just see for yourself - here they **ALL** are :D

**Note**: they are sorted by **primary_use**, then PNG file size - so more complicated plots (harder to compress &rarr; higher file size) come first.

This clearly shows sites tend to have missing observations (black areas) at the same time, e.g. site 15 as shown above, and in site 0, nearly all buildings have missing observations for January/February at the top of each plot.

Weather for each site is displayed first. Notice how much noisier plots from the natural world look compared to the robotic world of power consumption (and also different missing data patterns).

Colors for weather:

 - <font color=red>red</font> = **air_temperature**
 - <font color=green>green</font> = **precip_depth_1_hr**
 - <font color=blue>blue</font> = **sea_level_pressure**
 
 
## Tips

Apple Mac users can run "Digital Color Meter" to see the RGB components of pixels in these plots to examine exact values. (Windows &amp; Linux must have equivalents &mdash; if anyone knows please add a comment.)

MacOS also has an extremely useful full screen zoom feature that is as simple as **&lt;control&gt; + scroll gesture**.

If this does not work it can be enabled in the Accessibility settings:

    System Preferences
      >> Accessibility
        >> Zoom
          >> "Use scroll gesture with modifier keys to zoom"


In [23]:
import base64

buildings['png'] = pd.Series(rgb_pngs)
buildings['png_size'] = pd.Series(rgb_pngs).map(os.path.getsize)

IMG_PER_ROW = 4

def img_src(fname):
    with open(fname, 'rb') as f:
        return base64.b64encode(f.read()).decode('utf-8')

# Return HTML for a table of plots
def make_table(df, per_row, inline=True):
    src = ""
    for i, (bid, row) in enumerate(df.iterrows()):
        if (i%per_row) == 0:
            if i:
                src += "</tr>"
            src += "<tr>"
        if inline:
            dat = f"data:image/png;base64,{img_src(row.png)}"
        else:
            dat = row.png
        src += (f'<td><img width={WIDTH} height={HEIGHT} src="{dat}">'
                f'<br/>[{bid}] <i>{row.primary_use}</i></td>')
    src += "</tr>"
    return f"<table>{src}</table>"

# Write single HTML table with all plots
with open('pngs_sorted_by_size.html', 'w') as f:
    table_src = make_table(buildings.sort_values('png_size'), per_row=8, inline=False)
    print(f"<html><head><title>ASHRAE Plots</title></head>"
          f"<body>{table_src}</body></html>\n", file=f)

# Generate output section per site
for site, df in buildings.sort_values(['primary_use', 'png_size'], ascending=[True, False]).groupby('site_id'):
    display(HTML(f'<h1 id="s{site}">Site {site}</h1>'))
    
    display(HTML(f'<h2>Weather</h2>'))
    display_plot(weather_rgb(weather.loc[weather.site_id==site]), f'Weather site {site}')
    display(weather.loc[weather.site_id == site].agg({
                WEATHER_COLS[0]: ['min', 'mean', 'max'],
                WEATHER_COLS[1]: ['min', 'mean', 'max'],
                WEATHER_COLS[2]: ['min', 'mean', 'max']
            }).T.round(1)
    )
    
    display(HTML(f'<h2>Stats</h2>'))
    display(
        df.groupby('primary_use').agg({
            'square_feet': ['count', 'mean'],
            'year_built': ['min', 'median', 'max'],
            'floor_count': ['min', 'median', 'max']
        }).T.dropna(how='all').T.fillna('').style.background_gradient(axis=0)
    )
    
    display(HTML(f'<h2>Buildings</h2>'))
    src = make_table(df, IMG_PER_ROW, inline=True)
    src += "<br/><hr/>"
    display(HTML(src))

# Test Set Structure Mini EDA

I was going to plot the missing/present patterns in the test set but it turns out there is only one pattern: the test set is a perfect grid of all days in 2017-2018, all buildings, with selected meters per building...

In [24]:
test = pd.read_csv(f'{INPUT}/test.csv', usecols=['building_id', 'meter'], dtype=DTYPE)
test.shape

We expect hourly observations, so two years is:

In [25]:
2*365*24

Buildings are present in multiples of 17520:

In [26]:
test.groupby(['building_id']).size().value_counts().sort_index()

The test set is simply 2380 sets of `['building_id', 'meter']` :

In [27]:
test.groupby(['building_id', 'meter']).size().value_counts()

However... there *will* be missing data in the test set - so some submission rows will be *public*, some will be *private* and some will be *ignored*, scores are only computed for rows marked public or private.

See this old [Home Depot Product Search Relevance thread][1] for an example of a [Kaggle solution file][2], containing the ground truth (`Relevance`) and the `Usage` column :)

(Side note: the Home Depot competition, like many others, had a different type of *ignored* test set data - rows that are added just to make leaderboard probing harder.)

***UPDATE:***

It has now been announced [the set of rows that are *Ignored* will change before the deadline][3]. Sites with (test set era) 2017-2018 data publicly available will not be included in the private LB. 


 [1]: https://www.kaggle.com/c/home-depot-product-search-relevance/discussion/20587
 [2]: https://storage.googleapis.com/kaggle-forum-message-attachments/117897/4154/solution.csv
 [3]: https://www.kaggle.com/c/ashrae-energy-prediction/discussion/117357

# Clean Up

Compress the generated pngs - Kaggle does not allow more than 500 output files.

    -bd : disable progress indicator
    -mmt[N] : set number of CPU threads
    -sdel : delete files after compression

In [28]:
!7z a -bd -mmt4 -sdel {PLOTS_GRAY}.zip {PLOTS_GRAY} >>compress.log

In [29]:
!7z a -bd -mmt4 -sdel {PLOTS_RGB}.zip {PLOTS_RGB} >>compress.log