
### Problem Description

from https://www.drivendata.org/competitions/55/schneider-cold-start/page/111/

"The objective of this competition is to forecast energy consumption from varying amounts of "cold start" data, and little other building information. That means that for each building in the test set you are given a small amount of data and then asked to predict into the future."



Basically, we are supposed to predict energy usage at various time scales for a wide variety of buildings and energy use profiles. We are given some metadata about buildings as well as some energy usage data and the outside temperature (sometimes).

![test](images/mlscheme.png)

### First Look at Data

In [2]:
import pandas as pd
import numpy as np

In [3]:
# building metadata
meta = pd.read_csv('data/meta.csv')
# building energy consumption
consumption = pd.read_csv('data/consumption_train.csv')

In [4]:
# test data
test = pd.read_csv('data/cold_start_test.csv')

submission_format = pd.read_csv('data/submission_format.csv')

The meta data will be used in conjunction with the train, test, and submission data sets.

In [4]:
meta.head(3)

Unnamed: 0,series_id,surface,base_temperature,monday_is_day_off,tuesday_is_day_off,wednesday_is_day_off,thursday_is_day_off,friday_is_day_off,saturday_is_day_off,sunday_is_day_off
0,100003,x-large,low,False,False,False,False,False,True,True
1,100004,x-large,low,False,False,False,False,False,True,True
2,100006,x-small,low,False,False,False,False,False,True,True


In [5]:
meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1383 entries, 0 to 1382
Data columns (total 10 columns):
series_id               1383 non-null int64
surface                 1383 non-null object
base_temperature        1383 non-null object
monday_is_day_off       1383 non-null bool
tuesday_is_day_off      1383 non-null bool
wednesday_is_day_off    1383 non-null bool
thursday_is_day_off     1383 non-null bool
friday_is_day_off       1383 non-null bool
saturday_is_day_off     1383 non-null bool
sunday_is_day_off       1383 non-null bool
dtypes: bool(7), int64(1), object(2)
memory usage: 41.9+ KB


The consumption data is the training data set for any predictive model we create.

In [6]:
consumption.head(3)

Unnamed: 0.1,Unnamed: 0,series_id,timestamp,consumption,temperature
0,0,103088,2014-12-24 00:00:00,101842.233424,
1,1,103088,2014-12-24 01:00:00,105878.048906,
2,2,103088,2014-12-24 02:00:00,91619.105008,


The timestamp data looks like a datetime, but is actually a string.

In [7]:
consumption.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509376 entries, 0 to 509375
Data columns (total 5 columns):
Unnamed: 0     509376 non-null int64
series_id      509376 non-null int64
timestamp      509376 non-null object
consumption    509376 non-null float64
temperature    280687 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 19.4+ MB


In [6]:
test.head(3)

Unnamed: 0.1,Unnamed: 0,series_id,timestamp,consumption,temperature
0,0,102781,2013-02-27 00:00:00,15295.740389,17.0
1,1,102781,2013-02-27 01:00:00,15163.209562,18.25
2,2,102781,2013-02-27 02:00:00,15022.264079,18.0


In [8]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111984 entries, 0 to 111983
Data columns (total 5 columns):
Unnamed: 0     111984 non-null int64
series_id      111984 non-null int64
timestamp      111984 non-null object
consumption    111984 non-null float64
temperature    67068 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 4.3+ MB


In [8]:
submission_format.head(10)

Unnamed: 0,pred_id,series_id,timestamp,temperature,consumption,prediction_window
0,0,102781,2013-03-03 00:00:00,19.93125,0.0,daily
1,1,102781,2013-03-04 00:00:00,20.034375,0.0,daily
2,2,102781,2013-03-05 00:00:00,19.189583,0.0,daily
3,3,102781,2013-03-06 00:00:00,18.397917,0.0,daily
4,4,102781,2013-03-07 00:00:00,20.7625,0.0,daily
5,5,102781,2013-03-08 00:00:00,19.8,0.0,daily
6,6,102781,2013-03-09 00:00:00,20.466667,0.0,daily
7,7,103342,2013-06-26 00:00:00,10.486607,0.0,weekly
8,8,103342,2013-07-03 00:00:00,10.006548,0.0,weekly
9,9,102969,2013-12-15 00:00:00,20.214583,0.0,daily


In [9]:
submission_format.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7529 entries, 0 to 7528
Data columns (total 6 columns):
pred_id              7529 non-null int64
series_id            7529 non-null int64
timestamp            7529 non-null object
temperature          4579 non-null float64
consumption          7529 non-null float64
prediction_window    7529 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 353.0+ KB


We are expected to make predictions on the time scale indicated in the submission_format data, using the temperature and consumption data. Let's take a look at all the data associated with series_id 102781 (the first series id of the submission_format data).

In [9]:
meta[meta.series_id==102781]

Unnamed: 0,series_id,surface,base_temperature,monday_is_day_off,tuesday_is_day_off,wednesday_is_day_off,thursday_is_day_off,friday_is_day_off,saturday_is_day_off,sunday_is_day_off
1042,102781,large,low,False,False,False,False,False,True,True


In [12]:
test[test.series_id==102781].head(5)

Unnamed: 0.1,Unnamed: 0,series_id,timestamp,consumption,temperature
0,0,102781,2013-02-27 00:00:00,15295.740389,17.0
1,1,102781,2013-02-27 01:00:00,15163.209562,18.25
2,2,102781,2013-02-27 02:00:00,15022.264079,18.0
3,3,102781,2013-02-27 03:00:00,15370.420458,17.0
4,4,102781,2013-02-27 04:00:00,15303.103213,16.9


In [13]:
test[test.series_id==102781].tail(5)

Unnamed: 0.1,Unnamed: 0,series_id,timestamp,consumption,temperature
91,91,102781,2013-03-02 19:00:00,16595.804694,21.0
92,92,102781,2013-03-02 20:00:00,18299.772472,20.0
93,93,102781,2013-03-02 21:00:00,15130.602771,19.0
94,94,102781,2013-03-02 22:00:00,14411.149709,18.8
95,95,102781,2013-03-02 23:00:00,14486.88161,19.0


So there are 95 consecutive hours of temperature and energy consumption data for this building and we need to predict 6 days of consumption after this time series data. We also know it is a large building with low base temperature and it is off on the weekends.

It is also important to note that much of the temperature data is missing, in the training, test, and sumbission data:

In [32]:
consumption.temperature.isnull().mean()

0.4489591186078653

In [31]:
test.temperature.isnull().mean()

0.40109301328761254

In [30]:
submission_format.temperature.isnull().mean()

0.39181830256342143

That's all for the first look at the data. In the next notebook I'll do some deeper analysis.