# Structured and time series data - half hourly dataset

Based on the methodology taken by the third place result in the Rossman Kaggle competition as detailed in Guo/Berkhahn's [Entity Embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737). See fastai rossman.ipynb

The motivation behind exploring this architecture is it's relevance to real-world application. Most data used for decision making day-to-day in industry is structured and/or time-series data. Here we explore the end-to-end process of using neural networks with practical structured data problems.

Here we will use a Root Mean Square Error loss function to be consistent with other forecasting models used for this project

NB this notebook was divided into two parts so that the second part could be more iteratively fine tuned

In this part we pre-process the data for direct inport in 4_6_forecast_NN_hh


In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
from fastai.structured import *
from fastai.column_data import *
import feather as ftr
from datetime import timedelta
import random
import heapq

In [3]:
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'

In [4]:
# These are the usual ipython objects
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

In [5]:
# Get a sorted list of the objects and their sizes
#sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)

In [6]:
PATH='../input/merged_data/'

In [7]:
from IPython.display import HTML, display

Get day counts from this smaller dataset

In [7]:
en = pd.read_csv(f'{PATH}energy_only.csv')

In [8]:
en.head(n=2)

Unnamed: 0,lclid,day,energy_kwh_hh,str_time
0,MAC000002,2012-12-16,0.147,09:30:00
1,MAC000002,2012-12-16,0.11,10:00:00


In [9]:
#get a dict of key (mac) and value (row count)
d = en['lclid'].value_counts().to_dict()

In [10]:
len(en), len(en)/48

(168501120, 3510440.0)

In [11]:
del en

In [22]:
#(arbitrary) minimum number of sample points we want within each household
min_pts = 650*48

In [29]:
def keep_ids(d, min_pts):
    i=0
    #list of MAC's to keep from dataset
    keep_list = []
    for k, v in d.items():
        #want only households with >n days of data within this window
        if v >= min_pts:
            keep_list.append(k)
    return keep_list

In [None]:
keep_list=keep_ids(d, min_pts)

In [24]:
len(d), len(keep_list)

(5566, 2812)


39.6GB file takes some time to read

In [12]:
hh = pd.read_csv(f'{PATH}hh_weather_bank_final.csv')

In [13]:
hh.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,Bank_holiday_Boxing_Day,Bank_holiday_Christmas_Day,Bank_holiday_Early_May_bank_holiday,Bank_holiday_Easter_Monday,Bank_holiday_Good_Friday,Bank_holiday_New_Years_Day,Bank_holiday_New_Years_Day_substitute_day,Bank_holiday_Queens_Diamond_Jubilee_extra_bank_holiday,Bank_holiday_Spring_bank_holiday,Bank_holiday_Spring_bank_holiday_substitute_day,Bank_holiday_Summer_bank_holiday,energy_csum
0,0,MAC000002,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,
1,1,MAC000003,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,3284.479


In [16]:
hh['day_time'] = pd.to_datetime(hh['day_time'], format = '%Y-%m-%d')

In [14]:
#1/July/2012 is around the start of hight number of households with data
start_date=datetime.datetime(year=2012,month=7,day=1)
end_date=datetime.datetime(year=2014,month=2,day=1)

In [15]:
delta = end_date - start_date
print (delta.days)

580


In [18]:
# Data subset
hh_df_subset=hh[(hh["day_time"]>=start_date) & (hh["day_time"]<end_date)]

In [19]:
#get a dict of key (mac) and value (row count)
d = hh_df_subset['LCLid'].value_counts().to_dict()

In [20]:
len(d)

5566

In [26]:
#lets find households with data accross the entire date range (less 4 as we know there a 4 step gap in sept 2013 accross the dataset)
min_pts = (580*48)-4

In [27]:
keep_list=keep_ids(d, min_pts)

In [28]:
len(keep_list)

3676

In [29]:
#free some ram
del hh

In [32]:
hh_df_subset.to_csv(f'{PATH}hh_weather_bank_final_20120701-20140201.csv')

Lets set a start and end date and only keep the LCLid's that fully cover the specified time bracket

Originally I had planned to use as much of the dataset as possible - but have run out of time and need to significantly subset to be able to finish 
forecasting by submission date.

Below I create 2 datasets for deep learning a 'short fat' dataset of 3676 households with 580 days of data


And a a longer skinnyer dataset of ~700 days and ~600 households

In [33]:
#keep only households with 580 days data 
hh_df_subset = hh_df_subset[hh_df_subset['LCLid'].isin(keep_list)]

In [34]:
hh_df_subset.to_csv(f'{PATH}hh_weather_bank_final_20120701-20140201.csv')

In [35]:
hh_df_subset.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,Bank_holiday_Boxing_Day,Bank_holiday_Christmas_Day,Bank_holiday_Early_May_bank_holiday,Bank_holiday_Easter_Monday,Bank_holiday_Good_Friday,Bank_holiday_New_Years_Day,Bank_holiday_New_Years_Day_substitute_day,Bank_holiday_Queens_Diamond_Jubilee_extra_bank_holiday,Bank_holiday_Spring_bank_holiday,Bank_holiday_Spring_bank_holiday_substitute_day,Bank_holiday_Summer_bank_holiday,energy_csum
1,1,MAC000003,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,3284.479
2,2,MAC000004,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,248.781


In [None]:
#min(df['some_property'])
#max(df['some_property'])

In [9]:
hh = pd.read_csv(f'{PATH}hh_weather_bank_final_650_days.csv')

In [10]:
hh.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,Bank_holiday_Boxing_Day,Bank_holiday_Christmas_Day,Bank_holiday_Early_May_bank_holiday,Bank_holiday_Easter_Monday,Bank_holiday_Good_Friday,Bank_holiday_New_Years_Day,Bank_holiday_New_Years_Day_substitute_day,Bank_holiday_Queens_Diamond_Jubilee_extra_bank_holiday,Bank_holiday_Spring_bank_holiday,Bank_holiday_Spring_bank_holiday_substitute_day,Bank_holiday_Summer_bank_holiday,energy_csum
0,1,1,MAC000003,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,3284.479
1,2,2,MAC000004,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,248.781
2,4,4,MAC000006,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,703.783
3,11,11,MAC000015,0.089,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,3801.481999
4,13,13,MAC000017,0.851,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,1917.111001


In [36]:
hh_df_subset.drop(columns=['Unnamed: 0','energy_csum'],inplace=True)

In [12]:
#we dont want these columns, need to read bank data back in and join...

In [37]:
cols = [c for c in hh_df_subset.columns if c.lower()[:4] != 'bank']

In [38]:
hh_df_subset=hh_df_subset[cols]

In [40]:
hh_df_subset.head()

Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday
1,MAC000003,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False
2,MAC000004,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False
3,MAC000005,0.03,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False
4,MAC000006,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False
10,MAC000013,0.075,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False


In [None]:
display(DataFrameSummary(hh).summary())

## Data Cleaning / Feature Engineering

In [30]:
len(hh)

95209184

In [41]:
bh = pd.read_csv(f'{PATH}uk_bank_holidays.csv')

In [42]:
bh.head(n=2)

Unnamed: 0,Bank holidays,Type
0,2012-12-26,Boxing Day
1,2012-12-25,Christmas Day


`join_df` is a function for joining tables on specific fields. By default, we'll be doing a left outer join of `right` on the `left` argument using the given fields for each table.

Pandas does joins using the `merge` method. The `suffixes` argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a "\_y" to those on the right.

In [98]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.

You should *always* consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.

In [44]:
bh.rename(columns={'Bank holidays': 'day_time'}, inplace=True)

In [46]:
bh['day_time'] = pd.to_datetime(bh['day_time'], format = '%Y-%m-%d')

In [47]:
hh_df_subset = join_df(hh_df_subset, bh, "day_time")

In [48]:
hh_df_subset.head(n=2)

Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,Type
0,MAC000003,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,
1,MAC000004,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,


In [50]:
hh_df_subset.rename(columns={'Type': 'bank_holiday'}, inplace=True)

### Save here

In [51]:
hh_df_subset.to_csv(f'{PATH}hh_weather_bank_prep_20120701-20140201.csv')

In [52]:
#get a dict of key (mac) and value (row count)
d = hh_df_subset['LCLid'].value_counts().to_dict()

In [53]:
len(d)

3676

In [56]:
macs = list(d.keys())

Randomly sample a subset so we can move on from data wrangling and do some forecasting on a reasonably sized dataset...
We can return to here and sample a larger dataset if we have time

In [57]:
keep_list = random.sample(macs, 500)

In [58]:
hh_df_subset = hh_df_subset.loc[hh_df_subset['LCLid'].isin(keep_list)]

### Save 500 LCLid date truncated dataset

In [59]:
hh_df_subset.to_csv(f'{PATH}hh_weather_bank_500_20120701-20140201.csv')

The resultant 500 household dataset is only 2.7GB - far quicker to work with than 40-60GB

### Create a second sub dataset: 606 LCLid 750 day dataset

The code below is for creation of a longer dataset - reread in the following first:
    

In [None]:
hh = pd.read_csv(f'{PATH}hh_weather_bank_final.csv')

In [33]:
#(arbitrary) minimum number of sample points we want within each household
min_pts = 750*48

In [34]:
#get a dict of key (mac) and value (row count)
d = hh['LCLid'].value_counts().to_dict()

In [35]:
i=0
#list of MAC's to keep from dataset
keep_list = []
for k, v in d.items():
    #want only households with >n days of data within this window
    if v >= min_pts:
        keep_list.append(k)

In [36]:
len(d), len(keep_list)

(2812, 606)

In [38]:
#keep only households with > 750 days data - this helps with RAM and also better training data
hh = hh[hh['LCLid'].isin(keep_list)]

### Save 606 LCLid, 750 day long dataset

In [39]:
hh.to_csv(f'{PATH}hh_weather_bank_final_750_days.csv')

### Read in

In [58]:
hh = pd.read_csv(f'{PATH}hh_weather_bank_final_750_days.csv', low_memory=False)

In [59]:
hh.tail()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,bank_holiday
23030755,95209179,168479803,168479803,MAC000267,0.932,2011,12,49,5,0,339,False,False,False,False,False,False,1323043200,-447870,11.37,232.0,6.045,4.57,1018.405,3.255,3.765,rain,0.905,Clear,2011-12-05 23:30:00,False,
23030756,95209180,168479804,168479804,MAC000268,0.176,2011,12,49,5,0,339,False,False,False,False,False,False,1323043200,-447870,11.37,232.0,6.045,4.57,1018.405,3.255,3.765,rain,0.905,Clear,2011-12-05 23:30:00,False,
23030757,95209181,168479805,168479805,MAC000269,0.024,2011,12,49,5,0,339,False,False,False,False,False,False,1323043200,-447870,11.37,232.0,6.045,4.57,1018.405,3.255,3.765,rain,0.905,Clear,2011-12-05 23:30:00,False,
23030758,95209182,168479806,168479806,MAC000270,0.595,2011,12,49,5,0,339,False,False,False,False,False,False,1323043200,-447870,11.37,232.0,6.045,4.57,1018.405,3.255,3.765,rain,0.905,Clear,2011-12-05 23:30:00,False,
23030759,95209183,168479807,168479807,MAC000271,0.149,2011,12,49,5,0,339,False,False,False,False,False,False,1323043200,-447870,11.37,232.0,6.045,4.57,1018.405,3.255,3.765,rain,0.905,Clear,2011-12-05 23:30:00,False,


In [74]:
hh['day_time'] = pd.to_datetime(hh['day_time'], format='%Y-%m-%d %H:%M:%S')

We now want to ensure that we have a dataset will all households having same start and end dates and same number of sames

In [75]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count()}

In [76]:
gb_dt = hh['day_time'].groupby(hh['LCLid']).apply(get_stats).unstack()

In [77]:
gb_dt.head()

Unnamed: 0_level_0,count,max,min
LCLid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MAC000006,36524,2014-02-28 23:30:00,2012-01-30 00:00:00
MAC000015,39164,2014-02-28 23:30:00,2011-12-06 00:00:00
MAC000017,39164,2014-02-28 23:30:00,2011-12-06 00:00:00
MAC000018,39116,2014-02-28 23:30:00,2011-12-07 00:00:00
MAC000019,39116,2014-02-28 23:30:00,2011-12-07 00:00:00


Lets tweak the range, and numbers 

In [78]:
gb_dt = gb_dt.reset_index()

In [79]:
min_series = gb_dt['min']
max_series = gb_dt['max']

In [80]:
min_series.min(), min_series.max()

(Timestamp('2011-11-23 00:00:00'), Timestamp('2012-02-09 00:00:00'))

In [81]:
max_series.min(), max_series.max()

(Timestamp('2014-01-27 23:30:00'), Timestamp('2014-02-28 23:30:00'))

In [82]:
un_max = set(max_series.tolist())
un_min = set(min_series.tolist())

In [83]:
nlesser_items = heapq.nsmallest(6, un_max)
ngreater_items = heapq.nlargest(6, un_min)

In [84]:
nlesser_items

[Timestamp('2014-01-27 23:30:00'),
 Timestamp('2014-02-01 23:30:00'),
 Timestamp('2014-02-04 23:30:00'),
 Timestamp('2014-02-10 23:30:00'),
 Timestamp('2014-02-12 23:30:00'),
 Timestamp('2014-02-17 23:30:00')]

In [85]:
ngreater_items

[Timestamp('2012-02-09 00:00:00'),
 Timestamp('2012-02-08 00:00:00'),
 Timestamp('2012-02-07 00:00:00'),
 Timestamp('2012-02-06 00:00:00'),
 Timestamp('2012-02-04 00:00:00'),
 Timestamp('2012-02-03 00:00:00')]

Lets start on a Sunday, - 05/Feb/2012 and end by a Sunday -  09/Feb 2014

In [86]:
start_date=datetime.datetime(year=2012,month=2,day=5)
end_date=datetime.datetime(year=2014,month=2,day=9)

In [87]:
delta = end_date - start_date
print (delta.days)

735


In [88]:
# Data subset
df=hh[(hh["day_time"]>=start_date) & (hh["day_time"]<end_date)]

In [89]:
#get a dict of key (mac) and value (row count)
d = df['LCLid'].value_counts().to_dict()

In [90]:
#lets find households with data accross the entire date range (less 4 as we know there a 4 step gap in sept 2013 accross the dataset)
min_pts = (735*48)-4

In [91]:
keep_list=keep_ids(d, min_pts)

In [92]:
#keep only households with 735 days data - this will be one of the two forecasting datasets we will use
df = df[df['LCLid'].isin(keep_list)]

Number of households in this second sub-dataset

In [93]:
len(keep_list)

544

In [94]:
df.head(n=2)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,bank_holiday
0,2,4,4,MAC000006,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,
1,3,11,11,MAC000015,0.089,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,


In [95]:
len(d)

606

In [100]:
df.drop(columns=['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.1.1'], inplace=True)

Create a column with only date part so we can merge with daily data

In [101]:
df['day']= pd.DatetimeIndex(df['day_time']).normalize()

In [102]:
df.tail()

Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,bank_holiday,day
22096681,MAC005218,0.026,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19
22096682,MAC005219,1.036,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19
22096683,MAC005220,0.659,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19
22096684,MAC005221,0.023,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19
22096727,MAC005555,0.088,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19


### Save before we merge 

In [108]:
df.to_csv(f'{PATH}hh_prep_544_ids_735_days.csv')

Bring in daylight times

In [109]:
daily = pd.read_feather(f'{PATH}df_daily_cat.feather')

In [110]:
daily.head(n=2)

Unnamed: 0,day,LCLid,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase,dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,daylightMins,AfterbankHoliday,BeforebankHoliday,bankHoliday_bw,bankHoliday_fw
0,2014-02-27,MAC000002,0.218,0.427458,1.35,48,0.406681,20.518,0.08,Std,ACORN-A,Affluent,10.31,2014-02-27 14:00:00,224,partly-cloudy-day,3.08,2014-02-27 23:00:00,0.32,4.14,1007.02,2014-02-27 22:00:00,10.31,rain,12.04,0.74,2014-02-27 14:00:00,0.82,10.31,2.0,2014-02-27 17:37:35,3.43,3.93,10.31,2014-02-27 06:51:45,2014-02-27 14:00:00,2014-02-27 12:00:00,Partly cloudy until evening.,2014-02-28 02:00:00,1.41,2014-02-27 14:00:00,2014-02-28 02:00:00,0.93,2014,2,9,27,3,58,False,False,False,False,False,False,1393459200,645.0,0,0,0.0,0.0
1,2014-02-26,MAC000002,0.1515,0.256833,1.028,48,0.196095,12.328,0.076,Std,ACORN-A,Affluent,11.29,2014-02-26 13:00:00,227,partly-cloudy-day,2.74,2014-02-26 07:00:00,0.26,3.82,1012.73,2014-02-26 07:00:00,11.29,rain,13.0,0.73,2014-02-26 13:00:00,3.03,11.29,2.0,2014-02-26 17:35:49,6.01,4.17,11.29,2014-02-26 06:53:52,2014-02-26 13:00:00,2014-02-26 12:00:00,Partly cloudy throughout the day.,2014-02-27 00:00:00,1.67,2014-02-26 13:00:00,2014-02-27 00:00:00,0.9,2014,2,9,26,2,57,False,False,False,False,False,False,1393372800,641.0,0,0,0.0,0.0


#potentially useful fields, lets add to our hh df

'cloudCover','uvIndex','moonPhase','sunriseTime','sunsetTime'

In [111]:
daily = daily[daily['LCLid'].isin(keep_list)]

In [112]:
daily = daily[['day', 'LCLid', 'cloudCover','uvIndex','moonPhase','sunriseTime','sunsetTime']]

In [113]:
daily.head(n=2)

Unnamed: 0,day,LCLid,cloudCover,uvIndex,moonPhase,sunriseTime,sunsetTime
2521,2014-02-27,MAC000006,0.32,2.0,0.93,2014-02-27 06:51:45,2014-02-27 17:37:35
2522,2014-02-26,MAC000006,0.26,2.0,0.9,2014-02-26 06:53:52,2014-02-26 17:35:49


Merge the dataframes

In [114]:
df = pd.merge(df, daily, on=['day', 'LCLid'], how='left')

In [115]:
df.tail()

Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,bank_holiday,day,cloudCover,uvIndex,moonPhase,sunriseTime,sunsetTime
19190139,MAC005218,0.026,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19,0.17,1.0,0.92,2012-02-19 07:09:14,2012-02-19 17:22:21
19190140,MAC005219,1.036,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19,0.17,1.0,0.92,2012-02-19 07:09:14,2012-02-19 17:22:21
19190141,MAC005220,0.659,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19,0.17,1.0,0.92,2012-02-19 07:09:14,2012-02-19 17:22:21
19190142,MAC005221,0.023,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19,0.17,1.0,0.92,2012-02-19 07:09:14,2012-02-19 17:22:21
19190143,MAC005555,0.088,2012,2,7,19,6,50,False,False,False,False,False,False,1329609600,-338430,13.045,231.0,6.9,3.53,1020.405,6.9,2.575,snow,0.79,Clear,2012-02-19 23:30:00,False,,2012-02-19,0.17,1.0,0.92,2012-02-19 07:09:14,2012-02-19 17:22:21


In [116]:
df.dtypes

LCLid                          object
energy(kWh/hh)                float64
dayYear                         int64
dayMonth                        int64
dayWeek                         int64
dayDay                          int64
dayDayofweek                    int64
dayDayofyear                    int64
dayIs_month_end                  bool
dayIs_month_start                bool
dayIs_quarter_end                bool
dayIs_quarter_start              bool
dayIs_year_end                   bool
dayIs_year_start                 bool
dayElapsed                      int64
delta_minutes                   int64
visibility                    float64
windBearing                   float64
temperature                   float64
dewPoint                      float64
pressure                      float64
apparentTemperature           float64
windSpeed                     float64
precipType                     object
humidity                      float64
summary                        object
day_time    

In [117]:
#re-save
df.to_csv(f'{PATH}hh_prep_544_ids_735_days.csv')

In [9]:
df = pd.read_csv(f'{PATH}hh_prep_544_ids_735_days.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [16]:
#df['day_time'] = pd.to_datetime(df['day_time'], format='%Y-%m-%d %H:%M:%S')
df['sunriseTime'] = pd.to_datetime(df['sunriseTime'], format='%Y-%m-%d %H:%M:%S')
df['sunsetTime'] = pd.to_datetime(df['sunsetTime'], format='%Y-%m-%d %H:%M:%S')

In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,bank_holiday,day,cloudCover,uvIndex,moonPhase,sunriseTime,sunsetTime
18798464,18798464,MAC000006,0.042,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-360000,1.32,160.0,-0.12,-0.22,1024.21,-4.68,4.35,snow,0.99,Foggy,2012-02-05 00:00:00,False,,2012-02-05,0.85,1.0,0.42,2012-02-05 07:34:45,2012-02-05 16:56:39
18799008,18799008,MAC000006,0.015,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-359970,1.37,160.0,0.05,-0.2,1023.655,-4.28,4.085,snow,0.98,Foggy,2012-02-05 00:30:00,False,,2012-02-05,0.85,1.0,0.42,2012-02-05 07:34:45,2012-02-05 16:56:39
18799552,18799552,MAC000006,0.029,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-359940,1.42,160.0,0.22,-0.18,1023.1,-3.88,3.82,snow,0.97,Foggy,2012-02-05 01:00:00,False,,2012-02-05,0.85,1.0,0.42,2012-02-05 07:34:45,2012-02-05 16:56:39
18800096,18800096,MAC000006,0.036,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-359910,1.7,155.0,0.24,0.0,1022.965,-3.575,3.46,snow,0.98,Foggy,2012-02-05 01:30:00,False,,2012-02-05,0.85,1.0,0.42,2012-02-05 07:34:45,2012-02-05 16:56:39
18800640,18800640,MAC000006,0.015,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-359880,1.98,150.0,0.26,0.18,1022.83,-3.27,3.1,snow,0.99,Foggy,2012-02-05 02:00:00,False,,2012-02-05,0.85,1.0,0.42,2012-02-05 07:34:45,2012-02-05 16:56:39


In [18]:
#create columns tracting these
df['from_sunrise']=df['sunriseTime']-df['day_time']
df['to_sunset']=df['day_time'] - df['sunsetTime']

In [23]:
df['from_sunrise'] = df['from_sunrise'].astype('timedelta64[m]')
df['to_sunset'] = df['to_sunset'].astype('timedelta64[m]')

In [25]:
df.dtypes

Unnamed: 0                      int64
LCLid                          object
energy(kWh/hh)                float64
dayYear                         int64
dayMonth                        int64
dayWeek                         int64
dayDay                          int64
dayDayofweek                    int64
dayDayofyear                    int64
dayIs_month_end                  bool
dayIs_month_start                bool
dayIs_quarter_end                bool
dayIs_quarter_start              bool
dayIs_year_end                   bool
dayIs_year_start                 bool
dayElapsed                      int64
delta_minutes                   int64
visibility                    float64
windBearing                   float64
temperature                   float64
dewPoint                      float64
pressure                      float64
apparentTemperature           float64
windSpeed                     float64
precipType                     object
humidity                      float64
summary     

In [26]:
df.drop(columns=['Unnamed: 0','sunriseTime','sunsetTime'],inplace=True)

## Durations

It is common when working with time series data to extract data that explains relationships across rows as opposed to columns, e.g.:
* Running averages
* Time until next event
* Time since last event

This is often difficult to do with most table manipulation frameworks, since they are designed to work with relationships across columns. As such, we've created a class to handle this type of data.

We'll define a function `get_elapsed` for cumulative counting across a sorted dataframe. Given a particular field `fld` to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.

Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly.

In [27]:
def get_elapsed(fld, pre):
    #D for days, m for minutes
    day1 = np.timedelta64(1, 'm')
    last_date = np.datetime64()
    last_hh = 0
    res = []

    for s,v,d in zip(df.LCLid.values, df[fld].values, df.day_time.values):
        if s != last_hh:
            last_date = np.datetime64()
            last_hh = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1))
    df[pre+fld] = res

We'll be applying this to a subset of columns:

In [28]:
columns = ["day_time", "LCLid", "is_bank_holiday"]

Let's walk through an example.

Say we're looking at Bank Holiday. We'll first sort by LCLid, then day, and then call `add_elapsed('BankHoliday', 'After')`:
This will apply to each row with Bank Holiday:
* A applied to every row of the dataframe in order of LCLid and day
* Will add to the dataframe the days since seeing a Bank Holiday
* If we sort in the other direction, this will count the days until another holiday.

In [29]:
df.dtypes

LCLid                          object
energy(kWh/hh)                float64
dayYear                         int64
dayMonth                        int64
dayWeek                         int64
dayDay                          int64
dayDayofweek                    int64
dayDayofyear                    int64
dayIs_month_end                  bool
dayIs_month_start                bool
dayIs_quarter_end                bool
dayIs_quarter_start              bool
dayIs_year_end                   bool
dayIs_year_start                 bool
dayElapsed                      int64
delta_minutes                   int64
visibility                    float64
windBearing                   float64
temperature                   float64
dewPoint                      float64
pressure                      float64
apparentTemperature           float64
windSpeed                     float64
precipType                     object
humidity                      float64
summary                        object
day_time    

In [30]:
fld = 'is_bank_holiday'
df = df.sort_values(['LCLid', 'day_time'])
get_elapsed(fld, 'After')
df = df.sort_values(['LCLid', 'day_time'], ascending=[True, False])
get_elapsed(fld, 'Before')

In [31]:
df.tail(n=2)

Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,bank_holiday,day,cloudCover,uvIndex,moonPhase,from_sunrise,to_sunset,Afteris_bank_holiday,Beforeis_bank_holiday
18799551,MAC005555,0.065,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-359970,1.37,160.0,0.05,-0.2,1023.655,-4.28,4.085,snow,0.98,Foggy,2012-02-05 00:30:00,False,,2012-02-05,0.85,1.0,0.42,424.0,-987.0,,-87840.0
18799007,MAC005555,0.095,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-360000,1.32,160.0,-0.12,-0.22,1024.21,-4.68,4.35,snow,0.99,Foggy,2012-02-05 00:00:00,False,,2012-02-05,0.85,1.0,0.42,454.0,-1017.0,,-87840.0


We're going to set the active index to Date.

In [32]:
df = df.set_index("day_time")

Then set null values from elapsed field calculations to 0.

In [33]:
columns = ['is_bank_holiday']

In [34]:
for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(0).astype(int)

In [35]:
df.head(n=2)

Unnamed: 0_level_0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,is_bank_holiday,bank_holiday,day,cloudCover,uvIndex,moonPhase,from_sunrise,to_sunset,Afteris_bank_holiday,Beforeis_bank_holiday
day_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
2014-02-08 23:30:00,MAC000006,0.077,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698370,10.6,202.5,8.39,4.175,979.175,5.07,6.85,rain,0.76,Breezy,False,,2014-02-08,0.47,1.0,0.3,-962.0,386.0,54720,0
2014-02-08 23:00:00,MAC000006,0.055,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698340,11.27,221.0,7.49,0.66,977.87,2.93,9.85,rain,0.62,Breezy,False,,2014-02-08,0.47,1.0,0.3,-932.0,356.0,53280,0


### Save

In [36]:
df.to_csv(f'{PATH}hh_bank_544_ids_735_days.csv')

In [70]:
df = pd.read_csv(f'{PATH}hh_bank_544_ids_735_days.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [71]:
df.head(n=2)

Unnamed: 0,day_time,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,is_bank_holiday,bank_holiday,day,cloudCover,uvIndex,moonPhase,from_sunrise,to_sunset,Afteris_bank_holiday,Beforeis_bank_holiday
0,2014-02-08 23:30:00,MAC000006,0.077,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698370,10.6,202.5,8.39,4.175,979.175,5.07,6.85,rain,0.76,Breezy,False,,2014-02-08,0.47,1.0,0.3,-962.0,386.0,54720,0
1,2014-02-08 23:00:00,MAC000006,0.055,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698340,11.27,221.0,7.49,0.66,977.87,2.93,9.85,rain,0.62,Breezy,False,,2014-02-08,0.47,1.0,0.3,-932.0,356.0,53280,0


Next we'll demonstrate window functions in pandas to calculate rolling quantities.

Here we're sorting by date (`sort_index()`) and counting the number of events of interest (`sum()`) defined in `columns` in the following week (`rolling()`), grouped by Store (`groupby()`). We do the same in the opposite direction.

In [37]:
bwd = df[['LCLid']+columns].sort_index().groupby("LCLid").rolling(336, min_periods=1).sum()

In [38]:
fwd = df[['LCLid']+columns].sort_index(ascending=False
                                      ).groupby("LCLid").rolling(336, min_periods=1).sum()

In [39]:
fwd.head(n=2)

Unnamed: 0_level_0,Unnamed: 1_level_0,is_bank_holiday
LCLid,day_time,Unnamed: 2_level_1
MAC000006,2014-02-08 23:30:00,0.0
MAC000006,2014-02-08 23:00:00,0.0


Next we want to drop the LCLid indices grouped together in the window function.

Often in pandas, there is an option to do this in place. This is time and memory efficient when working with large datasets.

In [40]:
#bwd.drop('LCLid',1,inplace=True)
bwd.reset_index(inplace=True)

In [41]:
#fwd.drop('LCLid',1,inplace=True)
fwd.reset_index(inplace=True)

In [42]:
fwd.head(n=2)

Unnamed: 0,LCLid,day_time,is_bank_holiday
0,MAC000006,2014-02-08 23:30:00,0.0
1,MAC000006,2014-02-08 23:00:00,0.0


In [81]:
df.reset_index(inplace=True)

Now we'll merge these values onto the df.

In [44]:
df = df.merge(bwd, 'left', ['day_time', 'LCLid'], suffixes=['', '_bw'])
hh = df.merge(fwd, 'left', ['day_time', 'LCLid'], suffixes=['', '_fw'])

In [45]:
df.drop(columns,1,inplace=True)

In [46]:
df.head(n=2)

Unnamed: 0,day_time,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,bank_holiday,day,cloudCover,uvIndex,moonPhase,from_sunrise,to_sunset,Afteris_bank_holiday,Beforeis_bank_holiday,is_bank_holiday_bw
0,2014-02-08 23:30:00,MAC000006,0.077,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698370,10.6,202.5,8.39,4.175,979.175,5.07,6.85,rain,0.76,Breezy,,2014-02-08,0.47,1.0,0.3,-962.0,386.0,54720,0,0.0
1,2014-02-08 23:00:00,MAC000006,0.055,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698340,11.27,221.0,7.49,0.66,977.87,2.93,9.85,rain,0.62,Breezy,,2014-02-08,0.47,1.0,0.3,-932.0,356.0,53280,0,0.0


In [49]:
# Check dates
start_date=datetime.datetime(year=2012,month=2,day=5)
end_date=datetime.datetime(year=2014,month=2,day=9)

In [76]:
df = df.set_index("day_time")

In [None]:
df.head()

In [51]:
# Note date_range is inclusive of the end date
ref_date_range = pd.date_range('2012-2-5 00:00:00', '2014-2-8 23:30:00', freq='30Min')

ref_df = pd.DataFrame(np.random.randint(1, 20, (ref_date_range.shape[0], 1)))
ref_df.index = ref_date_range  

# check for missing datetimeindex values based on reference index (with all values)
missing_dates = ref_df.index[~ref_df.index.isin(df.index)]

In [52]:
missing_dates

DatetimeIndex(['2013-09-09 23:00:00', '2013-09-09 23:30:00',
               '2013-09-10 00:00:00', '2013-09-10 00:30:00'],
              dtype='datetime64[ns]', freq='30T')

In [53]:
#from start to end non inclusive of end
start_missing=datetime.datetime(year=2013,month=9,day=9,hour=23,minute=0)
end_missing=datetime.datetime(year=2013,month=9,day=10, hour=1, minute=0)


    Correct the missing values

    Parameters
    ----------
    df : dataframe
        The weather data with missing values
    quad_features : list, optional (default=[])
        feature for quadratic interpolation.
    linear_features : list, optional (default=[])
        feature for linear interpolation.
    categorical_features : list, optional (default=[])
        feature for nearest interpolation.

    Returns
    -------
    dataframe
        The corrected data.


Experimental - needs more work, for now we just use nearest instead

<pre>
def fill_missing(df, quad_features, linear_features, categorical_features,
                        start_blank, end_blank):
    start_blank = pd.to_datetime(start_blank)
    end_blank = pd.to_datetime(end_blank)
    blank_period = (df['day_time'] >= start_blank) & (df['day_time'] < end_blank)
    df.loc[~blank_period, quad_features] = df[quad_features].interpolate(
        method='quadratic')
    df.loc[~blank_period, linear_features] = df[linear_features].interpolate(
        method='linear')
    df.loc[~blank_period, categorical_features] = df[categorical_features].interpolate(
        method='nearest')
    return df
    
quad_features = ['temperature', 'visibility', 'dewPoint', 'pressure', 'apparentTemperature', 'windSpeed', 'humidity','cloudCover','energy(kWh/hh)']
linear_features = ['windBearing', 'moonPhase','Afteris_bank_holiday','Beforeis_bank_holiday',
                  'delta_minutes','from_sunrise','to_sunset']
categorical_features = ['LCLid', 'dayYear','dayMonth','dayWeek','dayDay','dayDayofweek','dayDayofyear','dayIs_month_end',
                       'dayIs_month_start', 'dayIs_quarter_end', 'dayIs_quarter_start', 'dayIs_year_end', 'dayIs_year_start',
                       'dayElapsed', 'bank_holiday','day']
df = fill_missing(df, quad_features=quad_features, linear_features=linear_features, 
                                  categorical_features=categorical_features, start_blank=start_date, end_blank=end_date)
</pre>


In [75]:
df.head()

Unnamed: 0,day_time,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,is_bank_holiday,bank_holiday,day,cloudCover,uvIndex,moonPhase,from_sunrise,to_sunset,Afteris_bank_holiday,Beforeis_bank_holiday
0,2014-02-08 23:30:00,MAC000006,0.077,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698370,10.6,202.5,8.39,4.175,979.175,5.07,6.85,rain,0.76,Breezy,False,,2014-02-08,0.47,1.0,0.3,-962.0,386.0,54720,0
1,2014-02-08 23:00:00,MAC000006,0.055,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698340,11.27,221.0,7.49,0.66,977.87,2.93,9.85,rain,0.62,Breezy,False,,2014-02-08,0.47,1.0,0.3,-932.0,356.0,53280,0
2,2014-02-08 22:30:00,MAC000006,0.08,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698310,11.26,220.5,7.52,1.15,977.785,2.975,9.805,rain,0.64,Breezy,False,,2014-02-08,0.47,1.0,0.3,-902.0,326.0,53280,0
3,2014-02-08 22:00:00,MAC000006,0.074,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698280,11.25,220.0,7.55,1.64,977.7,3.02,9.76,rain,0.66,Breezy,False,,2014-02-08,0.47,1.0,0.3,-872.0,296.0,53280,0
4,2014-02-08 21:30:00,MAC000006,0.056,2014,2,6,8,5,39,False,False,False,False,False,False,1391817600,698250,11.815,218.5,7.365,2.4,977.51,2.99,8.97,rain,0.71,Clear,False,,2014-02-08,0.47,1.0,0.3,-842.0,266.0,53280,0


In [82]:
df['day_time'] = pd.to_datetime(df['day_time'], format='%Y-%m-%d %H:%M:%S')

In [83]:
full_idx = pd.date_range(start=df['day_time'].min(), end=df['day_time'].max(), freq='30T')

In [84]:
full_idx[0]

Timestamp('2012-02-05 00:00:00', freq='30T')

In [86]:
df = df.set_index("day_time")

In [87]:
df = (
    df
    .groupby('LCLid', as_index=False)  
    .apply(lambda group: group.reindex(full_idx, method='nearest'))  
    .reset_index(level=0, drop=True)  
    .sort_index()  
)

In [89]:
# check for missing datetimeindex values based on reference index (with all values)
missing_dates = ref_df.index[~ref_df.index.isin(df.index)]

In [90]:
missing_dates

DatetimeIndex([], dtype='datetime64[ns]', freq='30T')

In [91]:
#check if more than necessary in data
inv_missing_dates = df.index[~df.index.isin(ref_df.index)]
print('frac missing: {0}, inv_missing_dates: {1}'.format(len(inv_missing_dates)/len(df), inv_missing_dates))

frac missing: 0.0, inv_missing_dates: DatetimeIndex([], dtype='datetime64[ns]', freq=None)


In [92]:
df.reset_index(inplace=True)

In [93]:
len(df)

19192320

In [94]:
df.to_csv(f'{PATH}hh_bank_544_interpol_735_days.csv')

### Add Acorn data

In [95]:
informations_households = pd.read_csv(f'{PATH}informations_households.csv')

In [96]:
informations_households.head(n=2)

Unnamed: 0,LCLid,stdorToU,Acorn,Acorn_grouped,file
0,MAC005492,ToU,ACORN-,ACORN-,block_0
1,MAC001074,ToU,ACORN-,ACORN-,block_0


In [97]:
informations_households.drop(columns=['file'],inplace=True)

In [99]:
df = join_df(df, informations_households, "LCLid")

#### Cold felt

Feature that illustrate the felt cold which combines the wind and the temperature. See Data Challenge JDS 2018 winning code for code snippet source.

The formula is given by

$$
T_{felt} = (A - T) * \sqrt{V},
$$

where $T$ is the temperature, $V$ the wind, and $A$ is a temperature such as when $T$ is greater than $A$, the wind have no real impact on the consumption. In the following, we decided to fit $A$ such as the correlation between $T_{felt}$ and $y$ is maximum.

Experimental, needs more qc as to why getting NaN max corr

<pre>
from scipy.stats.stats import pearsonr
from scipy.optimize import minimize

def find_param_temp(temp, df):
    """Computes the correlation between the target and the 
    new feature for a given temperature.
    """
    t_wind = (temp-df['temperature']) * np.sqrt(df['windSpeed'])
    print(t_wind.head(n=2))
    id_null = t_wind.isnull()
    corr = pearsonr(df.loc[~id_null, 'energy(kWh/hh)'], t_wind[~id_null])[0]
    return -corr

#train = data[data['type'] == 'train'].copy()
#test = data[data['type'] == 'test'].copy()

result = minimize(find_param_temp, 15., args=(df))
temp = result.x
print('Max correlation: %.2f for nominal temperature of %.2f' % (- result.fun, temp))
df['wind_chill'] = (temp-df['temperature']) * np.sqrt(df['windSpeed'])
</pre>

In [101]:
df.drop(columns=['wind_chill'],inplace=True)

In [102]:
df.head(n=2)

Unnamed: 0,index,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,is_bank_holiday,bank_holiday,day,cloudCover,uvIndex,moonPhase,from_sunrise,to_sunset,Afteris_bank_holiday,Beforeis_bank_holiday,stdorToU,Acorn,Acorn_grouped
0,2012-02-05,MAC000006,0.042,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-360000,1.32,160.0,-0.12,-0.22,1024.21,-4.68,4.35,snow,0.99,Foggy,False,,2012-02-05,0.85,1.0,0.42,454.0,-1017.0,0,-87840,Std,ACORN-Q,Adversity
1,2012-02-05,MAC005178,0.561,2012,2,5,5,6,36,False,False,False,False,False,False,1328400000,-360000,1.32,160.0,-0.12,-0.22,1024.21,-4.68,4.35,snow,0.99,Foggy,False,,2012-02-05,0.85,1.0,0.42,454.0,-1017.0,0,-87840,Std,ACORN-E,Affluent


In [112]:
df.rename(columns={'index': 'day_time'}, inplace=True)

In [104]:
df.to_csv(f'{PATH}hh_final_544_ids_735_days.csv')

In [None]:
df = pd.read_csv(f'{PATH}hh_final_544_ids_735_days.csv')

We now have our final set of engineered features.