# Structured and time series data 

Based on the methodology taken by the third place result in the Rossman Kaggle competition as detailed in Guo/Berkhahn's [Entity Embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737). See fastai rossman.ipynb

The motivation behind exploring this architecture is it's relevance to real-world application. Most data used for decision making day-to-day in industry is structured and/or time-series data. Here we explore the end-to-end process of using neural networks with practical structured data problems.

As per the Kaggle competition we will use Root Mean Square Percentage Error (RMSPE). The RMSPE is calculated as

![title](../images/RMSPE.png)

where y_i denotes the energy used for a single household on a single day and yhat_i denotes the corresponding prediction. Any day and household with 0 energy use is ignored in scoring.


In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
from fastai.structured import *
from fastai.column_data import *


In [3]:
pd.set_option('display.max_columns', None)

In [4]:
PATH='../input/merged_data/'

In [5]:
from IPython.display import HTML, display


The following returns summarized aggregate information to each table accross each field.

In [10]:
daily_48hh = pd.read_csv(f'{PATH}daily_all_48hh.csv')

Here we read in only daly data where a household has 48 half hourly measurements

In [11]:
daily_48hh.head()

Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
0,1,MAC000118,2011-12-15,0.114,0.183542,0.641,48,0.156322,8.81,0.061
1,2,MAC000118,2011-12-16,0.151,0.20675,1.029,48,0.190122,9.924,0.061
2,3,MAC000118,2011-12-17,0.146,0.218854,0.677,48,0.174688,10.505,0.056
3,4,MAC000118,2011-12-18,0.182,0.274417,1.012,48,0.235795,13.172,0.061
4,5,MAC000118,2011-12-19,0.1175,0.19475,0.695,48,0.164503,9.348,0.057


In [13]:
daily_48hh.drop(columns=['Unnamed: 0'],inplace=True)

In [14]:
display(DataFrameSummary(daily_48hh).summary())

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
count,,,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06
mean,,,0.158577,0.211765,0.836838,48,0.172691,10.1647,0.0592012
std,,,0.169372,0.190229,0.668409,0,0.153189,9.13101,0.0849863
min,,,0,0,0,48,0,0,0
25%,,,0.067,0.0982292,0.348,48,0.0691159,4.715,0.02
50%,,,0.1145,0.163417,0.691,48,0.132795,7.844,0.039
75%,,,0.191,0.262542,1.13,48,0.229319,12.602,0.071
max,,,6.9055,6.92825,10.761,48,3.34731,332.556,6.394
counts,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352
uniques,5560,827,10337,396529,6419,1,3245602,399606,2078


In [16]:
weather = pd.read_csv(f'{PATH}weather_daily_darksky.csv')

In [27]:
weather[['day', 'time']] = weather['sunriseTime'].str.split(' ', n=1, expand=True)

In [30]:
weather.head(n=2)

Unnamed: 0,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,time,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase,day
0,11.96,2011-11-11 23:00:00,123,fog,9.4,2011-11-11 07:00:00,0.79,3.88,1016.08,2011-11-11 07:00:00,10.87,rain,3.3,0.95,2011-11-11 19:00:00,10.87,11.96,1.0,07:12:14,2011-11-11 16:19:21,10.87,8.85,10.87,2011-11-11 07:12:14,2011-11-11 19:00:00,2011-11-11 11:00:00,Foggy until afternoon.,2011-11-11 19:00:00,6.48,2011-11-11 23:00:00,2011-11-11 19:00:00,0.52,2011-11-11
1,8.59,2011-12-11 14:00:00,198,partly-cloudy-day,4.49,2011-12-11 01:00:00,0.56,3.94,1007.71,2011-12-11 02:00:00,5.62,rain,12.09,0.88,2011-12-11 19:00:00,-0.64,5.72,1.0,07:57:02,2011-12-11 15:52:53,3.09,2.48,8.59,2011-12-11 07:57:02,2011-12-11 14:00:00,2011-12-11 12:00:00,Partly cloudy throughout the day.,2011-12-12 07:00:00,0.11,2011-12-11 20:00:00,2011-12-12 08:00:00,0.53,2011-12-11


In [None]:
#convert time string to timestamp
hh_weather['time'] = pd.to_timedelta(hh_weather['time'])

Bank Holidays

In [23]:
bank_holidays = pd.read_csv(f'{PATH}uk_bank_holidays.csv')

In [24]:
bank_holidays = bank_holidays.rename(columns={'Bank holidays': 'day'})


In [25]:
bank_holidays.head()

Unnamed: 0,day,Type
0,2012-12-26,Boxing Day
1,2012-12-25,Christmas Day
2,2012-08-27,Summer bank holiday
3,2012-05-06,Queen?s Diamond Jubilee (extra bank holiday)
4,2012-04-06,Spring bank holiday (substitute day)


In [18]:
acorn_details = pd.read_csv(f'{PATH}acorn_details.csv')

In [19]:
acorn_details.head()

Unnamed: 0,MAIN CATEGORIES,CATEGORIES,REFERENCE,ACORN-A,ACORN-B,ACORN-C,ACORN-D,ACORN-E,ACORN-F,ACORN-G,ACORN-H,ACORN-I,ACORN-J,ACORN-K,ACORN-L,ACORN-M,ACORN-N,ACORN-O,ACORN-P,ACORN-Q
0,POPULATION,Age,Age 0-4,77.0,83.0,72.0,100.0,120.0,77.0,97.0,97.0,63.0,119.0,67.0,114.0,113.0,89.0,123.0,138.0,133.0
1,POPULATION,Age,Age 5-17,117.0,109.0,87.0,69.0,94.0,95.0,102.0,106.0,67.0,95.0,64.0,108.0,116.0,86.0,89.0,136.0,106.0
2,POPULATION,Age,Age 18-24,64.0,73.0,67.0,107.0,100.0,71.0,83.0,89.0,62.0,104.0,459.0,97.0,96.0,86.0,117.0,109.0,110.0
3,POPULATION,Age,Age 25-34,52.0,63.0,62.0,197.0,151.0,66.0,90.0,88.0,63.0,132.0,145.0,109.0,96.0,90.0,140.0,120.0,120.0
4,POPULATION,Age,Age 35-49,102.0,105.0,91.0,124.0,118.0,93.0,102.0,103.0,76.0,111.0,67.0,99.0,98.0,90.0,102.0,103.0,100.0


In [21]:
informations_households = pd.read_csv(f'{PATH}informations_households.csv')

In [22]:
informations_households.head()

Unnamed: 0,LCLid,stdorToU,Acorn,Acorn_grouped,file
0,MAC005492,ToU,ACORN-,ACORN-,block_0
1,MAC001074,ToU,ACORN-,ACORN-,block_0
2,MAC000002,Std,ACORN-A,Affluent,block_0
3,MAC003613,Std,ACORN-A,Affluent,block_0
4,MAC003597,Std,ACORN-A,Affluent,block_0


## Data Cleaning / Feature Engineering

In [15]:
len(daily_48hh)

3469352

`join_df` is a function for joining tables on specific fields. By default, we'll be doing a left outer join of `right` on the `left` argument using the given fields for each table.

Pandas does joins using the `merge` method. The `suffixes` argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a "\_y" to those on the right.

In [32]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

Join weather/state names.

In [33]:
df = join_df(daily_48hh, informations_households, "LCLid")

In [34]:
df.head(n=2)

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,file
0,MAC000118,2011-12-15,0.114,0.183542,0.641,48,0.156322,8.81,0.061,Std,ACORN-G,Comfortable,block_60
1,MAC000118,2011-12-16,0.151,0.20675,1.029,48,0.190122,9.924,0.061,Std,ACORN-G,Comfortable,block_60


In [35]:
df.drop(columns=['file'],inplace=True)

In [36]:
df = join_df(df, weather, "day")

In [37]:
df = join_df(df, bank_holidays, "day")

In [38]:
df.head(n=2)

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,time,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase,Type
0,MAC000118,2011-12-15,0.114,0.183542,0.641,48,0.156322,8.81,0.061,Std,ACORN-G,Comfortable,7.97,2011-12-15 14:00:00,234,wind,2.41,2011-12-15 00:00:00,0.42,4.71,996.75,2011-12-15 00:00:00,4.25,rain,12.79,0.77,2011-12-15 14:00:00,-2.58,4.38,1.0,08:00:46,2011-12-15 15:52:48,1.8,4.08,7.97,2011-12-15 08:00:46,2011-12-15 14:00:00,2011-12-15 11:00:00,Partly cloudy throughout the day and breezy in...,2011-12-16 08:00:00,1.07,2011-12-15 21:00:00,2011-12-16 08:00:00,0.66,
1,MAC000118,2011-12-16,0.151,0.20675,1.029,48,0.190122,9.924,0.061,Std,ACORN-G,Comfortable,4.68,2011-12-16 00:00:00,315,partly-cloudy-day,1.6,2011-12-16 08:00:00,0.7,3.71,988.1,2011-12-16 10:00:00,0.23,rain,10.96,0.88,2011-12-16 15:00:00,-3.56,0.99,1.0,08:01:35,2011-12-16 15:52:56,0.24,1.8,4.53,2011-12-16 08:01:35,2011-12-16 15:00:00,2011-12-16 11:00:00,Mostly cloudy throughout the day.,2011-12-17 08:00:00,-2.65,2011-12-16 00:00:00,2011-12-17 08:00:00,0.7,


In [39]:
df = df.rename(columns={'Type': 'bankHoliday'})

The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.

You should *always* consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.

In [40]:
add_datepart(df, "day", drop=False)

In [41]:
df.head(n=2)

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,time,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase,bankHoliday,dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed
0,MAC000118,2011-12-15,0.114,0.183542,0.641,48,0.156322,8.81,0.061,Std,ACORN-G,Comfortable,7.97,2011-12-15 14:00:00,234,wind,2.41,2011-12-15 00:00:00,0.42,4.71,996.75,2011-12-15 00:00:00,4.25,rain,12.79,0.77,2011-12-15 14:00:00,-2.58,4.38,1.0,08:00:46,2011-12-15 15:52:48,1.8,4.08,7.97,2011-12-15 08:00:46,2011-12-15 14:00:00,2011-12-15 11:00:00,Partly cloudy throughout the day and breezy in...,2011-12-16 08:00:00,1.07,2011-12-15 21:00:00,2011-12-16 08:00:00,0.66,,2011,12,50,15,3,349,False,False,False,False,False,False,1323907200
1,MAC000118,2011-12-16,0.151,0.20675,1.029,48,0.190122,9.924,0.061,Std,ACORN-G,Comfortable,4.68,2011-12-16 00:00:00,315,partly-cloudy-day,1.6,2011-12-16 08:00:00,0.7,3.71,988.1,2011-12-16 10:00:00,0.23,rain,10.96,0.88,2011-12-16 15:00:00,-3.56,0.99,1.0,08:01:35,2011-12-16 15:52:56,0.24,1.8,4.53,2011-12-16 08:01:35,2011-12-16 15:00:00,2011-12-16 11:00:00,Mostly cloudy throughout the day.,2011-12-17 08:00:00,-2.65,2011-12-16 00:00:00,2011-12-17 08:00:00,0.7,,2011,12,50,16,4,350,False,False,False,False,False,False,1323993600


In [44]:
#convert time strings to timestamp
df['sunsetTime'] = pd.to_datetime(df['sunsetTime'], format = '%Y-%m-%d %H:%M:%S')
df['sunriseTime'] = pd.to_datetime(df['sunriseTime'], format = '%Y-%m-%d %H:%M:%S')

In [46]:
df['daylightMins'] =(df['sunsetTime']-df['sunriseTime']).astype('timedelta64[m]')

In [47]:
df.drop(columns=['time'],inplace=True)
df.head(n=2)

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase,bankHoliday,dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,daylightMins
0,MAC000118,2011-12-15,0.114,0.183542,0.641,48,0.156322,8.81,0.061,Std,ACORN-G,Comfortable,7.97,2011-12-15 14:00:00,234,wind,2.41,2011-12-15 00:00:00,0.42,4.71,996.75,2011-12-15 00:00:00,4.25,rain,12.79,0.77,2011-12-15 14:00:00,-2.58,4.38,1.0,2011-12-15 15:52:48,1.8,4.08,7.97,2011-12-15 08:00:46,2011-12-15 14:00:00,2011-12-15 11:00:00,Partly cloudy throughout the day and breezy in...,2011-12-16 08:00:00,1.07,2011-12-15 21:00:00,2011-12-16 08:00:00,0.66,,2011,12,50,15,3,349,False,False,False,False,False,False,1323907200,472.0
1,MAC000118,2011-12-16,0.151,0.20675,1.029,48,0.190122,9.924,0.061,Std,ACORN-G,Comfortable,4.68,2011-12-16 00:00:00,315,partly-cloudy-day,1.6,2011-12-16 08:00:00,0.7,3.71,988.1,2011-12-16 10:00:00,0.23,rain,10.96,0.88,2011-12-16 15:00:00,-3.56,0.99,1.0,2011-12-16 15:52:56,0.24,1.8,4.53,2011-12-16 08:01:35,2011-12-16 15:00:00,2011-12-16 11:00:00,Mostly cloudy throughout the day.,2011-12-17 08:00:00,-2.65,2011-12-16 00:00:00,2011-12-17 08:00:00,0.7,,2011,12,50,16,4,350,False,False,False,False,False,False,1323993600,471.0


Next we'll fill in missing values to avoid complications with `NA`'s. `NA` (not available) is how Pandas indicates missing values; many models have problems when missing values are present, so it's always important to think about how to deal with them. In these cases, we are picking an arbitrary *signal value* that doesn't otherwise appear in the data.

In [32]:
#df['nn'] = df.nn.fillna(1900).astype(np.int32)


In [39]:
joined.to_feather(f'{PATH}joined')
joined_test.to_feather(f'{PATH}joined_test')

## Durations

It is common when working with time series data to extract data that explains relationships across rows as opposed to columns, e.g.:
* Running averages
* Time until next event
* Time since last event

This is often difficult to do with most table manipulation frameworks, since they are designed to work with relationships across columns. As such, we've created a class to handle this type of data.

We'll define a function `get_elapsed` for cumulative counting across a sorted dataframe. Given a particular field `fld` to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.

Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly.

In [52]:
def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    print(f'day1: {str(day1)}')
    last_date = np.datetime64()
    last_hh = 0
    res = []

    for s,v,d in zip(df.LCLid.values,df[fld].values, df.day.values):
        if s != last_hh:
            last_date = np.datetime64()
            last_hh = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1))
    df[pre+fld] = res

We'll be applying this to a subset of columns:

In [53]:
columns = ["day", "LCLid", "bankHoliday"]

Let's walk through an example.

Say we're looking at Bank Holiday. We'll first sort by LCLid, then day, and then call `add_elapsed('BankHoliday', 'After')`:
This will apply to each row with Bank Holiday:
* A applied to every row of the dataframe in order of LCLid and day
* Will add to the dataframe the days since seeing a Bank Holiday
* If we sort in the other direction, this will count the days until another holiday.

In [55]:
fld = 'bankHoliday'
df = df.sort_values(['LCLid', 'day'])
get_elapsed(fld, 'After')
df = df.sort_values(['LCLid', 'day'], ascending=[True, False])
get_elapsed(fld, 'Before')

day1: 1 days
day1: 1 days


In [None]:
df.head(n=2)

We'll do this for two more fields.

We're going to set the active index to Date.

In [56]:
df = df.set_index("day")

Then set null values from elapsed field calculations to 0.

In [57]:
columns = ['bankHoliday']

In [58]:
for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(0).astype(int)

In [59]:
df.head(n=2)

Unnamed: 0_level_0,LCLid,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase,bankHoliday,dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,daylightMins,AfterbankHoliday,BeforebankHoliday
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1
2014-02-27,MAC000002,0.218,0.427458,1.35,48,0.406681,20.518,0.08,Std,ACORN-A,Affluent,10.31,2014-02-27 14:00:00,224,partly-cloudy-day,3.08,2014-02-27 23:00:00,0.32,4.14,1007.02,2014-02-27 22:00:00,10.31,rain,12.04,0.74,2014-02-27 14:00:00,0.82,10.31,2.0,2014-02-27 17:37:35,3.43,3.93,10.31,2014-02-27 06:51:45,2014-02-27 14:00:00,2014-02-27 12:00:00,Partly cloudy until evening.,2014-02-28 02:00:00,1.41,2014-02-27 14:00:00,2014-02-28 02:00:00,0.93,,2014,2,9,27,3,58,False,False,False,False,False,False,1393459200,645.0,0,0
2014-02-26,MAC000002,0.1515,0.256833,1.028,48,0.196095,12.328,0.076,Std,ACORN-A,Affluent,11.29,2014-02-26 13:00:00,227,partly-cloudy-day,2.74,2014-02-26 07:00:00,0.26,3.82,1012.73,2014-02-26 07:00:00,11.29,rain,13.0,0.73,2014-02-26 13:00:00,3.03,11.29,2.0,2014-02-26 17:35:49,6.01,4.17,11.29,2014-02-26 06:53:52,2014-02-26 13:00:00,2014-02-26 12:00:00,Partly cloudy throughout the day.,2014-02-27 00:00:00,1.67,2014-02-26 13:00:00,2014-02-27 00:00:00,0.9,,2014,2,9,26,2,57,False,False,False,False,False,False,1393372800,641.0,0,0


In [61]:
#replace characters in bank holiday
df['bankHoliday'] = df['bankHoliday'].str.replace('?','')

Next we'll demonstrate window functions in pandas to calculate rolling quantities.

Here we're sorting by date (`sort_index()`) and counting the number of events of interest (`sum()`) defined in `columns` in the following week (`rolling()`), grouped by Store (`groupby()`). We do the same in the opposite direction.

For now lets just create a boolean for holiday or not, can revisit to refine later on

In [64]:
df['bankHoliday'].fillna(False, inplace=True)

In [65]:
mask = df.bankHoliday != False
df.loc[mask, 'bankHoliday'] = True

In [66]:
bwd = df[['LCLid']+columns].sort_index().groupby("LCLid").rolling(7, min_periods=1).sum()

In [67]:
fwd = df[['LCLid']+columns].sort_index(ascending=False
                                      ).groupby("LCLid").rolling(7, min_periods=1).sum()

In [68]:
fwd.head(n=2)

Unnamed: 0_level_0,Unnamed: 1_level_0,bankHoliday
LCLid,day,Unnamed: 2_level_1
MAC000002,2014-02-27,0.0
MAC000002,2014-02-26,0.0


Next we want to drop the LCLid indices grouped together in the window function.

Often in pandas, there is an option to do this in place. This is time and memory efficient when working with large datasets.

In [72]:
#bwd.drop('LCLid',1,inplace=True)
bwd.reset_index(inplace=True)

In [71]:
#fwd.drop('LCLid',1,inplace=True)
fwd.reset_index(inplace=True)

In [73]:
fwd.head(n=2)

Unnamed: 0,LCLid,day,bankHoliday
0,MAC000002,2014-02-27,0.0
1,MAC000002,2014-02-26,0.0


In [74]:
df.reset_index(inplace=True)

In [75]:
df.head(n=2)

Unnamed: 0,day,LCLid,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase,bankHoliday,dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,daylightMins,AfterbankHoliday,BeforebankHoliday
0,2014-02-27,MAC000002,0.218,0.427458,1.35,48,0.406681,20.518,0.08,Std,ACORN-A,Affluent,10.31,2014-02-27 14:00:00,224,partly-cloudy-day,3.08,2014-02-27 23:00:00,0.32,4.14,1007.02,2014-02-27 22:00:00,10.31,rain,12.04,0.74,2014-02-27 14:00:00,0.82,10.31,2.0,2014-02-27 17:37:35,3.43,3.93,10.31,2014-02-27 06:51:45,2014-02-27 14:00:00,2014-02-27 12:00:00,Partly cloudy until evening.,2014-02-28 02:00:00,1.41,2014-02-27 14:00:00,2014-02-28 02:00:00,0.93,False,2014,2,9,27,3,58,False,False,False,False,False,False,1393459200,645.0,0,0
1,2014-02-26,MAC000002,0.1515,0.256833,1.028,48,0.196095,12.328,0.076,Std,ACORN-A,Affluent,11.29,2014-02-26 13:00:00,227,partly-cloudy-day,2.74,2014-02-26 07:00:00,0.26,3.82,1012.73,2014-02-26 07:00:00,11.29,rain,13.0,0.73,2014-02-26 13:00:00,3.03,11.29,2.0,2014-02-26 17:35:49,6.01,4.17,11.29,2014-02-26 06:53:52,2014-02-26 13:00:00,2014-02-26 12:00:00,Partly cloudy throughout the day.,2014-02-27 00:00:00,1.67,2014-02-26 13:00:00,2014-02-27 00:00:00,0.9,False,2014,2,9,26,2,57,False,False,False,False,False,False,1393372800,641.0,0,0


Now we'll merge these values onto the df.

In [76]:
df = df.merge(bwd, 'left', ['day', 'LCLid'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['day', 'LCLid'], suffixes=['', '_fw'])

In [77]:
df.drop(columns,1,inplace=True)

In [78]:
df.head(n=2)

Unnamed: 0,day,LCLid,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase,dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,daylightMins,AfterbankHoliday,BeforebankHoliday,bankHoliday_bw,bankHoliday_fw
0,2014-02-27,MAC000002,0.218,0.427458,1.35,48,0.406681,20.518,0.08,Std,ACORN-A,Affluent,10.31,2014-02-27 14:00:00,224,partly-cloudy-day,3.08,2014-02-27 23:00:00,0.32,4.14,1007.02,2014-02-27 22:00:00,10.31,rain,12.04,0.74,2014-02-27 14:00:00,0.82,10.31,2.0,2014-02-27 17:37:35,3.43,3.93,10.31,2014-02-27 06:51:45,2014-02-27 14:00:00,2014-02-27 12:00:00,Partly cloudy until evening.,2014-02-28 02:00:00,1.41,2014-02-27 14:00:00,2014-02-28 02:00:00,0.93,2014,2,9,27,3,58,False,False,False,False,False,False,1393459200,645.0,0,0,0.0,0.0
1,2014-02-26,MAC000002,0.1515,0.256833,1.028,48,0.196095,12.328,0.076,Std,ACORN-A,Affluent,11.29,2014-02-26 13:00:00,227,partly-cloudy-day,2.74,2014-02-26 07:00:00,0.26,3.82,1012.73,2014-02-26 07:00:00,11.29,rain,13.0,0.73,2014-02-26 13:00:00,3.03,11.29,2.0,2014-02-26 17:35:49,6.01,4.17,11.29,2014-02-26 06:53:52,2014-02-26 13:00:00,2014-02-26 12:00:00,Partly cloudy throughout the day.,2014-02-27 00:00:00,1.67,2014-02-26 13:00:00,2014-02-27 00:00:00,0.9,2014,2,9,26,2,57,False,False,False,False,False,False,1393372800,641.0,0,0,0.0,0.0


In [None]:
df.to_feather(f'{PATH}df_daily_cat.feather')

In [16]:
df = pd.read_feather(f'{PATH}df_daily_cat.feather')

In [17]:
#convert time strings to timestamp
#try with and without these columns - initially will try without
df.drop(columns=['temperatureMaxTime', 'temperatureMinTime',
                'apparentTemperatureMinTime','apparentTemperatureHighTime',
                'sunsetTime','sunriseTime','uvIndexTime',
                'temperatureHighTime','apparentTemperatureMaxTime',
                'apparentTemperatureLowTime','temperatureLowTime'], inplace=True)
'''
df['temperatureMaxTime'] = pd.to_datetime(df['temperatureMaxTime'], format = '%Y-%m-%d %H:%M:%S')
df['temperatureMinTime'] = pd.to_datetime(df['temperatureMinTime'], format = '%Y-%m-%d %H:%M:%S')
df['apparentTemperatureMinTime'] = pd.to_datetime(df['apparentTemperatureMinTime'], format = '%Y-%m-%d %H:%M:%S')
df['apparentTemperatureHighTime'] = pd.to_datetime(df['apparentTemperatureHighTime'], format = '%Y-%m-%d %H:%M:%S')
df['sunsetTime'] = pd.to_datetime(df['sunsetTime'], format = '%Y-%m-%d %H:%M:%S')
df['sunriseTime'] = pd.to_datetime(df['sunriseTime'], format = '%Y-%m-%d %H:%M:%S')
df['uvIndexTime'] = pd.to_datetime(df['uvIndexTime'], format = '%Y-%m-%d %H:%M:%S')
df['temperatureHighTime'] = pd.to_datetime(df['temperatureHighTime'], format = '%Y-%m-%d %H:%M:%S')
df['apparentTemperatureMaxTime'] = pd.to_datetime(df['apparentTemperatureMaxTime'], format = '%Y-%m-%d %H:%M:%S')
df['apparentTemperatureLowTime'] = pd.to_datetime(df['apparentTemperatureLowTime'], format = '%Y-%m-%d %H:%M:%S')
'''


"\ndf['temperatureMaxTime'] = pd.to_datetime(df['temperatureMaxTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['temperatureMinTime'] = pd.to_datetime(df['temperatureMinTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['apparentTemperatureMinTime'] = pd.to_datetime(df['apparentTemperatureMinTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['apparentTemperatureHighTime'] = pd.to_datetime(df['apparentTemperatureHighTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['sunsetTime'] = pd.to_datetime(df['sunsetTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['sunriseTime'] = pd.to_datetime(df['sunriseTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['uvIndexTime'] = pd.to_datetime(df['uvIndexTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['temperatureHighTime'] = pd.to_datetime(df['temperatureHighTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['apparentTemperatureMaxTime'] = pd.to_datetime(df['apparentTemperatureMaxTime'], format = '%Y-%m-%d %H:%M:%S')\ndf['apparentTemperatureLowTime'] = pd.to_datetime(df['apparentTemperatureLowTime'], format = '%Y-%

It's usually a good idea to back up large tables of extracted / wrangled features before you join them onto another one, that way you can go back to it easily if you need to make changes to it.

In [18]:
df.to_feather(f'{PATH}df_daily_cat_no_dates.feather')

### Read in pre proc data

In [28]:
df = pd.read_feather(f'{PATH}df_daily_cat_no_dates.feather')

In [29]:
df.dtypes

day                        datetime64[ns]
LCLid                              object
energy_median                     float64
energy_mean                       float64
energy_max                        float64
energy_count                        int64
energy_std                        float64
energy_sum                        float64
energy_min                        float64
stdorToU                           object
Acorn                              object
Acorn_grouped                      object
temperatureMax                    float64
windBearing                         int64
icon                               object
dewPoint                          float64
cloudCover                        float64
windSpeed                         float64
pressure                          float64
apparentTemperatureHigh           float64
precipType                         object
visibility                        float64
humidity                          float64
apparentTemperatureLow            

The authors also removed all instances where the store had zero sale / was closed. We speculate that this may have cost them a higher standing in the competition. One reason this may be the case is that a little exploratory data analysis reveals that there are often periods where stores are closed, typically for refurbishment. Before and after these periods, there are naturally spikes in sales that one might expect. By ommitting this data from their training, the authors gave up the ability to leverage information about these periods to predict this otherwise volatile behavior.

We'll back this up as well.

We now have our final set of engineered features.

While these steps were explicitly outlined in the paper, these are all fairly typical feature engineering steps for dealing with time series data and are practical in any similar setting.

## Create features

In [9]:
#joined = pd.read_feather(f'{PATH}joined')
#joined_test = pd.read_feather(f'{PATH}joined_test')

In [27]:
#df.head().T.head(40)

Now that we've engineered all our features, we need to convert to input compatible with a neural network.

This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc...

### T/T split

In [30]:
df['day'].min()

Timestamp('2011-11-24 00:00:00')

In [31]:
df['day'].max()

Timestamp('2014-02-27 00:00:00')

One issue with this dataset is that the household meter measurements dont all start and end on the same date

Here we will use the period starting from 1/Feb/2014 as the test datset. Households which end memter measurements before the start of 1/Feb/2014 will be removed.

We will then use the period 1/Jan/2014 to 1/Feb/2014 as a validation dataset

In [32]:
split_date = pd.datetime(2014,2,1)

df_train = df.loc[df['day'] <= split_date]
df_test = df.loc[df['day'] > split_date]

In [33]:
len(df), len(df_train), len(df_test)

(3469352, 3339707, 129645)

In [34]:
df_train.head(n=2)

Unnamed: 0,day,LCLid,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,windBearing,icon,dewPoint,cloudCover,windSpeed,pressure,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureLow,apparentTemperatureMax,uvIndex,temperatureLow,temperatureMin,temperatureHigh,summary,apparentTemperatureMin,moonPhase,dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,daylightMins,AfterbankHoliday,BeforebankHoliday,bankHoliday_bw,bankHoliday_fw
26,2014-02-01,MAC000002,0.13,0.238021,0.799,48,0.19576,11.425,0.076,Std,ACORN-A,Affluent,9.72,217,wind,3.18,0.19,6.97,990.08,4.27,rain,11.6,0.76,2.81,6.86,1.0,6.24,4.83,8.32,Partly cloudy until evening and breezy overnight.,1.1,0.06,2014,2,5,1,5,32,False,True,False,False,False,False,1391212800,549.0,0,0,0.0,0.0
27,2014-01-31,MAC000002,0.39,0.449458,1.627,48,0.443706,21.574,0.075,Std,ACORN-A,Affluent,8.83,177,wind,3.93,0.73,4.74,998.51,3.13,rain,7.08,0.91,1.1,5.27,1.0,4.83,1.97,7.08,Overcast throughout the day and breezy startin...,0.29,0.03,2014,1,5,31,4,31,True,False,False,False,False,False,1391126400,546.0,0,0,0.0,0.0


In [35]:
cat_vars = ['LCLid', 'dayWeek', 'dayYear', 'dayMonth', 'dayDay', 'dayDayofyear', 'stdorToU','Acorn','Acorn_grouped',
            'icon','precipType','summary']

contin_vars = ['energy_median','energy_mean','energy_max','energy_count',
               'energy_std','energy_sum','energy_min','temperatureMax',
               'windBearing','dewPoint','cloudCover','windSpeed','pressure',
               'apparentTemperatureHigh','visibility','humidity','apparentTemperatureLow',
               'apparentTemperatureMax','uvIndex','temperatureLow','temperatureMin',
               'temperatureHigh','apparentTemperatureMin','moonPhase','dayElapsed','daylightMins']

n = len(df); n

3469352

In [18]:
dep = 'energy_sum'
df_train = df_train[cat_vars+contin_vars+[dep, 'day']].copy()

In [19]:
df_test[dep] = 0
df_test = df_test[cat_vars+contin_vars+[dep, 'day', 'LCLid']].copy()

In [36]:
df_test.head(n=2)

Unnamed: 0,day,LCLid,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,stdorToU,Acorn,Acorn_grouped,temperatureMax,windBearing,icon,dewPoint,cloudCover,windSpeed,pressure,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureLow,apparentTemperatureMax,uvIndex,temperatureLow,temperatureMin,temperatureHigh,summary,apparentTemperatureMin,moonPhase,dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,daylightMins,AfterbankHoliday,BeforebankHoliday,bankHoliday_bw,bankHoliday_fw
0,2014-02-27,MAC000002,0.218,0.427458,1.35,48,0.406681,20.518,0.08,Std,ACORN-A,Affluent,10.31,224,partly-cloudy-day,3.08,0.32,4.14,1007.02,10.31,rain,12.04,0.74,0.82,10.31,2.0,3.43,3.93,10.31,Partly cloudy until evening.,1.41,0.93,2014,2,9,27,3,58,False,False,False,False,False,False,1393459200,645.0,0,0,0.0,0.0
1,2014-02-26,MAC000002,0.1515,0.256833,1.028,48,0.196095,12.328,0.076,Std,ACORN-A,Affluent,11.29,227,partly-cloudy-day,2.74,0.26,3.82,1012.73,11.29,rain,13.0,0.73,3.03,11.29,2.0,6.01,4.17,11.29,Partly cloudy throughout the day.,1.67,0.9,2014,2,9,26,2,57,False,False,False,False,False,False,1393372800,641.0,0,0,0.0,0.0


In [38]:
#for v in cat_vars: df_train[v] = df_train[v].astype('category').cat.as_ordered()
for v in cat_vars: 
    print(v)
    df_train[v] = df_train[v].astype('category')

LCLid


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


dayWeek
dayYear
dayMonth
dayDay
dayDayofyear
stdorToU
Acorn
Acorn_grouped
icon
precipType
summary


In [25]:
df_train.drop(columns=['day'],inplace=True)

In [39]:
df_train.dtypes

day                        datetime64[ns]
LCLid                            category
energy_median                     float64
energy_mean                       float64
energy_max                        float64
energy_count                        int64
energy_std                        float64
energy_sum                        float64
energy_min                        float64
stdorToU                         category
Acorn                            category
Acorn_grouped                    category
temperatureMax                    float64
windBearing                         int64
icon                             category
dewPoint                          float64
cloudCover                        float64
windSpeed                         float64
pressure                          float64
apparentTemperatureHigh           float64
precipType                       category
visibility                        float64
humidity                          float64
apparentTemperatureLow            

In [None]:
#apply_cats rather tha train_cats

In [40]:
apply_cats(df_test, df_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[n] = pd.Categorical(c, categories=trn[n].cat.categories, ordered=True)


In [41]:
for v in contin_vars:
    df_train[v] = df_train[v].fillna(0).astype('float32')
    df_test[v] = df_test[v].fillna(0).astype('float32')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


We're going to run on a sample.

In [42]:
idxs = get_cv_idxs(n, val_pct=150000/n)
df_samp = df_train.iloc[idxs].set_index("day")
samp_size = len(df_samp); samp_size

IndexError: positional indexers are out-of-bounds

To run on the full dataset, use this instead:

In [None]:
samp_size = n
df_samp = df_test.set_index("day")

We can now process our data...

In [None]:
df_samp.head(2)

In [None]:
#keep track of the mapper (44:15)

In [None]:
df, y, nas, mapper = proc_df(df_samp, 'energy_sum', do_scale=True)
yl = np.log(y)

In [None]:
df.head(2)

In [None]:
df_test = df_test.set_index("day")

In [None]:
#apply the same mapper to the test set

In [None]:
df_test, _, nas, mapper = proc_df(df_test, 'day', do_scale=True, skip_flds=['LCLid'],
                                  mapper=mapper, na_dict=nas)

In [None]:
df.head(100)

In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be in real application. This issue is discussed in detail in [this post](http://www.fast.ai/2017/11/13/validation-sets/) on our web site.

One approach is to take the last 25% of rows (sorted by date) as our validation set.

In [None]:
train_ratio = 0.75
# train_ratio = 0.9
train_size = int(samp_size * train_ratio); train_size
val_idx = list(range(train_size, len(df)))

An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:

In [None]:
#val_idx = np.flatnonzero(
#    (df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

In [None]:
#val_idx=[0]

## DL

We're ready to put together our models.

Root-mean-squared percent error is the metric Kaggle used for this competition.

In [None]:
def inv_y(a): return np.exp(a)

def exp_rmspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return math.sqrt((pct_var**2).mean())

max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)

We can create a ModelData object directly from out data frame.

pass in the test dataframe in th usuak way

In [None]:
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,
                                       test_df=df_test)

Some categorical variables have a lot more levels than others. Store, in particular, has over a thousand!

In [None]:
cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]

In [None]:
cat_sz

We use the *cardinality* of each variable (that is, its number of unique values) to decide how large to make its *embeddings*. Each level will be associated with a vector with length defined as below.

In [None]:
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

In [None]:
emb_szs

In [None]:
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
m.summary()

In [None]:
lr = 1e-3
m.lr_find()

In [None]:
m.sched.plot(100)

### Sample

In [None]:
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

In [None]:
m.fit(lr, 3, metrics=[exp_rmspe])

In [None]:
m.fit(lr, 5, metrics=[exp_rmspe], cycle_len=1)

In [None]:
m.fit(lr, 2, metrics=[exp_rmspe], cycle_len=4)

### All

In [None]:
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

In [None]:
m.fit(lr, 1, metrics=[exp_rmspe])

In [None]:
m.fit(lr, 3, metrics=[exp_rmspe])

In [None]:
m.fit(lr, 3, metrics=[exp_rmspe], cycle_len=1)

### Test

In [None]:
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

In [None]:
m.fit(lr, 3, metrics=[exp_rmspe])

In [None]:
m.fit(lr, 3, metrics=[exp_rmspe], cycle_len=1)

In [None]:
m.save('val0')

In [None]:
m.load('val0')

In [None]:
x,y=m.predict_with_targs()

In [None]:
exp_rmspe(x,y)

In [None]:
pred_test=m.predict(True)

In [None]:
pred_test = np.exp(pred_test)

In [None]:
joined_test['energy_sum']=pred_test

In [None]:
csv_fn=f'{PATH}tmp/sub.csv'

In [None]:
joined_test[['LCLid','energy_sum']].to_csv(csv_fn, index=False)

In [None]:
FileLink(csv_fn)

## RF

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
((val,trn), (y_val,y_trn)) = split_by_idx(val_idx, df.values, yl)

In [None]:
m = RandomForestRegressor(n_estimators=40, max_features=0.99, min_samples_leaf=2,
                          n_jobs=-1, oob_score=True)
m.fit(trn, y_trn);

In [None]:
preds = m.predict(val)
m.score(trn, y_trn), m.score(val, y_val), m.oob_score_, exp_rmspe(preds, y_val)