# Challenge 1 Data: Smart meters in London

## Objectives 

### Overall

The objective of this project is to curate information about monitoring electricity consumption patterns using statistical and machine learning methods. 

Facilitate collaboration between statisticians, computer scientists, data scientists and the wider community.

### Challenge specific objectives 1

https://github.com/cobleg/Hack-A-Gig/wiki

Derive insight from smart meter data. 

Use this insight to generate a seven day ahead forecast for each consumer. 

Provide advice to consumers on why their demand for electricity varies over the forecast week.



### Challenge specific objectives 2

https://github.com/cobleg/Hack-A-Gig/wiki/Challenge-1-Data:-Smart-meters-in-London

Develop an understanding of the consumption patterns contained in the data.

Relate the consumption patterns to statistically distinct groups of consumers.

Create a scalable forecasting process to accurately predict consumption for each group and individual household for up to a week ahead.

### Research questions

Are there any obvious seasonal (or cyclical) patterns. If so, are the cycles daily, weekly, annual etc.

The ratio of peak to average demand - known as load factor in the power industry.

Any trends evident - growth or contraction.

The degree of correlation across consumers.

How large is the ratio of the signal to noise? That is, are the identified consumption patterns highly predictable?

## Prepare data for exploratory analysis

Link data files to facilitate exploration of relationships between variables.

Identify and flag missing values, outliers.

Create time index, binary variables for the time of day, week and year.

create a tidy data set (for infomration see: https://vita.had.co.nz/papers/tidy-data.pdf)

## Notes

We are using a fastai conda env to run this nb

also you may beed to run the follwing if you get pandas error: AttributeError: module 'pandas.core.common' has no attribute 'is_numeric_dtype'
    
    pip install -e git+https://github.com/mouradmourafiq/pandas-summary#egg=pandas-summary
    
This notebook regularly used >>128GB of RAM whereupon swap space was then utilized, slowing down processing markedly

Hence I have regularly saved intermediate results to allow restarting notebook where left off to clear the cache

The only file format that I could write and read back >2GB files was .csv, which is slow and large

## Suggested TODO's

Create a table of summary statistics (i.e. min, mean, median, max, std. dev., range, the coefficient of variation) for both the time-series and structural variables.

Create line plots for time-series data over the following intervals: 24 hours of the day; days of the week; months of the year.

Create scatter plots of temperature, wind etc. versus electricity consumption.

Create other visualisations of the data to understand the impact of consumer segmentation on electricity consumption.

Create clustering visualisation to show interesting patterns in the data.

Apply any unsupervised learning methods to better understand the patterns in the data.

In [8]:
from fastai.structured import *
from fastai.column_data import *

In [9]:
import time

In [10]:
pd.set_option('display.max_columns', None)

In [18]:
INPUT_PATH='../input/'
PATH='../input/merged_data/'

Dataset background information can be found here:
    
http://jmdaignan.com/2018/01/28/Londonsmartmeter/

Kaggle link:

https://www.kaggle.com/jeanmidev/smart-meters-in-london/home
        

### HHBlock data

* The hhblock_dataset that contains the transpose data of a day for one household (as an array) with for example the hh_0 column is the consumption between 00:00 and 00:30

In [4]:
#merge datasets
#df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('../input/hhblock_dataset/', "block_*.csv"))))
#df.to_csv('../input/merged_data/hhblock_all.csv')

### Daily data

* The daily_dataset that contains daily informations on the consumption of the households

(Uncomment relevant block as required)



In [5]:
#concatenate all daly data
#daily_all = pd.concat(map(pd.read_csv, glob.glob(os.path.join('../input/daily_dataset/', "block_*.csv"))))
#daily_all.to_csv('../input/merged_data/daily_all.csv')

* Concatenate daily data with full half hour sampling through day

In [21]:
#list_ = []
#daily_data_folder = f'{INPUT_PATH}daily_dataset/'
#for file_ in os.listdir(daily_data_folder):
#    df = pd.read_csv(f'{INPUT_PATH}daily_dataset/{file_}')
#    df=df[df["energy_count"]==48].dropna()
#    list_.append(df)
#df = pd.concat(list_)
#df.to_csv('../input/merged_data/daily_all_48hh.csv')

### Half hourly data

* LCLid that corresponds to the household id

* tstp the timestamp of the measure

* energy(kWh/hh) the energy consumes in the past 30 minutes in kWh

### informations_households

* LCLid that correspond to the household id

* stdorToU the kind of tariff applied (ToU the dynamic tariff in function of the days or Std the classic fixed tariff)

* Acorn the ACORN group associated, that categorises the household

* Acorn_grouped this is another more global classification of the ACORN (fusion of different ACORN groups)

* file name of the file in the different zip files where you can find the data of the household

### acorn_details

* contains the index for multiple parameters in comparison of the national (that have an index of 100)

https://acorn.caci.co.uk/downloads/Acorn-User-guide.pdf

Acorn is a segmentation tool which categorises the UK’s population into demographic types. Acorn segments households, postcodes and neighbourhoods into 6 categories, 18 groups and 62 types.

In [6]:
#concatenate all 1/2 hourly data
#halfhourly_all = pd.concat(map(pd.read_csv, glob.glob(os.path.join('../input/halfhourly_dataset/', "block_*.csv"))))
#halfhourly_all.to_csv('../input/merged_data/halfhourly_all.csv')

In [40]:
table_names = ['daily_all', 'halfhourly_all', 'hhblock_all', 'acorn_details', 
               'informations_households', 'uk_bank_holidays', 
               'weather_daily_darksky', 'weather_hourly_darksky']

In [41]:
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]

In [42]:
from IPython.display import HTML, display

In [43]:
for t, name in zip(tables, table_names): 
    print(name)
    display(t.head(n=2))

daily_all


Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
0,0,MAC000041,2011-12-08,0.295,0.396259,1.071,27,0.285051,10.699,0.119
1,1,MAC000041,2011-12-09,0.204,0.235437,0.744,48,0.184686,11.301,0.023


halfhourly_all


Unnamed: 0.1,Unnamed: 0,LCLid,tstp,energy(kWh/hh)
0,0,MAC000041,2011-12-08 10:30:00.0000000,0.126
1,1,MAC000041,2011-12-08 11:00:00.0000000,0.12


hhblock_all


Unnamed: 0.1,Unnamed: 0,LCLid,day,hh_0,hh_1,hh_2,hh_3,hh_4,hh_5,hh_6,hh_7,hh_8,hh_9,hh_10,hh_11,hh_12,hh_13,hh_14,hh_15,hh_16,hh_17,hh_18,hh_19,hh_20,hh_21,hh_22,hh_23,hh_24,hh_25,hh_26,hh_27,hh_28,hh_29,hh_30,hh_31,hh_32,hh_33,hh_34,hh_35,hh_36,hh_37,hh_38,hh_39,hh_40,hh_41,hh_42,hh_43,hh_44,hh_45,hh_46,hh_47
0,0,MAC000041,2011-12-09,0.07,0.062,0.063,0.062,0.062,0.062,0.062,0.061,0.061,0.062,0.06,0.061,0.06,0.06,0.077,0.13,0.127,0.239,0.112,0.097,0.06,0.169,0.023,0.068,0.354,0.163,0.481,0.744,0.643,0.53,0.471,0.612,0.478,0.262,0.368,0.315,0.296,0.291,0.313,0.424,0.361,0.324,0.389,0.331,0.309,0.292,0.297,0.283
1,1,MAC000041,2011-12-10,0.178,0.082,0.063,0.063,0.062,0.063,0.062,0.062,0.063,0.062,0.061,0.053,0.024,0.024,0.081,0.267,0.136,0.369,0.208,0.077,0.17,0.065,0.068,0.062,0.074,0.069,0.172,0.089,0.692,0.299,0.24,0.158,1.038,0.723,0.265,0.255,0.239,0.275,0.44,0.317,0.383,0.347,0.403,0.216,0.212,0.223,0.186,0.189


acorn_details


Unnamed: 0,MAIN CATEGORIES,CATEGORIES,REFERENCE,ACORN-A,ACORN-B,ACORN-C,ACORN-D,ACORN-E,ACORN-F,ACORN-G,ACORN-H,ACORN-I,ACORN-J,ACORN-K,ACORN-L,ACORN-M,ACORN-N,ACORN-O,ACORN-P,ACORN-Q
0,POPULATION,Age,Age 0-4,77.0,83.0,72.0,100.0,120.0,77.0,97.0,97.0,63.0,119.0,67.0,114.0,113.0,89.0,123.0,138.0,133.0
1,POPULATION,Age,Age 5-17,117.0,109.0,87.0,69.0,94.0,95.0,102.0,106.0,67.0,95.0,64.0,108.0,116.0,86.0,89.0,136.0,106.0


informations_households


Unnamed: 0,LCLid,stdorToU,Acorn,Acorn_grouped,file
0,MAC005492,ToU,ACORN-,ACORN-,block_0
1,MAC001074,ToU,ACORN-,ACORN-,block_0


uk_bank_holidays


Unnamed: 0,Bank holidays,Type
0,2012-12-26,Boxing Day
1,2012-12-25,Christmas Day


weather_daily_darksky


Unnamed: 0,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,time,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase
0,11.96,2011-11-11 23:00:00,123,fog,9.4,2011-11-11 07:00:00,0.79,3.88,1016.08,2011-11-11 07:00:00,10.87,rain,3.3,0.95,2011-11-11 19:00:00,10.87,11.96,1.0,2011-11-11 00:00:00,2011-11-11 16:19:21,10.87,8.85,10.87,2011-11-11 07:12:14,2011-11-11 19:00:00,2011-11-11 11:00:00,Foggy until afternoon.,2011-11-11 19:00:00,6.48,2011-11-11 23:00:00,2011-11-11 19:00:00,0.52
1,8.59,2011-12-11 14:00:00,198,partly-cloudy-day,4.49,2011-12-11 01:00:00,0.56,3.94,1007.71,2011-12-11 02:00:00,5.62,rain,12.09,0.88,2011-12-11 19:00:00,-0.64,5.72,1.0,2011-12-11 00:00:00,2011-12-11 15:52:53,3.09,2.48,8.59,2011-12-11 07:57:02,2011-12-11 14:00:00,2011-12-11 12:00:00,Partly cloudy throughout the day.,2011-12-12 07:00:00,0.11,2011-12-11 20:00:00,2011-12-12 08:00:00,0.53


weather_hourly_darksky


Unnamed: 0,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,icon,humidity,summary
0,5.97,104,10.24,2011-11-11 00:00:00,8.86,1016.76,10.24,2.77,rain,partly-cloudy-night,0.91,Partly Cloudy
1,4.88,99,9.76,2011-11-11 01:00:00,8.83,1016.63,8.24,2.95,rain,partly-cloudy-night,0.94,Partly Cloudy


In [44]:
for t, name in zip(tables, table_names): 
    print(name)
    display(DataFrameSummary(t).summary())

daily_all


Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
count,3.51043e+06,,,3.5104e+06,3.5104e+06,3.5104e+06,3.51043e+06,3.4991e+06,3.5104e+06,3.5104e+06
mean,15791.6,,,0.158739,0.21173,0.834521,47.8036,0.172667,10.1241,0.0596258
std,9203.25,,,0.170186,0.190846,0.668316,2.81098,0.153121,9.12879,0.0870131
min,0,,,0,0,0,0,0,0,0
25%,7835,,,0.067,0.0980833,0.346,48,0.0691163,4.682,0.02
50%,15723,,,0.1145,0.163292,0.688,48,0.132791,7.815,0.039
75%,23629,,,0.191,0.262458,1.128,48,0.229312,12.569,0.071
max,36167,,,6.9705,6.92825,10.761,48,4.02457,332.556,6.524
counts,3510433,3510433,3510433,3510403,3510403,3510403,3510433,3499102,3510403,3510403
uniques,36168,5566,829,10437,421337,6425,44,3275190,401153,2149


halfhourly_all


Unnamed: 0.1,Unnamed: 0,LCLid,tstp,energy(kWh/hh)
count,1.67817e+08,,,
mean,754967,,,
std,439993,,,
min,0,,,
25%,374591,,,
50%,751675,,,
75%,1.12964e+06,,,
max,1.73057e+06,,,
counts,167817021,167817021,167817021,167817021
uniques,1730575,5566,40405,9611


hhblock_all


Unnamed: 0.1,Unnamed: 0,LCLid,day,hh_0,hh_1,hh_2,hh_3,hh_4,hh_5,hh_6,hh_7,hh_8,hh_9,hh_10,hh_11,hh_12,hh_13,hh_14,hh_15,hh_16,hh_17,hh_18,hh_19,hh_20,hh_21,hh_22,hh_23,hh_24,hh_25,hh_26,hh_27,hh_28,hh_29,hh_30,hh_31,hh_32,hh_33,hh_34,hh_35,hh_36,hh_37,hh_38,hh_39,hh_40,hh_41,hh_42,hh_43,hh_44,hh_45,hh_46,hh_47
count,3.46935e+06,,,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46933e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46389e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06
mean,15607.6,,,0.179307,0.169293,0.151992,0.138064,0.127924,0.121697,0.117061,0.114255,0.113251,0.115754,0.120018,0.131474,0.146741,0.167772,0.18483,0.200603,0.21065,0.216295,0.216013,0.215544,0.214407,0.213523,0.213521,0.214799,0.215783,0.216988,0.215491,0.213231,0.210763,0.210271,0.212412,0.21997,0.231929,0.25091,0.270309,0.292199,0.306149,0.318025,0.321005,0.320821,0.3158,0.30992,0.300074,0.287665,0.266337,0.242233,0.21441,0.187216
std,9096.51,,,0.308812,0.329103,0.298032,0.265524,0.237337,0.219654,0.207623,0.196807,0.190258,0.192065,0.193695,0.207926,0.225033,0.24687,0.263282,0.276338,0.287956,0.294516,0.296167,0.295753,0.297309,0.296928,0.298446,0.298324,0.29837,0.298545,0.296088,0.292482,0.289571,0.28724,0.285779,0.291625,0.301533,0.318541,0.334514,0.352467,0.361617,0.368874,0.368126,0.363025,0.352158,0.34236,0.330115,0.31898,0.302796,0.284976,0.264942,0.241926
min,0,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
25%,7744,,,0.052,0.048,0.046,0.044,0.042,0.042,0.041,0.041,0.04,0.041,0.041,0.043,0.046,0.05,0.055,0.059,0.061,0.062,0.061,0.06,0.059,0.058,0.058,0.058,0.058,0.059,0.059,0.059,0.059,0.06,0.061,0.063,0.067,0.072,0.078,0.085,0.091,0.097,0.101,0.105,0.108,0.109,0.108,0.103,0.092,0.079,0.067,0.058
50%,15539,,,0.098,0.088,0.081,0.077,0.074,0.072,0.071,0.07,0.07,0.071,0.072,0.077,0.084,0.095,0.106,0.115,0.121,0.124,0.122,0.121,0.119,0.117,0.116,0.117,0.118,0.12,0.119,0.118,0.118,0.119,0.121,0.127,0.135,0.147,0.16,0.174,0.186,0.195,0.201,0.205,0.207,0.207,0.203,0.195,0.178,0.158,0.134,0.113
75%,23353,,,0.191,0.168,0.15,0.138,0.13,0.125,0.122,0.12,0.119,0.122,0.127,0.139,0.157,0.183,0.207,0.228,0.24,0.246,0.244,0.242,0.239,0.238,0.237,0.239,0.24,0.242,0.239,0.237,0.234,0.234,0.238,0.248,0.265,0.291,0.319,0.35,0.371,0.388,0.39,0.389,0.382,0.375,0.363,0.349,0.323,0.294,0.258,0.221
max,35800,,,7.272,8.717,8.025,8.75,8.414,8.591,7.357,7.676,7.581,7.568,7.273,7.608,8.892,8.812,9.166,8.782,9.71,9.65,9.106,9.112,8.848,8.136,9.568,7.731,9.294,8.171,7.556,8.875,8.797,8.659,8.425,8.105,9.944,10.528,10.761,8.631,8.702,9.679,8.833,9.141,8.998,9.189,8.539,9.257,7.819,8.051,7.769,8.411
counts,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469350,3469352,3469352,3469352,3469352,3469352,3469331,3469350,3469352,3469352,3469352,3463892,3469352,3469352,3469352,3469352,3469352,3469351,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352
uniques,35801,5560,827,5569,5763,5442,5099,4785,4493,4358,4093,3829,3780,3689,3705,3769,3729,3963,3985,4120,4182,4223,4134,4203,4118,4123,4091,4042,4077,4036,3999,3996,3981,3906,3950,4076,4188,4248,4383,4409,4434,4456,4409,4366,4381,4307,4261,4090,3974,3820,3639


acorn_details


Unnamed: 0,MAIN CATEGORIES,CATEGORIES,REFERENCE,ACORN-A,ACORN-B,ACORN-C,ACORN-D,ACORN-E,ACORN-F,ACORN-G,ACORN-H,ACORN-I,ACORN-J,ACORN-K,ACORN-L,ACORN-M,ACORN-N,ACORN-O,ACORN-P,ACORN-Q
count,,,,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826
mean,,,,131.313,110.86,100.081,136.858,117.895,95.5745,101.444,97.2989,87.0285,104.217,127.483,93.7242,91.4103,79.9124,95.5793,100.141,90.8554
std,,,,201.448,42.464,30.0995,97.7408,35.7688,33.6367,21.799,18.2292,30.3378,19.924,97.4282,22.177,22.9096,33.9952,25.9358,37.2103,37.634
min,,,,12,0.957011,0.281968,2,21,0,0.791419,1.15545,6.36326,16.0507,17,0.393546,0.714857,2,11,9,1
25%,,,,87,94,86,93.0922,99,81,94.1381,91,70,97,85,86,82,60.2535,86,82.25,71.25
50%,,,,104,107,100,121,117,98,102,99,88,105,109,95,93,74,96,96,87
75%,,,,128,122,113,154,135,108,109,105,101.75,115,144,102,101,93.1584,104,109,101
max,,,,3795,419,272,1159.03,286,462,295,192,410,197,1821,280,161,295,252,389,326
counts,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826
uniques,15,84,632,237,194,167,256,191,159,136,121,159,130,247,134,140,166,155,169,173


informations_households


Unnamed: 0,LCLid,stdorToU,Acorn,Acorn_grouped,file
count,5566,5566,5566,5566,5566
unique,5566,2,19,5,112
top,MAC000488,Std,ACORN-E,Affluent,block_48
freq,1,4443,1567,2192,50
counts,5566,5566,5566,5566,5566
uniques,5566,2,19,5,112
missing,0,0,0,0,0
missing_perc,0%,0%,0%,0%,0%
types,unique,bool,categorical,categorical,categorical


uk_bank_holidays


Unnamed: 0,Bank holidays,Type
count,25,25
unique,25,11
top,2013-05-27,Boxing Day
freq,1,3
counts,25,25
uniques,25,11
missing,0,0
missing_perc,0%,0%
types,unique,categorical


weather_daily_darksky


Unnamed: 0,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,apparentTemperatureHigh,precipType,visibility,humidity,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureMax,uvIndex,time,sunsetTime,temperatureLow,temperatureMin,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase
count,882,,882,,882,,881,882,882,,882,,882,882,,882,882,881,,,882,882,882,,,,,,882,,,882
mean,13.6601,,195.703,,6.53003,,0.477605,3.5818,1014.13,,12.7239,,11.1671,0.781871,,6.08505,12.9295,2.54257,,,7.70984,7.41416,13.5424,,,,,,5.73804,,,0.50093
std,6.18274,,89.3408,,4.83088,,0.193514,1.69401,11.073,,7.27917,,2.46611,0.0953482,,6.03197,7.10543,1.83298,,,4.871,4.88885,6.2602,,,,,,6.04875,,,0.287022
min,-0.06,,0,,-7.84,,0,0.2,979.25,,-6.46,,1.48,0.43,,-8.88,-4.11,0,,,-5.64,-5.64,-0.81,,,,,,-8.88,,,0
25%,9.5025,,120.5,,3.18,,0.35,2.37,1007.43,,7.0325,,10.3275,0.72,,1.5225,7.3325,1,,,3.99,3.705,9.2125,,,,,,1.105,,,0.26
50%,12.625,,219,,6.38,,0.47,3.44,1014.62,,12.47,,11.97,0.79,,5.315,12.625,2,,,7.54,7.1,12.47,,,,,,4.885,,,0.5
75%,17.92,,255,,10.0575,,0.6,4.5775,1021.75,,17.91,,12.83,0.86,,11.4675,17.92,4,,,11.4675,11.2775,17.91,,,,,,11.2775,,,0.75
max,32.4,,359,,17.77,,1,9.96,1040.92,,32.42,,15.34,0.98,,20.54,32.42,7,,,20.54,20.54,32.4,,,,,,20.54,,,0.99
counts,882,882,882,882,882,882,881,882,882,882,882,882,882,882,882,882,882,881,882,882,882,882,882,882,882,881,882,882,882,882,882,882
uniques,711,882,304,6,687,882,96,466,802,882,731,2,387,49,882,718,728,8,882,882,696,694,715,882,882,881,88,882,706,882,882,100


weather_hourly_darksky


Unnamed: 0,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,icon,humidity,summary
count,21165,21165,21165,,21165,21152,21165,21165,,,21165,
mean,11.1665,195.686,10.4715,,6.5305,1014.13,9.23034,3.90522,,,0.781829,
std,3.09934,90.6295,5.7819,,5.04197,11.3883,6.94092,2.02685,,,0.140369,
min,0.18,0,-5.64,,-9.98,975.74,-8.88,0.04,,,0.23,
25%,10.12,121,6.47,,2.82,1007.43,3.9,2.42,,,0.7,
50%,12.26,217,9.93,,6.57,1014.78,9.36,3.68,,,0.81,
75%,13.08,256,14.31,,10.33,1022.05,14.32,5.07,,,0.89,
max,16.09,359,32.4,,19.88,1043.32,32.42,14.8,,,1,
counts,21165,21165,21165,21165,21165,21152,21165,21165,21165,21165,21165,21165
uniques,953,360,2803,21165,2398,4988,3124,1095,2,7,78,13


In [45]:
#acorn_details
tables[3].head().T.head(4)


Unnamed: 0,0,1,2,3,4
MAIN CATEGORIES,POPULATION,POPULATION,POPULATION,POPULATION,POPULATION
CATEGORIES,Age,Age,Age,Age,Age
REFERENCE,Age 0-4,Age 5-17,Age 18-24,Age 25-34,Age 35-49
ACORN-A,77,117,64,52,102


In [46]:
#modify the cell below to copy tables[0] if have run load all tables above

In [11]:
#halfhourly_all
#hh_vertical = tables[1]
hh_vertical = pd.read_csv(f'{PATH}halfhourly_all.csv', low_memory=False)
hh_vertical.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid,tstp,energy(kWh/hh)
0,0,MAC000041,2011-12-08 10:30:00.0000000,0.126
1,1,MAC000041,2011-12-08 11:00:00.0000000,0.12


In [12]:
hh_vertical.dtypes

Unnamed: 0         int64
LCLid             object
tstp              object
energy(kWh/hh)    object
dtype: object

**Format time string**

The next 2 string formatting step take a long time and use a lot of RAM (>64GB RAM)

In [14]:
hh_vertical[['day', 'time']] = hh_vertical['tstp'].str.split(' ', n=1, expand=True)


In [15]:
hh_vertical.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid,tstp,energy(kWh/hh),day,time
0,0,MAC000041,2011-12-08 10:30:00.0000000,0.126,2011-12-08,10:30:00.0000000
1,1,MAC000041,2011-12-08 11:00:00.0000000,0.12,2011-12-08,11:00:00.0000000


remove the seconds from the time string

In [16]:
hh_vertical['time'] = hh_vertical['time'].str.split(".").str[0]

In [20]:
#find anomalous times
retain_list=[]
for t in range(24):
    if len(str(t))<2:
        t = '0'+str(t)
    else:
        t = str(t)
    retain_list.append(t+':00:00') 
    retain_list.append(t+':30:00') 
    
#list_of_values = [3,6]

#y = df[df['A'] in list_of_values]

In [None]:
#LCLid, energy(kWh/hh)

better save as dont want to re-run this 

In [17]:
#error when saving as feather: TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
#tables[1].to_feather('../input/merged_data/halfhourly_all_day_time.feather')
hh_vertical.to_csv('../input/merged_data/halfhourly_all_day_time.csv')

In [18]:
hh_vertical = pd.read_csv('../input/merged_data/halfhourly_all_day_time.csv', low_memory=False)

In [6]:
hh_vertical.head(n=2)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,LCLid,tstp,energy(kWh/hh),day,time
0,0,0,MAC000041,2011-12-08 10:30:00.0000000,0.126,2011-12-08,10:30:00
1,1,1,MAC000041,2011-12-08 11:00:00.0000000,0.12,2011-12-08,11:00:00


In [21]:
hh_time_series = hh_vertical['time']

In [22]:
hh_time_series.unique()

array(['10:30:00', '11:00:00', '11:30:00', '12:00:00', '12:30:00', '13:00:00', '13:30:00', '14:00:00',
       '14:30:00', '15:00:00', '15:30:00', '16:00:00', '16:30:00', '17:00:00', '17:30:00', '18:00:00',
       '18:30:00', '19:00:00', '19:30:00', '20:00:00', '20:30:00', '21:00:00', '21:30:00', '22:00:00',
       '22:30:00', '23:00:00', '23:30:00', '00:00:00', '00:30:00', '01:00:00', '01:30:00', '02:00:00',
       '02:30:00', '03:00:00', '03:30:00', '04:00:00', '04:30:00', '05:00:00', '05:30:00', '06:00:00',
       '06:30:00', '07:00:00', '07:30:00', '08:00:00', '08:30:00', '09:00:00', '09:30:00', '10:00:00',
       '15:13:43', '15:13:39', '15:15:47', '15:15:48', '15:16:39', '15:16:40', '15:16:41', '15:17:18',
       '15:18:10', '15:18:50', '15:18:51', '15:18:52', '15:19:42', '15:18:30', '15:19:00', '15:19:01',
       '15:19:02', '15:19:09', '15:22:50', '15:22:52', '15:22:53', '15:23:52', '15:22:19', '15:15:59',
       '15:16:00', '15:14:14', '15:14:16', '15:14:17', '15:14:20', '15:14

In [23]:
bad_times = ['15:13:43', '15:13:39', '15:15:47', '15:15:48', '15:16:39', '15:16:40', '15:16:41', '15:17:18',
       '15:18:10', '15:18:50', '15:18:51', '15:18:52', '15:19:42', '15:18:30', '15:19:00', '15:19:01',
       '15:19:02', '15:19:09', '15:22:50', '15:22:52', '15:22:53', '15:23:52', '15:22:19', '15:15:59',
       '15:16:00', '15:14:14', '15:14:16', '15:14:17', '15:14:20', '15:14:21', '15:14:22', '15:14:23',
       '15:14:29', '15:14:34', '15:20:27', '15:20:28', '15:13:27', '15:13:28', '15:13:35', '15:13:36',
       '15:15:40', '15:15:41', '15:15:42', '15:15:44', '15:16:06', '15:16:49', '15:16:50', '15:17:07',
       '15:17:09', '15:17:10', '15:23:27', '15:24:11', '15:24:23', '15:24:26', '15:22:05', '15:19:35',
       '15:19:37', '15:19:38', '15:19:40', '15:15:23', '15:15:26', '15:15:27', '15:15:32', '15:15:33',
       '15:13:56', '15:14:06', '15:14:09', '15:22:41', '15:22:46', '15:22:47', '15:22:49', '15:15:49',
       '15:15:50', '15:15:51', '15:15:52', '15:14:54', '15:15:00', '15:13:37', '15:13:38', '15:13:40',
       '15:23:12', '15:23:18', '15:23:20', '15:15:53', '15:15:54', '15:15:55', '15:15:56', '15:15:57',
       '15:15:58', '15:16:01', '15:16:02', '15:14:10', '15:14:11', '15:14:12', '15:14:19', '15:14:30',
       '15:14:33', '15:15:46', '15:16:04', '15:16:16', '15:16:38', '15:16:44', '15:18:20', '15:18:54',
       '15:19:16', '15:19:19', '15:19:20', '15:19:43', '15:19:46', '15:19:52', '15:19:53', '15:20:12',
       '15:20:21', '15:20:22', '15:20:26', '15:21:17', '15:21:20', '15:21:23', '15:21:30', '15:16:59',
       '15:22:22', '15:22:33', '15:18:35', '15:18:58', '15:23:19', '15:24:03', '15:24:04', '15:24:06',
       '15:24:08', '15:24:16', '15:24:21', '15:24:24', '15:14:02', '15:20:29', '15:20:35', '15:14:51',
       '15:16:27', '15:16:35', '15:13:42', '15:13:50', '15:13:51', '15:15:11', '15:16:11', '15:16:13',
       '09:33:01', '15:20:47', '15:20:57', '15:20:59', '15:21:36', '15:21:39', '15:21:46', '15:16:55',
       '15:22:27', '15:21:59', '15:22:57', '15:23:04', '15:23:09', '15:23:32', '15:23:39', '15:23:49',
       '15:23:57', '15:24:22', '15:22:18', '15:20:03', '15:15:24', '15:15:31', '15:14:07', '15:14:13',
       '15:14:28', '15:20:41', '15:14:47', '15:15:01', '15:15:03', '15:16:25', '15:16:31', '15:13:41',
       '15:15:37', '15:17:11', '15:17:12', '15:17:13', '15:17:15', '15:18:24', '15:18:25', '15:18:41',
       '15:18:42', '15:18:43', '15:18:44', '15:20:43', '15:20:45', '15:20:46', '15:21:07', '15:21:56',
       '15:21:57', '15:22:00', '15:22:02', '15:18:59', '15:19:08', '15:15:20', '15:15:21', '15:15:30',
       '15:14:45', '15:14:46', '15:13:49', '15:13:26', '15:15:13', '15:15:38', '15:17:47', '15:19:12',
       '15:19:55', '15:20:13', '15:20:18', '15:20:19', '15:20:52', '15:20:54', '15:20:56', '15:21:02',
       '15:21:13', '15:21:25', '15:21:26', '15:21:32', '15:16:53', '15:17:01', '15:24:02', '15:24:12',
       '15:24:13', '15:24:14', '15:24:19', '15:19:33', '15:19:34', '15:14:38', '15:16:30', '15:16:33',
       '15:16:34', '15:15:16', '15:16:08', '15:16:17', '15:16:18', '15:17:21', '15:17:38', '15:18:09',
       '15:18:11', '15:18:16', '15:18:48', '15:19:18', '15:17:22', '15:17:30', '15:18:38', '15:21:50',
       '15:21:53', '15:22:01', '15:19:04', '15:23:03', '15:23:50', '15:22:12', '15:22:13', '15:19:30',
       '15:20:37', '15:14:59', '15:16:26', '15:16:32', '12:32:40', '15:15:12', '15:15:17', '15:15:39',
       '15:17:46', '15:18:49', '15:18:53', '15:19:10', '15:19:14', '15:20:24', '15:20:49', '15:21:22',
       '15:21:24', '15:18:29', '15:23:13', '15:23:14', '15:24:10', '15:24:15', '15:24:17', '15:24:18',
       '15:24:20', '15:24:25', '15:13:48', '15:15:10', '15:15:36', '15:15:45', '15:18:47', '15:19:21',
       '15:19:24', '15:19:44', '15:19:45', '15:20:44', '15:14:37', '15:14:50', '15:14:55', '15:14:56',
       '15:15:02', '15:16:20', '15:16:22', '15:13:53', '15:13:29', '15:13:30', '15:15:05', '15:22:04',
       '15:18:55', '15:19:56', '15:19:58', '15:19:59', '15:20:00', '15:20:01', '15:20:04', '15:20:05',
       '15:20:06', '15:15:22', '15:15:35', '15:15:43', '15:16:14', '15:19:15', '15:19:22', '15:19:51',
       '15:19:54', '15:20:14', '15:20:16', '15:20:50', '15:20:53', '15:20:55', '15:21:03', '15:21:11',
       '15:20:36', '15:13:52', '15:15:04', '15:19:05', '15:19:06', '15:19:07', '15:22:51', '15:14:24',
       '15:14:25', '15:14:26', '15:22:24', '15:14:27', '15:14:31', '15:14:32', '15:16:03', '15:16:07',
       '15:20:20', '15:20:25', '15:20:48', '15:21:08', '15:21:09', '15:21:10', '15:14:35', '15:14:36',
       '15:14:39', '15:14:43', '15:14:48', '15:16:29', '15:16:09', '15:16:42', '15:16:46', '15:16:48',
       '15:22:54', '15:22:55', '15:22:56', '15:23:16', '15:23:17', '15:23:25', '15:23:29', '15:23:30',
       '15:23:44', '15:23:45', '15:23:51', '15:23:53', '15:23:55', '15:23:58', '15:24:09', '15:22:16',
       '15:22:17', '15:19:57', '15:20:10', '15:16:21', '15:16:28', '15:16:10', '15:17:16', '15:21:38',
       '15:22:34', '15:17:54', '15:18:07', '15:22:59', '15:23:02', '15:23:15', '15:23:36', '15:24:05',
       '15:14:01', '15:22:36', '15:24:01', '15:22:06', '15:22:07', '15:22:08', '15:22:09', '15:22:10',
       '15:19:26', '15:19:27', '15:19:36', '15:22:38', '15:22:39', '15:17:43', '15:18:23', '15:20:17',
       '15:21:05', '15:21:19', '15:21:28', '15:21:35', '15:21:47', '15:16:56', '15:17:33', '15:17:34',
       '15:18:33', '15:21:54', '15:23:07', '15:23:31', '15:23:33', '15:23:41', '15:23:46', '15:23:54',
       '15:23:56', '15:24:07', '15:15:19', '15:22:40', '15:20:40', '12:37:26', '15:15:14', '15:17:19',
       '15:17:37', '15:17:40', '15:17:48', '15:18:19', '15:18:21', '15:18:46', '15:19:11', '15:17:35',
       '15:17:36', '15:22:29', '15:19:25', '15:14:08', '15:20:30', '15:20:31', '12:32:41', '15:16:43',
       '15:18:12', '15:18:18', '15:21:34', '15:21:37', '15:17:02', '15:17:03', '15:17:28', '15:17:29',
       '15:17:31', '15:22:20', '15:22:23', '15:22:26', '15:22:28', '15:22:32', '15:23:28', '15:23:34',
       '15:24:27', '15:23:35', '15:23:42', '15:23:43', '15:23:47', '15:13:47', '15:19:23', '15:19:47',
       '15:19:50', '15:20:15', '15:20:51', '15:21:00', '15:21:16', '15:21:48', '15:16:52', '15:15:34',
       '15:17:14', '15:17:41', '15:17:42', '15:17:45', '15:22:15', '15:20:09', '15:15:08', '15:19:41',
       '15:19:49', '15:21:18', '15:21:45', '15:21:49', '15:16:51', '15:17:04', '15:17:05', '15:17:06',
       '15:17:23', '15:22:21', '15:22:25', '15:22:30', '15:17:53', '15:14:57', '15:14:58', '15:15:18',
       '15:16:15', '15:17:49', '15:20:23', '15:21:01', '15:21:14', '15:20:42', '15:13:32', '15:16:37',
       '15:17:08', '15:17:20', '15:21:12', '15:21:33', '15:17:55', '15:17:57', '15:17:58', '15:18:36',
       '15:21:51', '15:22:11', '15:16:23', '15:19:13', '15:21:21', '15:22:35', '15:17:50', '15:18:13',
       '15:18:45', '15:21:06', '15:20:33', '15:13:46', '15:19:17', '15:17:32', '15:13:44', '15:13:45',
       '15:13:54', '15:23:48', '15:13:57', '15:13:58', '15:13:59', '15:14:00', '15:14:03', '12:32:39',
       '15:19:48', '15:20:58', '15:16:24', '15:16:45', '15:16:47', '15:20:02', '15:15:25', '15:15:29',
       '15:13:55', '15:13:31', '15:13:33', '15:16:19', '15:16:36', '15:15:28', '15:17:39', '15:21:27',
       '15:21:40', '15:21:41', '15:17:17', '15:18:05', '18:14:54', '15:18:57', '15:23:26', '15:15:06',
       '15:15:07', '15:17:59', '15:18:00', '15:18:03', '15:18:26', '15:18:27', '15:18:28', '15:14:52',
       '15:14:53', '15:18:17', '15:18:08', '15:18:32', '15:18:34', '15:19:31', '15:20:38', '15:14:44',
       '15:16:12', '15:17:44', '15:23:59', '15:20:07', '15:18:15', '15:18:22', '15:18:40', '15:22:03',
       '12:54:31', '15:14:42', '15:22:14', '15:19:28', '15:19:29', '15:19:32', '15:20:08', '15:20:11',
       '15:18:14', '15:18:04', '15:23:37', '15:23:38', '15:22:44', '15:13:34', '15:14:05', '15:22:37',
       '15:22:42', '15:22:43', '15:22:45', '15:22:48', '15:18:56', '15:22:58', '15:23:00', '15:23:22',
       '15:14:04', '15:18:37', '15:18:39', '15:21:52', '15:14:15', '15:14:18', '15:14:40', '15:23:08',
       '15:23:11', '15:23:23', '15:14:49', '15:21:29', '15:16:54', '15:16:57', '15:17:27', '15:17:56',
       '15:18:06', '15:24:00', '18:22:38', '15:18:31', '15:20:32', '15:20:34', '15:14:41', '15:17:51',
       '15:21:43', '15:16:58', '15:17:25', '15:23:40', '15:15:09', '15:21:55', '12:37:27', '15:17:24',
       '15:21:58', '15:19:03', '15:19:39', '15:18:01', '15:21:44', '15:17:26', '18:20:32', '15:15:15',
       '15:23:10', '15:20:39', '15:23:01', '15:23:06', '15:23:24', '12:37:28', '15:21:42', '15:17:52',
       '15:18:02', '15:22:31', '15:16:05', '15:23:05', '15:17:00', '18:15:40', '15:23:21', '18:24:09',
       '13:15:05', '18:26:48', '12:32:42', '18:23:02', '15:21:31', '18:19:44']

In [25]:
#get bad rows
df_bad = hh_vertical.loc[hh_vertical['time'].isin(bad_times)]

In [27]:
#problem data seems to be not just associated with one LCLid
#note how energy is Null for these anomalous times
df_bad.head(10)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,LCLid,tstp,energy(kWh/hh),day,time
18040,18040,18040,MAC000041,2012-12-18 15:13:43.0000000,Null,2012-12-18,15:13:43
57040,57040,57040,MAC000042,2012-12-18 15:13:43.0000000,Null,2012-12-18,15:13:43
96173,96173,96173,MAC000268,2012-12-18 15:13:39.0000000,Null,2012-12-18,15:13:39
135307,135307,135307,MAC000269,2012-12-18 15:13:39.0000000,Null,2012-12-18,15:13:39
168977,168977,168977,MAC000527,2012-12-18 15:15:47.0000000,Null,2012-12-18,15:15:47
202549,202549,202549,MAC000534,2012-12-18 15:15:48.0000000,Null,2012-12-18,15:15:48
218807,218807,218807,MAC000702,2012-12-18 15:16:39.0000000,Null,2012-12-18,15:16:39
251326,251326,251326,MAC000703,2012-12-18 15:16:39.0000000,Null,2012-12-18,15:16:39
283660,283660,283660,MAC000711,2012-12-18 15:16:40.0000000,Null,2012-12-18,15:16:40
315222,315222,315222,MAC000714,2012-12-18 15:16:41.0000000,Null,2012-12-18,15:16:41


In [31]:
#drop these bad rows
hh_vertical = hh_vertical.loc[hh_vertical['energy(kWh/hh)'] != 'Null']

In [None]:
hh_vertical.to_csv('../input/merged_data/halfhourly_all_day_time_clean.csv')

In [32]:
hh_time_series = hh_vertical['time']
hh_time_series.unique()

array(['10:30:00', '11:00:00', '11:30:00', '12:00:00', '12:30:00', '13:00:00', '13:30:00', '14:00:00',
       '14:30:00', '15:00:00', '15:30:00', '16:00:00', '16:30:00', '17:00:00', '17:30:00', '18:00:00',
       '18:30:00', '19:00:00', '19:30:00', '20:00:00', '20:30:00', '21:00:00', '21:30:00', '22:00:00',
       '22:30:00', '23:00:00', '23:30:00', '00:00:00', '00:30:00', '01:00:00', '01:30:00', '02:00:00',
       '02:30:00', '03:00:00', '03:30:00', '04:00:00', '04:30:00', '05:00:00', '05:30:00', '06:00:00',
       '06:30:00', '07:00:00', '07:30:00', '08:00:00', '08:30:00', '09:00:00', '09:30:00', '10:00:00'],
      dtype=object)

**Pivot the table to have times as columns**

In [33]:
hh = hh_vertical.pivot_table(index=['LCLid', 'day'], columns='time',
                     values='energy(kWh/hh)', aggfunc='first').reset_index()

could also use 
<pre>
tables[1].set_index(['LCLid', 'day', 'time'])['energy(kWh/hh)'].unstack().reset_index()
</pre>

### Save and read in half hourly pivoted data

In [34]:
hh.to_csv('../input/merged_data/halfhourly_all_pivoted.csv')

In [5]:
hh = pd.read_csv('../input/merged_data/halfhourly_all_pivoted.csv', low_memory=False)

In [59]:
hh.head()

time,LCLid,day,00:00:00,00:30:00,01:00:00,01:30:00,02:00:00,02:30:00,03:00:00,03:30:00,04:00:00,04:30:00,05:00:00,05:30:00,06:00:00,06:30:00,07:00:00,07:30:00,08:00:00,08:30:00,09:00:00,09:30:00,10:00:00,10:30:00,11:00:00,11:30:00,12:00:00,12:30:00,13:00:00,13:30:00,14:00:00,14:30:00,15:00:00,15:30:00,16:00:00,16:30:00,17:00:00,17:30:00,18:00:00,18:30:00,19:00:00,19:30:00,20:00:00,20:30:00,21:00:00,21:30:00,22:00:00,22:30:00,23:00:00,23:30:00,LCL_day_uid
0,MAC000002,2012-10-12,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.143,0.663,0.256,0.155,0.199,0.125,0.165,0.14,0.148,0.154,0.137,0.493,0.354,0.228,0.195,0.527,0.886,0.198,0.243,0.193,0.342,0.27,0.325,0.269,0.29,MAC000002_2012-10-12
1,MAC000002,2012-10-13,0.263,0.269,0.275,0.256,0.211,0.136,0.161,0.119,0.167,0.109,0.168,0.107,0.166,0.117,0.157,0.126,0.146,0.106,0.135,0.191,0.915,0.933,0.122,0.138,0.076,0.133,0.076,0.133,0.085,0.263,0.134,0.235,0.124,0.184,0.23,0.176,0.388,0.26,0.918,0.278,0.267,0.239,0.23,0.233,0.235,0.188,0.259,0.25,MAC000002_2012-10-13
2,MAC000002,2012-10-14,0.262,0.166,0.226,0.088,0.126,0.082,0.123,0.083,0.12,0.079,0.121,0.075,0.124,0.073,0.125,0.07,0.13,0.108,0.196,0.346,0.524,0.076,0.129,0.667,0.23,0.22,0.163,0.091,0.17,0.11,0.11,0.121,0.099,0.157,0.093,0.371,0.386,1.085,1.075,0.956,0.821,0.745,0.712,0.511,0.231,0.21,0.278,0.159,MAC000002_2012-10-14
3,MAC000002,2012-10-15,0.192,0.097,0.141,0.083,0.132,0.07,0.13,0.074,0.124,0.078,0.118,0.082,0.112,0.087,0.106,0.14,0.12,1.075,0.146,0.123,0.082,0.127,0.077,0.551,0.149,0.129,0.075,0.13,0.075,0.129,0.075,0.128,0.166,0.194,0.695,0.26,0.227,0.255,1.164,0.249,0.225,0.258,0.26,0.334,0.299,0.236,0.241,0.237,MAC000002_2012-10-15
4,MAC000002,2012-10-16,0.237,0.237,0.193,0.118,0.098,0.107,0.094,0.109,0.091,0.105,0.091,0.104,0.092,0.103,0.093,0.101,0.144,0.1,0.408,0.102,0.1,0.116,0.354,0.146,0.19,0.991,0.31,0.121,0.113,0.094,0.119,0.087,0.13,0.238,0.204,0.284,0.447,0.266,0.966,0.172,0.192,0.228,0.203,0.211,0.188,0.213,0.157,0.202,MAC000002_2012-10-16


<pre>
tables[0] is 'daily_all'
</pre>

In [36]:
mem_test=tables[0].memory_usage(index=True).sum()
print("daily_all dataset uses ",mem_test/ 1024**2," MB")

daily_all dataset uses  1339.112548828125  MB


In [47]:
#print(hh.dtypes)

Create a column to split data frames on

In [48]:
tables[0]['LCL_day_uid'] = tables[0]['LCLid'] + '_' + tables[0]['day']


In [49]:
hh['LCL_day_uid'] = hh['LCLid'] + '_' + hh['day'] 

In [50]:
LCL_day_uid_0 = tables[0]['LCL_day_uid']
LCL_day_uid_2 = hh['LCL_day_uid']

41k more rows in 'daily_all' than 'hhblock_all'

In [51]:
len(list(set(LCL_day_uid_0))), len(list(set(LCL_day_uid_2))), len(list(set(LCL_day_uid_0)))-len(list(set(LCL_day_uid_2)))

(3510433, 3510403, 30)

In [52]:
#missing = list(set(LCL_day_uid_0) - set(LCL_day_uid_2))
#missing

### Join data

join_df is a function for joining tables on specific fields. By default, we'll be doing a left outer join of right on the left argument using the given fields for each table.

Pandas does joins using the merge method. The suffixes argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a "_y" to those on the right.

**Very memory intensive**

Uses >>64GB combining full dataframes, splitting into chunks then joining each chunk pair

In [53]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

In [54]:
hh.head(n=2)

time,LCLid,day,00:00:00,00:30:00,01:00:00,01:30:00,02:00:00,02:30:00,03:00:00,03:30:00,04:00:00,04:30:00,05:00:00,05:30:00,06:00:00,06:30:00,07:00:00,07:30:00,08:00:00,08:30:00,09:00:00,09:30:00,10:00:00,10:30:00,11:00:00,11:30:00,12:00:00,12:30:00,13:00:00,13:30:00,14:00:00,14:30:00,15:00:00,15:30:00,16:00:00,16:30:00,17:00:00,17:30:00,18:00:00,18:30:00,19:00:00,19:30:00,20:00:00,20:30:00,21:00:00,21:30:00,22:00:00,22:30:00,23:00:00,23:30:00,LCL_day_uid
0,MAC000002,2012-10-12,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.143,0.663,0.256,0.155,0.199,0.125,0.165,0.14,0.148,0.154,0.137,0.493,0.354,0.228,0.195,0.527,0.886,0.198,0.243,0.193,0.342,0.27,0.325,0.269,0.29,MAC000002_2012-10-12
1,MAC000002,2012-10-13,0.263,0.269,0.275,0.256,0.211,0.136,0.161,0.119,0.167,0.109,0.168,0.107,0.166,0.117,0.157,0.126,0.146,0.106,0.135,0.191,0.915,0.933,0.122,0.138,0.076,0.133,0.076,0.133,0.085,0.263,0.134,0.235,0.124,0.184,0.23,0.176,0.388,0.26,0.918,0.278,0.267,0.239,0.23,0.233,0.235,0.188,0.259,0.25,MAC000002_2012-10-13


In [55]:
tables[0].head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid
0,0,MAC000041,2011-12-08,0.295,0.396259,1.071,27,0.285051,10.699,0.119,MAC000041_2011-12-08
1,1,MAC000041,2011-12-09,0.204,0.235437,0.744,48,0.184686,11.301,0.023,MAC000041_2011-12-09


In [56]:
hh_combined = pd.merge(tables[0], hh, on='LCL_day_uid', how='outer')

In [57]:
hh_combined.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid_x,day_x,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,LCLid_y,day_y,00:00:00,00:30:00,01:00:00,01:30:00,02:00:00,02:30:00,03:00:00,03:30:00,04:00:00,04:30:00,05:00:00,05:30:00,06:00:00,06:30:00,07:00:00,07:30:00,08:00:00,08:30:00,09:00:00,09:30:00,10:00:00,10:30:00,11:00:00,11:30:00,12:00:00,12:30:00,13:00:00,13:30:00,14:00:00,14:30:00,15:00:00,15:30:00,16:00:00,16:30:00,17:00:00,17:30:00,18:00:00,18:30:00,19:00:00,19:30:00,20:00:00,20:30:00,21:00:00,21:30:00,22:00:00,22:30:00,23:00:00,23:30:00
0,0,MAC000041,2011-12-08,0.295,0.396259,1.071,27,0.285051,10.699,0.119,MAC000041_2011-12-08,MAC000041,2011-12-08,,,,,,,,,,,,,,,,,,,,,,0.126,0.12,0.119,0.425,0.154,0.551,0.264,1.071,1.015,0.957,0.676,0.266,0.319,0.243,0.341,0.295,0.849,0.53,0.539,0.176,0.196,0.303,0.26,0.317,0.272,0.162,0.153
1,1,MAC000041,2011-12-09,0.204,0.235437,0.744,48,0.184686,11.301,0.023,MAC000041_2011-12-09,MAC000041,2011-12-09,0.07,0.062,0.063,0.062,0.062,0.062,0.062,0.061,0.061,0.062,0.06,0.061,0.06,0.06,0.077,0.13,0.127,0.239,0.112,0.097,0.06,0.169,0.023,0.068,0.354,0.163,0.481,0.744,0.643,0.53,0.471,0.612,0.478,0.262,0.368,0.315,0.296,0.291,0.313,0.424,0.361,0.324,0.389,0.331,0.309,0.292,0.297,0.283


### Save file

In [58]:
hh_combined.to_feather('../input/merged_data/halfhourly_combined.feather')

In [60]:
len(hh_combined)

3510433

### Read file

In [None]:
hh_combined = pd.read_feather('../input/merged_data/halfhourly_combined.feather')

In [61]:
hh_combined.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid_x,day_x,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,LCLid_y,day_y,00:00:00,00:30:00,01:00:00,01:30:00,02:00:00,02:30:00,03:00:00,03:30:00,04:00:00,04:30:00,05:00:00,05:30:00,06:00:00,06:30:00,07:00:00,07:30:00,08:00:00,08:30:00,09:00:00,09:30:00,10:00:00,10:30:00,11:00:00,11:30:00,12:00:00,12:30:00,13:00:00,13:30:00,14:00:00,14:30:00,15:00:00,15:30:00,16:00:00,16:30:00,17:00:00,17:30:00,18:00:00,18:30:00,19:00:00,19:30:00,20:00:00,20:30:00,21:00:00,21:30:00,22:00:00,22:30:00,23:00:00,23:30:00
0,0,MAC000041,2011-12-08,0.295,0.396259,1.071,27,0.285051,10.699,0.119,MAC000041_2011-12-08,MAC000041,2011-12-08,,,,,,,,,,,,,,,,,,,,,,0.126,0.12,0.119,0.425,0.154,0.551,0.264,1.071,1.015,0.957,0.676,0.266,0.319,0.243,0.341,0.295,0.849,0.53,0.539,0.176,0.196,0.303,0.26,0.317,0.272,0.162,0.153
1,1,MAC000041,2011-12-09,0.204,0.235437,0.744,48,0.184686,11.301,0.023,MAC000041_2011-12-09,MAC000041,2011-12-09,0.07,0.062,0.063,0.062,0.062,0.062,0.062,0.061,0.061,0.062,0.06,0.061,0.06,0.06,0.077,0.13,0.127,0.239,0.112,0.097,0.06,0.169,0.023,0.068,0.354,0.163,0.481,0.744,0.643,0.53,0.471,0.612,0.478,0.262,0.368,0.315,0.296,0.291,0.313,0.424,0.361,0.324,0.389,0.331,0.309,0.292,0.297,0.283


In [62]:
len(tables[0]), len(tables[2]), len(hh_combined)

(3510433, 3469352, 3510433)

**Join UK Bank Holidays**

In [63]:
bh = tables[5].rename(columns={'Bank holidays': 'day'})
hh_combined.rename(columns={'day_x': 'day'}, inplace=True)

In [64]:
hh_combined = pd.merge(hh_combined, bh, on='day', how='outer')

In [65]:
hh_combined.rename(columns={'Type': 'Bank_holiday'}, inplace=True)

In [66]:
hh_combined.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid_x,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,LCLid_y,day_y,00:00:00,00:30:00,01:00:00,01:30:00,02:00:00,02:30:00,03:00:00,03:30:00,04:00:00,04:30:00,05:00:00,05:30:00,06:00:00,06:30:00,07:00:00,07:30:00,08:00:00,08:30:00,09:00:00,09:30:00,10:00:00,10:30:00,11:00:00,11:30:00,12:00:00,12:30:00,13:00:00,13:30:00,14:00:00,14:30:00,15:00:00,15:30:00,16:00:00,16:30:00,17:00:00,17:30:00,18:00:00,18:30:00,19:00:00,19:30:00,20:00:00,20:30:00,21:00:00,21:30:00,22:00:00,22:30:00,23:00:00,23:30:00,Bank_holiday
0,0.0,MAC000041,2011-12-08,0.295,0.396259,1.071,27.0,0.285051,10.699,0.119,MAC000041_2011-12-08,MAC000041,2011-12-08,,,,,,,,,,,,,,,,,,,,,,0.126,0.12,0.119,0.425,0.154,0.551,0.264,1.071,1.015,0.957,0.676,0.266,0.319,0.243,0.341,0.295,0.849,0.53,0.539,0.176,0.196,0.303,0.26,0.317,0.272,0.162,0.153,
1,814.0,MAC000042,2011-12-08,0.2435,0.288077,0.806,26.0,0.153612,7.49,0.081,MAC000042_2011-12-08,MAC000042,2011-12-08,,,,,,,,,,,,,,,,,,,,,,,0.35,0.3,0.212,0.31,0.247,0.284,0.183,0.168,0.163,0.221,0.223,0.202,0.197,0.238,0.3,0.24,0.381,0.623,0.349,0.336,0.423,0.806,0.366,0.185,0.102,0.081,


In [67]:
hh_combined.rename(columns={'LCLid_x': 'LCLid'}, inplace=True)

### Save merged 1/2 hourly power and holiday data

In [68]:
hh_combined.to_feather('../input/merged_data/halfhourly_combined_bank.feather')

**read file**

In [None]:
hh_combined = pd.read_feather('../input/merged_data/halfhourly_combined_bank.feather')

## Weather data

In [34]:
tables[7][['day', 'time']] = tables[7]['time'].str.split(' ', n=1, expand=True)

In [35]:
tables[7].drop(columns=['icon'], inplace=True)

make copy of hourly weather data to use for half hourly power signal

In [40]:
tables[7].to_feather('../input/merged_data/weather_halfhourly_darksky_temp.feather')

In [24]:
hh_weather = pd.read_feather('../input/merged_data/weather_halfhourly_darksky_temp.feather')

In [25]:
hh_weather.head()

Unnamed: 0,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day,time_plus_half
0,5.97,104,10.24,00:00:00,8.86,1016.76,10.24,2.77,rain,0.91,Partly Cloudy,2011-11-11,00:00:00
1,4.88,99,9.76,01:00:00,8.83,1016.63,8.24,2.95,rain,0.94,Partly Cloudy,2011-11-11,01:00:00
2,3.7,98,9.46,02:00:00,8.79,1016.36,7.76,3.17,rain,0.96,Partly Cloudy,2011-11-11,02:00:00
3,3.12,99,9.23,03:00:00,8.63,1016.28,7.44,3.25,rain,0.96,Foggy,2011-11-11,03:00:00
4,1.85,111,9.26,04:00:00,9.21,1015.98,7.24,3.7,rain,1.0,Foggy,2011-11-11,04:00:00


**Interpolate every half hour linearly**

In [26]:
hh_weather = hh_weather.reindex(np.arange(len(hh_weather.index) * 2) / 2).interpolate().reset_index(drop=True)

In [27]:
hh_weather.head()

Unnamed: 0,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day,time_plus_half
0,5.97,104.0,10.24,00:00:00,8.86,1016.76,10.24,2.77,rain,0.91,Partly Cloudy,2011-11-11,00:00:00
1,5.425,101.5,10.0,,8.845,1016.695,9.24,2.86,,0.925,,,
2,4.88,99.0,9.76,01:00:00,8.83,1016.63,8.24,2.95,rain,0.94,Partly Cloudy,2011-11-11,01:00:00
3,4.29,98.5,9.61,,8.81,1016.495,8.0,3.06,,0.95,,,
4,3.7,98.0,9.46,02:00:00,8.79,1016.36,7.76,3.17,rain,0.96,Partly Cloudy,2011-11-11,02:00:00


In [28]:
#fill NaN with previous value
hh_weather = hh_weather.fillna(method='ffill')

In [29]:
#convert time string to timestamp
hh_weather['time'] = pd.to_timedelta(hh_weather['time'])

In [30]:
# create series by flooring by hour, then adding 30 minutes
s = hh_weather['time'].dt.floor('h') + pd.Timedelta(minutes=30)

In [31]:
# assign new series conditional on index
hh_weather['time'] = np.where(hh_weather.index % 2, s, hh_weather['time'])

In [35]:
#convert timedelta to datetime so can save in feather format
hh_weather['time'] = pd.to_datetime(hh_weather['time'])
#hh_weather.drop(columns=['time_plus_half'], inplace=True)

### Save 1/2 hr interpolated weather data

In [36]:
hh_weather.to_feather('../input/merged_data/hh_weather_interpolated.feather')

### Read in interpolated weather data

In [69]:
hh_weather = pd.read_feather('../input/merged_data/hh_weather_interpolated.feather')
hh_weather['time'] = hh_weather['time'].dt.time

In [70]:
hh_weather.head()

Unnamed: 0,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day
0,5.97,104.0,10.24,00:00:00,8.86,1016.76,10.24,2.77,rain,0.91,Partly Cloudy,2011-11-11
1,5.425,101.5,10.0,00:30:00,8.845,1016.695,9.24,2.86,rain,0.925,Partly Cloudy,2011-11-11
2,4.88,99.0,9.76,01:00:00,8.83,1016.63,8.24,2.95,rain,0.94,Partly Cloudy,2011-11-11
3,4.29,98.5,9.61,01:30:00,8.81,1016.495,8.0,3.06,rain,0.95,Partly Cloudy,2011-11-11
4,3.7,98.0,9.46,02:00:00,8.79,1016.36,7.76,3.17,rain,0.96,Partly Cloudy,2011-11-11


## Reformat power data with time vertically

In [5]:
#read in the merged power,holiday dataset
hh_combined = pd.read_feather('../input/merged_data/halfhourly_combined_bank.feather')

In [6]:
hh_combined.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,LCLid_y,day_y,00:00:00,00:30:00,01:00:00,01:30:00,02:00:00,02:30:00,03:00:00,03:30:00,04:00:00,04:30:00,05:00:00,05:30:00,06:00:00,06:30:00,07:00:00,07:30:00,08:00:00,08:30:00,09:00:00,09:30:00,10:00:00,10:30:00,11:00:00,11:30:00,12:00:00,12:30:00,13:00:00,13:30:00,14:00:00,14:30:00,15:00:00,15:30:00,16:00:00,16:30:00,17:00:00,17:30:00,18:00:00,18:30:00,19:00:00,19:30:00,20:00:00,20:30:00,21:00:00,21:30:00,22:00:00,22:30:00,23:00:00,23:30:00,Bank_holiday
0,0.0,MAC000041,2011-12-08,0.295,0.396259,1.071,27.0,0.285051,10.699,0.119,MAC000041_2011-12-08,MAC000041,2011-12-08,,,,,,,,,,,,,,,,,,,,,,0.126,0.12,0.119,0.425,0.154,0.551,0.264,1.071,1.015,0.957,0.676,0.266,0.319,0.243,0.341,0.295,0.849,0.53,0.539,0.176,0.196,0.303,0.26,0.317,0.272,0.162,0.153,
1,814.0,MAC000042,2011-12-08,0.2435,0.288077,0.806,26.0,0.153612,7.49,0.081,MAC000042_2011-12-08,MAC000042,2011-12-08,,,,,,,,,,,,,,,,,,,,,,,0.35,0.3,0.212,0.31,0.247,0.284,0.183,0.168,0.163,0.221,0.223,0.202,0.197,0.238,0.3,0.24,0.381,0.623,0.349,0.336,0.423,0.806,0.366,0.185,0.102,0.081,


In [7]:
hh_combined.drop(columns=['Unnamed: 0', 'LCLid_y', 'day_y'], inplace=True)

In [8]:
hh_combined.head(n=2)

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,00:00:00,00:30:00,01:00:00,01:30:00,02:00:00,02:30:00,03:00:00,03:30:00,04:00:00,04:30:00,05:00:00,05:30:00,06:00:00,06:30:00,07:00:00,07:30:00,08:00:00,08:30:00,09:00:00,09:30:00,10:00:00,10:30:00,11:00:00,11:30:00,12:00:00,12:30:00,13:00:00,13:30:00,14:00:00,14:30:00,15:00:00,15:30:00,16:00:00,16:30:00,17:00:00,17:30:00,18:00:00,18:30:00,19:00:00,19:30:00,20:00:00,20:30:00,21:00:00,21:30:00,22:00:00,22:30:00,23:00:00,23:30:00,Bank_holiday
0,MAC000041,2011-12-08,0.295,0.396259,1.071,27.0,0.285051,10.699,0.119,MAC000041_2011-12-08,,,,,,,,,,,,,,,,,,,,,,0.126,0.12,0.119,0.425,0.154,0.551,0.264,1.071,1.015,0.957,0.676,0.266,0.319,0.243,0.341,0.295,0.849,0.53,0.539,0.176,0.196,0.303,0.26,0.317,0.272,0.162,0.153,
1,MAC000042,2011-12-08,0.2435,0.288077,0.806,26.0,0.153612,7.49,0.081,MAC000042_2011-12-08,,,,,,,,,,,,,,,,,,,,,,,0.35,0.3,0.212,0.31,0.247,0.284,0.183,0.168,0.163,0.221,0.223,0.202,0.197,0.238,0.3,0.24,0.381,0.623,0.349,0.336,0.423,0.806,0.366,0.185,0.102,0.081,


**Full melt version**

In [12]:
#If the line below is run at the start of notebook execution, RAM is not an issue

#However, use the chunking version if cant run this

In [9]:
hh_melted = hh_combined.melt(id_vars=['LCLid', 'day', 'energy_median', 'energy_mean', 'energy_max', 'energy_count', 'energy_std', 'energy_sum', 'energy_min', 'LCL_day_uid', 'Bank_holiday'],
                            var_name='time', value_name='energy(kWh/hh)')

In [10]:
hh_melted.to_csv('../input/merged_data/hh_melted_all.csv')

In [11]:
len(hh_melted)

168501120

In [14]:
hh_melted = hh_melted.sort_values(by=['LCLid', 'day', 'time'])

In [15]:
hh_melted.reset_index(inplace=True)
hh_melted.drop(columns=['index'],inplace=True)

In [16]:
hh_melted.head()

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,Bank_holiday,time,energy(kWh/hh)
0,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,00:00:00,
1,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,00:30:00,0.0
2,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,01:00:00,0.0
3,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,01:30:00,0.0
4,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,02:00:00,0.0


### Save melted time column dataset

This took >30mins on my machine (26.1 GB size)

In [17]:
hh_melted.to_csv('../input/merged_data/hh_melted_all.csv')

**Chunking version - Melt in chunks, then combine chunks at end**

In [10]:
#divide into chunks, save each processed chunk as we go
NUM_CHUNKS=10
chunks = np.array_split(hh_combined, NUM_CHUNKS)

In [11]:
chunks[0].head()

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,00:00:00,00:30:00,01:00:00,01:30:00,02:00:00,02:30:00,03:00:00,03:30:00,04:00:00,04:30:00,05:00:00,05:30:00,06:00:00,06:30:00,07:00:00,07:30:00,08:00:00,08:30:00,09:00:00,09:30:00,10:00:00,10:30:00,11:00:00,11:30:00,12:00:00,12:30:00,13:00:00,13:30:00,14:00:00,14:30:00,15:00:00,15:30:00,16:00:00,16:30:00,17:00:00,17:30:00,18:00:00,18:30:00,19:00:00,19:30:00,20:00:00,20:30:00,21:00:00,21:30:00,22:00:00,22:30:00,23:00:00,23:30:00,Bank_holiday
0,MAC000041,2011-12-08,0.295,0.396259,1.071,27.0,0.285051,10.699,0.119,MAC000041_2011-12-08,,,,,,,,,,,,,,,,,,,,,,0.126,0.12,0.119,0.425,0.154,0.551,0.264,1.071,1.015,0.957,0.676,0.266,0.319,0.243,0.341,0.295,0.849,0.53,0.539,0.176,0.196,0.303,0.26,0.317,0.272,0.162,0.153,
1,MAC000042,2011-12-08,0.2435,0.288077,0.806,26.0,0.153612,7.49,0.081,MAC000042_2011-12-08,,,,,,,,,,,,,,,,,,,,,,,0.35,0.3,0.212,0.31,0.247,0.284,0.183,0.168,0.163,0.221,0.223,0.202,0.197,0.238,0.3,0.24,0.381,0.623,0.349,0.336,0.423,0.806,0.366,0.185,0.102,0.081,
2,MAC000268,2011-12-08,0.0795,0.093625,0.354,48.0,0.067913,4.494,0.006,MAC000268_2011-12-08,0.132,0.084,0.106,0.106,0.056,0.071,0.057,0.052,0.054,0.01,0.04,0.014,0.013,0.04,0.006,0.025,0.029,0.007,0.051,0.052,0.077,0.122,0.073,0.109,0.069,0.031,0.057,0.05,0.04,0.103,0.08,0.112,0.16,0.354,0.185,0.171,0.118,0.092,0.115,0.079,0.127,0.156,0.117,0.251,0.221,0.169,0.131,0.12,
3,MAC000269,2011-12-08,0.0395,0.063417,0.18,48.0,0.05062,3.044,0.005,MAC000269_2011-12-08,0.025,0.03,0.019,0.016,0.029,0.027,0.012,0.03,0.031,0.009,0.025,0.03,0.016,0.019,0.03,0.157,0.18,0.061,0.066,0.135,0.032,0.032,0.148,0.01,0.06,0.078,0.005,0.039,0.091,0.072,0.044,0.113,0.12,0.032,0.053,0.178,0.067,0.135,0.112,0.073,0.142,0.154,0.129,0.04,0.045,0.037,0.017,0.039,
4,MAC000156,2011-12-08,0.2505,0.259833,0.543,48.0,0.124164,12.472,0.118,MAC000156_2011-12-08,0.417,0.378,0.401,0.543,0.394,0.152,0.12,0.149,0.133,0.13,0.118,0.126,0.134,0.171,0.161,0.171,0.189,0.38,0.156,0.154,0.462,0.255,0.161,0.172,0.13,0.149,0.149,0.132,0.128,0.143,0.253,0.248,0.194,0.287,0.322,0.454,0.494,0.446,0.416,0.346,0.364,0.304,0.315,0.381,0.417,0.257,0.257,0.259,


In [12]:
#test this is going to work first
c_melted = pd.melt(chunks[0], id_vars=['LCLid', 'day', 'energy_median', 'energy_mean', 'energy_max', 'energy_count', 'energy_std', 'energy_sum', 'energy_min', 'LCL_day_uid', 'Bank_holiday'],
        var_name='time', value_name='energy(kWh/hh)')

In [14]:
c_melted = c_melted.sort_values(by=['day', 'time'])
c_melted.head()

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,Bank_holiday,time,energy(kWh/hh)
0,MAC000041,2011-12-08,0.295,0.396259,1.071,27.0,0.285051,10.699,0.119,MAC000041_2011-12-08,,00:00:00,
1,MAC000042,2011-12-08,0.2435,0.288077,0.806,26.0,0.153612,7.49,0.081,MAC000042_2011-12-08,,00:00:00,
2,MAC000268,2011-12-08,0.0795,0.093625,0.354,48.0,0.067913,4.494,0.006,MAC000268_2011-12-08,,00:00:00,0.132
3,MAC000269,2011-12-08,0.0395,0.063417,0.18,48.0,0.05062,3.044,0.005,MAC000269_2011-12-08,,00:00:00,0.025
4,MAC000156,2011-12-08,0.2505,0.259833,0.543,48.0,0.124164,12.472,0.118,MAC000156_2011-12-08,,00:00:00,0.417


In [15]:
for i, c in enumerate(chunks):
    print(f'melting chunk {i} of {NUM_CHUNKS}')
    c_melted = pd.melt(c, id_vars=['LCLid', 'day', 'energy_median', 'energy_mean', 'energy_max', 'energy_count', 'energy_std', 'energy_sum', 'energy_min', 'LCL_day_uid', 'Bank_holiday'],
        var_name='time', value_name='energy(kWh/hh)')
    if i == 0:
        print(c_melted.head(n=2))
    c_melted.to_feather(f'../input/merged_data/hh_melted_{i}.feather')
    c_melted = None

melting chunk 0 of 10
       LCLid         day  energy_median  energy_mean  energy_max  \
0  MAC000041  2011-12-08         0.2950     0.396259       1.071   
1  MAC000042  2011-12-08         0.2435     0.288077       0.806   

   energy_count  energy_std  energy_sum  energy_min           LCL_day_uid  \
0          27.0    0.285051      10.699       0.119  MAC000041_2011-12-08   
1          26.0    0.153612       7.490       0.081  MAC000042_2011-12-08   

  Bank_holiday      time energy(kWh/hh)  
0         None  00:00:00           None  
1         None  00:00:00           None  
melting chunk 1 of 10
melting chunk 2 of 10
melting chunk 3 of 10
melting chunk 4 of 10
melting chunk 5 of 10
melting chunk 6 of 10
melting chunk 7 of 10
melting chunk 8 of 10
melting chunk 9 of 10


In [17]:
for i in range(NUM_CHUNKS):
    print(f'reading in saved chunk {i} of {NUM_CHUNKS}')
    df = pd.read_feather(f'../input/merged_data/hh_melted_{i}.feather')
    if i == 0:
        hh_melted = df
    else:
        hh_melted=pd.concat([hh_melted, df], ignore_index=True)
hh_melted.head()

reading in saved chunk 0 of 10
reading in saved chunk 1 of 10
reading in saved chunk 2 of 10
reading in saved chunk 3 of 10
reading in saved chunk 4 of 10
reading in saved chunk 5 of 10
reading in saved chunk 6 of 10
reading in saved chunk 7 of 10
reading in saved chunk 8 of 10
reading in saved chunk 9 of 10


Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,Bank_holiday,time,energy(kWh/hh)
0,MAC000041,2011-12-08,0.295,0.396259,1.071,27.0,0.285051,10.699,0.119,MAC000041_2011-12-08,,00:00:00,
1,MAC000042,2011-12-08,0.2435,0.288077,0.806,26.0,0.153612,7.49,0.081,MAC000042_2011-12-08,,00:00:00,
2,MAC000268,2011-12-08,0.0795,0.093625,0.354,48.0,0.067913,4.494,0.006,MAC000268_2011-12-08,,00:00:00,0.132
3,MAC000269,2011-12-08,0.0395,0.063417,0.18,48.0,0.05062,3.044,0.005,MAC000269_2011-12-08,,00:00:00,0.025
4,MAC000156,2011-12-08,0.2505,0.259833,0.543,48.0,0.124164,12.472,0.118,MAC000156_2011-12-08,,00:00:00,0.417


In [18]:
hh_melted = hh_melted.sort_values(by=['LCLid', 'day', 'time'])
hh_melted.head()

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,Bank_holiday,time,energy(kWh/hh)
33828546,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,00:00:00,
34179590,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,00:30:00,0.0
34530634,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,01:00:00,0.0
34881678,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,01:30:00,0.0
35232722,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,02:00:00,0.0


In [21]:
hh_melted.reset_index(inplace=True)

In [23]:
hh_melted.drop(columns=['index'],inplace=True)

In [13]:
#TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array when saving as feather format- file may be too big?
#save as csv instead

#hh_melted.to_csv(f'../input/merged_data/hh_melted_all.csv')

### Read in Melted data

In [5]:
hh_melted = pd.read_csv(f'../input/merged_data/hh_melted_date_all.csv')

In [6]:
hh_melted.head()

Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,Bank_holiday,time,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed
0,0,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 00:00:00,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000
1,1,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 00:30:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000
2,2,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 01:00:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000
3,3,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 01:30:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000
4,4,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 02:00:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000


In [7]:
hh_melted.dtypes

Unnamed: 0               int64
LCLid                   object
day                     object
energy_median          float64
energy_mean            float64
energy_max             float64
energy_count           float64
energy_std             float64
energy_sum             float64
energy_min             float64
LCL_day_uid             object
Bank_holiday            object
time                    object
energy(kWh/hh)         float64
dayYear                  int64
dayMonth                 int64
dayWeek                  int64
dayDay                   int64
dayDayofweek             int64
dayDayofyear             int64
dayIs_month_end           bool
dayIs_month_start         bool
dayIs_quarter_end         bool
dayIs_quarter_start       bool
dayIs_year_end            bool
dayIs_year_start          bool
dayElapsed               int64
dtype: object

### Format times


In [19]:
add_datepart(hh_melted, "day", drop=False)

In [None]:
#recreate full date time so can do time delta from start

In [8]:
hh_melted[['temp_day', 'str_time']] = hh_melted['time'].str.split(' ', n=1, expand=True)

In [9]:
hh_melted['time'] = pd.to_datetime(hh_melted['str_time'], format='%H:%M:%S')

In [10]:
hh_melted.head(n=2)

Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,Bank_holiday,time,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,temp_day,str_time
0,0,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 00:00:00,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,1900-01-01,00:00:00
1,1,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 00:30:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,1900-01-01,00:30:00


In [11]:
#calculate time delta from midnight
start_daytime = '2012-10-12-00:00:00' 
format = '%Y-%m-%d-%H:%M:%S'

In [12]:
def start_timedelta(x):
    tdelta = x - datetime.datetime.strptime(start_daytime, format)
    return tdelta

In [13]:
acquisition_times = hh_melted['str_time']

In [14]:
#acquisition_times = acquisition_times.apply(lambda x: x.strftime('%H:%M:%S'))
acquisition_times[:10]

0    00:00:00
1    00:30:00
2    01:00:00
3    01:30:00
4    02:00:00
5    02:30:00
6    03:00:00
7    03:30:00
8    04:00:00
9    04:30:00
Name: str_time, dtype: object

In [15]:
days = hh_melted['day']
#days = days.apply(lambda x: x.strftime('%Y-%M-%D'))
days[:10]

0    2012-10-12
1    2012-10-12
2    2012-10-12
3    2012-10-12
4    2012-10-12
5    2012-10-12
6    2012-10-12
7    2012-10-12
8    2012-10-12
9    2012-10-12
Name: day, dtype: object

In [16]:
day_time_str = ["{0}-{1}".format(a, b) for a, b in zip(days, acquisition_times)]

In [17]:
day_time_str[:10]

['2012-10-12-00:00:00',
 '2012-10-12-00:30:00',
 '2012-10-12-01:00:00',
 '2012-10-12-01:30:00',
 '2012-10-12-02:00:00',
 '2012-10-12-02:30:00',
 '2012-10-12-03:00:00',
 '2012-10-12-03:30:00',
 '2012-10-12-04:00:00',
 '2012-10-12-04:30:00']

In [18]:
hh_melted['day_time'] = pd.to_datetime(day_time_str, format='%Y-%m-%d-%H:%M:%S')

In [19]:
dt = hh_melted['day_time']


In [20]:
print(dt[:10])

0   2012-10-12 00:00:00
1   2012-10-12 00:30:00
2   2012-10-12 01:00:00
3   2012-10-12 01:30:00
4   2012-10-12 02:00:00
5   2012-10-12 02:30:00
6   2012-10-12 03:00:00
7   2012-10-12 03:30:00
8   2012-10-12 04:00:00
9   2012-10-12 04:30:00
Name: day_time, dtype: datetime64[ns]


In [21]:
#the line below took ~3 hours to run

In [22]:
dts = dt.apply(start_timedelta)

In [23]:
start_dt = datetime.datetime.strptime(start_daytime, format)

In [24]:
start_as_series = pd.Series(start_dt for _ in range(len(dt)))

In [25]:
delta_minutes = (hh_melted['day_time']-start_as_series).astype('timedelta64[m]')

In [26]:
delta_minutes = delta_minutes.astype(int)

In [27]:
hh_melted['delta_minutes'] = delta_minutes

### save file

In [28]:
hh_melted.to_csv(f'../input/merged_data/hh_melted_date_all_deltas.csv')

### read file

In [5]:
hh_melted = pd.read_csv(f'../input/merged_data/hh_melted_date_all_deltas.csv')

In [6]:
hh_melted.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min,LCL_day_uid,Bank_holiday,time,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,temp_day,str_time,day_time,delta_minutes
0,0,0,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 00:00:00,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,1900-01-01,00:00:00,2012-10-12 00:00:00,0
1,1,1,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 00:30:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,1900-01-01,00:30:00,2012-10-12 00:30:00,30
2,2,2,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 01:00:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,1900-01-01,01:00:00,2012-10-12 01:00:00,60
3,3,3,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 01:30:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,1900-01-01,01:30:00,2012-10-12 01:30:00,90
4,4,4,MAC000002,2012-10-12,0.1385,0.154304,0.886,46.0,0.196034,7.098,0.0,MAC000002_2012-10-12,,1900-01-01 02:00:00,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,1900-01-01,02:00:00,2012-10-12 02:00:00,120


Drop the columns we dont need

In [7]:
hh_melted.drop(columns=['energy_median','energy_mean','energy_max','energy_count','energy_std','energy_sum','energy_min', 'Unnamed: 0','Unnamed: 0.1','temp_day','time'], inplace=True)

In [8]:
#save before we do the merge
hh_melted.to_csv(f'../input/merged_data/hh_melted_hh_only.csv')

Dataset is getting really big, see data_wrangling_2 for code using dask as postgres

### Combine Power and Weather Data

In [9]:
hh_weather = pd.read_feather('../input/merged_data/hh_weather_interpolated.feather')
hh_weather['time'] = hh_weather['time'].dt.time


In [13]:
hh_weather.head()

Unnamed: 0,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day
0,5.97,104.0,10.24,00:00:00,8.86,1016.76,10.24,2.77,rain,0.91,Partly Cloudy,2011-11-11
1,5.425,101.5,10.0,00:30:00,8.845,1016.695,9.24,2.86,rain,0.925,Partly Cloudy,2011-11-11
2,4.88,99.0,9.76,01:00:00,8.83,1016.63,8.24,2.95,rain,0.94,Partly Cloudy,2011-11-11
3,4.29,98.5,9.61,01:30:00,8.81,1016.495,8.0,3.06,rain,0.95,Partly Cloudy,2011-11-11
4,3.7,98.0,9.46,02:00:00,8.79,1016.36,7.76,3.17,rain,0.96,Partly Cloudy,2011-11-11


In [14]:
hh_weather.dtypes

visibility             float64
windBearing            float64
temperature            float64
time                    object
dewPoint               float64
pressure               float64
apparentTemperature    float64
windSpeed              float64
precipType              object
humidity               float64
summary                 object
day                     object
dtype: object

In [15]:
#create a day_time column to join with hh data

In [10]:
hh_weather['time_str'] = hh_weather['time'].apply(str)

In [11]:
day_time_str = ["{0} {1}".format(a, b) for a, b in zip(hh_weather['day'], hh_weather['time_str'])]

In [12]:
hh_weather['day_time'] = pd.to_datetime(day_time_str, format='%Y-%m-%d %H:%M:%S')

In [13]:
hh_weather['day_time_str'] = day_time_str

In [14]:
hh_weather['day_time_str'][0]

'2011-11-11 00:00:00'

In [15]:
#both columns need to be same type, creating str columns to join on

In [16]:
hh_melted['day_time_str'] = hh_melted['day_time'].apply(str)

In [17]:
hh_melted['day_time_str'][0]

'2012-10-12 00:00:00'

In [18]:
start = time.time()
hh_combined = pd.merge(hh_melted, hh_weather, on='day_time_str', how='inner')
end = time.time()
print(f'hh data and weather merge took: {end-start} seconds')

hh data and weather merge took: 458.5537848472595 seconds


In [21]:
hh_combined.head()


Unnamed: 0,LCLid,day_x,LCL_day_uid,Bank_holiday,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,str_time,day_time_x,delta_minutes,day_time_str,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_y,time_str,day_time_y
0,MAC000002,2012-10-12,MAC000002_2012-10-12,,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,2012-10-12 00:00:00,0,2012-10-12 00:00:00,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,00:00:00,2012-10-12
1,MAC000003,2012-10-12,MAC000003_2012-10-12,,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,2012-10-12 00:00:00,0,2012-10-12 00:00:00,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,00:00:00,2012-10-12
2,MAC000004,2012-10-12,MAC000004_2012-10-12,,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,2012-10-12 00:00:00,0,2012-10-12 00:00:00,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,00:00:00,2012-10-12
3,MAC000005,2012-10-12,MAC000005_2012-10-12,,0.03,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,2012-10-12 00:00:00,0,2012-10-12 00:00:00,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,00:00:00,2012-10-12
4,MAC000006,2012-10-12,MAC000006_2012-10-12,,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,2012-10-12 00:00:00,0,2012-10-12 00:00:00,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,00:00:00,2012-10-12


In [19]:
#read/write csv is really slow - but it works. issues with feather with very large files.
#TODO try parquet format performance testing

In [20]:
hh_combined.to_csv(f'../input/merged_data/hh_combined_weather_join.csv')

In [24]:
hh_melted = None
hh_weather = None

In [26]:
sys.getsizeof(hh_combined)


156095542398

In [27]:
hh_combined.drop(columns=['LCL_day_uid','day_time_x','day_time_str','day_y','time_str','day_time_y'], inplace=True)

In [28]:
hh_combined.head()

Unnamed: 0,LCLid,day_x,Bank_holiday,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,str_time,delta_minutes,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary
0,MAC000002,2012-10-12,,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy
1,MAC000003,2012-10-12,,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy
2,MAC000004,2012-10-12,,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy
3,MAC000005,2012-10-12,,0.03,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy
4,MAC000006,2012-10-12,,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy


### Save data

In [29]:
hh_combined.to_csv(f'../input/merged_data/hh_combined_weather_join_clean.csv')

In [30]:
hh_combined.to_parquet(f'../input/merged_data/hh_combined_weather_join_clean.parquet')

### Read data

In [7]:
hh_combined = pd.read_parquet(f'{PATH}hh_combined_weather_join_clean.parquet')

In [8]:
hh_combined.head()

Unnamed: 0,LCLid,day_x,Bank_holiday,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,str_time,delta_minutes,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary
0,MAC000002,2012-10-12,,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy
1,MAC000003,2012-10-12,,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy
2,MAC000004,2012-10-12,,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy
3,MAC000005,2012-10-12,,0.03,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy
4,MAC000006,2012-10-12,,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy


In [9]:
len(hh_combined.index)

168479808

In [10]:
day_time_str = ["{0} {1}".format(a, b) for a, b in zip(hh_combined['day_x'], hh_combined['time'])]

In [11]:
day_time_str[0]

'2012-10-12 00:00:00'

In [12]:
#hh_combined['day_time'] = pd.to_datetime(day_time_str, format='%Y-%m-%d %H:%M:%S')
hh_combined['day_time'] = day_time_str

In [13]:
hh_combined['day_time'][:10]

0    2012-10-12 00:00:00
1    2012-10-12 00:00:00
2    2012-10-12 00:00:00
3    2012-10-12 00:00:00
4    2012-10-12 00:00:00
5    2012-10-12 00:00:00
6    2012-10-12 00:00:00
7    2012-10-12 00:00:00
8    2012-10-12 00:00:00
9    2012-10-12 00:00:00
Name: day_time, dtype: object

In [14]:
hh_combined.head()

Unnamed: 0,LCLid,day_x,Bank_holiday,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,str_time,delta_minutes,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time
0,MAC000002,2012-10-12,,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12 00:00:00
1,MAC000003,2012-10-12,,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12 00:00:00
2,MAC000004,2012-10-12,,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12 00:00:00
3,MAC000005,2012-10-12,,0.03,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12 00:00:00
4,MAC000006,2012-10-12,,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,00:00:00,0,11.76,234.0,13.61,00:00:00,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12 00:00:00


In [15]:
hh_combined.drop(columns=['day_x','str_time','time'],inplace=True)

### Save

In [16]:
#save in case run out of memory - but having issues with loading parquet when gets over ~1.1GB - 
#ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483664
#hh_combined.to_parquet(f'{PATH}hh_combined_clean.parquet')
#so we save as .csv...
hh_combined.to_csv(f'{PATH}hh_combined_clean.csv')

### Load

In [5]:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

hh_combined = pd.read_csv(f'{PATH}hh_combined_clean.csv', parse_dates=['day_time'], date_parser=dateparse)


  interactivity=interactivity, compiler=compiler, result=result)


In [18]:
#comment out if just loaded
hh_combined['day_time'] = pd.to_datetime(hh_combined['day_time'], format='%Y-%m-%d %H:%M:%S')

In [6]:
hh_combined['day_time'][0]

Timestamp('2012-10-12 00:00:00')

In [7]:
hh_combined.dtypes

Unnamed: 0                      int64
LCLid                          object
Bank_holiday                   object
energy(kWh/hh)                float64
dayYear                         int64
dayMonth                        int64
dayWeek                         int64
dayDay                          int64
dayDayofweek                    int64
dayDayofyear                    int64
dayIs_month_end                  bool
dayIs_month_start                bool
dayIs_quarter_end                bool
dayIs_quarter_start              bool
dayIs_year_end                   bool
dayIs_year_start                 bool
dayElapsed                      int64
delta_minutes                   int64
visibility                    float64
windBearing                   float64
temperature                   float64
dewPoint                      float64
pressure                      float64
apparentTemperature           float64
windSpeed                     float64
precipType                     object
humidity    

In [8]:
#list uniue bank holiday values
hh_combined.Bank_holiday.unique()

array([nan, 'Christmas Day', 'Boxing Day', 'New Year?s Day', 'Easter Monday', 'Good Friday',
       'Spring bank holiday', 'Early May bank holiday', 'Summer bank holiday',
       'Spring bank holiday (substitute day)', 'Queen?s Diamond Jubilee (extra bank holiday)',
       'New Year?s Day (substitute day)'], dtype=object)

In [9]:
#replace characters in bank holiday
hh_combined['Bank_holiday'] = hh_combined['Bank_holiday'].str.replace('?','')
hh_combined['Bank_holiday'] = hh_combined['Bank_holiday'].str.replace('(','')
hh_combined['Bank_holiday'] = hh_combined['Bank_holiday'].str.replace(')','')
hh_combined['Bank_holiday'] = hh_combined['Bank_holiday'].str.replace(' ','_')


In [10]:
hh_combined.head()

Unnamed: 0.1,Unnamed: 0,LCLid,Bank_holiday,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time
0,0,MAC000002,,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12
1,1,MAC000003,,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12
2,2,MAC000004,,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12
3,3,MAC000005,,0.03,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12
4,4,MAC000006,,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12


In [11]:
hh_combined.to_csv(f'{PATH}hh_combined_bank_clean.csv')

In [20]:
hh_combined['is_bank_holiday'] = np.where(hh_combined['Bank_holiday'].notnull(), True, False)

In [23]:
hh_combined.head(2)

Unnamed: 0.1,Unnamed: 0,LCLid,Bank_holiday,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday
0,0,MAC000002,,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False
1,1,MAC000003,,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False


In [24]:
hh_combined = pd.get_dummies(hh_combined, columns=["Bank_holiday"])

In [38]:
hh_combined.head()

Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,Bank_holiday_Boxing_Day,Bank_holiday_Christmas_Day,Bank_holiday_Early_May_bank_holiday,Bank_holiday_Easter_Monday,Bank_holiday_Good_Friday,Bank_holiday_New_Years_Day,Bank_holiday_New_Years_Day_substitute_day,Bank_holiday_Queens_Diamond_Jubilee_extra_bank_holiday,Bank_holiday_Spring_bank_holiday,Bank_holiday_Spring_bank_holiday_substitute_day,Bank_holiday_Summer_bank_holiday,energy_csum
0,MAC000002,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,
1,MAC000003,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,3284.479
2,MAC000004,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,248.781
3,MAC000005,0.03,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,516.081
4,MAC000006,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,703.783


In [26]:
hh_combined.drop(columns=['Unnamed: 0'],inplace=True)

In [None]:
#df.sort_values(['LCLid','day_time'],ascending=False).groupby('job')
#df.groupby('id')['x'].cumsum()

### Final Save

In [39]:
hh_combined.to_csv(f'{PATH}hh_weather_bank_final.csv')

In [None]:
#hh_gp = hh_combined.sort_values(['LCLid','day_time'],ascending=True).groupby('LCLid')

In [None]:
hh_gp.head()

In [27]:
hh_combined['energy_csum'] = hh_combined.sort_values(['LCLid','day_time','delta_minutes'],ascending=True).groupby(['LCLid'])['energy(kWh/hh)'].cumsum()

In [28]:
hh_combined.head()

Unnamed: 0,LCLid,energy(kWh/hh),dayYear,dayMonth,dayWeek,dayDay,dayDayofweek,dayDayofyear,dayIs_month_end,dayIs_month_start,dayIs_quarter_end,dayIs_quarter_start,dayIs_year_end,dayIs_year_start,dayElapsed,delta_minutes,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,precipType,humidity,summary,day_time,is_bank_holiday,Bank_holiday_Boxing_Day,Bank_holiday_Christmas_Day,Bank_holiday_Early_May_bank_holiday,Bank_holiday_Easter_Monday,Bank_holiday_Good_Friday,Bank_holiday_New_Years_Day,Bank_holiday_New_Years_Day_substitute_day,Bank_holiday_Queens_Diamond_Jubilee_extra_bank_holiday,Bank_holiday_Spring_bank_holiday,Bank_holiday_Spring_bank_holiday_substitute_day,Bank_holiday_Summer_bank_holiday,energy_csum
0,MAC000002,,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,
1,MAC000003,0.166,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,3284.479
2,MAC000004,0.0,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,248.781
3,MAC000005,0.03,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,516.081
4,MAC000006,0.033,2012,10,41,12,4,286,False,False,False,False,False,False,1350000000,0,11.76,234.0,13.61,12.21,999.47,13.61,5.4,rain,0.91,Mostly Cloudy,2012-10-12,False,0,0,0,0,0,0,0,0,0,0,0,703.783


In [None]:
#hh_combined.loc[:,'delta_minutes'] /= 30 

### Save each LCLid for individual forecasting

In [30]:
f = lambda x: x.to_csv(f'{PATH}/LCLid/{x.name.lower()}.csv', index=False)
hh_combined.groupby('LCLid').apply(f)

### Save all LCLid's for each acorn for group forecasting

In [32]:
ih = pd.read_csv(f'{PATH}informations_households.csv')

In [34]:
ih.head()

Unnamed: 0,LCLid,stdorToU,Acorn,Acorn_grouped,file
0,MAC005492,ToU,ACORN-,ACORN-,block_0
1,MAC001074,ToU,ACORN-,ACORN-,block_0
2,MAC000002,Std,ACORN-A,Affluent,block_0
3,MAC003613,Std,ACORN-A,Affluent,block_0
4,MAC003597,Std,ACORN-A,Affluent,block_0


In [35]:
acorns = ih.Acorn.unique()

In [36]:
#get LCL's by acorn
acorn_dict = {}
for acorn in acorns:
    lcl = ih[ih['Acorn']==acorn]['LCLid']
    acorn_dict[acorn] = lcl.tolist()

In [37]:
for k,v in acorn_dict.items():
    print(f'Saving all households in acorn: {k}')
    new_df = hh_combined[hh_combined['LCLid'].isin(v)]
    new_df.to_csv(f'{PATH}/Acorns/{k.lower()}.csv', index=False)

Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}
Saving all households in acorn: {k}


### Read in (part of) the final data

In [5]:
fields = ['LCLid','energy(kWh/hh)','day_time','delta_minutes']

hh_selected= pd.read_csv(f'{PATH}hh_weather_bank_final.csv', usecols=fields)

Check how many measurements from each household we have

In [6]:
lclid_counts = hh_selected.groupby(['LCLid']).size().reset_index(name='counts')

In [7]:
from collections import Counter
Counter(lclid_counts['counts'].tolist())

Counter({24236: 36,
         35516: 15,
         31772: 78,
         30620: 40,
         36524: 12,
         25100: 37,
         26108: 2,
         25292: 16,
         23756: 36,
         24716: 31,
         29660: 8,
         24092: 16,
         39164: 6,
         19584: 2,
         39116: 19,
         32204: 7,
         39068: 14,
         23856: 1,
         39020: 15,
         31436: 77,
         27696: 1,
         38972: 14,
         24288: 1,
         38828: 8,
         25776: 1,
         38876: 13,
         31628: 62,
         38396: 15,
         34172: 34,
         38060: 14,
         38732: 15,
         38780: 7,
         19872: 4,
         35084: 12,
         29184: 1,
         19104: 2,
         20880: 2,
         21072: 2,
         31484: 70,
         38684: 7,
         20592: 4,
         30812: 29,
         39788: 11,
         20304: 2,
         25392: 1,
         39740: 10,
         22848: 1,
         38204: 2,
         39692: 6,
         30576: 1,
         39644: 6,
     