# Challenge 1 Data: Smart meters in London

## Objectives 

### Overall

The objective of this project is to curate information about monitoring electricity consumption patterns using statistical and machine learning methods. 

Facilitate collaboration between statisticians, computer scientists, data scientists and the wider community.

### Challenge specific objectives 1

https://github.com/cobleg/Hack-A-Gig/wiki

Derive insight from smart meter data. 

Use this insight to generate a seven day ahead forecast for each consumer. 

Provide advice to consumers on why their demand for electricity varies over the forecast week.



### Challenge specific objectives 2

https://github.com/cobleg/Hack-A-Gig/wiki/Challenge-1-Data:-Smart-meters-in-London

Develop an understanding of the consumption patterns contained in the data.

Relate the consumption patterns to statistically distinct groups of consumers.

Create a scalable forecasting process to accurately predict consumption for each group and individual household for up to a week ahead.

### Research questions

Are there any obvious seasonal (or cyclical) patterns. If so, are the cycles daily, weekly, annual etc.

The ratio of peak to average demand - known as load factor in the power industry.

Any trends evident - growth or contraction.

The degree of correlation across consumers.

How large is the ratio of the signal to noise? That is, are the identified consumption patterns highly predictable?

## Prepare data for exploratory analysis

Link data files to facilitate exploration of relationships between variables.

Identify and flag missing values, outliers.

Create time index, binary variables for the time of day, week and year.

create a tidy data set (for infomration see: https://vita.had.co.nz/papers/tidy-data.pdf)

## Explore the data

Create a table of summary statistics (i.e. min, mean, median, max, std. dev., range, the coefficient of variation) for both the time-series and structural variables.

Create line plots for time-series data over the following intervals: 24 hours of the day; days of the week; months of the year.

Create scatter plots of temperature, wind etc. versus electricity consumption.

Create other visualisations of the data to understand the impact of consumer segmentation on electricity consumption.

Create clustering visualisation to show interesting patterns in the data.

Apply any unsupervised learning methods to better understand the patterns in the data.

In [1]:
from fastai.structured import *
from fastai.column_data import *

In [2]:
PATH='../input/merged_data/'

Dataset background information can be found here:
    
http://jmdaignan.com/2018/01/28/Londonsmartmeter/

Kaggle link:

https://www.kaggle.com/jeanmidev/smart-meters-in-london/home
        

### HHBlock data

* The hhblock_dataset that contains the transpose data of a day for one household (as an array) with for example the hh_0 column is the consumption between 00:00 and 00:30

In [3]:
#merge datasets
#df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('../input/hhblock_dataset/', "block_*.csv"))))
#df.to_csv('../input/merged_data/hhblock_all.csv')

### Daily data

* The daily_dataset that contains daily informations on the consumption of the households



In [4]:
#concatenate all daly data
#daily_all = pd.concat(map(pd.read_csv, glob.glob(os.path.join('../input/daily_dataset/', "block_*.csv"))))
#daily_all.to_csv('../input/merged_data/daily_all.csv')

### Half hourly data

* LCLid that corresponds to the household id

* tstp the timestamp of the measure

* energy(kWh/hh) the energy consumes in the past 30 minutes in kWh

### informations_households

* LCLid that correspond to the household id

* stdorToU the kind of tariff applied (ToU the dynamic tariff in function of the days or Std the classic fixed tariff)

* Acorn the ACORN group associated, that categorises the household

* Acorn_grouped this is another more global classification of the ACORN (fusion of different ACORN groups)

* file name of the file in the different zip files where you can find the data of the household

### acorn_details

* contains the index for multiple parameters in comparison of the national (that have an index of 100)

https://acorn.caci.co.uk/downloads/Acorn-User-guide.pdf

Acorn is a segmentation tool which categorises the UK’s population into demographic types. Acorn segments households, postcodes and neighbourhoods into 6 categories, 18 groups and 62 types.

In [5]:
#concatenate all 1/2 hourly data
#halfhourly_all = pd.concat(map(pd.read_csv, glob.glob(os.path.join('../input/halfhourly_dataset/', "block_*.csv"))))
#halfhourly_all.to_csv('../input/merged_data/halfhourly_all.csv')

In [6]:
table_names = ['daily_all', 'halfhourly_all', 'hhblock_all', 'acorn_details', 
               'informations_households', 'uk_bank_holidays', 
               'weather_daily_darksky', 'weather_hourly_darksky']

In [7]:
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]

In [8]:
from IPython.display import HTML, display

In [9]:
for t, name in zip(tables, table_names): 
    print(name)
    display(t.head(n=2))

daily_all


Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
0,0,MAC000041,2011-12-08,0.295,0.396259,1.071,27,0.285051,10.699,0.119
1,1,MAC000041,2011-12-09,0.204,0.235437,0.744,48,0.184686,11.301,0.023


halfhourly_all


Unnamed: 0.1,Unnamed: 0,LCLid,tstp,energy(kWh/hh)
0,0,MAC000041,2011-12-08 10:30:00.0000000,0.126
1,1,MAC000041,2011-12-08 11:00:00.0000000,0.12


hhblock_all


Unnamed: 0.1,Unnamed: 0,LCLid,day,hh_0,hh_1,hh_2,hh_3,hh_4,hh_5,hh_6,...,hh_38,hh_39,hh_40,hh_41,hh_42,hh_43,hh_44,hh_45,hh_46,hh_47
0,0,MAC000041,2011-12-09,0.07,0.062,0.063,0.062,0.062,0.062,0.062,...,0.313,0.424,0.361,0.324,0.389,0.331,0.309,0.292,0.297,0.283
1,1,MAC000041,2011-12-10,0.178,0.082,0.063,0.063,0.062,0.063,0.062,...,0.44,0.317,0.383,0.347,0.403,0.216,0.212,0.223,0.186,0.189


acorn_details


Unnamed: 0,MAIN CATEGORIES,CATEGORIES,REFERENCE,ACORN-A,ACORN-B,ACORN-C,ACORN-D,ACORN-E,ACORN-F,ACORN-G,ACORN-H,ACORN-I,ACORN-J,ACORN-K,ACORN-L,ACORN-M,ACORN-N,ACORN-O,ACORN-P,ACORN-Q
0,POPULATION,Age,Age 0-4,77.0,83.0,72.0,100.0,120.0,77.0,97.0,97.0,63.0,119.0,67.0,114.0,113.0,89.0,123.0,138.0,133.0
1,POPULATION,Age,Age 5-17,117.0,109.0,87.0,69.0,94.0,95.0,102.0,106.0,67.0,95.0,64.0,108.0,116.0,86.0,89.0,136.0,106.0


informations_households


Unnamed: 0,LCLid,stdorToU,Acorn,Acorn_grouped,file
0,MAC005492,ToU,ACORN-,ACORN-,block_0
1,MAC001074,ToU,ACORN-,ACORN-,block_0


uk_bank_holidays


Unnamed: 0,Bank holidays,Type
0,2012-12-26,Boxing Day
1,2012-12-25,Christmas Day


weather_daily_darksky


Unnamed: 0,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,...,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase
0,11.96,2011-11-11 23:00:00,123,fog,9.4,2011-11-11 07:00:00,0.79,3.88,1016.08,2011-11-11 07:00:00,...,10.87,2011-11-11 07:12:14,2011-11-11 19:00:00,2011-11-11 11:00:00,Foggy until afternoon.,2011-11-11 19:00:00,6.48,2011-11-11 23:00:00,2011-11-11 19:00:00,0.52
1,8.59,2011-12-11 14:00:00,198,partly-cloudy-day,4.49,2011-12-11 01:00:00,0.56,3.94,1007.71,2011-12-11 02:00:00,...,8.59,2011-12-11 07:57:02,2011-12-11 14:00:00,2011-12-11 12:00:00,Partly cloudy throughout the day.,2011-12-12 07:00:00,0.11,2011-12-11 20:00:00,2011-12-12 08:00:00,0.53


weather_hourly_darksky


Unnamed: 0,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,icon,humidity,summary
0,5.97,104,10.24,2011-11-11 00:00:00,8.86,1016.76,10.24,2.77,rain,partly-cloudy-night,0.91,Partly Cloudy
1,4.88,99,9.76,2011-11-11 01:00:00,8.83,1016.63,8.24,2.95,rain,partly-cloudy-night,0.94,Partly Cloudy


In [10]:
for t, name in zip(tables, table_names): 
    print(name)
    display(DataFrameSummary(t).summary())

daily_all


Unnamed: 0.1,Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
count,3.51043e+06,,,3.5104e+06,3.5104e+06,3.5104e+06,3.51043e+06,3.4991e+06,3.5104e+06,3.5104e+06
mean,15791.6,,,0.158739,0.21173,0.834521,47.8036,0.172667,10.1241,0.0596258
std,9203.25,,,0.170186,0.190846,0.668316,2.81098,0.153121,9.12879,0.0870131
min,0,,,0,0,0,0,0,0,0
25%,7835,,,0.067,0.0980833,0.346,48,0.0691163,4.682,0.02
50%,15723,,,0.1145,0.163292,0.688,48,0.132791,7.815,0.039
75%,23629,,,0.191,0.262458,1.128,48,0.229312,12.569,0.071
max,36167,,,6.9705,6.92825,10.761,48,4.02457,332.556,6.524
counts,3510433,3510433,3510433,3510403,3510403,3510403,3510433,3499102,3510403,3510403
uniques,36168,5566,829,10437,421337,6425,44,3275190,401153,2149


halfhourly_all


Unnamed: 0.1,Unnamed: 0,LCLid,tstp,energy(kWh/hh)
count,1.67817e+08,,,
mean,754967,,,
std,439993,,,
min,0,,,
25%,374591,,,
50%,751675,,,
75%,1.12964e+06,,,
max,1.73057e+06,,,
counts,167817021,167817021,167817021,167817021
uniques,1730575,5566,40405,9611


hhblock_all


Unnamed: 0.1,Unnamed: 0,LCLid,day,hh_0,hh_1,hh_2,hh_3,hh_4,hh_5,hh_6,...,hh_38,hh_39,hh_40,hh_41,hh_42,hh_43,hh_44,hh_45,hh_46,hh_47
count,3.46935e+06,,,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,...,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06,3.46935e+06
mean,15607.6,,,0.179307,0.169293,0.151992,0.138064,0.127924,0.121697,0.117061,...,0.321005,0.320821,0.3158,0.30992,0.300074,0.287665,0.266337,0.242233,0.21441,0.187216
std,9096.51,,,0.308812,0.329103,0.298032,0.265524,0.237337,0.219654,0.207623,...,0.368126,0.363025,0.352158,0.34236,0.330115,0.31898,0.302796,0.284976,0.264942,0.241926
min,0,,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
25%,7744,,,0.052,0.048,0.046,0.044,0.042,0.042,0.041,...,0.101,0.105,0.108,0.109,0.108,0.103,0.092,0.079,0.067,0.058
50%,15539,,,0.098,0.088,0.081,0.077,0.074,0.072,0.071,...,0.201,0.205,0.207,0.207,0.203,0.195,0.178,0.158,0.134,0.113
75%,23353,,,0.191,0.168,0.15,0.138,0.13,0.125,0.122,...,0.39,0.389,0.382,0.375,0.363,0.349,0.323,0.294,0.258,0.221
max,35800,,,7.272,8.717,8.025,8.75,8.414,8.591,7.357,...,8.833,9.141,8.998,9.189,8.539,9.257,7.819,8.051,7.769,8.411
counts,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,...,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352,3469352
uniques,35801,5560,827,5569,5763,5442,5099,4785,4493,4358,...,4456,4409,4366,4381,4307,4261,4090,3974,3820,3639


acorn_details


Unnamed: 0,MAIN CATEGORIES,CATEGORIES,REFERENCE,ACORN-A,ACORN-B,ACORN-C,ACORN-D,ACORN-E,ACORN-F,ACORN-G,ACORN-H,ACORN-I,ACORN-J,ACORN-K,ACORN-L,ACORN-M,ACORN-N,ACORN-O,ACORN-P,ACORN-Q
count,,,,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826
mean,,,,131.313,110.86,100.081,136.858,117.895,95.5745,101.444,97.2989,87.0285,104.217,127.483,93.7242,91.4103,79.9124,95.5793,100.141,90.8554
std,,,,201.448,42.464,30.0995,97.7408,35.7688,33.6367,21.799,18.2292,30.3378,19.924,97.4282,22.177,22.9096,33.9952,25.9358,37.2103,37.634
min,,,,12,0.957011,0.281968,2,21,0,0.791419,1.15545,6.36326,16.0507,17,0.393546,0.714857,2,11,9,1
25%,,,,87,94,86,93.0922,99,81,94.1381,91,70,97,85,86,82,60.2535,86,82.25,71.25
50%,,,,104,107,100,121,117,98,102,99,88,105,109,95,93,74,96,96,87
75%,,,,128,122,113,154,135,108,109,105,101.75,115,144,102,101,93.1584,104,109,101
max,,,,3795,419,272,1159.03,286,462,295,192,410,197,1821,280,161,295,252,389,326
counts,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826,826
uniques,15,84,632,237,194,167,256,191,159,136,121,159,130,247,134,140,166,155,169,173


informations_households


Unnamed: 0,LCLid,stdorToU,Acorn,Acorn_grouped,file
count,5566,5566,5566,5566,5566
unique,5566,2,19,5,112
top,MAC002452,Std,ACORN-E,Affluent,block_82
freq,1,4443,1567,2192,50
counts,5566,5566,5566,5566,5566
uniques,5566,2,19,5,112
missing,0,0,0,0,0
missing_perc,0%,0%,0%,0%,0%
types,unique,bool,categorical,categorical,categorical


uk_bank_holidays


Unnamed: 0,Bank holidays,Type
count,25,25
unique,25,11
top,2014-12-25,Christmas Day
freq,1,3
counts,25,25
uniques,25,11
missing,0,0
missing_perc,0%,0%
types,unique,categorical


weather_daily_darksky


Unnamed: 0,temperatureMax,temperatureMaxTime,windBearing,icon,dewPoint,temperatureMinTime,cloudCover,windSpeed,pressure,apparentTemperatureMinTime,...,temperatureHigh,sunriseTime,temperatureHighTime,uvIndexTime,summary,temperatureLowTime,apparentTemperatureMin,apparentTemperatureMaxTime,apparentTemperatureLowTime,moonPhase
count,882,,882,,882,,881,882,882,,...,882,,,,,,882,,,882
mean,13.6601,,195.703,,6.53003,,0.477605,3.5818,1014.13,,...,13.5424,,,,,,5.73804,,,0.50093
std,6.18274,,89.3408,,4.83088,,0.193514,1.69401,11.073,,...,6.2602,,,,,,6.04875,,,0.287022
min,-0.06,,0,,-7.84,,0,0.2,979.25,,...,-0.81,,,,,,-8.88,,,0
25%,9.5025,,120.5,,3.18,,0.35,2.37,1007.43,,...,9.2125,,,,,,1.105,,,0.26
50%,12.625,,219,,6.38,,0.47,3.44,1014.62,,...,12.47,,,,,,4.885,,,0.5
75%,17.92,,255,,10.0575,,0.6,4.5775,1021.75,,...,17.91,,,,,,11.2775,,,0.75
max,32.4,,359,,17.77,,1,9.96,1040.92,,...,32.4,,,,,,20.54,,,0.99
counts,882,882,882,882,882,882,881,882,882,882,...,882,882,882,881,882,882,882,882,882,882
uniques,711,882,304,6,687,882,96,466,802,882,...,715,882,882,881,88,882,706,882,882,100


weather_hourly_darksky


Unnamed: 0,visibility,windBearing,temperature,time,dewPoint,pressure,apparentTemperature,windSpeed,precipType,icon,humidity,summary
count,21165,21165,21165,,21165,21152,21165,21165,,,21165,
mean,11.1665,195.686,10.4715,,6.5305,1014.13,9.23034,3.90522,,,0.781829,
std,3.09934,90.6295,5.7819,,5.04197,11.3883,6.94092,2.02685,,,0.140369,
min,0.18,0,-5.64,,-9.98,975.74,-8.88,0.04,,,0.23,
25%,10.12,121,6.47,,2.82,1007.43,3.9,2.42,,,0.7,
50%,12.26,217,9.93,,6.57,1014.78,9.36,3.68,,,0.81,
75%,13.08,256,14.31,,10.33,1022.05,14.32,5.07,,,0.89,
max,16.09,359,32.4,,19.88,1043.32,32.42,14.8,,,1,
counts,21165,21165,21165,21165,21165,21152,21165,21165,21165,21165,21165,21165
uniques,953,360,2803,21165,2398,4988,3124,1095,2,7,78,13


### Join data

join_df is a function for joining tables on specific fields. By default, we'll be doing a left outer join of right on the left argument using the given fields for each table.

Pandas does joins using the merge method. The suffixes argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a "_y" to those on the right.

**This is pretty memory hungry**

In [11]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

In [15]:
halfhourly_combined = join_df(tables[0], tables[2], "LCLid", "LCLid") 

MemoryError: 

In [None]:
halfhourly_combined.head(n=2)

In [None]:
halfhourly_combined = join_df(halfhourly_combined, table[4], "LCLid", "LCLid")

In [None]:
halfhourly_combined.to_csv('../input/merged_data/halfhourly_combined.csv')