Load the electricity readings for every period. This will create one np.memmap for each period. A memmap is basically a "normal" numpy array sitting on disk instead of in memory. It allows to create and work with arrays that do not fit into RAM. Each created np.memmap contains the reading for one home in a column (that is, it has as many columns as there are homes to use). Additional information like which home is in which column etc will be stored in an additional file.

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import sys
sys.path.insert(0, '../src')

import pickle
import datetime

import numpy as np
import pandas as pd

import seaborn as sns

from pathlib import Path
from multiprocessing import Pool
from functools import partial

from IdealDataInterface import IdealDataInterface

from utils import treatment_control, load_mains
from config import SENSOR_DATA_FOLDER, CACHE_FOLDER, CPU_HIGH_MEMMORY, CPU_LOW_MEMMORY
from config import EVALUATION_PERIOD, FFILL_LIMIT, READING_FREQ

In [3]:
# Run plotting styles
%run -i '../src/sns_styles.py'

cmap = sns.color_palette()

In [4]:
df_group = treatment_control()

df_group.tail()

Unnamed: 0,homeid,group,start_date,end_date
263,331,treatment,2018-05-15,2018-06-30
264,332,treatment,2018-05-15,2018-06-30
265,334,control,2018-05-15,2018-06-30
266,335,treatment,2018-05-15,2018-06-30
267,333,control,2018-05-15,2018-06-30


In [5]:
homeid_control = df_group.loc[df_group['group'] == 'control','homeid']
homeid_treatment = df_group.loc[df_group['group'] == 'treatment','homeid']
homeid_enhanced = df_group.loc[df_group['group'] == 'enhanced','homeid']

print('Found {} homes in the control group'.format(len(homeid_control)))
print('Found {} homes in the treatment group'.format(len(homeid_treatment)))
print('Found {} homes in the enhanced group'.format(len(homeid_enhanced)))

Found 107 homes in the control group
Found 107 homes in the treatment group
Found 39 homes in the enhanced group


In [6]:
# Create and define the folder to store computation results
fpath = CACHE_FOLDER / Path('sampling_cache/')

if not fpath.is_dir():
    fpath.mkdir()

Each period can be a varying length and each np.memmap might thus have a different number of rows. In the following, the shape of the memmaps is pre-computed and stored as this will be needed when loading the memmaps again.

In [7]:
# Specify the homeids which should be loaded
homeids = list(homeid_control) + list(homeid_treatment)

# Pre-compute the index for each period defined in config.py
date_ranges = dict()
for period, (start_date, end_date) in EVALUATION_PERIOD.items():
    # Assemble the time index, closed=None as the index should be set in EVALUATION_PERIOD
    # such that the start is the first index that should be included and the end such 
    # that this is the last value that should be included.
    # << Can you add a note about why the freq below is 1S and not READING_FREQ? (fixed)
    index = pd.date_range(start=start_date, end=end_date, freq=READING_FREQ, closed=None)
    
    # Store the index
    date_ranges[period] = index


# Compute the shape each memmap must have
shapes = { period:(len(date_ranges[period]),len(homeids)) for period in EVALUATION_PERIOD.keys() }

In [8]:
# Filenames for the memmaps
fname = lambda s: fpath / Path('mmap_readings_period_{}.npy'.format(s))
    
# Store all the additional information to disk
print('Dumping the additional information..')
pickle.dump((homeids, date_ranges, shapes), open(fpath / Path('mmap_supplement.pkl'), 'wb'))
print('Done.')

Dumping the additional information..
Done.


Load the data for each home and place the readings into the respective memmaps.

In [9]:
# Create the memmaps. THIS WILL OVERWRITE EXISTING FILES!!!
mmaps = { p:np.memmap(fname(p), dtype='float32', mode='w+', shape=shapes[p]) for p in EVALUATION_PERIOD.keys() }

# Load the readings and put them into the memmap
print('Loading the readings..')
for col, homeid in enumerate(homeids):
    # Load the mains electricity readings
    ts = load_mains(homeid)
    
    # Iterate over each period and put the data into the memmap
    for p in EVALUATION_PERIOD.keys():
        # Get the pre-computed index
        index = date_ranges[p]
        
        # Limit the data to the current evaluation period. This will be done by joining the
        # readings to an all missing DataFrame as this will ensure that we'll always have
        # at least NaNs in the final DataFrame.
        tsr = pd.DataFrame(np.nan, index=index, columns=['missing']).join(ts, how='left')
        del tsr['missing']
        tsr = tsr.squeeze()
        
        # Store the result in the memmap
        mmaps[p][:,col] = tsr.values
        
# Make sure all data is written to disk
for k,v in mmaps.items():
    v.flush()
    
print('Done.')

Loading the readings..
Done.
