If we assume that we can use the Central Limit Theorem and bootstrap energy estimates for a day, we need an estimate of the 'true' distribution. For this, we take the readings of each period as estimate. The question of this notebook is how many examples per time-bin for varying window_width parameter we have (the more examples the better).

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import sys
sys.path.insert(0, '../src')

import pickle
import datetime

import numpy as np
import pandas as pd

import seaborn as sns

from pathlib import Path
from multiprocessing import Pool
from functools import partial

from IdealDataInterface import IdealDataInterface

from config import SENSOR_DATA_FOLDER, CACHE_FOLDER
from config import EVALUATION_PERIOD, CPU_HIGH_MEMMORY, CPU_LOW_MEMMORY

from utils import load_cached_data
from sampling import data_to_sample_array, compute_sample_sizes

In [3]:
# Run plotting styles
%run -i '../src/sns_styles.py'

cmap = sns.color_palette()

In [4]:
# Store the result in here
sample_sizes = dict()

# Compute the available sample size if the following seconds around the center are included
# The sample size for 0 seconds will always be included.
seconds = np.array([0,1,2,3,4,5,15,30,60,5*60])

# Compute the available sample sizes

In [5]:
func = partial(compute_sample_sizes, seconds=seconds)

print('Computing the sampling sizes. This may take a while..')

for period in EVALUATION_PERIOD.keys():
    # Load the data, only keep homes with enough data (see config.py)
    df = load_cached_data(period, full=False)

    # Compute the sample sizes
    with Pool(processes=CPU_LOW_MEMMORY) as pool:
        sample_sizes[period] = pool.map(func, [ df.loc[:,c] for c in df.columns ])

print('Done.')

Computing the sampling sizes. This may take a while..
Done.


# Absolute minimum of samples to draw from

This looks at the worst case for at least time point for at least one home during that period. If a value of zero appears, there is at least one home for which the estimate cannot be computed.

In [6]:
dat = { p:[ min([ i[s].min() for i in sample_sizes[p] ]) for s in seconds ] for p in EVALUATION_PERIOD.keys() }

df_sampling_min = pd.DataFrame(dat, index=seconds)

df_sampling_min

Unnamed: 0,P1_1,P1_2,P2_1,P2_2,P3
0,43,35,29,15,26
1,129,106,89,45,78
2,215,178,150,77,130
3,301,250,214,109,182
4,387,325,280,141,234
5,473,404,347,173,286
15,1359,1242,1026,503,831
30,2712,2510,2060,1017,1642
60,5384,5061,4135,2253,3358
300,28048,25639,21764,13087,17737


# 1 percentile of samples to draw from

Instead of the absolute minimum, check the minimum of the 99 centile per home. The 99 centile is computed for each home (across the time points of the day) and the minimum of those is displayed.

In [7]:
dat = { p:[ min([ np.percentile(i[s], 1) for i in sample_sizes[p] ]) for s in seconds ] for p in EVALUATION_PERIOD.keys() }

df_sampling_5_percentile = pd.DataFrame(dat, index=seconds)

df_sampling_5_percentile

Unnamed: 0,P1_1,P1_2,P2_1,P2_2,P3
0,47.0,40.0,35.0,23.0,30.0
1,141.0,122.0,105.0,69.0,89.0
2,235.0,205.0,176.0,114.0,148.0
3,329.0,288.0,247.0,160.0,207.0
4,422.0,372.0,318.0,206.0,266.0
5,516.0,456.0,389.0,252.0,326.0
15,1452.99,1318.0,1109.0,715.0,919.0
30,2853.0,2616.0,2197.0,1408.0,1815.0
60,5680.0,5206.99,4391.99,2749.99,3629.99
300,28569.99,25939.0,22094.0,13614.0,18366.0
