The Household Power Consumption dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years. The data was collected between December 2006 and November 2010 and observations of power consumption within the household were collected every minute. It is a multivariate series comprised of seven variables (besides the date and time)

􏰀 . global active power: The total active power consumed by the household (kilowatts).      
􏰀 . global reactive power: The total reactive power consumed by the household (kilowatts).       
 . voltage: Average voltage (volts).      
􏰀 . global intensity: Average current intensity (amps).   
􏰀 . sub metering 1: Active energy for kitchen (watt-hours of active energy).   
􏰀 . sub metering 2: Active energy for laundry (watt-hours of active energy).   
􏰀 . sub metering 3: Active energy for climate control systems (watt-hours of active energy).  

In [65]:
import numpy  as np
import pandas as pd

# load the new file
dataset = pd.read_csv('./Downloads/data/household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime'])


First, we can mark all missing values indicated with a ‘?’ character with a NaN value, which is a float. This will allow us to work with the data as one array of floating point values rather than mixed types (less efficient.). We will process the original data and save into csv for project.

In [67]:
# mark all missing values
dataset.replace('?', np.nan, inplace=True) # make dataset numeric
dataset = dataset.astype('float32')

We also need to fill in the missing values now that they have been marked. A very simple approach would be to copy the observation from the same time the day before. We can implement this in a function named fill missing() that will take the NumPy array of the data and copy values from exactly 24 hours ago.

In [70]:
# fill missing values with a value at the same time one day ago
def fill_missing(values):
    one_day = 60 * 24
    for row in range(values.shape[0]):
        for col in range(values.shape[1]): 
            if np.isnan(values[row, col]):
                values[row, col] = values[row - one_day, col]
# fill missing
fill_missing(dataset.values)

In [72]:
dataset.isna().sum()

Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64

In [74]:
# add a column for for the remainder of sub metering
values = dataset.values
dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] +
    values[:,6])
# save updated dataset
dataset.to_csv('./Downloads/data/household_power_consumption.csv')

**Given recent power consumption**, what is the expected power consumption for the week ahead? This requires that a predictive model forecast the total active power for each day over the next seven days.   

Technically, this framing of the problem is referred to as a multi-step time series forecasting problem, given the multiple forecast steps. A model that makes use of multiple input variables may be referred to as a multivariate multi-step time series forecasting model.

It would be useful to **downsample the per-minute observations** of power consumption to **daily totals**. This is not required, but makes sense, given that we are interested in total power per day. We can achieve this easily using the resample() function on the Pandas DataFrame. Calling this function with the argument ‘D’ allows the loaded data indexed by date-time to be grouped by day (see all offset aliases). We can then calculate the sum of all observations for each day and create a new dataset of daily power consumption data for each of the eight variables. 

In [76]:
# resample minute data to total for each day for the power usage dataset

# load the new file
dataset = pd.read_csv('./Downloads/data/household_power_consumption.csv', header=0, infer_datetime_format=True,
parse_dates=['datetime'], index_col=['datetime']) 

dataset.head()

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3,sub_metering_4
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0,52.26667
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0,72.333336
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0,70.566666
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0,71.8
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0,43.1


In [77]:
# resample data to daily
daily_groups = dataset.resample('D')
daily_data = daily_groups.sum()
# summarize
print(daily_data.shape)
print(daily_data.head())
# save 

daily_data.to_csv('./Downloads/data/household_power_consumption_days.csv')
daily_data.head()

(1442, 8)
            Global_active_power  Global_reactive_power    Voltage  \
datetime                                                            
2006-12-16             1209.176                 34.922   93552.53   
2006-12-17             3390.460                226.006  345725.32   
2006-12-18             2203.826                161.792  347373.64   
2006-12-19             1666.194                150.942  348479.01   
2006-12-20             2225.748                160.998  348923.61   

            Global_intensity  Sub_metering_1  Sub_metering_2  Sub_metering_3  \
datetime                                                                       
2006-12-16            5180.8             0.0           546.0          4926.0   
2006-12-17           14398.6          2033.0          4187.0         13341.0   
2006-12-18            9247.2          1063.0          2621.0         14018.0   
2006-12-19            7094.0           839.0          7602.0          6197.0   
2006-12-20            9313

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3,sub_metering_4
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2006-12-16,1209.176,34.922,93552.53,5180.8,0.0,546.0,4926.0,14680.933319
2006-12-17,3390.46,226.006,345725.32,14398.6,2033.0,4187.0,13341.0,36946.666732
2006-12-18,2203.826,161.792,347373.64,9247.2,1063.0,2621.0,14018.0,19028.433281
2006-12-19,1666.194,150.942,348479.01,7094.0,839.0,7602.0,6197.0,13131.900043
2006-12-20,2225.748,160.998,348923.61,9313.0,0.0,2648.0,14063.0,20384.800011


### Select evaluation metric.  

A forecast will be comprised of seven values, one for each day of the week ahead. It is common with multi-step forecasting problems to evaluate each forecasted time step separately. This is helpful for a few reasons:   
􏰀 * To comment on the skill at a specific lead time (e.g. +1 day vs +3 days).   
􏰀 * To contrast models based on their skills at different lead times (e.g. models good at +1 day vs models good at days +5).  

The units of the total power are kilowatts and it would be useful to have an error metric that was also in the same units. Both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) fit this bill, although RMSE is more commonly used and will be adopted in this tutorial. Unlike MAE, RMSE is more punishing of forecast errors. The performance metric for this problem will be the RMSE for each lead time from day 1 to day 7.   

The function evaluate forecasts() below will implement this behavior and return the performance of a model based on multiple seven-day forecasts.

In [80]:
# evaluate one or more weekly forecasts against expected values
def evaluate_forecasts(actual, predicted): 
    scores = list()
    # calculate an RMSE score for each day
    for i in range(actual.shape[1]):
    # calculate mse
        mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse
        rmse = sqrt(mse)
        # store
        scores.append(rmse)
          # calculate overall RMSE
    s=0
    for row in range(actual.shape[0]):
        for col in range(actual.shape[1]):
            s += (actual[row, col] - predicted[row, col])**2
    
    score = sqrt(s / (actual.shape[0] * actual.shape[1]))
    return score, scores


We will use the first three years of data for training predictive models and the final year for evaluating models. The data in a given dataset will be divided into standard weeks. These are weeks that begin on a Sunday and end on a Saturday.   

We will split the data into standard weeks, working backwards from the test dataset. The final year of the data is in 2010 and the first Sunday for 2010 was January 3rd. The data ends in mid November 2010 and the closest final Saturday in the data is November 20th. This gives 46 weeks of test data

In [83]:

# split a univariate dataset into train/test sets
def split_dataset(data):
    # split into standard weeks
    train, test = data[1:-328], data[-328:-6] 
    # restructure into windows of weekly data 
    train = array(split(train, len(train)/7)) 
    test = array(split(test, len(test)/7)) 
    
    return train, test

    # load the new file
dataset = pd.read_csv('./Downloads/data/household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime'])
train, test = split_dataset(dataset.values)
# validate train data
print(train.shape)
print(train[0, 0, 0], train[-1, -1, 0]) # validate test
print(test.shape)
print(test[0, 0, 0], test[-1, -1, 0])

(159, 7, 8)
3390.46 1309.2679999999998
(46, 7, 8)
2083.4539999999984 2197.006000000004
