# EDA and Preprocessing, UMASS

Author: Scott Yip \
Date: 28 June 2022

The abstract and information for this dataset can be accessed [here](https://traces.cs.umass.edu/index.php/smart/smart). In particular, we use teh apartment dataset which contains the energy consumption data for 114 single-family apartments in the period 2014-2016.

A dated paper by Barker et al. (2012) explains the original purpose and the method for obtaining such data. The paper can be accessed [here](https://lass.cs.umass.edu/papers/pdf/sustkdd12-smart.pdf). However, the paper does not detail the information regarding the apartment data, rather only the original house data (which we do not utilise).

It is worth noting that the power data collected is in Watts (not Kilowatts).


## 1. Intro and baseline processing

Let's perform some very quick EDA to get an idea of this dataset.

Nb: we use the `dask` library instead of `pandas` as we want to avoid out-of-memory computations. `dask` will allow us to perform on-disk processing for larger-than-memory computations (which may occur).

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import os
import dask.dataframe as dd

In [2]:
# Set folder parameters
dataset_folder = "../../dataset/raw/UMASS_apartment/"
output_folder = "../../dataset/interim/"

In [3]:
names = [i for i in os.listdir(dataset_folder) if i[-3:] == 'csv']
all_files = [dataset_folder + i for i in os.listdir(dataset_folder) if i[-3:] == 'csv']

print('Number of csv files: {}'.format(len(all_files)))

Number of csv files: 100


In [4]:
df = pd.DataFrame()

for f, n in zip(all_files, names):
    
    df_ = pd.read_csv(f, header=None, names=['timestamp', 'kwh'])
    df_['house'] = n
    
    df = pd.concat([df, df_])
    

In [5]:
df.timestamp = pd.to_datetime(df.timestamp, format = '%Y-%m-%d %H:%M:%S')

Missing

In [6]:
df.kwh.isna().sum()

0

Check frequency

In [7]:
# df.groupby('house').diff(periods = 1).timestamp.value_counts()

Mean duplicates

In [8]:
df = df.groupby(['timestamp', 'house']).mean().reset_index()

In [9]:
# df.groupby('house').diff(periods = 1).timestamp.value_counts()

Find freq change:

In [10]:
# change_indexer = df.diff(periods = 1).timestamp == df.diff(periods = 1).timestamp.value_counts().index[-1]
# change_indexer = change_indexer[change_indexer].index.values[0]

In [11]:
# df.loc[(change_indexer - 3):(change_indexer + 3)]

Looks like change is on December 15. Ok. Use Sep 1 to Oct 30.

In [12]:
df['date'] = df.timestamp.dt.normalize()

In [13]:
keep_date_range = pd.date_range(start = '2015-09-01', end = '2015-10-30', freq = '1D')

In [14]:
def check_all_dates_in_range(subset_df):
    return all(keep_date_range.isin(subset_df.date))

In [15]:
# check to ensure all dates present as required
all_date_consec_checker = df.groupby('house').apply(check_all_dates_in_range)

In [16]:
all_date_consec_checker.sum()

100

In [17]:
len(all_date_consec_checker)

100

In [18]:
# pull dates
df = df[(df.date.isin(keep_date_range))]

Resample to ensure none missing

In [19]:
df = df.set_index('timestamp').groupby('house').\
    resample('15T', origin='start').asfreq().reset_index('timestamp').reset_index(drop=True)

In [20]:
df.kwh.isna().sum()

0

Normalize:

**IF YOU DECIDE TO NORMALISE HERE, PLEASE SAVE AS `umass_train` and `umass_test` OTHERWISE IF YOU DO NOT, APPEND AN ADDITIONAL `_unnormal`**

In [21]:
# def normalize_daily_load_profiles(subset_df):
    
#     subset_df.kwh = (subset_df.kwh - subset_df.kwh.min()) / (subset_df.kwh.max() - subset_df.kwh.min())
    
#     return subset_df

In [22]:
# df = df.groupby(['house', 'date']).apply(normalize_daily_load_profiles)

Drop superfluous rows:

In [23]:
df = df.drop('date', axis = 1)

Pull out 10% of houses for train

In [24]:
train_k = int(len(all_files) * .9)
houses_train = np.random.choice(df.house.unique(), train_k, replace = False)
houses_test = np.setdiff1d(df.house.unique(), houses_train)

In [25]:
houses_test

array(['Apt12_2015.csv', 'Apt14_2015.csv', 'Apt29_2015.csv',
       'Apt2_2015.csv', 'Apt34_2015.csv', 'Apt49_2015.csv',
       'Apt65_2015.csv', 'Apt74_2015.csv', 'Apt89_2015.csv',
       'Apt8_2015.csv'], dtype=object)

Save

In [26]:
df_train = df[df.house.isin(houses_train)]
df_test = df[df.house.isin(houses_test)]

In [27]:
df_train.to_csv(output_folder + 'umass_train_unnormal.csv', index=False)
df_test.to_csv(output_folder + 'umass_test_unnormal.csv', index=False)