# DRED Dataset EDA and Preprocessing

## EDA

Let's perform some EDA to get an idea of this dataset.

In [1]:
import numpy as np
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
dataset_folder = "../../dataset/raw/DRED/"
output_folder = "../../dataset/interim/"
dred_file = "DRED_Aggregated_data.csv"
dred_appliance_file = "DRED_Appliance_data.csv"

# Actua

In [3]:
df_dred = pd.read_csv(dataset_folder + dred_file)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [4]:
df_dred.head(4)

Unnamed: 0.1,Unnamed: 0,1
0,,mains
1,,
2,2015-07-05 00:00:00+02:00,
3,2015-07-05 00:00:01+02:00,


In [5]:
df_dred.shape

(13302001, 2)

In [6]:
df_dred.dtypes

Unnamed: 0    object
1             object
dtype: object

Let's do a quick clean up of columns and column names, and convert `dtype` to some sort of datetime format. Datetime format per `strftime()` and per review of head above.

In [7]:
df_dred = df_dred[2:]

In [8]:
df_dred.columns = ['timestamp', 'Mains']

In [9]:
df_dred.timestamp = df_dred.timestamp.str.slice(stop=-6)

In [10]:
df_dred.head(4)

Unnamed: 0,timestamp,Mains
2,2015-07-05 00:00:00,
3,2015-07-05 00:00:01,
4,2015-07-05 00:00:02,
5,2015-07-05 00:00:03,


In [11]:
df_dred['timestamp'] = pd.to_datetime(df_dred['timestamp'], format="%Y/%m/%d %H:%M:%S")

In [12]:
df_dred.timestamp.min()

Timestamp('2015-07-05 00:00:00')

In [13]:
df_dred.timestamp.max()

Timestamp('2015-12-05 22:59:58')

##### Missing Data

Check the missing data percentage

In [14]:
df_dred.isna().sum()

timestamp        0
Mains        39605
dtype: int64

In [15]:
(df_dred.isna().sum() / df_dred.isna().count()) * 100

timestamp    0.000000
Mains        0.297737
dtype: float64

Great it's just a tiny amount missing. Let's see what the `.diff()` is. If the difference is just 1 second, and one other sample, then this indicates that this is just 2 different parts missing. We also observed before, that this is right at the start of the dataset and the end of the dataset.

In [16]:
df_dred.timestamp[df_dred.Mains.isna() == True].diff().value_counts()

0 days 00:00:01      39603
153 days 11:59:55        1
Name: timestamp, dtype: int64

In [17]:
check_na = df_dred.set_index('timestamp').isna()

In [18]:
check_na.Mains = check_na.Mains.apply(lambda x: 1 if x == True else 0)

In [19]:
# check_na.plot()

Great. Let's just remove them as they don't exist in the middle of the dataset.

In [20]:
df_dred = df_dred.dropna().reset_index(drop=True)

In [21]:
len(df_dred)

13262394

##### Average and Resample

In [22]:
df_dred = df_dred.set_index('timestamp').resample('1s', origin='start').asfreq().reset_index()

In [23]:
len(df_dred)

13262394

#### Convert to kwh

In [24]:
df_dred.dtypes

timestamp    datetime64[ns]
Mains                object
dtype: object

In [25]:
df_dred['kwh'] = pd.to_numeric(df_dred.Mains) / 1000
df_dred = df_dred.drop(columns='Mains')

##### Pull date range

In [26]:
train_range = pd.date_range(start = '2015-09-01', end = '2015-10-30', freq = '1D')
test_range = pd.date_range(start = '2015-10-31', end = '2015-11-05', freq = '1D')
total_range = pd.date_range(start = '2015-09-01', end = '2015-11-05', freq = '1D')

In [27]:
df_dred['date']= df_dred.timestamp.dt.normalize()

In [28]:
len(df_dred[df_dred.date.isin(train_range)])

5184000

In [29]:
len(df_dred[df_dred.date.isin(test_range)])

518400

In [30]:
df_dred = df_dred[df_dred.date.isin(total_range)]

Coincidentally, this is also the same amount of lines as our `cern_train_v2`. So we can use this.

At this point we need to work out `wide_freq`. I say we just build a couple and see how we feel?
* 1 minute wide = 60 wide
* 15 minute wide = 15 * 60 = 900 wide
* 30 minute wide = 30 * 60 = 1800 wide
* 60 minute wide = 60 * 60 = 3600 wide

##### Save

In [33]:
df_train_ = df_dred[df_dred.date.isin(train_range)]
df_test_ = df_dred[df_dred.date.isin(test_range)]

df_train_ = df_train_.drop(columns='date')
df_test_ = df_test_.drop(columns='date')

# df_train_.to_csv(output_folder + 'dred_train_.csv', index = False)
# df_test_.to_csv(output_folder + 'dred_test_.csv', index = False)