# Electricity Demand in Victoria, Australia 

In this notebook we will prepare and store the electricity demand dataset found [here](https://github.com/tidyverts/tsibbledata/tree/master/data-raw/vic_elec/VIC2015).

**Citation:**

Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). Australian Electricity Demand Dataset (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4659727

**Description of data:**

Data set description from [here](https://rdrr.io/cran/tsibbledata/man/vic_elec.html): "This data is for operational demand, which is the demand met by local scheduled generating units, semi-scheduled generating units, and non-scheduled intermittent generating units of aggregate capacity larger than 30 MW, and by generation imports to the region. The operational demand excludes the demand met by non-scheduled non-intermittent generating units, non-scheduled intermittent generating units of aggregate capacity smaller than 30 MW, exempt generation (e.g. rooftop solar, gas tri-generation, very small wind farms, etc), and demand of local scheduled loads. It also excludes some very large industrial users (such as mines or smelters)."

The dataset is at a 30 minute granularity from 2002 to the start of 2015.

## Download the data via the URL below and pandas

In [20]:
import pandas as pd
import numpy as np

In [34]:
# Electricity demand.
url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv"
demand = pd.read_csv(url)

# Temperature of Melbourne (BOM site 086071).
url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/temperature.csv"
temp = pd.read_csv(url)

df = demand.merge(temp, on=['Date','Period'], how='left')

In [35]:
df.head()

Unnamed: 0,Date,Period,OperationalLessIndustrial,Industrial,Temp
0,37257,1,3535.867064,1086.132936,32.6
1,37257,2,3383.499028,1088.500972,32.6
2,37257,3,3655.527552,1084.472448,32.6
3,37257,4,3510.446636,1085.553364,32.6
4,37257,5,3294.697156,1081.302844,32.6


In [36]:
# Public holidays in Australia
url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/holidays.txt"

# We will convert this into a dummy variable later
# with 1 if public holiday else 0.
# We will join this with the rest of the dataset later.
holidays = pd.read_csv(url, header=None, parse_dates=[0])
holidays.columns = ['date']

In [37]:
holidays.head()

Unnamed: 0,date
0,2000-01-01
1,2000-01-26
2,2000-03-13
3,2000-04-21
4,2000-04-24


We will only use the `OperationLessIndustrial` demand. So let's drop `Industrial`.

In [38]:
df.drop(columns=['Industrial'], inplace=True)

The date are integers representing the number of days from an origin date. The origin date for this dataset is determined from [here](https://github.com/tidyverts/tsibbledata/blob/master/data-raw/vic_elec/vic_elec.R) and [here](https://robjhyndman.com/hyndsight/electrictsibbles/) and is "1899-12-30". The `Period` integers refer to 30 minute intervals in a 24 hour day, hence there are 48 for each day.



Let's extract the date and date-time.

In [39]:
# Convert the integer Date to an actual date with datetime type
df["date"] = df["Date"].apply(lambda x: pd.Timestamp("1899-12-30") + pd.Timedelta(x, unit="days"))

# Create a timestamp from the integer Period representing 30 minute intervals
df["date_time"] = df["date"] + pd.to_timedelta((df["Period"]-1)*30, unit="m")

In [40]:
df.head()

Unnamed: 0,Date,Period,OperationalLessIndustrial,Temp,date,date_time
0,37257,1,3535.867064,32.6,2002-01-01,2002-01-01 00:00:00
1,37257,2,3383.499028,32.6,2002-01-01,2002-01-01 00:30:00
2,37257,3,3655.527552,32.6,2002-01-01,2002-01-01 01:00:00
3,37257,4,3510.446636,32.6,2002-01-01,2002-01-01 01:30:00
4,37257,5,3294.697156,32.6,2002-01-01,2002-01-01 02:00:00


Drop the null rows 

In [41]:
df.dropna(inplace=True)

Let's check that the time interval between rows is still the same.

In [42]:
# `holidays` only contains dates of public holidays.
# So create a column `is_holidays` and set them to 1.
holidays['is_holiday'] = 1
df = df.merge(holidays, on=['date'], how='left')
# The left join will create NaNs for is_holiday when 
# it is not a public holiday. Use this to define when
# it is not a public holiday. Fill nulls with zero.
df['is_holiday'] = df['is_holiday'].fillna(0).astype(int)

In [44]:
display(df.head()) # 1st of Jan is a public holiday
display(df.tail()) # 2nd of Feb is not public holiday

Unnamed: 0,Date,Period,OperationalLessIndustrial,Temp,date,date_time,is_holiday
0,37257,1,3535.867064,32.6,2002-01-01,2002-01-01 00:00:00,1
1,37257,2,3383.499028,32.6,2002-01-01,2002-01-01 00:30:00,1
2,37257,3,3655.527552,32.6,2002-01-01,2002-01-01 01:00:00,1
3,37257,4,3510.446636,32.6,2002-01-01,2002-01-01 01:30:00,1
4,37257,5,3294.697156,32.6,2002-01-01,2002-01-01 02:00:00,1


Unnamed: 0,Date,Period,OperationalLessIndustrial,Temp,date,date_time,is_holiday
230731,42063,44,4094.95736,18.9,2015-02-28,2015-02-28 21:30:00,0
230732,42063,45,4058.531582,18.9,2015-02-28,2015-02-28 22:00:00,0
230733,42063,46,4051.524334,18.9,2015-02-28,2015-02-28 22:30:00,0
230734,42063,47,4274.237836,18.9,2015-02-28,2015-02-28 23:00:00,0
230735,42063,48,4245.130916,18.9,2015-02-28,2015-02-28 23:30:00,0


In [45]:
df["date_time"].diff().value_counts()

0 days 00:30:00    230735
Name: date_time, dtype: int64

We now just use the timestamp and the electricity demand and resample to hourly.

In [46]:
# Rename columns 
timeseries = df[["date_time", "OperationalLessIndustrial", "Temp", "is_holiday"]]
timeseries.columns = ["date_time", "demand", "temperature", "is_holiday"] 
    
# Resample to hourly.
timeseries = timeseries.set_index("date_time").resample("H").agg({
                                                                  "demand":"sum",
                                                                  "temperature":"mean",
                                                                  "is_holiday":np.min,
                                                                })
timeseries.head()

Unnamed: 0_level_0,demand,temperature,is_holiday
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2002-01-01 00:00:00,6919.366092,32.6,1
2002-01-01 01:00:00,7165.974188,32.6,1
2002-01-01 02:00:00,6406.542994,32.6,1
2002-01-01 03:00:00,5815.537828,32.6,1
2002-01-01 04:00:00,5497.732922,32.6,1


Sanity check time intervals are still uniform.

In [47]:
timeseries.reset_index()['date_time'].diff().value_counts()

0 days 01:00:00    115367
Name: date_time, dtype: int64

Check range.

In [48]:
print(timeseries.index.min())
print(timeseries.index.max())

2002-01-01 00:00:00
2015-02-28 23:00:00


Save the timeseries in the datasets folder.

In [12]:
timeseries.to_csv("../Datasets/victoria_electricity_demand.csv")