# Cleaning Energy Consumption Data
This Jupyter Notebook focuses on cleaning and preprocessing energy consumption data from the London Kaggle Data. The difference between the Kaggle data here and Pauls data is that the kaggle data is sorted in 30 min intervals. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

alldata = pd.read_csv("archive/halfhourly_dataset/halfhourly_dataset/block_0.csv")

In [2]:
print(alldata)

             LCLid                         tstp energy(kWh/hh)
0        MAC000002  2012-10-12 00:30:00.0000000             0 
1        MAC000002  2012-10-12 01:00:00.0000000             0 
2        MAC000002  2012-10-12 01:30:00.0000000             0 
3        MAC000002  2012-10-12 02:00:00.0000000             0 
4        MAC000002  2012-10-12 02:30:00.0000000             0 
...            ...                          ...            ...
1222665  MAC005492  2014-02-27 22:00:00.0000000         0.182 
1222666  MAC005492  2014-02-27 22:30:00.0000000         0.122 
1222667  MAC005492  2014-02-27 23:00:00.0000000          0.14 
1222668  MAC005492  2014-02-27 23:30:00.0000000         0.192 
1222669  MAC005492  2014-02-28 00:00:00.0000000         0.088 

[1222670 rows x 3 columns]


## Data Cleaning
Removing the rows where the energy consumption value is "Null" and converting the energy consumption data column to a numeric type. The timestamp column is also converted to a datetime type. The energy consumption values are then multiplied by 1000 to convert them to watt-hours.

In [3]:
# set energy consumption data to numeric type
alldata = alldata[alldata["energy(kWh/hh)"] != "Null"]
alldata.loc[:,"energy"] = alldata["energy(kWh/hh)"].astype("float64")
alldata['tstp'] = pd.to_datetime(alldata['tstp'])
# calculate the cumulative energy use over time for each date
alldata["energy"] = alldata["energy"]*1000

In [4]:
print(alldata)

             LCLid                tstp energy(kWh/hh)  energy
0        MAC000002 2012-10-12 00:30:00             0      0.0
1        MAC000002 2012-10-12 01:00:00             0      0.0
2        MAC000002 2012-10-12 01:30:00             0      0.0
3        MAC000002 2012-10-12 02:00:00             0      0.0
4        MAC000002 2012-10-12 02:30:00             0      0.0
...            ...                 ...            ...     ...
1222665  MAC005492 2014-02-27 22:00:00         0.182    182.0
1222666  MAC005492 2014-02-27 22:30:00         0.122    122.0
1222667  MAC005492 2014-02-27 23:00:00          0.14    140.0
1222668  MAC005492 2014-02-27 23:30:00         0.192    192.0
1222669  MAC005492 2014-02-28 00:00:00         0.088     88.0

[1222620 rows x 4 columns]


In [5]:
print(alldata.columns)

Index(['LCLid', 'tstp', 'energy(kWh/hh)', 'energy'], dtype='object')


In [6]:
df1 = alldata.drop(columns=["energy(kWh/hh)"])
df2 = df1.reset_index().set_index("LCLid")

In [7]:
print(df2)

             index                tstp  energy
LCLid                                         
MAC000002        0 2012-10-12 00:30:00     0.0
MAC000002        1 2012-10-12 01:00:00     0.0
MAC000002        2 2012-10-12 01:30:00     0.0
MAC000002        3 2012-10-12 02:00:00     0.0
MAC000002        4 2012-10-12 02:30:00     0.0
...            ...                 ...     ...
MAC005492  1222665 2014-02-27 22:00:00   182.0
MAC005492  1222666 2014-02-27 22:30:00   122.0
MAC005492  1222667 2014-02-27 23:00:00   140.0
MAC005492  1222668 2014-02-27 23:30:00   192.0
MAC005492  1222669 2014-02-28 00:00:00    88.0

[1222620 rows x 3 columns]


In [8]:
df1.to_csv("30_minutes.csv", index=True)