# Energy Disaggregation using Random Forest Regression
recorded_data.csv contains the recorded activities that was given by Paul in a building along with the corresponding time and date information. Each row corresponds to a different activity. The data_30_minutes.csv contains the energy consumption data for pauls house, measured every 30 minutes.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from hmmlearn import hmm
from nilmtk.legacy.disaggregate import fhmm_exact
from nilmtk import MeterGroup, ElecMeter
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import fhmm_model as fhmm

In [2]:
df = pd.read_csv("recorded_data.csv")
df_data = pd.read_csv("data_30_minutes.csv")
print(df)

           date      time          datetime         Activity
0    12/01/2023  19:30:00  12/01/2023 19:30             Bath
1    12/01/2023  19:28:00  12/01/2023 19:28       Dishwasher
2    12/01/2023  19:40:00  12/01/2023 19:40  Washing Machine
3    13/01/2023  06:55:00  13/01/2023 06:55           Kettle
4    13/01/2023  07:30:00  13/01/2023 07:30           Kettle
..          ...       ...               ...              ...
236  02/02/2023  16:27:00  02/02/2023 16:27           Kettle
237  02/02/2023  16:55:00  02/02/2023 16:55           Kettle
238  02/02/2023  20:55:00  02/02/2023 20:55           Kettle
239  03/02/2023  13:41:00  03/02/2023 13:41           Kettle
240  03/02/2023  16:05:00  03/02/2023 16:05           Kettle

[241 rows x 4 columns]


## Data Cleaning
While sorting through the dataframes, I realized that the appliance data needed to be organized specifically for the NILMTK toolkit. In order to achieve this, I created a vector data for each appliance, which would indicate the timescale that each appliance was in use. The process involved extracting unique values, creating activity value vectors, applying those vectors to the dataframe, sorting the dataframes by index, merging the two dataframes based on the 'datetime' column, interpolating any missing values using linear interpolation, and finally resetting the index of the resulting dataframe. The resulting cleaned data was then ready to be used for further analysis.

In [3]:
#Extracting the unique values
appliances = df['Activity'].unique()
# Creating the vectors for the activity values 
def create_appliance_vector(row, appliances):
    appliance_states = []
    for appliance in appliances:
        if row['Activity'] == appliance:
            appliance_states.append(1)
        else:
            appliance_states.append(0)
    return appliance_states
# Applying the vectors to the df 
df['Appliance_states'] = df.apply(lambda row: create_appliance_vector(row, appliances), axis=1)
print(df)

           date      time          datetime         Activity  \
0    12/01/2023  19:30:00  12/01/2023 19:30             Bath   
1    12/01/2023  19:28:00  12/01/2023 19:28       Dishwasher   
2    12/01/2023  19:40:00  12/01/2023 19:40  Washing Machine   
3    13/01/2023  06:55:00  13/01/2023 06:55           Kettle   
4    13/01/2023  07:30:00  13/01/2023 07:30           Kettle   
..          ...       ...               ...              ...   
236  02/02/2023  16:27:00  02/02/2023 16:27           Kettle   
237  02/02/2023  16:55:00  02/02/2023 16:55           Kettle   
238  02/02/2023  20:55:00  02/02/2023 20:55           Kettle   
239  03/02/2023  13:41:00  03/02/2023 13:41           Kettle   
240  03/02/2023  16:05:00  03/02/2023 16:05           Kettle   

                                      Appliance_states  
0    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
1    [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
2    [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
3  

In [4]:
df_appliance = df.sort_index()
df_energy = df_data.sort_index()
print(df_appliance)
print(df_energy)

           date      time          datetime         Activity  \
0    12/01/2023  19:30:00  12/01/2023 19:30             Bath   
1    12/01/2023  19:28:00  12/01/2023 19:28       Dishwasher   
2    12/01/2023  19:40:00  12/01/2023 19:40  Washing Machine   
3    13/01/2023  06:55:00  13/01/2023 06:55           Kettle   
4    13/01/2023  07:30:00  13/01/2023 07:30           Kettle   
..          ...       ...               ...              ...   
236  02/02/2023  16:27:00  02/02/2023 16:27           Kettle   
237  02/02/2023  16:55:00  02/02/2023 16:55           Kettle   
238  02/02/2023  20:55:00  02/02/2023 20:55           Kettle   
239  03/02/2023  13:41:00  03/02/2023 13:41           Kettle   
240  03/02/2023  16:05:00  03/02/2023 16:05           Kettle   

                                      Appliance_states  
0    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
1    [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
2    [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
3  

In [5]:
df_appliance['datetime'] = pd.to_datetime(df_appliance['datetime'])
df_energy['tstp'] = pd.to_datetime(df_energy['tstp'])
df_energy['tstp'] = df_energy['tstp'].dt.tz_convert(None)
# Rename the 'tstp' column in df2 to 'datetime' for merging purposes
df2 = df_energy.rename(columns={'tstp': 'datetime'})

# Merge the two dataframes based on the 'datetime' column
merged_df = pd.merge(df_appliance, df2, on='datetime', how='outer')

# Sort the merged dataframe by 'datetime'
merged_df = merged_df.sort_values(by='datetime')

# Interpolate missing values using linear interpolation
interpolated_df = merged_df.interpolate()

# Reset the index of the final dataframe
interpolated_df = interpolated_df.reset_index(drop=True)

interpolated_df = interpolated_df[interpolated_df['Activity'].notna()]
print(interpolated_df)

              date      time            datetime         Activity  \
136534  13/01/2023  06:55:00 2023-01-13 06:55:00           Kettle   
136855  13/01/2023  07:30:00 2023-01-13 07:30:00           Kettle   
137177  13/01/2023  08:05:00 2023-01-13 08:05:00    Office Heater   
138043  13/01/2023  09:40:00 2023-01-13 09:40:00           Shower   
138090  13/01/2023  09:45:00 2023-01-13 09:45:00           Kettle   
...            ...       ...                 ...              ...   
799699  03/02/2023  13:41:00 2023-03-02 13:41:00           Kettle   
800582  03/02/2023  16:05:00 2023-03-02 16:05:00           Kettle   
928089  12/01/2023  19:28:00 2023-12-01 19:28:00       Dishwasher   
928090  12/01/2023  19:30:00 2023-12-01 19:30:00             Bath   
928091  12/01/2023  19:40:00 2023-12-01 19:40:00  Washing Machine   

                                         Appliance_states  Unnamed: 0  energy  
136534  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...     88683.5  3310.0  
136855  [0,

In [6]:
#creating a new dataframe from values in Appliance states
appliance_states_df = pd.DataFrame(interpolated_df['Appliance_states'].tolist(), index=interpolated_df.index)
#to rename the columns with appliance names, need to do it well for the actual names
appliance_states_df.columns = [f'appliance_{i+1}' for i in range(len(appliance_states_df.columns))]


In [7]:
df_cleaned = pd.concat([interpolated_df, appliance_states_df], axis=1)
df_cleaned.drop('Appliance_states', axis=1, inplace=True)


In [8]:
print(df_cleaned)

              date      time            datetime         Activity  Unnamed: 0  \
136534  13/01/2023  06:55:00 2023-01-13 06:55:00           Kettle     88683.5   
136855  13/01/2023  07:30:00 2023-01-13 07:30:00           Kettle     88891.5   
137177  13/01/2023  08:05:00 2023-01-13 08:05:00    Office Heater     89100.5   
138043  13/01/2023  09:40:00 2023-01-13 09:40:00           Shower     89661.5   
138090  13/01/2023  09:45:00 2023-01-13 09:45:00           Kettle     89691.5   
...            ...       ...                 ...              ...         ...   
799699  03/02/2023  13:41:00 2023-03-02 13:41:00           Kettle   1009319.5   
800582  03/02/2023  16:05:00 2023-03-02 16:05:00           Kettle   1010116.5   
928089  12/01/2023  19:28:00 2023-12-01 19:28:00       Dishwasher   1125700.0   
928090  12/01/2023  19:30:00 2023-12-01 19:30:00             Bath   1125700.0   
928091  12/01/2023  19:40:00 2023-12-01 19:40:00  Washing Machine   1125700.0   

        energy  appliance_1

# Random Forest Regression
The Random Forest Regressor model was created using scikit-learn library and was trained on the energy consumption dataset. The model was then used to make predictions on the validation set and the entire dataset. The Mean Squared Error (MSE) and Mean Absolute Error (MAE) were calculated to evaluate the model's performance.

In [9]:
# Extract the feature windows from the cleaned dataframe
feature_windows = df_cleaned.iloc[:, -len(appliances):]

# Extract the labels (energy consumption of each appliance) from the cleaned dataframe
appliance_columns = [f'appliance_{i+1}' for i in range(len(appliances))]
labels = df_cleaned[appliance_columns].values

# Split the feature windows and labels into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(feature_windows, labels, test_size=0.2, random_state=42)


In [10]:
# Create the Random Forest Regressor model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training data
model.fit(X_train, y_train)


In [11]:
# Make predictions on the validation set
y_pred = model.predict(X_val)

# Calculate and print the evaluation metrics
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)


Mean Squared Error: 0.0017324705882352943
Mean Absolute Error: 0.004141176470588236


In [12]:
# Make predictions for the entire dataset
y_pred_all = model.predict(feature_windows)


In [13]:
# Calculate the energy consumption for each appliance
disaggregated_energy = y_pred_all.sum(axis=0)

# Print the disaggregated energy for each appliance
for i, energy in enumerate(disaggregated_energy):
    print(f"Energy consumption of {appliances[i]}: {energy:.2f}")



Energy consumption of Bath: 5.20
Energy consumption of Dishwasher: 9.05
Energy consumption of Washing Machine: 6.00
Energy consumption of Kettle: 126.00
Energy consumption of Office Heater: 3.32
Energy consumption of Shower: 21.00
Energy consumption of Gas Hob: 34.00
Energy consumption of Toaster: 9.03
Energy consumption of Coffee Grinder: 14.00
Energy consumption of Fan Oven: 2.58
Energy consumption of Tumble Dryer: 6.06
Energy consumption of Hoover: 1.15
Energy consumption of Xmas Lights: 1.34
Energy consumption of Oven: 5.87
Energy consumption of Kitchen Heater: 4.13
Energy consumption of Grill: 1.27
Energy consumption of Heater: 0.00


## Performance:
The Model performance was interesting as the mean errors are low, However the predicted values of energy usage has some values which just look wrong due to the limited resources. This would include the limited amount of labelled data, the model used, and the preprocessing method to vectorise the Times that appliances are used as we do not have time period data for the length of time that the appliance is used. 