Notebook showing how the "duration-of-compressor-being-on" model was trained, given duration of the door being open, and temperature difference between setpoint (ie., internal temperature) and roomTemp

Caveats: 

1. Setpoint was considered to be internal temperature based on the analysis for the current uuid

2. There is a script file which automates the training of the models for each of the 3 uuid files. This notebook is meant to be seen as illustration

3. A lot of the code (data prep and wrangling is repeated) which could have been made in to a function, but I ran out of time

In [1]:
import pandas as pd
import numpy as np
import config
from datetime import datetime
import math
import pickle

In [2]:
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [3]:
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt


urllib3 (1.24.1) or chardet (2.3.0) doesn't match a supported version!



Let us go through how the setpoints change over the usage of the 3 devices

In [4]:
file_name = '09ac4a10-7e8e-40f3-a327-1f93a5cf2383.pickle'
df = pd.read_pickle(file_name)
#df.describe()

In [5]:
# flag the instances the door opened
df['door_toggle'] =  df['door'] - df['door'].shift(periods=1, fill_value =0)
#print(np.unique(df['door_toggle'],return_counts=True))
# make a smoothing filter to remove the times when the door is closed, and then suddenly opened again (over each samples)
convolution_filter = np.ones_like(np.arange(4))/4
df['door_opened_error'] = np.convolve(abs(df['door_toggle']), convolution_filter, mode='same')
df.loc[df['door_opened_error']==0.5,'door_toggle']=0
# now, door_toggle has only +1 and -1 toggle, specifically one after the other. +1 denotes door open, -1 denotes door closed
#print("-----")
#print("after filtering:")
#print(np.unique(df['door_toggle'],return_counts=True))

# flag the instances the compressor turned on
df['compressor_toggle'] =  df['compressor'] -  df['compressor'].shift(periods=1, fill_value =0)

In [6]:
# calculate the the duration the door was open (using same logic as above)
df.loc[df['door_toggle']!=0,'door_open_duration']= df.loc[df['door_toggle']!=0,'timestamp'] - df.loc[df['door_toggle']!=0,'timestamp'].shift(periods=1)

In [7]:
#df[df['compressor_toggle']!=0].head()
# see if door is always open when compressor is toggled
# print(np.unique(df.loc[df['compressor_toggle']==1,'door'], return_counts=True))
##(array([1], dtype=int64), array([414], dtype=int64))
# Door is always 1 when compressor is toggled on

In [8]:
# calculate the the duration the door was open
df.loc[df['compressor_toggle']!=0,'compressor_on_duration']= df.loc[df['compressor_toggle']!=0,'timestamp'] - df.loc[df['compressor_toggle']!=0,'timestamp'].shift(periods=1)
df[df['compressor_toggle']!=0].head()

Unnamed: 0,compressor,door,roomTemp,setpoint,temp,timestamp,door_toggle,door_opened_error,compressor_toggle,door_open_duration,compressor_on_duration
3196,1,1,70.672697,37,40.139183,2019-01-01 13:19:00,0,0.25,1,NaT,NaT
3224,0,0,70.491869,37,37.0,2019-01-01 13:26:00,0,0.0,-1,NaT,00:07:00
4636,1,1,62.021042,37,40.17712,2019-01-01 19:19:00,0,0.0,1,NaT,05:53:00
4646,0,0,61.991146,37,37.0,2019-01-01 19:21:30,0,0.0,-1,NaT,00:02:30
4718,1,1,61.796863,37,40.163242,2019-01-01 19:39:30,0,0.0,1,NaT,00:18:00


In [9]:
# let's look into the toggling activity, by keeping only the rows where something was toggled
toggle_df = df.loc[np.logical_or(df['compressor_toggle']!=0,df['door_toggle']!=0),:].reset_index()

In [10]:
# verify if the temperature inside the fridge and setpoint is the same when the door is opened. 
# this could affect accuracy
toggle_df['result']= toggle_df.loc[toggle_df['door_toggle']==1,'setpoint'] - toggle_df.loc[toggle_df['door_toggle']==1,'temp']

#print(toggle_df[toggle_df['door_toggle']==1].shape) 
#door was opened 421 times

#print(toggle_df.loc[np.logical_and(toggle_df['door_toggle']==1,toggle_df['result']==0),:].shape)
# temp was equal to setpoint 414 times

# toggle_df.loc[np.logical_and(toggle_df['door_toggle']==1,toggle_df['result']!=0),:]
# shows that door was shut and opened while compressor was still running. Let's remove those
toggle_df = toggle_df.drop(toggle_df.loc[np.logical_and(toggle_df['door_toggle']==1,toggle_df['result']!=0),:].index)

#clean up the data by removing the unnecessary columns
toggle_df = toggle_df.drop(columns=['result','index'])

In [11]:
# get time when door was opened, and the temperature at that time
toggle_df.loc[toggle_df['door_toggle']==1,'door_activated_time'] = toggle_df.loc[toggle_df['door_toggle']==1,'timestamp']
toggle_df.loc[toggle_df['door_toggle']==1,'door_activated_roomTemp'] = toggle_df.loc[toggle_df['door_toggle']==1,'roomTemp']

In [12]:
# forward fill those columns, so that the values are propogated further, thereby resulting in a simple subtraction to find difference in times and temperatures
toggle_df['door_activated_time'] = toggle_df['door_activated_time'].fillna(method='ffill') 
toggle_df['door_activated_roomTemp'] = toggle_df['door_activated_roomTemp'].fillna(method='ffill') 
toggle_df['door_open_duration'] = toggle_df['door_open_duration'].fillna(method='ffill') 
toggle_df['delta_temp'] = toggle_df['door_activated_roomTemp'] - toggle_df['setpoint']

prepare the final dataframe to model the compressor being switched on

In [13]:
final_df = toggle_df.loc[toggle_df['compressor_toggle']==-1,['timestamp','delta_temp',
                                                             'door_open_duration',
                                                             'compressor_on_duration']].reset_index(drop=True)
#final_df['dates'] = final_df['timestamp'].dt.date
#final_df['hour'] = final_df['timestamp'].dt.hour
final_df['door_open_duration'] = final_df['door_open_duration'].dt.total_seconds()
final_df['compressor_on_duration'] = final_df['compressor_on_duration'].dt.total_seconds()
final_df.tail()

Unnamed: 0,timestamp,delta_temp,door_open_duration,compressor_on_duration
409,2019-04-30 13:35:00,33.374335,120.0,240.0
410,2019-04-30 13:44:45,33.136451,150.0,285.0
411,2019-04-30 14:18:00,32.209785,120.0,240.0
412,2019-04-30 19:48:00,24.747237,105.0,150.0
413,2019-04-30 20:11:30,24.587966,270.0,360.0


In [14]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(final_df[['door_open_duration','delta_temp']], final_df['compressor_on_duration'])
reg.score(final_df[['door_open_duration','delta_temp']], final_df['compressor_on_duration'])

0.9035166562407311

In [15]:
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(final_df[['door_open_duration','delta_temp']], 
                                                                            final_df['compressor_on_duration'], 
                                                                            test_size = 0.2,
                                                                            random_state = 4)

In [16]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (331, 2)
Training Labels Shape: (331,)
Testing Features Shape: (83, 2)
Testing Labels Shape: (83,)


In [17]:
from sklearn.preprocessing import StandardScaler
train_features = StandardScaler().fit_transform(train_features)
test_features = StandardScaler().fit_transform(test_features)

In [18]:
# Instantiate model with 1000 decision trees
rf_reg = RandomForestRegressor(n_estimators = 1000)
# Train the model on training data
rf_reg.fit(train_features, train_labels)
# Use the forest's predict method on the test data
rf_reg_predictions = rf_reg.predict(test_features)
# Performance metrics
rf_reg_errors = abs(rf_reg_predictions - test_labels)
# print performance evaluations
print('Mean Absolute Error:', metrics.mean_absolute_error(test_labels, rf_reg_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(test_labels, rf_reg_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_labels, rf_reg_predictions)))
print('R2 value:', rf_reg.score(test_features, test_labels))

Mean Absolute Error: 17.68807228915663
Mean Squared Error: 1665.2193668674704
Root Mean Squared Error: 40.80709946648341
R2 value: 0.8806266412101857


In [19]:
# Perform Grid-Search

search = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid={
        'max_depth': [5,10,20],
        'n_estimators': [100, 500, 1000, 2000],
    },
    cv=5,n_jobs=-1)

search_result = search.fit(train_features, train_features)
best_params = search_result.best_params_

rf_search = RandomForestRegressor(max_depth=best_params["max_depth"], 
                            n_estimators=best_params["n_estimators"])

# Train the model on training data
rf_search.fit(train_features, train_labels)

# Use the forest's predict method on the test data
rf_search_predictions = rf_search.predict(test_features)
# Performance metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(test_labels, rf_search_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(test_labels, rf_search_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_labels, rf_search_predictions)))
print('R2 value:', rf_search.score(test_features, test_labels))

Mean Absolute Error: 17.239156626506027
Mean Squared Error: 1626.8321385542172
Root Mean Squared Error: 40.33400722162648
R2 value: 0.883378478277157


Looks like linear regression might be a better fit than the random forests. Would have done more visualizations and verifications if there were more time

If there was more time, I would have also added a loop to go through and pick the best amongst regression, random forests or neural nets

In [20]:
pickle.dump(grid_result, open(str(file_name[:-7]+'_reg_params.pickle'), 'wb'))

NameError: name 'grid_result' is not defined