# **Usage Agent**

The **Usage Agent's** functionnality is to estimate the probability that a particular device will be used on the following day. Within the general recommendation framework, this function is used in order to limit unnecessary recommendations that could irritate the user. Whenever a device is unlikely to be used on the next day (estimated likelihood below a certain threshold), no recommendation will be made.

In the present notebook, we will describe how these probabilities are estimated in detail and define the **Usage Agent** class that will be integrated into the recommendation agent.

The Usage Agent will use a ML-algorithm on features extracted from the household's electricity consumption data in order to predict the likelihood of use of devices on the next day. For instance, at a given day t-1, it will use all available consumption data until day t-1 in order to predict device usage on day t. The features we will use can be divided into 3 categories: 
1. Whether activity has been detected in the house in the preceding days (activity detected by electricity consumption).
2. Whether the to-be-prediced-device has been used in the previous days.
3. Time dummies.

Given the limited number of observations for each household, we will need to restrict the complexity of the ML-Algorithm in use. This is the reason why we will use a logit model with a limited number of features.

## **1. Load And Preprocess Data**

This part's only purpose is to load the data used in the Usage Agent. This process is described in detail in the Preparation Agent. 

**Note: When computing the script with another Household than Household 1 you might need to adapt some parameters**

### **1.1 Initialize And Load Python Scripts**

In [1]:
#please change to the ones where you have codes.
import sqlite3
import pandas as pd
import numpy as np
os.chdir('/Users/sofyakonchakova/Desktop/HU_Berlin/SIS/codes/')
from helper_functions import Helper
from PreparationAgent import Preparation_Agent
os.chdir('/Users/sofyakonchakova/Desktop/')
dbfile = 'home-assistant_Chris.db'
helper = Helper()

In [2]:
###add it to helper function
def export_sql(file=dbfile):
    with sqlite3.connect(file) as con:
        cur = con.cursor()
        cur.execute("SELECT * FROM states")
        states = cur.fetchall()
    from_states_db = []
    for result in states:
        result = list(result)
        from_states_db.append(result)
    columns = ["state_id","entity_id","state","attributes","event_id","last_changed","last_updated","old_state_id","attributes_id","context_id","context_user_id","context_parent_id","origin_idx"]
    states_df = pd.DataFrame(from_states_db, columns = columns)

    with sqlite3.connect(file) as con:
        cur = con.cursor()
        cur.execute("SELECT * FROM state_attributes")
        state_attributes = cur.fetchall()
    from_state_attributes_db = []
    for result in state_attributes:
        result = list(result)
        from_state_attributes_db.append(result)
    columns = ["attributes_id","hash","shared_attributes"]
    state_attributes_df = pd.DataFrame(from_state_attributes_db, columns = columns)

    output = pd.merge(states_df, state_attributes_df, how= "left", on = 'attributes_id')
    return output

In [3]:
prep = Preparation_Agent(export_sql(dbfile))

In [4]:
# load household data

active_appliances = [573,579,603, 605]

### **1.2 Set Parameters For Pre-processing Step**

In [5]:
truncation_params = {
    'features': 'all', 
    'factor': 1.5, 
    'verbose': 0
}

scale_params = {
    'features': 'all', 
    'kind': 'MinMax', 
    'verbose': 0
}

aggregate_params = {
    'resample_param': '60T'
}
aggregate_params24_H = {
    'resample_param': '24H'
}


activity_params = {
    'active_appliances': [573,579,603, 605],
    'threshold': .15
}

time_params = {
    'features': ['hour', 'day_name']
}

activity_lag_params = {
    'features': ['activity'],
    'lags': [24, 48, 72]
}

shiftable_devices = {573,579}

device = {
    'threshold' : .15}

activity_pipe_params = {
    'truncate': truncation_params,
    'scale': scale_params,
    'activity': activity_params,
    'aggregate_hour': aggregate_params,
    'aggregate_day': aggregate_params24_H,
    'time': time_params,
    'activity_lag': activity_lag_params,
    'shiftable_devices' : shiftable_devices,
    'device': device
}



### **1.3 Pre-process Data For Input In Device_Usage Agent**

In [6]:
# calling the preparation pipeline
prep = Preparation_Agent(export_sql(dbfile))
df = prep.pipeline_usage(prep.input, activity_pipe_params)

#display all potential variables for predicting device usage likelihood
df.head()

Unnamed: 0_level_0,activity,579_usage,573_usage,periods_since_last_activity,periods_since_last_579_usage,periods_since_last_573_usage,hour,activity_lag_1,activity_lag_2,activity_lag_3,...,573_usage_lag_1,573_usage_lag_2,573_usage_lag_3,active_last_2_days,day_name_Monday,day_name_Saturday,day_name_Sunday,day_name_Thursday,day_name_Tuesday,day_name_Wednesday
last_updated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-11-21,1,0,1,0.0,0.0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,1,0,0,0,0,0
2022-11-22,0,0,0,1.0,2.0,1.0,0,1.0,0.0,0.0,...,1.0,0.0,0.0,1,0,0,0,0,1,0
2022-11-23,0,0,0,2.0,3.0,2.0,0,0.0,1.0,0.0,...,0.0,1.0,0.0,1,0,0,0,0,0,1
2022-11-24,0,0,0,3.0,4.0,3.0,0,0.0,0.0,1.0,...,0.0,0.0,1.0,0,0,0,0,1,0,0
2022-11-25,0,0,0,4.0,5.0,4.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


## **2.  Constructing the Usage Agent**

### **2.1 Initialize Agent**

First we define the **Usage Agent class**. It takes as input the data generated by the prep.pipeline_usage function computed above, and the name of the device for which predictions should be made (e.g "Washing Machine", "Dishwasher"etc...).

In [7]:
class Usage_Agent:
    import pandas as pd

    def __init__(self, input_df, device):
        self.input = input_df
        self.device = device

Here we initialize the agent for the device "Dishwasher"

In [8]:
import pandas as pd
Usage_Agent_i = Usage_Agent(df, 573) 

### **2.2 Train_test_split function**

The number of data points available to make a prediction for day t increases by one, each time t increases by one. Therefore, we define a custom train_test_split function that automatically puts all data available until day t-1 (incl.) into the training set. The Data for day t (= prediction day) comes into the test set.

In order to limit over-fitting the function also filters out the number of features to be taken into account to train the model. Here these are the following:

1. Indicator of device usage at day t-1.
2. Indicator of device usage at day t-2.
3. Indicator of activity in the household in the past two days.


In [9]:
#date: the day of prediction
#train start: the day from which training starts
def train_test_split(self, df, date, train_start='2022-11-22'):
    #restrict number of variables
    select_vars =  [str(self.device) + '_usage', str(self.device)+ '_usage_lag_1', str(self.device)+ '_usage_lag_2',	'active_last_2_days']
    df = df[select_vars]
    #spli train and test
    X_train = df.loc[train_start:date, df.columns != str(self.device) + '_usage']
    y_train = df.loc[train_start:date, df.columns == str(self.device) + '_usage']
    X_test  = df.loc[date, df.columns != str(self.device) + '_usage']
    y_test  = df.loc[date , df.columns == str(self.device) + '_usage']
    return X_train, y_train, X_test, y_test

# add to Activity agent
setattr(Usage_Agent, 'train_test_split', train_test_split)
del train_test_split 

Ouput:

In [10]:
X_train, y_train, X_test, y_test = Usage_Agent_i.train_test_split(df, "2022-11-30", train_start='2012-11-22')

In [11]:
X_train

Unnamed: 0_level_0,573_usage_lag_1,573_usage_lag_2,active_last_2_days
last_updated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-11-21,0.0,0.0,0
2022-11-22,1.0,0.0,1
2022-11-23,0.0,1.0,1
2022-11-24,0.0,0.0,0
2022-11-25,0.0,0.0,0
2022-11-26,0.0,0.0,0
2022-11-27,0.0,0.0,0
2022-11-28,0.0,0.0,0
2022-11-29,0.0,0.0,0
2022-11-30,0.0,0.0,1


In [12]:
X_test

573_usage_lag_1       0.0
573_usage_lag_2       0.0
active_last_2_days    1.0
Name: 2022-11-30 00:00:00, dtype: float64

In [13]:
y_train

Unnamed: 0_level_0,573_usage
last_updated,Unnamed: 1_level_1
2022-11-21,1
2022-11-22,0
2022-11-23,0
2022-11-24,0
2022-11-25,0
2022-11-26,0
2022-11-27,0
2022-11-28,0
2022-11-29,0
2022-11-30,1


In [14]:
y_test

573_usage    1.0
Name: 2022-11-30 00:00:00, dtype: float64

### **2.3 Fitting Models**

Now that we have the function to perform the split-sampling we can fit the model on training data. For that purpose, we define a Logit-fitting function as follows:

In [15]:
def fit_smLogit(self, X, y):
    import statsmodels.api as sm
    return sm.Logit(y, X).fit(disp=False)

# add to Activity agent
setattr(Usage_Agent, 'fit_smLogit', fit_smLogit)
del fit_smLogit 

def fit(self, X, y, model_type):
    if model_type == 'logit':
        model = self.fit_smLogit(X, y)
    else:
        raise InputError('Unknown model type.')
    return model

# add to Activity agent
setattr(Usage_Agent, 'fit', fit)
del fit

Using this function on the training split, we can train our first model:

In [22]:
usage = Usage_Agent(df, 573)
model = usage.fit(X_train, y_train, 'logit')
print(model.summary())

                           Logit Regression Results                           
Dep. Variable:              573_usage   No. Observations:                   10
Model:                          Logit   Df Residuals:                        7
Method:                           MLE   Df Model:                            2
Date:                Tue, 13 Dec 2022   Pseudo R-squ.:                 0.03037
Time:                        17:09:49   Log-Likelihood:                -4.8520
converged:                      False   LL-Null:                       -5.0040
Covariance Type:            nonrobust   LLR p-value:                    0.8590
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
573_usage_lag_1      -54.7482   8.78e+06  -6.24e-06      1.000   -1.72e+07    1.72e+07
573_usage_lag_2      -54.7482   8.78e+06  -6.24e-06      1.000   -1.72e+07    1.72e+07
active_last_2_days  



In [23]:
y_train

Unnamed: 0_level_0,573_usage
last_updated,Unnamed: 1_level_1
2022-11-21,1
2022-11-22,0
2022-11-23,0
2022-11-24,0
2022-11-25,0
2022-11-26,0
2022-11-27,0
2022-11-28,0
2022-11-29,0
2022-11-30,1


Once the model is fitted to the training data, a prediction can be made for the test day. This prediction function is defined in the following:

In [24]:
def predict(self, model, X):
    import statsmodels
    import numpy as np
    X = np.array(X)

    if type(model) == statsmodels.discrete.discrete_model.BinaryResultsWrapper:
        y_hat = model.predict(X)
    else:
        raise InputError('Unknown model type.')
    return y_hat

# add to Activity agent
setattr(Usage_Agent, 'predict', predict)
del predict

In [25]:
#compute prediction at day t (see date used for split sampling)
import numpy as np
y_hat = usage.predict(model, X_test)
y_hat

array([1.])

### **2.4 Pipeline**

Finally, we wrap up all the previously defined functions into the **pipeline** function. This allows to generate a prediction by simply inputting:
* the pre-processed usage data
* the prediction date
* the model type (limited to logit for now)
* the date at which the model has started to train


In [26]:
def pipeline(self, df, date, model_type, train_start):
    X_train, y_train, X_test, y_test = self.train_test_split(df, date, train_start)

    # fit model
    model = self.fit(X_train, y_train, model_type)

    # predict
    return self.predict(model, X_test)

# add to Activity agent
setattr(Usage_Agent, 'pipeline', pipeline)
del pipeline

A prediction for the "2013-12-08" based on the data starting on the '2013-11-01' can finally be made for the device with which we initialized the class (here: "Dishwasher")

In [27]:
date = "2022-11-30"
train_start = '2022-11-21'
usage.pipeline(df, date, 'logit', train_start)



array([1.])

### **2.5 Model Evaluation**

Finally, we want to assess the accuracy of our model before using it in the Recommendation Agent. 

A drawback to our approach is that we are not able to apply conventional model evaluation techniques to our model. We will train our model for each day to account for newly available information. Hence, we have different train and test sets for each day and for each day different performance metric based on the respective data sets. Therefore, we created our own evaluation function. 

Our evaluation function will build a model, fit the model and predict the target for each day for a given prediction period. For each day and fitted model it will calculate a performance metric on the train data. We chose the Area Under the Receiver Operating Characteristic Curve (AUC) as performance metric for our binary classification task. As in our case the test data is only the current date to be predicted, we calculate the test AUC over the usage probabilities of all day after all days have been predicted. To summarize the train AUC in one score, we apply an average over all calculated train AUC scores (Note: This approach is the same as for the activity predictions).

In [30]:
def auc(self, y_true, y_hat):
    import sklearn.metrics
    return sklearn.metrics.roc_auc_score(y_true, y_hat)

# add to Activity agent
setattr(Usage_Agent, 'auc', auc)
del auc

def evaluate(self, df, model_type, train_start, predict_start='2022-11-21', predict_end=-1):
    import pandas as pd
    import numpy as np
    dates = pd.DataFrame(df.index)
    dates = dates.set_index(df.index)['last_updated']
    predict_start = pd.to_datetime(predict_start)
    predict_end = pd.to_datetime(dates.iloc[predict_end]) if type(predict_end) == int else pd.to_datetime(predict_end)
    dates = dates.loc[predict_start:predict_end]
    y_true = []
    y_hat_train = {}
    y_hat_test = []
    auc_train_dict = {}
    auc_test = []

    for date in dates.index:
        # train test split
        #train_test_split(self, df, date, train_start='2013-11-01', test_delta='all', target='activity')
        X_train, y_train, X_test, y_test = self.train_test_split(df, date, train_start)

        # fit model
        model = self.fit(X_train, y_train, model_type)

        # predict
        y_hat_train.update({date: self.predict(model, X_train)})
        y_hat_test += list(self.predict(model, X_test))

        # evaluate train data
        auc_train_dict.update({date: self.auc(y_train, list(y_hat_train.values())[-1])})
        
        y_true += list(y_test)
    
    auc_test = self.auc(y_true, y_hat_test)
    auc_train = np.mean(list(auc_train_dict.values()))

    return auc_train, auc_test, auc_train_dict


# add to Activity agent
setattr(Usage_Agent, 'evaluate', evaluate)
del evaluate

Finally, we can evaluate the simple Logit model for the "Dishwasher", for instance for all predictions after the "2014-08-01".  

In [31]:
auc_train, auc_test, auc_train_dict = usage.evaluate(df, "logit", '2022-12-01', predict_start='2022-11-21', predict_end= -1)
print("mean_auc_on_train = "+ str(auc_train) + " | test_auc = " + str(auc_test))

ValueError: zero-size array to reduction operation maximum which has no identity

As can be seen above, the model's performance is quite disappointing. It is not surprising that we do not have a very high accuracy, given the little amount of data we have. However, there must be potential for improvment. A first step in that direction would be a proper feature selection methodology taking into account different devices and households. Moreover, there has been a large decrease in model accuracy after changing the pre-processing pipeline methodology. Therefore, it seems that the model is sensitive to the way we detect the devices' activity. In the next steps we should investigate how and why these pre-processing steps impact the model's performance.

## **Appendix A1: Complete Usage Agent Class**

In [23]:
class Usage_Agent:
    import pandas as pd

    def __init__(self, input_df, device):
        self.input = input_df
        self.device = device

    # train test split
    # -------------------------------------------------------------------------------------------
    def train_test_split(self, df, date, train_start="2013-11-01"):
        select_vars = [
            self.device + "_usage",
            self.device + "_usage_lag_1",
            self.device + "_usage_lag_2",
            "active_last_2_days",
        ]
        df = df[select_vars]
        X_train = df.loc[train_start:date, df.columns != self.device + "_usage"]
        y_train = df.loc[train_start:date, df.columns == self.device + "_usage"]
        X_test = df.loc[date, df.columns != self.device + "_usage"]
        y_test = df.loc[date, df.columns == self.device + "_usage"]
        return X_train, y_train, X_test, y_test
    
    # model training and evaluation
    # -------------------------------------------------------------------------------------------
    def fit_smLogit(self, X, y):
        import statsmodels.api as sm

        return sm.Logit(y, X).fit(disp=False)

    def fit(self, X, y, model_type):
        if model_type == "logit":
            model = self.fit_smLogit(X, y)
        else:
            raise InputError("Unknown model type.")
        return model

    def predict(self, model, X):
        import statsmodels
        import numpy as np

        X = np.array(X)

        if type(model) == statsmodels.discrete.discrete_model.BinaryResultsWrapper:
            y_hat = model.predict(X)
        else:
            raise InputError("Unknown model type.")
        return y_hat

    
    def auc(self, y_true, y_hat):
        import sklearn.metrics
        return sklearn.metrics.roc_auc_score(y_true, y_hat)
    
    def evaluate(
        self, df, model_type, train_start, predict_start="2014-01-01", predict_end=-1, return_errors=False
    ):
        import pandas as pd
        import numpy as np
        from tqdm import tqdm

        dates = pd.DataFrame(df.index)
        dates = dates.set_index(df.index)["Time"]
        predict_start = pd.to_datetime(predict_start)
        predict_end = (
            pd.to_datetime(dates.iloc[predict_end])
            if type(predict_end) == int
            else pd.to_datetime(predict_end)
        )
        dates = dates.loc[predict_start:predict_end]
        y_true = []
        y_hat_train = {}
        y_hat_test = []
        auc_train_dict = {}
        auc_test = []

        for date in tqdm(dates.index):
            errors = {}
            try:
                X_train, y_train, X_test, y_test = self.train_test_split(
                    df, date, train_start
                )
                # fit model
                model = self.fit(X_train, y_train, model_type)
                # predict
                y_hat_train.update({date: self.predict(model, X_train)})
                y_hat_test += list(self.predict(model, X_test))
                # evaluate train data
                auc_train_dict.update(
                    {date: self.auc(y_train, list(y_hat_train.values())[-1])}
                )
                y_true += list(y_test)
            except Exception as e:
                errors[date] = e

        auc_test = self.auc(y_true, y_hat_test)
        auc_train = np.mean(list(auc_train_dict.values()))

        if return_errors:
            return auc_train, auc_test, auc_train_dict, errors
        else:
            return auc_train, auc_test, auc_train_dict
        
    # pipeline function: predicting device usage
    # -------------------------------------------------------------------------------------------        
    def pipeline(self, df, date, model_type, train_start):
        X_train, y_train, X_test, y_test = self.train_test_split(df, date, train_start)
        model = self.fit(X_train, y_train, model_type)
        return self.predict(model, X_test)