# Introduction to Machine Learning for the Built Environment - Unsupervised Learning using Clustering and Supervised Prediction using Regression

- Created by Clayton Miller - clayton@nus.edu.sg - miller.clayton@gmail.com

This notebook is an introduction to the machine learning concepts of clustering and preduction using regression. We will use the Building Data Genome Project data set to analyze electrical meter data from non-residential buildings.

## The Scikit Learn Machine Learning Library

In this series of videos, we will learn a new library called Scikit-Learn that includes various Machine Learning Models:

### https://scikit-learn.org/stable/

![alt text](https://raw.githubusercontent.com/buds-lab/the-building-data-genome-project/master/docs/edx-graphics/EDX-ML-ScikitLearn-2.png)



## Scikit-Learn Cheat Sheet

A handy flow chart is available open source from the scikit-learn community from: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

![alt text](https://raw.githubusercontent.com/buds-lab/the-building-data-genome-project/master/docs/edx-graphics/EDX-ML-ScikitLearn-1.png)


## Using the Building Data Genome Project Data Set for Clustering and Regression Prediction

Let's use the lectrical meter data to create clusters of typical load profiles for analysis. First we can load our conventional packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

Next let's load all the packages we will need for the clustering and regression analysis

In [None]:
import sklearn
from sklearn import metrics
from sklearn.neighbors import KNeighborsRegressor

from scipy.cluster.vq import kmeans, vq, whiten
from scipy.spatial.distance import cdist
import numpy as np
from datetime import datetime

# Using Unsupervised Learning to Cluster Daily Load Profiles

The first thing we will use the library for is to analyze the daily load profiles from an electrical meter

Let's load an example meter data file and do a clustering analysis of the data. We will be following the tutorial found on the [Data-Driven Building](https://cargocollective.com/buildingdata/DayFilter-Unsupervised-Pattern-Filtering) blog.



## Load meter data from the electricity data set as an example

In [None]:
elec_all_data = pd.read_csv("../input/buildingdatagenomeproject2/electricity_cleaned.csv", index_col='timestamp', parse_dates=True)

In [None]:
elec_all_data.head()

In [None]:
buildingname = 'Panther_office_Hannah'
office_example = pd.DataFrame(elec_all_data[buildingname].truncate(before='2017-01-01'))

In [None]:
office_example.plot(alpha=0.5, figsize=(15, 5))
plt.title("Electricity Consumption")
plt.xlabel("Time Range")
plt.ylabel("kWh Electricity Consumption Visualization");

Let's zoom in on smaller time range to see more detailed patterns

In [None]:
office_example.truncate(before='01-02-2017', after='14-02-2017').plot(figsize=(15,5))
plt.title("Electricity Consumption")
plt.xlabel("Time Range")
plt.ylabel("kWh Electricity Consumption Visualization");

## Conventional Daily Profile Analysis - Weekday vs. Weekend

It appears that there is some standard weekday vs. weekend behaviour and a few basic types of daily patterns.

Let's first do it the conventional way by looking at all the daily profiles. We'll pivot to get a DataFrame that can be plotted the way we needed.

In [None]:
office_example['Date'] = office_example.index.map(lambda t: t.date())
office_example['Time'] = office_example.index.map(lambda t: t.time())

In [None]:
office_example.head()

In [None]:
office_example_pivot = pd.pivot_table(office_example, values=buildingname, index='Date', columns='Time')

In [None]:
office_example_pivot.head()

In [None]:
office_example_pivot.T.plot(legend=False, figsize=(15,5), color='k', alpha=0.1, xticks=np.arange(0, 86400, 10800))
plt.title("Electrical Meter Data - Daily Profiles")
plt.xlabel("Daily Time Frame")
plt.ylabel("kWh Electricity");

Looks like we have quite a few pretty common patterns and a few outlier patterns where we have some consumption in the early morning and late night hours.

How can we divide this dataset up according to conventional wisdom -- the first obvious choice is to divide between weekdays vs. the weekends.

Let's look at weekdays first:

In [None]:
office_example['Weekday'] = office_example.index.map(lambda t: t.date().weekday())

In [None]:
office_example.head()

In [None]:
office_example_pivot_weekday = pd.pivot_table(office_example[(office_example.Weekday < 5)], values=buildingname, index='Date', columns='Time')

In [None]:
office_example_pivot_weekday.T.plot(legend=False, figsize=(15,5), color='k', alpha=0.1, xticks=np.arange(0, 86400, 10800))
plt.title("Electrical Meter Data - Weekday Daily Profiles")
plt.xlabel("Daily Time Frame")
plt.ylabel("kWh Electricity");

It can be noticed that there is still quite a bit of anomolous-looking daily profiles that are not characterized only by the day of the week -- this can be due to holidays, weird schedules, or actually deviant behaviour.



## Manual indentification of clusters

There also seems to be varying levels of consumption throughout the course of a year. This is likely because of weather effects or schedule changes. 

These could be considered "clusters" of behavior due to the course of 

Let's try weekend:

In [None]:
office_example_pivot_weekend = pd.pivot_table(office_example[(office_example.Weekday > 5)], values=buildingname, index='Date', columns='Time')
office_example_pivot_weekend.T.plot(legend=False, figsize=(15,5), color='k', alpha=0.1, xticks=np.arange(0, 86400, 10800))
plt.title("Electrical Meter Data - Weekday Daily Profiles")
plt.xlabel("Daily Time Frame")
plt.ylabel("kWh Electricity");

Weekends have a lower standard level of consumption with only bits of consumption during daytime hours

## k-Means Clustering of Daily Load Profiles

Let's reload the dataframe to start over so we can do the k-means process

In [None]:
buildingname = 'Panther_office_Hannah'
office_example = pd.DataFrame(elec_all_data[buildingname].truncate(before='2017-01-01'))

In [None]:
office_example.head()

In [None]:
office_example_norm = (office_example - office_example.mean()) / (office_example.max() - office_example.min()) 

office_example['Time'] = office_example.index.map(lambda t: t.time())
office_example['Date'] = office_example.index.map(lambda t: t.date())
office_example_norm['Time'] = office_example_norm.index.map(lambda t: t.time())
office_example_norm['Date'] = office_example_norm.index.map(lambda t: t.date())

In [None]:
office_example.head()

In [None]:
dailyblocks = pd.pivot_table(office_example, values=buildingname, index='Date', columns='Time', aggfunc='mean')
dailyblocks_norm = pd.pivot_table(office_example_norm, values=buildingname, index='Date', columns='Time', aggfunc='mean')

In [None]:
dailyblocks_norm.head()

## The Clustering Model

There is no need to train an unsupervised model, but we do need to indicate how many clusters we would like the model to extract -- in this case we will use 4

In [None]:
dailyblocksmatrix_norm = np.matrix(dailyblocks_norm.dropna())
centers, _ = kmeans(dailyblocksmatrix_norm, 4, iter=10000)
cluster, _ = vq(dailyblocksmatrix_norm, centers)

In [None]:
clusterdf = pd.DataFrame(cluster, columns=['ClusterNo'])

In [None]:
dailyclusters = pd.concat([dailyblocks.dropna().reset_index(), clusterdf], axis=1) 

In [None]:
dailyclusters.head()

Notice the last column is the cluster number assigned by the k-means process. We'll first reorder the clustering numbers so that the greatest consuming clusters have the highest numbers:

In [None]:
x = dailyclusters.groupby('ClusterNo').mean().sum(axis=1).sort_values()
x = pd.DataFrame(x.reset_index())
x['ClusterNo2'] = x.index
x = x.set_index('ClusterNo')
x = x.drop([0], axis=1)
dailyclusters = dailyclusters.merge(x, how='outer', left_on='ClusterNo', right_index=True)

In [None]:
dailyclusters = dailyclusters.drop(['ClusterNo'],axis=1)
dailyclusters = dailyclusters.set_index(['ClusterNo2','Date']).T.sort_index()

In [None]:
dailyclusters.head()

Now we have a dataframe with each of the clusters hiearchically divided -- let's visualize what the clusters. First, let's look at all the profiles at once divided according to cluster:

In [None]:
clusterlist = list(dailyclusters.columns.get_level_values(0).unique())
matplotlib.rcParams['figure.figsize'] = 20, 7

styles2 = ['LightSkyBlue', 'b','LightGreen', 'g','LightCoral','r','SandyBrown','Orange','Plum','Purple','Gold','b']
fig, ax = plt.subplots()
for col, style in zip(clusterlist, styles2):
    dailyclusters[col].plot(ax=ax, legend=False, style=style, alpha=0.1, xticks=np.arange(0, 86400, 10800))

ax.set_ylabel('Total Daily Profile')
ax.set_xlabel('Time of Day');

## Aggregate visualizations of the clusters

Now, let's aggregate and visualize the clusters as they exist across the time range:



In [None]:
def timestampcombine(date,time):
    pydatetime = datetime.combine(date, time)
    return pydatetime

In [None]:
def ClusterUnstacker(df):
    df = df.unstack().reset_index()
    df['timestampstring'] = pd.to_datetime(df.Date.astype("str") + " " + df.level_2.astype("str"))
    #pd.to_datetime(df.Date  df.level_2) #map(timestampcombine, )
    df = df.dropna()
    return df

In [None]:
dailyclusters.unstack().reset_index().head()

In [None]:
dfclusterunstacked = ClusterUnstacker(dailyclusters)
dfclusterunstackedpivoted = pd.pivot_table(dfclusterunstacked, values=0, index='timestampstring', columns='ClusterNo2')

In [None]:
clusteravgplot = dfclusterunstackedpivoted.resample('D').sum().replace(0, np.nan).plot(style="^",markersize=15)
clusteravgplot.set_ylabel('Daily Totals kWh')
clusteravgplot.set_xlabel('Date');

In [None]:
dfclusterunstackedpivoted['Time'] = dfclusterunstackedpivoted.index.map(lambda t: t.time())
dailyprofile = dfclusterunstackedpivoted.groupby('Time').mean().plot(figsize=(20,7),linewidth=3, xticks=np.arange(0, 86400, 10800))
dailyprofile.set_ylabel('Average Daily Profile kWh')
dailyprofile.set_xlabel('Time of Day')
dailyprofile.legend(loc='center left', bbox_to_anchor=(1, 0.5), title='Cluster')

In [None]:
def DayvsClusterMaker(df):
    df.index = df.timestampstring
    df['Weekday'] = df.index.map(lambda t: t.date().weekday())
    df['Date'] = df.index.map(lambda t: t.date())
    df['Time'] = df.index.map(lambda t: t.time())
    DayVsCluster = df.resample('D').mean().reset_index(drop=True)
    DayVsCluster = pd.pivot_table(DayVsCluster, values=0, index='ClusterNo2', columns='Weekday', aggfunc='count')
    DayVsCluster.columns = ['Mon','Tue','Wed','Thur','Fri','Sat','Sun']
    return DayVsCluster.T

In [None]:
DayVsCluster = DayvsClusterMaker(dfclusterunstacked)
DayVsClusterplot1 = DayVsCluster.plot(figsize=(20,7),kind='bar',stacked=True)
DayVsClusterplot1.set_ylabel('Number of Days in Each Cluster')
DayVsClusterplot1.set_xlabel('Day of the Week')
DayVsClusterplot1.legend(loc='center left', bbox_to_anchor=(1, 0.5), title='Cluster')

In [None]:
DayVsClusterplot2 = DayVsCluster.T.plot(figsize=(20,7),kind='bar',stacked=True, color=['b','g','r','c','m','y','k']) #, color=colors2
DayVsClusterplot2.set_ylabel('Number of Days in Each Cluster')
DayVsClusterplot2.set_xlabel('Cluster Number')
DayVsClusterplot2.legend(loc='center left', bbox_to_anchor=(1, 0.5))

# Electricity Prediction using Regression for Measurement and Verification

Prediction is a common machine learning (ML) technique used on building energy consumption data. This process is valuable for anomaly detection, load profile-based building control and measurement and verification procedures. 

The graphic below comes from the IPMVP to show how prediction can be used for M&V to calculate how much energy **would have** been consumed if an energy savings intervention had not been implemented. 



## Prediction for Measurement and Verification

![alt text](https://raw.githubusercontent.com/buds-lab/the-building-data-genome-project/master/docs/edx-graphics/EDX-ML-ScikitLearn-3.png)

There is an open publication that gives more information on how prediction in this realm can be approached: https://www.mdpi.com/2504-4990/1/3/56

There is an entire Kaggle Machine Learning competition also focused on this application: https://www.kaggle.com/c/ashrae-energy-prediction



## Load electricity data and weather data

First we can load the data from the BDG in the same as our previous weather analysis influence notebook from the Construction Phase videos

In [None]:
elec_all_data.info()

In [None]:
buildingname = 'Panther_office_Hannah'

In [None]:
office_example_prediction_data = pd.DataFrame(elec_all_data[buildingname].truncate(before='2017-01-01')).fillna(method='ffill')

In [None]:
office_example_prediction_data.info()

In [None]:
office_example_prediction_data.plot()

In [None]:
weather_data = pd.read_csv("../input/buildingdatagenomeproject2/weather.csv", index_col='timestamp', parse_dates=True)

In [None]:
weather_data_site = weather_data[weather_data.site_id == 'Panther'].truncate(before='2017-01-01')

In [None]:
weather_data_site.info()

In [None]:
weather_hourly = weather_data_site.resample("H").mean()
weather_hourly_nooutlier = weather_hourly[weather_hourly > -40]
weather_hourly_nooutlier_nogaps = weather_hourly_nooutlier.fillna(method='ffill')

In [None]:
temperature = weather_hourly_nooutlier_nogaps["airTemperature"]

In [None]:
temperature.plot()

## Create Train and Test Datasets for Supervsed Learning

With **supervised learning**, the model is given a set of data that will be used to **train** the model to predict a specific objectice. In this case, we will use a few simple time series features as well as outdoor air temperature to predict how much energy a building uses.

For this demonstration, we will use three months of data from April, May, and June to prediction July.

In [None]:
training_months = [4,5,6]
test_months = [7]

We can divide the data set by using the `datetime index` of the data frame and a function known as `.isin` to extract the months for the model

In [None]:
trainingdata = office_example_prediction_data[office_example_prediction_data.index.month.isin(training_months)]
testdata = office_example_prediction_data[office_example_prediction_data.index.month.isin(test_months)]

In [None]:
trainingdata.info()

In [None]:
testdata.info()

We can extract the training input data features that will go into the model and the training **label** data which is what are are targeting to predict. 

## Encoding Categorical Variables 

We use the pandas `.get_dummies()` function to change the temporal variables of *time of day* and *day of week* into categories that the model can use more effectively. This process is known as [enconding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/)

In [None]:
train_features = pd.concat([pd.get_dummies(trainingdata.index.hour), 
                                     pd.get_dummies(trainingdata.index.dayofweek), 
                                     pd.DataFrame(temperature[temperature.index.month.isin(training_months)].values)], axis=1).dropna()


In [None]:
train_features.head()

## Train a K-Neighbor Regressor Model

This model was chosen after following the process in the cheat sheet until a model that worked and provided good results was found.

In [None]:
model = KNeighborsRegressor().fit(np.array(train_features), np.array(trainingdata.values));


In [None]:
test_features = np.array(pd.concat([pd.get_dummies(testdata.index.hour),
                                    pd.get_dummies(testdata.index.dayofweek),
                                    pd.DataFrame(temperature[temperature.index.month.isin(test_months)].values)], axis=1).dropna())



## Use the Model to predict for the *Test* period

Then the model is given the `test_features` from the period which we want to predict. We can then merge those results and see how the model did

In [None]:
predictions = model.predict(test_features)

In [None]:
predicted_vs_actual = pd.concat([testdata, pd.DataFrame(predictions, index=testdata.index)], axis=1)

In [None]:
predicted_vs_actual.columns = ["Actual", "Predicted"]

In [None]:
predicted_vs_actual.head()

In [None]:
predicted_vs_actual.plot()

In [None]:
trainingdata.columns = ["Actual"]

In [None]:
predicted_vs_actual_plus_training = pd.concat([trainingdata, predicted_vs_actual], sort=True)

In [None]:
predicted_vs_actual_plus_training.plot()

## Regression evaluation metrics

In order to understand quanitatively how the model performed, we can use various evaluation metrics to understand how well the model compared to reality. 

In this situation, let's use the error metric [Mean Absolute Percentage Error (MAPE)](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error) 

In [None]:
# Calculate the absolute errors
errors = abs(predicted_vs_actual['Predicted'] - predicted_vs_actual['Actual'])
# Calculate mean absolute percentage error (MAPE) and add to list
MAPE = 100 * np.mean((errors / predicted_vs_actual['Actual']))

In [None]:
MAPE