# **SARIMAX & Feature clustering**

The following notebook will explore/experiment with K-mean feature clustering on weather data. This was a small test project of mine that I wanted to upload.

Outside conditions are one of the leading variables responsible for energy consumption. However, they are often large in number and increase dimensions for the predictive ML models. To solve this issue with a minimal loss of information, I have decided to experiment a bit with K-mean clustering.


In [None]:
#Library importing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm
from scipy import stats
import itertools

from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import mean_squared_error

import datetime
import os
import math
import gc

# **Data extraction**

The data used for this notebook is separated into different files/blocks and needs to be assembled into a data frame first. 

In [None]:
def data_extraction(path):
    
    dataframe = pd.DataFrame()
    
    for i in os.listdir(path):
        
            df_temp = pd.read_csv(str(path) + "/" + str(i))
            df_temp = df_temp[["LCLid","day","energy_sum"]]
            df_temp.reset_index()
            dataframe = dataframe.append(df_temp)
    
    return dataframe

In [None]:
path = "../input/smart-meters-in-london/daily_dataset/daily_dataset"
df = data_extraction(path)

del path

The core metric we will try to predict is the mean energy per unique household.  We need to calculate it as it isn't given initially in the data. To this end, we will first take the daily sum of energy consumption and divide it by the number of unique households on the same day. Its important to take unique households as the number varies per day.

In [None]:
### Energy per Household###

energy = df.groupby("day")[["energy_sum"]].sum()
count_of_house = df.groupby("day")[["LCLid"]].nunique()

df_energy = energy.merge(count_of_house, on="day").reset_index()

df_energy["energy_per_household"] = df_energy["energy_sum"] / df_energy["LCLid"]
df_energy["day"] = pd.to_datetime(df_energy["day"])

del energy, count_of_house

gc.collect()

Importing the weather and holiday datasets. The holidays might be an interesting metric to look into further. They might have a different impact depending if the meter is placed on a household or business building. The data we are using originates from households so we might see an increase depending on the holiday.

In [None]:
#Weather and holiday data
weather_df = pd.read_csv("../input/smart-meters-in-london/weather_daily_darksky.csv")
holiday_df = pd.read_csv("../input/smart-meters-in-london/uk_bank_holidays.csv")

# *Clustering* #

The first step is to prepare the weather dataset for clustering. We will do this by filtering out some features and creating a new data frame. It is also important to convert the datatype of the "time" column into datetime.

In [None]:
weather_df = weather_df[["temperatureMax",
                         "windBearing",
                         "dewPoint",
                         "cloudCover",
                         "windSpeed",
                         "pressure",
                         "time",
                         "humidity"]]

weather_df["time"] = pd.to_datetime(weather_df["time"])

weather_df.dropna(inplace=True)

Looking at the correlations we get a general idea of what weather data can be clustered. This was a bit of an experimental step and this is the best version I got. I also decided not to include temperature into the clustering portion as the column was too important and I would rather not lose information on it.

In [None]:
weather_df.corr()

An important step at this point is to scale the data. This is to ensure that all the columns are valued equally in the K-mean clustering step.

In [None]:
scaler = MinMaxScaler()
weather_scaled = scaler.fit_transform(weather_df[["cloudCover","humidity","windSpeed"]]).astype("float64")

In [None]:
kmeans_kwargs = {"init": "k-means++",
                 "n_init": 10,
                 "max_iter": 450,
                 "random_state": 42}

def clustering (df):
    """
    Tests posible k_mean cluster instances and scores them based on the silhouette score
    """
    sc = []
    
    for k in range(2,15):
        
        kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
        kmeans.fit_transform(df)
        score = silhouette_score(df,kmeans.labels_)
        sc.append(score)
        
    return sc

The optimal number of clusters can be determined by graphing the silhouette score. **Silhouette Coefficiency** measures how similar an object is to its own cluster. It's a great tool to determine the optimal amount of clusters. 

The best practice is to take the "L" part of the curvature as the optimal number of clusters.

In [None]:
sc = clustering(weather_scaled)
plt.plot(range(2, 15), sc)

In [None]:
#Creating the KMean that will be used and droping the unused weather data
kmeans = KMeans(n_clusters=5, **kmeans_kwargs)
weather_df["Clusters"]= kmeans.fit(weather_scaled).labels_

to_drop = ["windBearing","dewPoint","cloudCover","windSpeed","pressure","humidity"]

weather_df.drop(to_drop,axis=1,inplace=True)

# **Additional data preparation**

This portion contains some additional data preparation before we can go on to SARIMAX predictions.

The holiday column will be coded based on a binary system, with the 1 representing that the date has a holiday and that the date is without a holiday.

In [None]:
holiday_df["Bank holidays"] = pd.to_datetime(holiday_df["Bank holidays"])

In [None]:
df_energy = df_energy.merge(weather_df, left_on="day",right_on="time")
final_df = df_energy.merge(holiday_df, left_on = "day",right_on = "Bank holidays",how = 'left')
final_df["holiday_id"] = np.where(final_df['Bank holidays'].isna(),0,1)

In [None]:
final_df.head()

In [None]:
to_drop = ["energy_sum","LCLid","time","Bank holidays","Type"]

final_df.drop(to_drop, axis=1, inplace=True)

In [None]:
final_df.head()

Finalizing the data to be used and splitting it into train/test portions.

In [None]:
final_df.index = pd.DatetimeIndex(final_df["day"]).to_period("D")

model_data = final_df[["energy_per_household","temperatureMax","Clusters","holiday_id"]]


train = model_data.iloc[0:len(model_data)-30] 
test = model_data.iloc[len(train):len(model_data)]

del model_data

In [None]:
train.head()

In [None]:
###SARIMAX###

#Constructs all possible parameter combinations.
p = d = q = range(0,2)
pdq = list(itertools.product(p,d,q))

seasonal_pdq = [(x[0],x[1],x[2],12) for x in list(itertools.product(p,d,q))]

# **SARIMAX** #

We will use SARIMAX to predict the mean consumption of the dataset. However, before doing that we need to test out the optimal pqd combination of the model. I will use a very brute force method for his as the dataset isn't that large.

In [None]:
def sarimax_function(endog,exog,pdq,s_pdq):

    """
    The function uses a brute force approach to apply all possible pdq combinations and evaluate the model
    """

    result_list = []
    for param in pdq:
        for s_param in s_pdq:

            model = sm.tsa.statespace.SARIMAX(endog=endog,exog=exog, order=param, seasonal_order=s_param,
            enforce_invertibility=False,enforce_stationarity=True)

            results = model.fit()
            result_list.append([param,s_param,results.aic])
            #print("ARIMA Parameters: {} x: {}. AIC: {}".format(param,s_param,results.aic))

    return result_list,results

When using a SARIMAX predictore we need to define the endog and exog variables to successfully run the model. To explain the two in simple terms:

* The endog variable is the target variable or the response variable or the model.
* The exog variable is the independent variable designed to explain the endog variable.

In [None]:
endog = train["energy_per_household"]
exog = train[["Clusters","holiday_id","temperatureMax"]]

In [None]:
result_list,results = sarimax_function(endog,exog,pdq,seasonal_pdq)

The results of the test indicate the optimal pdq combination based on AIC. AIC (Akaike Information Criterion -> AIC=ln (sm2) + 2m/T). As a model selection tool, AIC has some limitations as it only provides a relative evaluation of the model. However, it is an excellent metric for checking the general quality of a model.

In [None]:
results_dataframe = pd.DataFrame(result_list, columns=["dpq","s_dpq","aic"]).sort_values(by="aic")
results_dataframe.head()

# **Prediction**

We first need to generate a model based on the information we gathered in this notebook and "train" it on the training portion of the data.

In [None]:
model = sm.tsa.statespace.SARIMAX(endog=endog,exog=exog, order=(1, 1, 1), seasonal_order=(1, 0, 1, 12),
            enforce_invertibility=False,enforce_stationarity=True).fit()

print(model.summary().tables[1])

Defining the test exog variables

In [None]:
exog = test[["Clusters","holiday_id","temperatureMax"]]

Predicting and storing the data in a data frame for comparison.

In [None]:
predict = model.predict(start = len(train),end = len(train)+len(test)-1,
                            exog = test[["Clusters","holiday_id","temperatureMax"]])

test["prediction"] = predict.values

We will test the prediction using MAE (Mean absolute error) and Mean squared error to get a general idea of how good the model is.

In [None]:
test["diff"] = test["energy_per_household"] - test["prediction"]
results = mean_squared_error(test["energy_per_household"],test["prediction"])
print(results)

In [None]:
MAE = test['diff'].sum()/len(test)
print(MAE)


The results are generally pretty ok. However, I noticed that there is an outlier in one day so we will also take a look at it.

In [None]:
copy_test = test.copy()

In [None]:
copy_test.sort_values(by=["diff"])

In [None]:
### Results without the outlier ###

results = mean_squared_error(copy_test.iloc[:-1,:]["energy_per_household"],copy_test.iloc[:-1,:]["prediction"])
print(results)

In [None]:
MAE =copy_test.iloc[:-1,:]["diff"].sum()/len(test)
print(MAE)

The same metrics without the outlier look a lot better!

# **Conclusion**

This was a small project on clustering and I see myself using this in specific situations. The speed we gain when during this might not outweigh the small decrease in accuracy when predicting consumption, but its an interesting alternative for budgeting larger scale portofolios.