This notebook is about Kmeans clustering of hourly electricity consumption in Jan 2016. Data is available on Kaggle under Ashrae Energy Prediction Competition. I am interested in finding patterns or perhaps different types of consumers based on hourly consumption.

In [None]:
# import packages.

%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score


In [None]:
## data path
train_path = "../input/ashrae-energy-prediction/train.csv"
df = pd.read_csv(train_path)
df.head

In [None]:
## usage is in BKTU: convert meter_readings into kwh by multiplying by 0.2931 from BKTU
df.loc[:,'meter_reading'] *= 0.2931 

Data description
- building_id: type of building grouping different type of users
- meter: type of meters: 0 for electricity, 1 for hot water
- timestamp: time at which the reading was registered.
- meter_reading: usage per hour.

For the purpose of this analysis, I will only look at one month electricity usage

In [None]:
## sample data: subset meter == 0 (electricity)
#train = train.filter(train['meter'] == 0)
df = df[(df['meter'] == 0) & (df['timestamp'] <= "2016-01-31 00:00:00" )]
#df.head

In [None]:
## feature engineering
 ## extract day and month from datestamp column
df = df.assign(**{'hour': pd.to_datetime(df['timestamp']).dt.hour,
                  'day': pd.to_datetime(df['timestamp']).dt.dayofyear,
                  'day_name': pd.to_datetime(df['timestamp']).dt.day_name()
                 
                 }).drop('meter', 1)

#'day': pd.to_datetime(df['timestamp']).dt.day,

In [None]:
df.describe()

In [None]:
df[(df['meter_reading'] >=8626)]

In [None]:
## drop duplicates if there is any
df = df.drop_duplicates()

In [None]:
sns.boxplot(x=df['meter_reading'])

The boxplot displays values that can be considered as outliers. However, they might be genuine usage from same building. Let's check that hypothesis by checking if there are one off values (perhaps one hour of the day) or consisent values across the whole day

In [None]:
# let assume one building can consume 500 kwh.
high_val = df[(df['meter_reading'] >= 500)]
high_val.head

In [None]:
high_val['building_id'].unique()

In [None]:
# plot high_val: big electricity users.
high_chart = sns.lineplot(x="hour",
                         y="meter_reading",
                         data=high_val
                         ).set_title('hourly distribution for usage above 500')
plt.show()

high_values data has 35 unique building_ids. Looking at the plot, those values are not outliers, they are spread thoughout the day as normal usage; those buildings might be companies with constant processes such as factories?

let's split data into weekends and weekdays: usage patterns might be different.

In [None]:
# split df into weekdays and weekends
## weekdays: df1
df1 = df[(df['day_name'] != 'Saturday') & (df['day_name'] != 'Sunday')]

# weekends: df2
df2 = df[(df['day_name'] == 'Saturday') | (df['day_name'] == 'Sunday')]
# df2.head()


In [None]:
df1.head

In [None]:
df2.head

In [None]:
## transposing df1 and df2 by the column hour: i want to cluster usage per hour of the day.
## transposing hour column
df1 = pd.pivot_table(df1, values = 'meter_reading', index=["building_id", 'day'], columns = 'hour').reset_index()
df1 = df1.drop('day', 1)

## filter rows where sum of 0 > 0 and the count of non null is > 23.
df1 = df1[df1.iloc[:,1:25].ne(0).sum(1) > 23 ]

# filter rows where the sum of NaN from column 0 to 23 is less than 10
df1 = df1[df1.isnull().sum(axis=1) < 10] 

# fill NaN with row mean.
df1.iloc[:,1:25] = df1.iloc[:,1:25].T.fillna(df1.iloc[:,1:25].mean(axis=1)).T

## compute the percentage of usage per hour per day.
#df1.iloc[:,1:25] = 100 * df1.iloc[:,1:25].div(df1.iloc[:,1:25].sum(axis=1), axis=0) # 25776 * 25
#df1 = df1.dropna(0)

In [None]:
# df2
df2 = pd.pivot_table(df2, values = 'meter_reading', index=["building_id", "day"], columns = 'hour').reset_index()
df2 = df2.drop('day', 1)

## filter rows where sum of 0 > 0 and the count of non null is > 23.
df2 = df2[df2.iloc[:,1:25].ne(0).sum(1) > 23 ]

# filter rows where the sum of NaN from column 0 to 23 is less than 10
df2 = df2[df2.isnull().sum(axis=1) < 10] 

# fill NaN with row mean.
df2.iloc[:,1:25] = df2.iloc[:,1:25].T.fillna(df2.iloc[:,1:25].mean(axis=1)).T

## compute the percentage of usage per hour per day.
#df2.loc[:,1:25] = 100 * df2.iloc[:,1:25].div(df2.iloc[:,1:25].sum(axis=1), axis=0) # 11349  * 25

The script above does these:

- pivot column hour: I am interested in visualizing the distribution of usage per hour while indentifying missing values.
- remove all rows where there is no usage throughout the day as well as building where only one value per day is available; it is wrong data.
- remove all rows where the count of missing values are more than 10. out of 24 hours, if 10 values are missing, remove rows.
- filling remaining missing values with the average of the row per building.

In [None]:
#plot df1: weekdays
plt.style.use('seaborn')
df1.iloc[:,1:25].T.plot(figsize=(16,8), legend=False, color='blue', alpha=0.01)
plt.xlabel("hour of the day")
plt.ylabel("usage (in kwh)")
plt.title("Weekdays hourly usage")
plt.show()

In [None]:
plt.style.use('seaborn')
df2.iloc[:,1:25].T.plot(figsize=(16,8), legend=False, color='blue', alpha = 0.01)
plt.xlabel("hour of the day")
plt.ylabel("usage (in kwh)")
plt.title("Weekends hourly usage")
plt.show()

## possible clusters are grouped into categories of users (big: could be companies with running processes, small: households; regular consumers)

From both plots, we can observe:

- weekdays: there seems to be 4 to 6 distinct groups: below 500, between 500 and more
- weekends: there seems to be 4 to 6 groups (clusters): below 400, between 600 and 800, 1000 and 1400, above 1200

We can visually see this data can be clustered in type of users: low, medium, high and more. However, it is not possible to know for the optimum number of k by which data should be grouped and for which the error rate is the lowest. One way to find the optimum k is via the Elbow method which consists in:

- Running kmeans clustering on different values ranging from 2 to 15 for example; for each value of k, the sum of squared errors (SSE) is calculated. 
- Plotting the line chart made of the number of potential k (x axis) and SSE (y axis). if the line ressembles like an arm, the optimum k is at the elbow.
- The idea is the select small value of k that minimises SSE.
 

Modelling: kmeans clustering.

What is kmeans clustering and how does it work?
It is an unsupervised machine learning algorithm aiming at grouping data points based on similarities with the number of groups (clusters) represented by k. It follows these steps:

- Choosing the best number of k.
- Assigning each data point to the closest centroid by calculating its euclidian diantance with respoect to each centroid
- Determine the new cluster centroid by computing the average of each cluster
- Repeating steps 2 and 3 until none of the cluster assignemnts changes.

In [None]:
## weekdays
distortions = []
K1 = range(1,10)
for k in K1:
    kmeanModel1 = KMeans(n_clusters=k)
    kmeanModel1.fit(df1.iloc[:, 1:25])
    distortions.append(kmeanModel1.inertia_)

# plot elbow line
plt.figure(figsize=(16,8))
plt.plot(K1, distortions)
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k (weekdays)')
plt.show()

The elbow line plot indicates optimum k is between 3 and 4. After 4, no improvement in SSE is observed. On this occasion, I will go for k = 4.

In [None]:
# weekdays model
kmeans1 = KMeans(n_clusters=4).fit(df1.iloc[:, 1:25])
centroids1 = kmeans1.cluster_centers_
print(centroids1)

In [None]:
## plot all weekdays centroids.
plt.style.use('seaborn')
ax = sns.lineplot(x='hour', y='usage', marker="o", 
                  hue="index",palette=["C0", "C1", "C2", "C3"],
                  data=pd.melt(pd.DataFrame(centroids1).reset_index(), 
                               id_vars="index", var_name="hour", value_name="usage")).set_title('clustering weekdays')
plt.legend(title = 'Cluster', loc= 'upper right', labels = ['low', 'high', 'very high', 'medium'])  
plt.show()                

- cluster low: can be household buildings where usage is very low
- cluster medium: can be small businesses.
- cluster high and very high: can be businesses with ongoing running processes at specific hours of the day. Both have similar patterns (at around 5am, usage starts to increase and peaks around 7am, gets constant until 15pm and then decreases.

In [None]:
## weekends
distortions = []
K2 = range(1,10)
for k in K2:
    kmeanModel2 = KMeans(n_clusters=k)
    kmeanModel2.fit(df2.iloc[:, 1:25])
    distortions.append(kmeanModel2.inertia_)

# plot elbow line
plt.figure(figsize=(16,8))
plt.plot(K2, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k (weekends)')
plt.show()

In [None]:
# weekends model with k = 4
kmeans2 = KMeans(n_clusters=4).fit(df2.iloc[:, 1:25])
centroids = kmeans2.cluster_centers_
print(centroids)

In [None]:
## plot all centroids.
plt.style.use('seaborn')
ax = sns.lineplot(x='hour', y='usage', marker="o", 
                  hue="index",palette=["C0", "C1", "C2", "C3"],
                  data=pd.melt(pd.DataFrame(centroids).reset_index(), 
                               id_vars="index", var_name="hour", value_name="usage")).set_title('clustering weekends')
plt.legend(title = 'Cluster', loc= 'center right', labels = ['low', 'high', 'medium', 'very high'])  
plt.show() 

weekends clusters are similar to weekdays; however, the highest and very highest users groups comsume less electricity during weekends and the lines are flat as opposed to weekdays.

Further work: cluster each category; more pattern such as early morning, mid-morning, afternoon and night owl users  might be uncovered