# Could we automatically get optimal number for Kmeans clustering?

In the module "BPS5229 Data Science for the Built Environment",  we've learned how to use Kmeans for clustering daily profiles of power meters.

However, cluster number of kmeans should be indicated by user, and user usually needs to manually adjust the number many times to find the optimal one.

So, this notebook tries to use **knee method (also called elbow method)** based on the **inertia value** of clustering result to find the optimal clustering number.

The result of this notebook, which is an useful python function, could *help user automatically find an initial optimal number for clustering.*

![Elbow mehod for clustering](https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/004-cluster-validation/figures/015-determining-the-optimal-number-of-clusters-k-means-optimal-clusters-wss-silhouette-1.png)
Reference: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

Let's start with importing packages and installing `kneed` package.

In [None]:
!pip install kneed

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
import time

import datetime as datetime

from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

from kneed import KneeLocator

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)

# Clustering function

Here are main functions we'll use in this notebook, which help us find optimal cluster numbers and produce the dataframe with cluster labels.

To find the optimal cluster number, there are three basic steps:
1. Normalize all variables in the dataframe (here we use `min-max-scaler`)
2. Calculate `inertia` for each cluster number in the search range we set
3. Draw the curve between `inertia` in y-axis and `cluster number` in x-axis, and find the optimal cluster number via **elbow method**


In [None]:
def MinMaxScaler(data):
    return (data-np.min(data))/(np.max(data)-np.min(data))

def Kmeans_clustering(df, clusterNum, max_iter=1000, n_jobs=-1):
    '''
    Function for doing kmeans clustering.
    Inputs should include at least the dataframe of variables and cluster number you want.
    '''
    # Normalize the dataframe
    scaler = StandardScaler()
    scaler.fit(df) 
    df_std = pd.DataFrame(data=scaler.transform(df), columns=df.columns, index=df.index)
    
    # Kmeans clustering
    km_model = KMeans(n_clusters=clusterNum, max_iter=max_iter, random_state=666)
    km_model = km_model.fit(df_std)
    
    clusterdf= pd.DataFrame(data=km_model.labels_, columns=['ClusterNo'])
    clusterdf.index = df.index
    
    return clusterdf

def Kmeans_bestClusterNum(df, range_min, range_max, max_iter=1000, n_jobs=-1):
    '''
    Function for finding optimal number of kmeans clustering.
    Inputs should include at least the dataframe of variables, and search range of cluster number (min & max cluster number).
    '''    
    
    # Normalize the dataframe
    scaler = StandardScaler()
    scaler.fit(df) 
    df_std = pd.DataFrame(data=scaler.transform(df), columns=df.columns, index=df.index)       
    
    # Calculate inertia for each cluster number in the research range
    sum_of_squared_distances = [] #Inertia of all clustering results
    ks = range(range_min,range_max+1)
    for k in ks:
        kmeans_fit = KMeans(n_clusters = k, max_iter=max_iter, random_state=666).fit(df_std)
        cluster_labels = kmeans_fit.labels_
        sum_of_squared_distances.append(kmeans_fit.inertia_)
        
    # Use kneed package to locate the elbow / knee of the curve line
    kn = KneeLocator(list(ks), sum_of_squared_distances, S=1.0, curve='convex', direction='decreasing')  
    
    # Plot the result of finding optimal cluster number
    plt.xlabel('k')
    plt.ylabel('sum_of_squared_distances')
    plt.title('The Elbow Method showing the optimal k')
    plt.plot(ks, sum_of_squared_distances, 'bx-')
    plt.vlines(kn.knee, plt.ylim()[0], plt.ylim()[1], linestyles='dashed')
    plt.show()
    
    print('Optimal clustering number:'+str(kn.knee))
    print('----------------------------')    
    
    return kn.knee

# Load dataset

Let's load the dataset of power meters!

In [None]:
path_file = r'/kaggle/input/create-pickle-for-dataset/'

In [None]:
df_data = pd.read_pickle(os.path.join(path_file, 'df_merged.pickle.gz'))
df_data

# Preprocess the dataset (power meter data at 15-min interval)

Before doing kmeans clustering, let's sum up energy consumptions from all power meters to make this demo much easier.

Besides, we also do resmpling from 1-min interval to 15-min interval to save some memory use.


In [None]:
# Leave columns with keyword of 'kW'
df_powerMeter = df_data.loc[:, df_data.columns.str.contains('kW')].copy()

# Sum up demands of all power meters
df_powerMeter = df_powerMeter.sum(axis=1).rename('total_demand')
df_powerMeter = df_powerMeter.resample('H').mean()
df_powerMeter

In [None]:
df_powerMeter.iplot()

Because our goal is to cluster daily load profiles, we need to reshape the dataframe from a single `time series` to an `array` with hour in y-axis and date in x-axis.

In [None]:
# Reshape the dataframe
df_temp = df_powerMeter.reset_index().copy()
df_temp['date'] = df_temp['Date'].dt.date    
df_temp['hour'] = df_temp['Date'].dt.hour
df_temp_pivot = df_temp.pivot_table(index='hour', columns='date')
df_temp_pivot

Here's the plot for all daily load profiles, and it's obvious that there are at least two groups: (1) workday group and (2) holiday group.

In [None]:
df_temp_pivot.plot(figsize=(15,5),color='black',alpha=0.1,legend=False)

# Clustering & visualizations

Let's start to find optimal cluster number in search range of 2 to 10.

In [None]:
df_PM_temp = df_temp_pivot.copy()
df_PM_temp = df_PM_temp.T

bestClusterNum_dept = Kmeans_bestClusterNum(df=df_PM_temp.fillna(0), range_min=2, range_max=10, max_iter=10000, n_jobs=-1)

### Hooray! We got optimal clustering number of 5 here!

Then we put this optimal number for following Kmeans clustering and see what happened:

In [None]:
df_PM_temp['ClusterNo'] = Kmeans_clustering(df=df_PM_temp.fillna(0), clusterNum=bestClusterNum_dept, max_iter=10000, n_jobs=-1)

for ClusterNo in df_PM_temp['ClusterNo'].sort_values().unique():
    df_plot = df_PM_temp[df_PM_temp['ClusterNo']==ClusterNo].T.drop('ClusterNo')
    print('Cluster No.: ' + str(ClusterNo))    
    print('Amount of meters: ' + str(len(df_plot.T)))
    df_plot.plot(figsize=(15,5),color='black',alpha=0.1,legend=False, ylim=(0, 1000))
    plt.show()
    print('-----------------------------------------------------------------------------------')

Group0, group3 and group4 seems like `workday groups` with traditional occupant behaviors, while the group1 and group2 are probably `holiday groups` with flat trends.

In [None]:
plt.figure(figsize=(15,6))
ax = sns.lineplot(x="hour", y="value", hue="ClusterNo",
                  data=df_PM_temp.melt(id_vars='ClusterNo'))

In [None]:
df_temp = df_temp.merge(df_PM_temp.reset_index()[['date', 'ClusterNo']], on='date')
df_temp = df_temp.pivot_table(columns='ClusterNo', index='Date', values='total_demand')
df_temp.loc['2018'].iplot()

The result seems nice and reasonable, but it's just one of cases for demonstrating the methodology.

Although more conductions and discussions should be made to find the optimal cluster number, this elbow method may provide a nice reference value for initial kmeans clustering