This notebook will use energy efficiency dataset to explore 2 clustering techniques: K-means and Expectation Maximisation. 

K-means clustering is a clustering method that randomly select a few points as centroids and cluster the data points according to the distance of the datapoints from the centroids. The datapoints that are closer to one particular centroid will be assigned to the cluster where the centroid is located. The centroid for each cluster usually will be the mean values of the cluster. 
https://en.wikipedia.org/wiki/K-means_clustering

Expectation maximisation clustering makes use of Guassian mixture models to estimate the latent variables in the data and maximise the parameters of the models using the data. From there, the datapoints can be grouped into multiple clusters according to the estimated distributions from the mixture models.  
https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

The data source is from https://archive.ics.uci.edu/ml/datasets/Energy+efficiency

# Library Loading

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.decomposition import PCA
from plotnine import *
from matplotlib import gridspec
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.mixture import GaussianMixture
import matplotlib.cm as cm
from sklearn.cluster import DBSCAN

# Data Loading 

This section will load the data into the notebook to look at the data structure. 

In [None]:
energy_df=pd.read_csv(r'../input/eergy-efficiency-dataset/ENB2012_data.csv')
energy_df.head()

In [None]:
energy_df.columns=["relative_compactness","surface_area","wall_area","roof_area","overall_height","orientaion",
                   "glazing_area","glazing_area_dist","heating_load","cooling_load"]

# Data Exploration & Transformation

This section will understand the data using histograms, scatter plots and correlation matrix. Data cleansing and transformation will be done if they are necessary. 

In [None]:
energy_df.describe()

In [None]:
energy_df.loc[energy_df["glazing_area"]==0].describe()

In [None]:
energy_df.loc[energy_df["glazing_area_dist"]==0].describe()

In [None]:
energy_df.hist(figsize=(15,15))
plt.show()

Looking at heating load, it seems to be heavily skewed to the right. Therefore, log transformation will be done on heating load to make it more normalised. 

In [None]:
energy_df["log_heating_load"]=np.log(energy_df["heating_load"])
energy_df["log_heating_load"].hist(bins=6)
plt.show()

After log transformation, the distribution looks better but shows a bimodal distribution as two peaks are formed. 

In [None]:
sns.pairplot(energy_df)
plt.show()

In [None]:
corr = energy_df.corr()
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(12, 10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask,cmap=cmap, vmax=.9, center=0, square=True, linewidths=.5, annot=True,cbar_kws={"shrink": .5})
plt.show()

relative compactness is highly correlated to surface area, roof area and overall height. 

heating load is highly correlated to cooling load which suggested that only 1 of them can be used as dependent factor.  

In [None]:
energy_df_f=energy_df.copy()
energy_df_f.drop(["heating_load","cooling_load"],axis=1,inplace=True)
#energy_df_f.drop(["log_heating_load","cooling_load"],axis=1,inplace=True)

In [None]:
energy_df_f.columns

# Using Raw Data for K-Means Clustering

This section will use K-Means Clustering to do clustering on the data. To determine the optimal number of clusters, CH and Silhouette scores are used. 

## Searching for Optimal Number of Clusters Using Calinski-Harabasz (CH) and Silhouette Scores

In [None]:
SH_score=[]
CH_score=[]
print("Silhouette analysis based on number of clusters:")
fig, ax = plt.subplots(3, 2, figsize=(15,8))
for i in [2,3,4,5,6,7]:
    clusterer = KMeans(n_clusters=i, init='k-means++',n_init=10, max_iter=100,random_state=48)
    Kmean_label = clusterer.fit_predict(energy_df_f)
    CH_temp=metrics.calinski_harabasz_score(energy_df_f, Kmean_label)
    CH_score.append(CH_temp)
    
    q, mod = divmod(i, 2)
    visualizer = SilhouetteVisualizer(clusterer, colors='yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(energy_df_f)
    SH_score.append(visualizer.silhouette_score_)


Looking at the figures above, 3 clusters to 5 clusters seems to be optimal number of clusters as each cluster exceed the average Silhouette score. Cluster 2 for 3 clusters seem to be the biggest among the 3 clusters while the size of the cluster for 4 to 5 clusters seems to be quite uniform except the last cluster. 

In [None]:
cluster_score=list(zip(SH_score,CH_score))
cluster_score=pd.DataFrame(cluster_score)
cluster_score.columns=["Silhouette_score","CH_score"]
cluster_score.index=cluster_score.index+2

In [None]:
fig,ax_cs=plt.subplots(1,2,figsize=(10,5))
sns.lineplot(data=cluster_score.iloc[:,0],ax=ax_cs[0])
sns.lineplot(data=cluster_score.iloc[:,1],ax=ax_cs[1])
plt.show()

Based on the figures above, the most optimal number of cluster is 3 as it has the smallest values for Silhouette score and CH score. This indicates that the 3 clusters are able to separate the datapoints clearly while retaining the large within cluster variability for each cluster.  

## Visualising Data Based on 3 Clusters

In [None]:
clusterer_best = KMeans(n_clusters=3, init='k-means++',n_init=10, max_iter=100,random_state=48)
Kmean_label_best = clusterer_best.fit_predict(energy_df_f)
energy_df_f["Cluster label"]=Kmean_label_best

In [None]:
fig,ax=plt.subplots(3,3,figsize=(15,8),sharex=True)
for i,col in enumerate(energy_df_f.columns[:-1]):
    q, mod = divmod(i, 3)
    sns.scatterplot(data=energy_df_f,y=col,x=energy_df_f.columns[-1],ax=ax[q][mod])

Based on the visualisation above, there are clear differences between clusters in terms of relative compactness, surface area, wall area, roof area, overall height and log_heating_load. 

In [None]:
energy_df_f.loc[:,['relative_compactness', 'surface_area', 'wall_area', 'Cluster label']].groupby("Cluster label").\
agg([np.mean,np.median,np.std])

Cluster 1 has the highest relative compactess and lowest surface area compared to other clusters while cluster 2 has the lowest wall area compared to other clusters in terms of mean and median. 

In [None]:
energy_df_f.loc[:,[ 'roof_area','overall_height', 'log_heating_load','Cluster label']].\
groupby("Cluster label").agg([np.mean,np.median,np.std])

Cluster 1 has the lowest roof area, highest overall height and log heating load compared to other clusters. 

# Using Raw Data for Expectation Maximisation (EM) Clustering

This section will use EM Clustering to do clustering on the data. To determine the optimal number of clusters, the same metric scores in the previous section are used.

However, the Silhouette visualiser from Yellowbrick library does not support EM clustering. So, only the number of clusters with the smallest Silhouette score will be visualised.

# Searching for Optimal Number of Clusters Using Calinski-Harabasz (CH) and Silhouette Scores

In [None]:
energy_df_em_f=energy_df_f.iloc[:,:-1].copy()

In [None]:
SH_EM_score=[]
CH_EM_score=[]
#print("Silhouette analysis based on number of clusters:")
#fig, ax = plt.subplots(3, 2, figsize=(15,8))
for i in [2,3,4,5,6,7]:
    em_clusterer = GaussianMixture(n_components=i,max_iter=100,n_init=10,init_params='kmeans',random_state=48)
    em_label = em_clusterer.fit_predict(energy_df_em_f)
    CH_temp=metrics.calinski_harabasz_score(energy_df_em_f, em_label)
    CH_EM_score.append(CH_temp)
    SH_temp=metrics.silhouette_score(energy_df_em_f, em_label)
    SH_EM_score.append(SH_temp)
    

In [None]:
cluster_score_em=list(zip(SH_EM_score,CH_EM_score))
cluster_score_em=pd.DataFrame(cluster_score_em)
cluster_score_em.columns=["Silhouette_score","CH_score"]
cluster_score_em.index=cluster_score_em.index+2

In [None]:
fig,ax_cs=plt.subplots(1,2,figsize=(10,5))
sns.lineplot(data=cluster_score_em.iloc[:,0],ax=ax_cs[0])
sns.lineplot(data=cluster_score_em.iloc[:,1],ax=ax_cs[1])
plt.show()

Based on the graphs above, they both showed that 3 clusters are the most optimal as 3 clusters have the smallest Silhouette score and CH score indicating that they contain high variability within cluster for each cluster with clear separation among clusters.

## Visualising Data Based on 3 Clusters

In [None]:
clusterer_em_best = GaussianMixture(n_components=3,max_iter=100,n_init=10,init_params='kmeans',random_state=48)
em_label_best = clusterer_em_best.fit_predict(energy_df_em_f)
em_silhouette_values = metrics.silhouette_samples(energy_df_em_f.iloc[:,:-1], em_label_best)
em_silhouette_avg = metrics.silhouette_score(energy_df_em_f.iloc[:,:-1], em_label_best)
y_lower = 0
fig, ax1 = plt.subplots(1, 1)
for i in np.unique(em_label_best):
    # Aggregate the silhouette scores for samples belonging to
    # cluster i, and sort them
    ith_cluster_silhouette_values = \
        em_silhouette_values[em_label_best == i]

    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    color = cm.nipy_spectral(float(i) / (max(np.unique(em_label_best))+1))
    ax1.fill_betweenx(np.arange(y_lower, y_upper),
                      0, ith_cluster_silhouette_values,
                      facecolor=color, edgecolor=color, alpha=0.7)

    # Label the silhouette plots with their cluster numbers at the middle
    ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

    # Compute the new y_lower for next plot
    y_lower = y_upper + 10  # 10 for the 0 samples

ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
#ax1.set_ylabel("Cluster label")

# The vertical line for average silhouette score of all the values
ax1.axvline(x=em_silhouette_avg, color="red", linestyle="--")

#ax1.set_yticks([])  # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.show()

Looking at the cluster size for each cluster, cluster 0 is the smallest followed by cluster 2 and cluster 3.

In [None]:
energy_df_em_f["Cluster label"]=em_label_best
fig,ax=plt.subplots(3,3,figsize=(15,8),sharex=True)
for i,col in enumerate(energy_df_em_f.columns[:-1]):
    q, mod = divmod(i, 3)
    sns.scatterplot(data=energy_df_em_f,y=col,x=energy_df_em_f.columns[-1],ax=ax[q][mod])

Compared to K-Means clustering, other than orientation, glazing area and glazing area distribution, cluster 1 is the cluster with lowest values while cluster 0 and cluster 2 have higher values when comparing to cluster 1.

In [None]:
energy_df_em_f.loc[:,['relative_compactness', 'surface_area', 'wall_area', 'Cluster label']].groupby("Cluster label").\
agg([np.mean,np.median,np.std])

Cluster 2 has buildings that have large value for relative compactness and small value for surface area but same mean value for wall area compared to cluster 1. Cluster 0 has buildings that have large wall areas compared to cluster 1 and cluster 2.

In [None]:
energy_df_em_f.loc[:,[ 'roof_area','overall_height', 'log_heating_load','Cluster label']].\
groupby("Cluster label").agg([np.mean,np.median,np.std])

Cluster 1 has buildings with larger roof area but lower in terms of height and heating load (log) respectively. Cluster 0 has buildings with larger heating load (log) compared to cluster 1 and cluster 2.

In K-Means clustering, cluster 1 contains buildings that are tall, high relative compactness and high heating load but smaller surface area and roof area while in EM clustering, cluster 1 contains buildings that are low in terms of height and heating load but large in terms of roof area and surface area. 

As a conclusion, the mean values for relative compactness, surface area, roof area, overall height and wall area in each cluster are different  indicating that these features might influence the heating load required by the building to maintain the warm indoor environment. 