<hr style="border:0.2px solid black"> </hr>

<figure>
  <IMG SRC="img/ntnu_logo.png" WIDTH=250 ALIGN="right">
</figure>

**<ins>Course:</ins>** TVM4174 - Hydroinformatics for Smart Water Systems

# <ins>Example:</ins> Data Mining - Clustering and Classification Methods and Tools
    
*Developed by David B. Steffelbauer*

<hr style="border:0.2px solid black"> </hr>

    
## Data Mining Example: Smart Meter Clustering

## Useful packages and predefined functions

First, we will import some packages that will be useful for our task to analyse the smart meter data. You already know most of the packages, nevertheless, here is a list of them including links to their documentation:
- [numpy](https://numpy.org/): a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays 
- [matplotlib](https://matplotlib.org/): a library for creating static, animated, and interactive visualizations in Python
- [pandas](https://pandas.pydata.org/): Python data analysis library
- [seaborn](https://seaborn.pydata.org/): statistical data analysis library, which provides a high-level interface for drawing attractive and informative statistical graphics

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
sns.set_style('darkgrid')
colors = sns.color_palette('magma', 3)
from cycler import cycler
mpl.rcParams['axes.prop_cycle'] = cycler(color=colors)
pd.plotting.register_matplotlib_converters()
from mpl_toolkits.mplot3d import Axes3D

In [None]:
def plot_patterns(patterns, ax=None, color=None, legend=False, alpha=0.5):
    """Plotting routine for patterns
    
    :param pattern: Patterns to be plotted
    :param ax: matplotlib axes object
    :param color: color of the patterns
    :param legend: if True, a legend is printed
    :return:
    """
    
    time = patterns.index
    y = patterns.values
    
    if ax is None:
        fig, ax = plt.subplots()
        
    ax.plot(time, y, color=color, alpha=alpha)
#     ax.plot(time, y.mean(axis=1).values, color='k', linestyle='--', linewidth=2)
    if legend:
        plt.legend(patterns.columns, fontsize=16)
    ax.set_xlim((time[0], time[-1]))
    

In [None]:
def plot_clusters(data, labels=None, centers=None, cs='w'):
    """Plotting routine for 2-dimensional figures containing clusters with their centers and 
    colored datapoints according to their labels
    
    :param data: Clustering input data, either numpy array or pandas DataFrame,
        columns are features, rows are samples
    :param labels: Clustering output data, either list, numpy vector or pandas Series,
        with the same length as the rows of data
    :param centers: Array containing the cluster centers' x and y coordinate
    :param cs: face color of the datapoints, when labels are not specified
    :return:
    """
    
    if isinstance(data, pd.DataFrame):
        data = data.values
    if labels is not None:
        n_clusters = len(set(labels))
        colors = sns.color_palette('plasma', n_clusters)
        cs = [colors[x] for x in labels]

    plt.scatter(data[:, 0], data[:, 1], s=100, alpha=0.7, c=cs, edgecolors='k')

    if centers is not None:
        for center in centers:
            plt.scatter(*center, marker='s', s=100, edgecolor='k', linewidths=4, c='None')

    plt.xlabel(r'$x_1$', fontsize=18)
    plt.ylabel(r'$x_2$', fontsize=18)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    plt.axis('equal')
    

## Problem statement

<img src="img/Problem_Statement.png" width=800 height=800/>


The CEO of L-town has sleepless nights, because he is worried about the unpredictability of the water demand in his water distribution system (WDS). Although, forecasting tools using time series analysis algorithms were developed for his WDS, the forecasts are still not sufficient for him. He sees a lot of potential in leak detection, optimal pump scheduling and pressure management, once, the variability in demand in the system is better known. 
Especially worrysome are the high peaks during the morning, which might drive the WDS to its limits. Are these peaks caused by single users, or is it generated by simultaneously water usage over a big group of  similar users?

Inspired by news articles on how Big-data is transforming the water industry, he decides to install Smart Water Meters (SWM) to gain insights into the water use behaviour of his customers. SWM are devices that are capable of measuring and transmitting customer's water consumption in (near) real-time to the Water Utility (WU).

However, although the SWM meters provide a lot of data to the WU, the WU staff still struggles to extract relevant knowledge from the huge amount of data. This is where data mining enters the stage...

## The dataset

The WU was already capable of extracting demand patterns from the data for each SWM. Demand patterns represent the average water use of a property over the course of a day. 

The demand pattern data is stored in a CSV file in the `data` folder with the filename `demand_pattern.csv`. 

We use pandas [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to load the data from the textfile:

In [None]:
filename = 'data/demand_pattern.csv'
pattern = pd.read_csv(filename, index_col=0, parse_dates=[0], header=0)

pattern

The data consists of an index, which is a datetime time object, and columns with the SWM name. 

We can plot all the data in a 3d plot, to visualize all the data at once:

In [None]:
fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(111, projection='3d')

x = np.arange(len(pattern.index))
y = np.arange(pattern.shape[1])
X, Y = np.meshgrid(x, y)
Z = pattern.values
Z = Z.T

t_format = [x.strftime('%H:%M') for x in pattern.index]

colors = sns.color_palette('magma_r', Z.shape[0])
for ii in range(Z.shape[0]):
    ax.plot(X[ii, :], Y[ii, :], Z[ii, :], color=colors[ii])

plt.xlim((x[0], x[-1]))
plt.ylim((y[0], y[-1]))

ax.xaxis.set_ticks(x[::12 * 3])
ax.xaxis.set_ticklabels(t_format[::12 * 3])

ax.yaxis.set_ticks(y[::15])
ax.yaxis.set_ticklabels(pattern.columns[::15])

plt.xticks(rotation=-45, fontsize=14)
plt.yticks(fontsize=14)
ax.zaxis.set_tick_params(labelsize=14)

ax.view_init(20, -120)

Or we can pick out certain sensor numbers and compare the data for those SWM.

In [None]:
sensors = ['n3', 'n7', 'n8']

plot_patterns(pattern[sensors], legend=True, alpha=0.9)

## Standardization

K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance.

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

$$ \mathbf{x}^{\prime} = \frac{\mathbf{x} - \overline{\mathbf{x}}}{\sigma_\mathbf{x}} $$

In [None]:
def standardize(df):

    return (df - df.mean()) / df.std()

So let's standardize the data and compare it to the original...

In [None]:
s_pattern = standardize(pattern)

fig, axs = plt.subplots(1,2, figsize=(12, 3))

pattern[sensors].plot(ax=axs[0])
axs[0].set_title('Original Data')

s_pattern[sensors].plot(ax=axs[1])
axs[1].set_title('Standardized Data')

## Some `scikit-learn` basics

Now it's time to start to get our hands on the  [`scikit-learn`](https://scikit-learn.org/stable/) package, which is a an open source machine learning library that supports **supervised** and **unsupervised** learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities. 

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.

The fit method generally accepts 2 inputs:

- The samples matrix (or design matrix) $X$. The size of $X$ is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.

- The target values $y$ which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, $y$ does not need to be specified. $y$ is usually 1d array where the $i$-th entry corresponds to the target of the $i$-th sample (row) of $X$.

Both $X$ and $y$ are usually expected to be numpy arrays or equivalent array-like data types (e.g. `pandas` DataFrames), though some estimators work with other formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of new data (e.g. for classification). 

For further information, have a look at the ["An introduction to machine learning with scikit-learn"](https://scikit-learn.org/stable/tutorial/basic/tutorial.html).

Here is also a [link to a Python Cheat Sheet for `scikit-learn`](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf).

## Feature reduction through Principal Component Analysis

Let's have a look at the pattern data, again, to see, how it fits into the data table structure that we discussed previously (e.g. features, samples)

In [None]:
s_pattern.T

In [None]:
# Principal Component Analysis
from sklearn.decomposition import PCA

dimensions = 2

col_names = [f'x{i}' for i in range(1, dimensions+1)]

pca = PCA(dimensions).fit(s_pattern.T)  # PCA algorithm to reduce data to 2 dimensions
X = pca.transform(s_pattern.T)  # Map the data into the 2 dimensional space

X = pd.DataFrame(X, columns=col_names, index=s_pattern.columns)

X

In [None]:
plot_clusters(X)

## *k*-Means Clustering with `scikit-learn`

<img src="img/kmeans_scheme.png" width=800 height=800/>

From now on, we will focus on the *k*-Means algorithm in `scikit-learn`. First, we import the `KMeans` class. Here is a [link](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) to the API documentation of `Kmeans`.

Like already introduced in the previous notebooks, we can also have a quick look at the documentation with the `help` command:

In [None]:
from sklearn.cluster import KMeans

help(KMeans)

The first input parameter, when we call the KMeans class is the number of clusters $k$ (`n_clusters` keyword). 

After the Kmeans algorithm is initialized, we use the fit routine to apply it on the data $X$. After that, we can extract the cluster centers and the labels from the attributes of the Kmeans class (attribute `clusters_centers_` and `labels_`, respectively).
Furthermore, the `intertia_` attribute gives us the sum of squared distances of the samples to their closest cluster center ($WCSS$ - *within cluster sum of squares**************************************************************************************), which is the objective function of the kMeans problem that we want to minimize

$$
\sum_{i=1}^{k} \sum_{\mathbf{p} \in \mathbf{S}_i} || \mathbf{p} - \mathbf{\mu}_i ||^2 \ ,
$$
where $\mathbf{S} = \{S_1, S_2, \ldots, S_k\}$ is the partitioning of the samples $\mathbf{p}$ in $k$ clusters, and $\mu_i$ is the  centroid (center of mass) of the cluster $S_i$

In [None]:
k = 2

km = KMeans(k).fit(X)

labels = km.labels_
centers = km.cluster_centers_
wcss = km.inertia_

plot_clusters(X, labels=labels, centers=centers)

print(f'Number of clusters = {k} => WCSS = {wcss:.2f}')
labels

In [None]:
    colors = sns.color_palette('magma', k)
    fig = plt.figure(figsize=(12,k*4))
    for ii in range(k):
        idx = s_pattern.columns[labels == ii]
        subset = s_pattern[idx]
        ax = fig.add_subplot(k + 1, 1, ii + 1)
        plot_patterns(subset, ax=ax, color=colors[ii])
        plt.title(f'Number of Patterns={len(idx)}', fontsize=16)
    plt.show()

## The ideal number of clusters - Elbow method

How should we decide, what is the optimal number of clusters present in the data, or what value should we chose for $k$, respectively? 

One possibility is visual inspection, but that is highly subjective. Furthermore, often times the data you’ll be working with will have multiple dimensions making it difficult to visual. As a consequence, the optimum number of clusters $k$ is no longer obvious. Fortunately, we have a way of determining this mathematically, the elbow method...

To decide on the optimal number of clusters, we train multiple models using a different $k$ values and storie the value of the WCSS (`intertia_` property) every time.

Of course, this value gets smaller, the more cluster centers are in the data.

In [None]:
wcss = []
c_numbers = np.arange(2, 16)

for k in c_numbers:
    
    km = KMeans(n_clusters=k).fit(X)
    
    wcss.append(km.inertia_)
    
    print(f'{k} Clusters -> WCSS={km.inertia_}')

wcss = pd.Series(wcss, index=c_numbers)

In [None]:
wcss.plot(marker='o', color='k', linewidth=2)
plt.xlabel('k', fontsize=18);
plt.ylabel('WCSS', fontsize=18);

In [None]:
labels

In [None]:
data = pattern.T

In [None]:
data

In [None]:
data.index[labels==1]

In [None]:
data['label'] = labels

data.to_csv('data/labelled_data.csv')