# Unsupervised learning example: Clustering daily consumptions

In unsupervised learning, the classic task is **cluster analysis** in which hidden patterns or groups are found in the data. Most of the time unsupervised learning tasks have an *open solution*, so you have to interpret the results and check if they make sense.

**Objective:** This example uses data containing information about the annual electricity consumption of a household in Austin, USA. The objective is to find the optimal number of clusters to group the different daily consumption patterns of the household throughout the year. The data contains multiple households, so one must be selected (id=9922).

**Context:** This example presents an unsupervised learning problem in which different clustering algorithms and evaluation metrics are used and compared.

### Before we start:

* The file **15minute_data_austin_.csv** contains the input dataset for this example (attributes).

## **1. Import libraries and data**

In [None]:
# Import libreries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings

# To suppress all warnings
warnings.filterwarnings("ignore")

# Select needed columns
columns_to_use = ['dataid', 'local_15min', 'grid']
df_activepower = pd.read_csv('Data/data_austin.csv', sep=';', usecols=columns_to_use)
df_activepower

## **2. Understanding the data**

It is necessary to visualize and understand the data we are going to work with, as well as to know its characteristics.

1. How much data is there? How many attributes are there in the data?  
2. What do they mean?
3. Is there any missing data?
4. Statistical summary of the input data set.

In [None]:
# Dimension of input data (rows x columns)
df_activepower.shape

In [None]:
# Let's see what the data looks like
df_activepower.head()

In [None]:
df_activepower.dtypes

**2. What do they mean?** 

* **[Dataid]**: numeric identification of each household
* **[local_15min]**: date and time format
* **[grid]**:  power consumed in each period [kW].

In [None]:
# Select the household with id = 9922 for this example
df_household = df_activepower.loc[df_activepower['dataid'] == 9922]

# Transform local_15min to datetime format with .to_datetime()
df_household['datetime'] = pd.to_datetime(df_household['local_15min'], format='%d/%m/%Y %H:%M')
print(df_household)

In [None]:
df_household.dtypes

In [None]:
df_household.dtypes
df_household.head()

In [None]:
# Remove household dataid and local_15min column
df_household = df_household.drop(['dataid', 'local_15min'], axis=1)
df_household

In [None]:
# Convert the column 'datetime' to index.
df_household = df_household.set_index('datetime')
df_household

In [None]:
# Show the new row x column data dimensions
df_household.shape

In [None]:
# Check whether there is any categorical data to be transformed
df_household.dtypes

**3. Is any data missing?** A check is made to see if any data is missing, and then empty cells are counted.
In this case, no data is missing in the input data set (there are no *Nan* values).

In [None]:
df_household.isna().sum()

**4. Summary statistics of the input data set:** Descriptive statistics collects and analyzes the input data set with the objective of describing the characteristics and behaviors of this set through the following summary measures: total number of observations (count), mean (mean), standard deviation (std), minimum value (min), maximum value (max) and the values of the different quartiles (25%, 50%, 75%).

In [None]:
# Evaluate the nature of the data with descriptive statistics.
df_household.describe()

## **3. Visualize the data**

A visual way to understand the input data. 
1. Histogram
2. Density curve
3. Boxplots

**1. Histogram**

Graphical representation of each of the attributes in the form of bars, where the surface of the bar is proportional to the frequency of the values represented.

In [None]:
histogram = df_household.hist(xlabelsize=10, ylabelsize=10, bins=300, figsize=(10, 10))

**2. Density graph**

Visualize the distribution of the data. It is a variable of the histogram, but eliminates noise, so they are better for determining the distribution shape of an attribute. Density plot spikes help show where values are most concentrated. 

In [None]:
density = df_household.plot(kind='kde', legend=True, layout=(1, 1), figsize=(10, 10),
                        fontsize=16, stacked=True) 

**3. Boxplots** 

The boxplot allows us to identify outliers and compare distributions. In addition, we know how 50% of the values are distributed (inside the box). 

In [None]:
sns.set(style="whitegrid")
ax = sns.boxplot(x=df_household["grid"])

## *4. Prepare the data*

1. Data cleaning and restructuring
2. Transform


Transform the data to obtain the average hourly power. Each row will represent 1h.

In [None]:
df_household_hourly = df_household.resample('H').mean()
df_household_hourly

In [None]:
# Create a new column with the hour
df_household_hourly['hour'] = df_household_hourly.index.hour
df_household_hourly

In [None]:
# The new index now contains only the date (DD/MM/YYYYYY).
df_household_hourly.index = df_household_hourly.index.date
df_household_hourly.head()

In [None]:
# Create a column with the average power of each hour 
df_household_pivot = df_household_hourly.pivot(columns='hour')
df_household_pivot = df_household_pivot.dropna()

df_household_pivot.head()

### Plot the transformed data
Each line shows the hourly consumption for one day of the year. 

In [None]:
# Hourly consumption. Dataframe has to be transposed.
df_household_pivot.T.plot(figsize=(18, 8), title='Daily Consumption', legend=False, color='blue', alpha=0.04)

**2. Transformation**. 

The data is scaled using the *MinMaxScaler()* method

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X = df_household_pivot.values.copy()
X_scaled = pd.DataFrame(scaler.fit_transform(X))
X_scaled.head()

## 5. Unsupervised Learning Model Building: Data Clustering using K-means

The data are grouped using the algorithm [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and evaluation metrics [silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html). The K-means algorithm needs to be told the number of clusters into which you want to group the data. You run the algorithm for several clusters and then compare the results using the silhouette_score metric, which will indicate the optimal number of clusters.

### Optimal number of clusters: Silhouette Coefficient
The Silhouette Coefficient is used, where the best value is 1 and the worst value is -1. Values close to 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, since a different cluster is more similar. Check documentation here: [sklearn.metrics.silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

silhouette_scores = []

# Evaluate the K-means algorithm for a range of [2,15] clusters. 
n_cluster_list = np.arange(2, 16).astype(int)

In [None]:
# Iteration to evaluate K-means for different number of clusters (n_clusters)
for n_cluster in n_cluster_list:
    kmeans = KMeans(n_clusters=n_cluster, random_state=0)
    cluster_found = kmeans.fit_predict(X_scaled)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_, random_state=0))

In [None]:
silhouette_scores

In [None]:
silhouette_metric = pd.DataFrame(index=n_cluster_list, columns=['silhouette_score'], data=silhouette_scores)
plt.plot(silhouette_metric, marker='o') 

In [None]:
# Train the K-means for the optimal number of clusters given the result of the Silhouette method.


kmeans = KMeans(n_clusters= , random_state=1990)  # write here the optimal number of clusters
cluster_found = kmeans.fit_predict(X_scaled)
cluster_found_sr = pd.Series(cluster_found, name='cluster')


In [None]:

# Create a multindex of the type: (date,cluster to which the day belongs)
df_household_pivot_clusters = df_household_pivot.set_index(cluster_found_sr, append=True)
df_household_pivot_clusters.index

In [None]:
df_household_pivot_clusters.head(20)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(14,5))
color_list = ['blue',  'red']
cluster_values = sorted(df_household_pivot_clusters.index.get_level_values('cluster').unique())
print(cluster_values)

for cluster, color in zip(cluster_values, color_list):
    # plot every line of both clusters
    df_household_pivot_clusters.xs(cluster, level=1).T.plot(ax=ax, legend=False, alpha=0.05, color=color)
    # plot the mean consumption of each cluster
    df_household_pivot_clusters.xs(cluster, level=1).mean().plot(ax=ax, color=color, legend=False, alpha=0.8, ls='--')

ax.set_ylabel('Average hourly power [kW]')
ax.set_xlabel('Hours')

K-means has found the clusters with the following characteristics, looking at the graph above:
* One of the clusters concentrates the highest consumption patterns with the highest consumption peaks.
* The other concentrates a lower average hourly power consumption.

## Validating results with Dimensionality Reduction (PCA)
Principal Component Analysis (PCA) is a statistical method that simplifies the complexity of sample spaces with many dimensions while preserving their information. The number of features is reduced from 24 to 2. 
One way to validate the results of the clustering algorithm is by dimensionality reduction techniques. Note that the PCA does not know anything about the groups found by K-means.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.colors

pca = PCA(n_components=2, random_state=1990)
results_pca = pca.fit_transform(X_scaled)
cmap = matplotlib.colors.LinearSegmentedColormap.from_list(cluster_values, color_list)

plt.scatter(results_pca[:, 0], results_pca[:, 1],
            c=df_household_pivot_clusters.index.get_level_values('cluster'),
            cmap=cmap,
            alpha=0.4,
            )
plt.show()

In [None]:
results_pca

In the graph above, each point represents a daily consumption profile. Theoretically, the distance between the points in the dimensional space is maintained, so points that are close together have similar daily consumption profiles.

The fact that most of the blue and red points are close together is a good indication that the clustering is correct. The results of the K-means algorithm are used to color the points in order to evaluate the performance of the K-means algorithm.

## EXERCISE: Try the Elbow method and see if it is similar.
Elbow Method [example](https://localcoder.org/scikit-learn-k-means-elbow-criterion)

In [None]:
# let's use the same scaled dataset for comparing the Silouethe with the Elbow method
X_scaled