<a href="https://colab.research.google.com/github/ramonVDAKKER/teaching-data-science-emas/blob/develop/notebooks/demo_clustering_with_k_means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cluster Analysis | $k$-means

### Import Packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
plt.show()

### Load dataset

House Prices; source=https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv

In [None]:
data_url = "https://raw.githubusercontent.com/ankita1112/House-Prices-Advanced-Regression/master/train.csv"
data_df = pd.read_csv(data_url)
pd.options.display.max_columns = 100
data_df.head(5)

# k-means

Consider a scatter plot of two variables. Question arises whether there are <i>clusters</i> of observations.  The rough, general idea is to find groups of observations, clusters, such that the 'variation' within the cluster is small and the 'variation' between clusters is large.

In [None]:
plt.scatter(data_df["1stFlrSF"], data_df["2ndFlrSF"])

$k$-means is a specific method to obtain clusters. Applying the algorithm requires a choice for the number of clusters $k$.

In [None]:
obj_kmeans = KMeans(n_clusters=4).fit(data_df[["1stFlrSF", "2ndFlrSF"]])

As output we obtain the <i>cluster centroids</i>, i.e. the means of the observations in each cluster, which are depicted in red in the figure below.

In [None]:
plt.scatter(data_df["1stFlrSF"], data_df["2ndFlrSF"])
for row in obj_kmeans.cluster_centers_:
    plt.scatter(row[0], row[1], s=50, c="r", marker="s")

All the observations are assigned to the 'closest' centroid. The Euclidean distance is used to measure the distance between two observations (i.e. the length of the line between the observations). 

In [None]:
obj_kmeans.labels_

Let us visualize these labels in the scatter plot.

In [None]:
for index, row in data_df[["1stFlrSF", "2ndFlrSF"]].iterrows():
    if obj_kmeans.labels_[index] == 1:
        plt.scatter(row[0], row[1], c="r")
    elif obj_kmeans.labels_[index] == 2:
        plt.scatter(row[0], row[1], c="b")
    elif obj_kmeans.labels_[index] == 3:
        plt.scatter(row[0], row[1], c="c")
    else:
        plt.scatter(row[0], row[1], c="g")

As the $k$-means algorithm is not guaranteed to converge and due to a random initialization, the results of $k$-means, on a fixed dataset, are not deterministic.

In [None]:
print("Cluster means:")
print(obj_kmeans.cluster_centers_)
obj_kmeans2 = KMeans(n_clusters=4).fit(data_df[["1stFlrSF", "2ndFlrSF"]])
print("\nCluster means after retraining k-means on same dataset:")
print(obj_kmeans2.cluster_centers_)

But how to select the number of clusters $k$? A popular technique is the elbow method: determine the 'fit' of $k$-means as function of $k$ and select $k$ such that the curve bends at $k$.

In [None]:
error = []
K = 20 # max number of clusters to be considered
for k in range(1, K):
    obj_kmeans = KMeans(n_clusters=k).fit(data_df[["1stFlrSF", "2ndFlrSF"]])
    error.append(obj_kmeans.inertia_)
plt.plot(range(1, K), error, marker="*")
plt.xlabel("Number of clusters")
plt.ylabel("Error")

Although clustering can already be useful for 2-dimensional data, it becomes extremely useful in higher dimensions. Interpretation of the results becomes more difficult. However, inspecting the centroids is a good first step.

In [None]:
names_cols = ["1stFlrSF", "2ndFlrSF", "TotalBsmtSF", "SalePrice"]
obj_kmeans = KMeans(n_clusters=4).fit(data_df[names_cols])
pd.DataFrame(obj_kmeans.cluster_centers_, columns=names_cols)