In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split
# new import statements
from sklearn.cluster import KMeans

# Unsupervised Machine Learning: Clustering

- In classification (supervised), we try to find boundaries/rules to separate points according to pre-determined labels.
- In clustering, the algorithm chooses the labels.  Goal is to choose labels so that similar rows get labeled the same.

### K-Means Clustering

- K: number of clusters:
    - 3-Means => 3 clusters
    - 4-Means => 4 clusters, and so on
- Means: we will find centroids (aka means aka averages) to create clusters

#### Iterative algorithm for K-Means

Animation of the iterative K-Means algorithm: https://www.youtube.com/watch?v=5I3Ei69I40s

In [None]:
# Generate random data
x, y = datasets.make_blobs(n_samples=100, centers=3, cluster_std=1.2, random_state=3)
df = pd.DataFrame(x, columns=["x0", "x1"])
df.head()

In [None]:
def km_scatter(df, **kwargs):
    """
    Produces scatter plot visualizations with x0 on x-axis and y0 on y-axis.
    It can also plot the centroids for clusters.
    Parameters:
        x0 => x-axis
        x1 => y-axis
        cluster => marker type
    """
    ax = kwargs.pop("ax", None)
    if not "label" in df.columns:
        return df.plot.scatter(x="x0", y="x1", marker="$?$", ax=ax, **kwargs)

    for marker in set(df["label"]):
        sub_df = df[df["label"] == marker]
        ax = sub_df.plot.scatter(x="x0", y="x1", marker=marker, ax=ax, **kwargs)
    return ax

ax = km_scatter(df, s=100, c="0.7")

### Hard Problem

Finding the best answer. What is the answer? Determing the centroids of the clusters.

### Easier Problem

Taking a random answer and make it a little better. Then repeat!
Downside? If randomization leads to very bad initial choice of centroids, that might lead to bad clustering (fewer clusters).

In [None]:
clusters = np.random.uniform(-5, 5, size=(3, 2))
clusters = pd.DataFrame(clusters, columns=["x0", "x1"])
clusters["label"] = ["o", "+", "x"]

ax = km_scatter(df, s=100, c="0.7")
km_scatter(clusters, s=200, c="red", ax=ax)

Two variables for us to deal with:
1. clusters: contains location of centroids and a label for them
2. df: contains the actual data points

In [None]:
clusters

In [None]:
df.head()

In [None]:
class KM:
    def __init__(self, df, clusters):
        # We make copies because we are going to keep changing the dataframe to 
        # identify better clusters
        pass
        
    def plot(self):
        pass
        
    def assign_points(self):
        """
        compute Euclidean distance between each point and each centroids
        """
        pass
    
    def update_centers(self):
        """
        update centroids by taking mean of the points that are nearest to that
        particular centroid
        """
        pass

"""
High-level algorithm:
1. Start with random locations for centroids
2. Iterate over each data point:
    1. Find the distance (Euclidean distance) between current data point and each centroid.
    2. Find the minimum of those distances and the corresponding label.
    3. Assign current data point to the closest cluster centroid label.
4. Once all points are assigned, compute new centroid for each cluster. Iterate over 
   each cluster:
    1. Extract subset of data points which got assigned to curr cluster label.
    2. Compute mean of all the assigned data points.
    3. Update cluster centroid.
5. Repeat steps 2 to 4 many times (iterative improvement).
"""

# Creating object instance
km = KM(df, clusters)
km.plot()

# for i in range(10):
#     km.assign_points()
#     km.update_centers()

# km.plot()

### `sklearn KMeans`

- import statement:
```python
from sklearn.cluster import KMeans
```
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

**Instantiation:**
`KMeans(n_clusters=<num>, n_init=<num>, max_iter=<num>)`
- `n_clusters`: number of clusters to be formed
- `n_init`: number of initial random seeds to try (to avoid downside of bad initial random choices)
- `max_iter`: maximum number of iterations for a single K-means run (single starting seed)

In [None]:
km_cluster = ???
km_cluster

In [None]:
df.head()

**Methods:**
1. `fit`: find good centroids
2. `transform`: give me the distances from each point to each centroid
3. `predict`: give me the chosen group labels

**Attributes:**
- `<km object>.cluster_centers_`: coordinates of cluster centers
- `<km object>.inertia_`: sum of squared distances of samples to their closest cluster center

In [None]:
# `fit`: find good centroids
km_cluster.???
# coordinates of cluster centers
km_cluster.???

**Observeration:** 3 rows (because we have 3 clusters), and 2 columns (because the df had 2 columns).

In [None]:
# `transform`: give me the distances from each point to each centroid
km_cluster.???

**Observations**: Each row corresponds to a row in df. 3 columns correspond to 3 distances to the centroids.

In [None]:
# `predict`: give me the chosen group labels
km_cluster.???

### How many clusters do we need?

- metric: `<km object>.inertia_`: sum of squared distances of samples to their closest cluster center

In [None]:
km_cluster.???

**Observation**: we want "inertia" to be as small as possible.

### Elbow plot to determine `n_clusters`

In [None]:
s = pd.Series(dtype=float)

for num_clusters in range(1, 11):
    ???
s

In [None]:
ax = s.plot.line(figsize=(6, 4))
ax.set_ylabel("Inertia")
ax.set_xlabel("Number of clusters")

**Observation**: there is an "elbow" around `n_clusters`=3.

#### Will we always have a clear "elbow"?

- Let's generate uniform random data

In [None]:
df2 = pd.DataFrame(np.random.uniform(0, 10, (100, 2)))
df2.head()

In [None]:
df2.plot.scatter(x=0, y=1)

In [None]:
s = pd.Series(dtype=float)

for num_clusters in range(1, 11):
    km = KMeans(num_clusters)
    km.fit(df2)
    s.at[num_clusters] = km.inertia_

ax = s.plot.line(figsize=(6, 4))
ax.set_ylabel("Inertia")
ax.set_xlabel("Number of clusters")