<a href="https://colab.research.google.com/github/smnieee/ml_workshop/blob/master/Intro_to_Unsupervised_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Unsupervised Learning

IEEE Southern Minnesota Section

Machine Learning Workshop: Session 4

April 19, 2021

## Introduction

In previous workshops, we covered supervised learning methods, e.g. linear regression. The crux of these lessons was the assumption that there was an underlying structure or model to the data being analyzed. 

This notebook will work through unsupervised learning. We will begin with a simple clustering example.

# Overview of k-Means Cluestering


Clustering is a set of methods for grouping data. Data are partitioned into groups based on similarities within the data. We do not know the similarities a priori but we begin with the data and look for them.

## Download and Unzip the Data

Data Website:
http://cs.joensuu.fi/sipu/datasets/


Data Citation:
P. Fränti R. Mariescu-Istodor and C. Zhong, "XNN graph" IAPR Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition Merida, Mexico, LNCS 10029, 207-217, November 2016.

In [None]:
!wget http://cs.joensuu.fi/sipu/datasets/g2-txt.zip
!unzip "g2-txt.zip";

## Read in Data and Convert to Numpy Array

The data in the csv is space separated. It needs to be read in and converted to a `numpy` array.

In [None]:
import csv
import numpy as np

dlist = []

with open("g2-2-40.txt", 'r') as c:
  creader = csv.reader(c, delimiter=" ", skipinitialspace=True)

  for row in creader:
    dlist.append([float(r) for r in row])

dlist = np.array(dlist)

### View the Data

In [None]:
import matplotlib.pyplot as plt

plt.scatter(dlist[:,0], dlist[:,1])


## Implement the k-Means Algorithm

The k-means algorithm is rather simple. To instill an understanding of the process we will implement it with basic code before trying to use more developed modules.

The algorithm is really just a few steps:
1. Pick the number of centroids, k.
2. Randomly place k centroids.
3. Repeat until converged:
  1. Assign each point to the closest centroid.
  2. Compute the new centroid from the mean of all points assigned to each group.
4. Assess results


In [None]:
k = 2

In [None]:
xmin, xmax = np.min(dlist[:,0]), np.max(dlist[:,0])
ymin, ymax = np.min(dlist[:,1]), np.max(dlist[:,1])

centroids = np.hstack((np.random.default_rng().uniform(xmin, xmax, (k,1)),
                      np.random.default_rng().uniform(ymin,ymax, (k,1))))


plt.subplots(figsize=(16,9))
plt.scatter(dlist[:,0], dlist[:,1])
plt.scatter(centroids[:,0], centroids[:,1], marker='o', c='red', s=200)

display(centroids)

In [None]:
centers_list = [centroids[i].copy() for i in range(k)]

for n in range(10):
  cnt = np.zeros(k)
  cent_sum = np.zeros_like(centroids)
  for d in dlist:
    diffs = [np.sum(np.square(d - c)) for c in centroids]
    i = np.argmin(diffs)
    cnt[i] += 1
    cent_sum[i] += d

  for m in range(k):
    centroids[m] = cent_sum[m] / cnt[m]
    centers_list[m] = np.vstack((centers_list[m], centroids[m]))


In [None]:
plt.subplots(figsize=(16,12))
plt.scatter(dlist[:,0], dlist[:,1])
for center in centers_list:
  plt.scatter(center[:,0], center[:,1], marker='o', edgecolors='black', s=200)

### Plot the Convergence

In [None]:
fig, ax = plt.subplots(2,2, figsize=(16,9))

ax[0,0].plot(centers_list[0][:,0])
ax[0,0].set_title('X-Values for Center 1')
ax[1,0].plot(centers_list[0][:,1])
ax[1,0].set_title('Y-Values for Center 1')

ax[0,1].plot(centers_list[1][:,0])
ax[0,1].set_title('X-Values for Center 2')
ax[1,1].plot(centers_list[1][:,1])
ax[1,1].set_title('Y-Values for Center 2')

### Plot the Final Labels

Now show how the final groupings based on final centroids.

In [None]:
final_labels = np.zeros(len(dlist))

for n,d in enumerate(dlist):
  diffs = [np.sum(np.square(d - c)) for c in centroids]
  i = np.argmin(diffs)
  final_labels[n] = i

plt.subplots(figsize=(16,9))
plt.scatter(dlist[:,0], dlist[:,1], c=final_labels, edgecolors='black')

### Cross check with 2D histogram

In [None]:
plt.subplots(figsize=(16,9))
plt.hist2d(dlist[:,0], dlist[:,1]);

# K-Means with Scikit-Learn

Of course, we don't have to implement this manually. There are packages that support clustering. Specfically, we will again use scikit-learn to investigate the data.


Reference:
https://realpython.com/k-means-clustering-python/

## Preprocessing Data

Most data processing packages expect the data to be of certain ranges and values. Therefore, our data needs to be preprocessed to 'look like' the expect inputs of the tools.

Specifically, our data needs to be scaled so that it has a mean of 0 and a standard deviation of 1 for each dimension. The scikit-learn package has a scaler function that will do this.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(dlist)

plt.subplots(figsize=(16,12))
plt.scatter(scaled_data[:,0], scaled_data[:,1])

## Create a k-means Class from scikit-learn

The k-means estimator can be created by passing some setup parameters to the Kmeans class.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(
    init = "random",
    n_init = 10,
    max_iter = 10,
    n_clusters = k
)

## Run k-means Fit using Estimator

In [None]:
kmeans.fit(scaled_data)

In [None]:
# The number of iterations needed for convergence
kmeans.n_iter_

In [None]:
# The centroids found
scaler.inverse_transform(kmeans.cluster_centers_)

In [None]:
# The best SSE (sum squared error)
kmeans.inertia_

In [None]:
plt.subplots(figsize=(16,12))
plt.scatter(scaled_data[:,0], scaled_data[:,1], c=kmeans.labels_, edgecolors='gray')

## Find the Best Number of Clusters

For our data, the number of clusters to use was just decided by looking at the data on a plot. Most of the time, it won't be that simple. There are better methods for finding an optimal number of clusters.

In [None]:
# Find the SSE for various number of clusters

errs = []
centers_list = []

for ktest in range(1,10):
  kmeans = KMeans(
    init = "random",
    n_init = 10,
    max_iter = 10,
    n_clusters = ktest
  )

  kmeans.fit(scaled_data)
  errs.append(kmeans.inertia_)
  centers_list.append(scaler.inverse_transform(kmeans.cluster_centers_))

  print(f"Number of Clusters: {ktest}")
  print(f"SSE: {kmeans.inertia_}")
  print(f"Centers: {centers_list[-1]}\n")

In [None]:
# Plot the Error versus Number of Clusters
plt.subplots(figsize=(16,12))
plt.plot(range(1,10), errs)
plt.title("SSE versus Number of Clusters")

### Silhouette Coefficients

The silhouette coefficient is an alternate measurement of our clustering. Instead of SSE, it looks at how well data fits in a given cluster _and_ how it does not fit in others.

An alternate method for choosing the number of clusters is to look at the silhouette coefficients.

In [None]:
from sklearn.metrics import silhouette_score

# Find the silhouette coefficient for various number of clusters

coeffs = []
centers_list = []

for ktest in range(2,10):
  kmeans = KMeans(
    init = "random",
    n_init = 10,
    max_iter = 10,
    n_clusters = ktest
  )

  kmeans.fit(scaled_data)
  score = silhouette_score(scaled_data, kmeans.labels_)
  coeffs.append(score)
  centers_list.append(scaler.inverse_transform(kmeans.cluster_centers_))

  print(f"Number of Clusters: {ktest}")
  print(f"Silhouette Score: {coeffs[-1]}")
  print(f"Centers: {centers_list[-1]}\n")

In [None]:
# Plot the Score versus Number of Clusters
plt.subplots(figsize=(16,12))
plt.plot(range(2,10), coeffs)
plt.title("Silhouette Scores")

In [None]:
kmeans = KMeans(
  init = "random",
  n_init = 10,
  max_iter = 10,
  n_clusters = 4
)

kmeans.fit(scaled_data)

plt.subplots(figsize=(16,12))
plt.scatter(scaled_data[:,0], scaled_data[:,1], c=kmeans.labels_, 
            edgecolors='gray'
)

plt.title("Labeled Data with K=4")