## Lecture 16: Clustering and Classification

We will start discussing Ch. 5 of the textbook today.  Topics we will cover from Ch.5 are:
* 5.1 Feature Selection and Data Mining
* 5.2 Supervised vs. Unsupervised Learning
* 5.3 $k$-means clustering
* 5.6 Supervised Learning and Linear Discriminants
* 5.7 Support Vector Machines (SVM)

Today, we will explore how to reduce high-dimenional data to lower-dimensions (ranks) and explain how to conduct feature selection to build models.

In [45]:
import numpy as np
import os

import matplotlib.pyplot as plt
from matplotlib import rc

plt.rcParams['xtick.labelsize']=16      # change the tick label size for x axis
plt.rcParams['ytick.labelsize']=16      # change the tick label size for x axis
plt.rcParams['axes.linewidth']=1        # change the line width of the axis
plt.rcParams['xtick.major.width'] = 3   # change the tick line width of x axis
plt.rcParams['ytick.major.width'] = 3   # change the tick line width of y axis
rc('text', usetex=False)                # disable LaTeX rendering in plots
rc('font',**{'family':'DejaVu Sans'})   # set the font of the plot to be DejaVu Sans

In [46]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 5.1 Feature Selection and Data Mining

#### Dataset 1: Fisher iris data set
Fisher iris data set with 150 measurements over three varieties, in- cluding 50 measurements each of Iris setosa, I. versicolor, and I. virginica. Each flower includes a measurement of sepal length, sepal width, petal length, and petal width.

In [None]:
path = "/content/drive/MyDrive/ME491"
data1_path = os.path.join(path, "data/fisheriris.mat")

In [8]:
from scipy import io

fisheriris_mat = io.loadmat(data1_path)
meas = fisheriris_mat['meas']
species = fisheriris_mat['species']

##### **Exercise 1**

Look at the two variables `meas` and `species`, and try to understand what do them mean, and how can you visualize it.

The fisher iris dataset is easy to visualize and understand, as there are four distinct features determined *a priori* based on biological understanding for the flowers. In this case, the features are already selected and we do not need to do much.

Now we turn to a dataset where we need to use what we learned in Ch.1 to identify features.

#### Dataset 2: dogs and cats
an image database of 80 dogs and 80 cats

The data for each cat and dog is the $64\times64$ pixel space of the image. Thus each image has $4096$ measurements, in contrast to the four measurements for each example in the iris data set.

The end goal is to select a finite set of features that can help us distinguish between a dog and a cat image.

(Side note: if you want a laugh, image search "dog or blueberry muffin")

In [21]:
dog_path = os.path.join(path, "data/dogData.mat")
cat_path = os.path.join(path, "data/catData.mat")
dogdata_mat = io.loadmat(dog_path)
catdata_mat = io.loadmat(cat_path)

In [22]:
dog = dogdata_mat['dog']
cat = catdata_mat['cat']

##### **Exercise 2**

Adapt the code when we drew people's faces, and draw the first 36 dogs and first 36 cats in two different cells

```
# Now we want to plot the first image of the first 36 person
allPersons = np.zeros((n*6,m*6))
count = 0

for j in range(6):
  for k in range(6):
    allPersons[j*n:(j+1)*n, k*m:(k+1)*m] = np.reshape(faces[:,np.sum(nfaces[:count])],(m,n)).T
    count += 1

img = plt.imshow(allPersons)
img.set_cmap('gray')
plt.axis('off')
```

##### Feature detection

Now we have a general understanding of the data, we want to extract features for the dataset, and we will use PCA, similar to eigenfaces to find the dominant features.

In [28]:
# we are going to use the same set of coordinates to
# describe both dogs and cats
DC = np.concatenate((dog, cat), axis = 1)

# PCA
avgAnimal = np.mean(DC, axis = 1)
X = DC - np.tile(avgAnimal, (DC.shape[1], 1)).T
U, S, VT = np.linalg.svd(X, full_matrices = False)

In [None]:
# Let's look at the average animal
plt.imshow(np.reshape(avgAnimal, (m,n)).T, cmap="Greys_r")
plt.axis('off')

In [None]:
# Now Let's plot the first 10 animal features

i = 2
j = 5

eigenanimal = np.zeros((n*i, m*j))
count = 0

for ii in range(i):
  for jj in range(j):
    eigenanimal[ii*n:(ii+1)*n, jj*m:(jj+1)*m] = np.reshape(U[:,count],(m,n)).T
    count += 1

img = plt.imshow(eigenanimal, vmin = -1e-2, vmax = 1e-2, cmap="Greys_r")
plt.axis('off')

In [38]:
dog_w_path = os.path.join(path, "data/dogData_w.mat")
cat_w_path = os.path.join(path, "data/catData_w.mat")
dogwdata_mat = io.loadmat(dog_w_path)
catwdata_mat = io.loadmat(cat_w_path)
dog_w = dogwdata_mat['dog_wave']
cat_w = catwdata_mat['cat_wave']

In [None]:
# Now we want to plot the first 36 dogs
n = 32
m = 32
alldogs_w = np.zeros((n*6,m*6))
count = 0

for j in range(6):
  for k in range(6):
    alldogs_w[j*n:(j+1)*n, k*m:(k+1)*m] = np.reshape(dog_w[:,count],(m,n)).T
    count += 1

img = plt.imshow(alldogs_w)
img.set_cmap('gray')
plt.axis('off')

In [41]:
DC_w = np.concatenate((dog_w, cat_w), axis = 1)

# PCA
avgAnimal_w = np.mean(DC_w, axis = 1)
Xw = DC_w - np.tile(avgAnimal_w, (DC_w.shape[1],1)).T
Uw, Sw, VTw = np.linalg.svd(Xw, full_matrices = False)

In [None]:
# Let's look at the average animal
plt.imshow(np.reshape(avgAnimal_w, (m,n)).T, cmap="Greys_r")
plt.axis('off')

In [None]:
# Now Let's plot the first 10 animal features

i = 2
j = 5

eigenanimal_w = np.zeros((n*i, m*j))
count = 0

for ii in range(i):
  for jj in range(j):
    eigenanimal_w[ii*n:(ii+1)*n, jj*m:(jj+1)*m] = np.reshape(Uw[:,count],(m,n)).T
    count += 1

img = plt.imshow(eigenanimal_w, vmin = -1e-2, vmax = 1e-2, cmap="coolwarm")
plt.axis('off')

When we were discussing PCA in Ch. 1, we never spent too much time discussing the meaning of the $V$ matrix.  Here, we will use $V$ matrix to perform feature engineering.

The importance of each feature to an individual image is given by the $V$ matrix in the SVD. Specifically, each column of $V$ determines the loading, or weighting, of each feature onto a specific image.

We can now look at the distributions for the $V$ matrix for dogs and cats.

In [None]:
xbin = np.linspace(-0.25, 0.25, 20)
xbin_edges = np.append(xbin, xbin[-1]+(xbin[1]-xbin[0])) - (xbin[1]-xbin[0])/2

fig, axs = plt.subplots(4,2)
fig.tight_layout(h_pad=0, w_pad=2)
fig.set_size_inches(6, 8)
for j in range(4):
  pdf1 = np.histogram(VT[j,:80], bins=xbin_edges)[0]
  pdf2 = np.histogram(VT[j,80:], bins=xbin_edges)[0]
  axs[j,0].plot(xbin, pdf1, label = "dogs")
  axs[j,0].plot(xbin, pdf2, label = "cats")
  axs[j,0].legend()
  axs[j,0].set_ylabel('PCA'+str(j+1), fontsize = 18)

  pdf1 = np.histogram(VTw[j,:80], bins=xbin_edges)[0]
  pdf2 = np.histogram(VTw[j,80:], bins=xbin_edges)[0]
  axs[j,1].plot(xbin, pdf1, label = "dogs")
  axs[j,1].plot(xbin, pdf2, label = "cats")
  axs[j,1].legend()

axs[0,0].set_title("image space", fontsize = 18)
axs[0,1].set_title("wavelet space", fontsize = 18)

All dog and cat images projecting to the first three PCA coordinates.

In [None]:
fig = plt.figure()
ax1 = fig.add_subplot(211, projection='3d')
ax1.scatter(VT[0,:80],VT[1,:80],VT[2,:80],c='r',marker='o',s=20)
ax1.scatter(VT[0,80:],VT[1,80:],VT[2,80:],c='b',marker='o',s=20)

ax2 = fig.add_subplot(212, projection='3d')
ax2.scatter(VTw[0,:80],VTw[1,:80],VTw[2,:80],c='r',marker='o',s=20)
ax2.scatter(VTw[0,80:],VTw[1,80:],VTw[2,80:],c='b',marker='o',s=20)

plt.show()