In [None]:
import sklearn

# Make sure we can see all of the model details.
sklearn.set_config(print_changed_only=False)

# Intro to image classification with Scikit-Learn

What about when the records are images? How can we make a machine learning model from those?

This notebook only uses `scikit-learn`; later we'll solve this problem with neural networks too.

We're also going to cheat a bit by loading pre-processed NumPy arrays. Getting images into this format is not too hard, but it can be a little fiddly if they are all very different from each other.

## Load `X` and `y`

In [None]:
import numpy as np

X = np.load('../data/fossils/X.npy')
y = np.load('../data/fossils/y.npy')

In [None]:
X.shape, y.shape

We have 3 classes:

In [None]:
np.unique(y)

Each row in `X` is an image of size 32 &times; 32 pixels:

In [None]:
import matplotlib.pyplot as plt

plt.imshow(X[190].reshape(32, 32))

## Split the data into train and test sets

### EXERCISE

Split the data so that 15% of the images go into a **validation** set called `X_val` and `y_val`.

In [None]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, random_state=42)

X_train.shape, X_val.shape

## Augmentation

We'd really like a lot of data (especially if we're going to train a neural net!). It seems like it should help to increase the size of the dataset... but without having to collect more examples. 

For example, let's flip the image above:

In [None]:
img = X_train[1].reshape(32,32)

flipped = np.flip(img, axis=1)

plt.imshow(flipped)

In [None]:
from scipy.ndimage import zoom

cropped = zoom(flipped, 1.1)

cropped = cropped[1:-2, 1:-2]

plt.imshow(cropped)

<div class="alert alert-success">
<h3>Exercise</h3>

- Write a function to randomly flip and crop each record in `X_train`. (It's okay to use a loop for this.)
- Add your new flipped records to `X_train`, and their labels to `y_train`.
</div>

In [None]:
# YOUR CODE HERE



## Train a 'shallow' model

Even in an image classification task, you should start with a shallow learning model. This will give you something to beat with a neural network (if you can!).

### EXERCISE

Implement a random forest classifier and predict the labels for the validation set. You should get performance around 65% weighted average F1.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# YOUR CODE HERE (about 4 lines of code)


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_val)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_val, y_pred))

## Looking more closely at validation

Here's the first validation example:

In [None]:
plt.imshow(X_val[0].reshape(32, 32))

The true label:

In [None]:
y_val[0]

The prediction:

In [None]:
y_pred[0]

Wrong! (Note: You'll need to use `random_state=42` in both the test split and the classifier for this to work out for sure!)

Let's look at the probabilities:

In [None]:
y_prob = clf.predict_proba(X_val)

y_prob[0]

In [None]:
clf.classes_

So the classifier's second guess would have been correct.

Let's look at how we did on several examples. To use my visualization function, we need integer-encoded labels, not 

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(y_train)

# Encode both the train and val labels.
y_train_enc = encoder.transform(y_train)
y_val_enc = encoder.transform(y_val)

y_val_enc

**You may need to copy the `utils.py` file from the `master` folder to the `notebooks` folder.**

In [None]:
import utils

utils.visualize(X_val, y_val_enc, y_prob,
                ncols=5, nrows=3, shape=(32, 32),
                classes=clf.classes_, cutoff=0.5
               )

## Convolution

Convolutional networks replace the weights with kernels, and the multiplication step with convolution.

Let's see what convolution can do to an image.

In [None]:
plt.imshow(img)

In [None]:
kernel = np.array([[-1, 0, 1],   # Sobel edge detector
                   [-2, 0, 2],
                   [-1, 0, 1]])

plt.imshow(kernel)

In [None]:
from scipy.signal import convolve2d

attr = convolve2d(img, kernel, mode='valid')

plt.imshow(attr)

Here's a nice resource on ConvNets: https://cs231n.github.io/convolutional-networks/

---
**Usually we'll stop this notebook here.**

---

## Dimensionality reduction

In high-dimensional datasets (i.e. ones with a lot of features), sometimes it helps to reduce the number of dimensions.

### Principal component analysis (PCA)

Let's try PCA; it works just like any other `sklearn` model, except that it's **unsupervised** so the `fit` step does not need to see the labels.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pca.fit(X_train)

X_train_2 = pca.transform(X_train)

In [None]:
plt.scatter(*X_train_2.T, c=y_train_enc, )
plt.colorbar()
plt.show()

In [None]:
encoder.classes_

We can look at the components themselves. They are directions in our original feature space -- so they can be interpreted as feature vectors.

We can call these **'eigenfossils'**.

In [None]:
no_ticks = {'xticks': [], 'yticks': []}
fig, axs = plt.subplots(ncols=2, figsize=(8, 4), subplot_kw=no_ticks)
for i, (ax, comp) in enumerate(zip(axs, pca.components_)):
    ax.imshow(comp.reshape(32, 32))
    ax.set_title(f"Component {i}")

It's a bit more interesting with more components:

In [None]:
pca = PCA(n_components=50).fit(X_train)

fig, axs = plt.subplots(3, 5, figsize=(15, 10), subplot_kw={'xticks': [], 'yticks': []})
for i, (ax, comp) in enumerate(zip(axs.ravel(), pca.components_)):
    ax.imshow(comp.reshape(32, 32))
    ax.set_title(f"Component {i}")

### t-statistic neighbourhood embedding (t-SNE)

We can also try t-SNE, which typically does better than PCA for visualizing a dataset in two dimensions:

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)

X_train_tsne = tsne.fit_transform(X_train)

In [None]:
plt.scatter(*X_train_tsne.T, c=y_train_enc)

## Train the model on reduced data

We'll just use PCA here, because t-SNE is not guaranteed to be a metric space.

### EXERCISE

- Create a PCA decomposition with 50 components and transform `X_train` and `X_val`.
- Train a new model on the transformed data, and validate on the transformed validation data.
- Do you get a better result than before?

**Stretch goal:** put the PCA transformer and the estimator into a pipeline and use cross-validation grid-search to find the optimal number of principal components to use.

In [None]:

# YOUR CODE HERE


In [None]:
pca = PCA(n_components=50)
pca.fit(X_train)
X_train_50 = pca.transform(X_train)
X_val_50 = pca.transform(X_val)

In [None]:
clf = RandomForestClassifier()
clf.fit(X_train_50, y_train)
y_pred = clf.predict(X_val_50)
print(classification_report(y_val, y_pred))

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pca = PCA()
rfc = RandomForestClassifier(min_samples_leaf=3)
pipe = Pipeline(steps=[('pca', pca), ('rfc', rfc)])

param_grid = {
    'pca__n_components': np.logspace(0.5, 3, 6, dtype=int),
    'rfc__max_depth': [3, 5, 7, 9],
}

cv = GridSearchCV(pipe, param_grid, n_jobs=6, verbose=5)
cv.fit(X_train, y_train)

In [None]:
cv.best_params_

In [None]:
y_pred = cv.predict(X_val)
print(classification_report(y_val, y_pred))

Conclusion: it's not much better than the original model. Oh well!

## One more thing...

It's fun to play with adding principal components:

In [None]:
pca = PCA(n_components=200, whiten=True)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_val_pca = pca.transform(X_val)

In [None]:
from ipywidgets import interact

@interact(instance=(0, 498, 1), components=(1, 201, 5))
def show(instance, components):
    img = (X_train_pca[instance] * pca.components_.T).T
    _, ax = plt.subplots(figsize=(8, 8), subplot_kw=no_ticks)
    im = np.sum(img[:components], axis=0)
    ax.imshow(im.reshape(32, 32))
    ax.set_title(f"First {components} components of instance {instance}")

## Dimensionality reduction on MNIST Handwritten Digits dataset

Just for fun, let's compare PCA with t-SNE on (a subset of) the famous MNIST digits dataset:

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()

digits.data.shape

In [None]:
plt.imshow(digits.data[234].reshape(8, 8,))

### EXERCISE

Can you adapt the code above to make (1) the 2-component PCA decomposition and (2) the 2-component t-SNE manifold? Then try crossplotting the 2 components for each decomposition, as we did before. Which one is better?

Give it a try before you scroll down for the solution.

In [None]:

# YOUR CODE HERE


In [None]:
pca = PCA(n_components=2).fit(digits.data)
digits_pca = pca.transform(digits.data)

tsne = TSNE(n_components=2, random_state=42)
digits_tsne = tsne.fit_transform(digits.data)

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(12, 6), subplot_kw=no_ticks)
axs[0].set_title('PCA')
axs[0].scatter(*digits_pca.T, c=digits.target)
axs[1].set_title('t-SNE')
axs[1].scatter(*digits_tsne.T, c=digits.target)
plt.show()

&copy; 2020 Agile Scientific, licensed CC-BY