# A Siamese network for fruit classification

## Instructions - please read carefully

Do not share this code or your solution with others.  For example, don't make it available on the web.  Also, don't share the data set I prepared.

In the code below, a Siamese network is built and used to predict the type of a fruit from its image.  
With a Siamese network we're able to do this well even for fruit not seen in training!

1. Review the Siamese network lecture and read and understand the code below (this step is important).

2. Insert your name below the title.

3. Download the data file I prepared and unzip it:  https://drive.google.com/file/d/1fVZuFdlcx4K2vqLDYkwcFbrdcHIeVfrS/view?usp=sharing

4. Add your own code or comments below where you see the problem prompts.

6. Do not add any imports except as needed to read the data file in Colab.  See the instructor if you'd like an exception.

7. Run your code from top to bottom before submitting.

In [None]:
from pathlib import Path
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.manifold import TSNE

import tensorflow as tf
from tensorflow.keras import models, layers, Input, Model
from tensorflow.keras.layers import Lambda
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.image import load_img

from IPython.core.display import display, HTML

In [None]:
# display options
pd.set_option('display.max_columns', 600)
pd.options.display.width = 120
pd.options.display.max_colwidth = 50
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
def plot_metric(history, metric='loss'):
    """ Plot training and test values for a metric. """

    val_metric = 'val_'+metric
    plt.plot(history.history[metric])
    plt.plot(history.history[val_metric])
    plt.title('model '+metric)
    plt.ylabel(metric)
    plt.xlabel('epoch')
    plt.legend(['train', 'test'])
    plt.show();

This will help with replicability, but does not control all aspects of randomness in the code.

In [None]:
np.random.seed()

### Read the small fruit data set

#### Problem 1.  Modify the following cell so that data can be read from `data_dir`

In [None]:
# if you're running TensorFlow locally, you'll want to use something like this:
# data_dir = Path("C:/Users/Glenn/Google Drive/CSUMB/courses/CST463-adv-machine-learning/datasets/fruit-360/fruit-360-small")

# if you're using Colab, you'll want to use something like this:
# from google.colab import drive
# drive.mount('/content/drive')
# data_dir=Path('/content/drive/MyDrive/fruit-360-small/fruit-360-small')

In [None]:
img_shape = (100, 100, 3)

def read_subset(subset_name, data_dir):
    subset_dir = data_dir / subset_name
    pics = subset_dir.glob('*.jpg')
    num_pics = len(list(pics))
    print(num_pics)
    X = np.zeros((num_pics, img_shape[0], img_shape[1], img_shape[2]))
    y = np.empty(num_pics, dtype='object')
    for i, pic in enumerate(subset_dir.glob('*.jpg')):
        fruit_name = pic.name.split('-')[0]
        y[i] = fruit_name
        img = load_img(pic)
        X[i] = img
    return X, y

In [None]:
X_train, y_train = read_subset('train', data_dir)
X_test, y_test  = read_subset('test', data_dir)

In [None]:
# sanity check
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
# sanity check
print(y_train[:3])

### Preprocess the data

In [None]:
# from integers in [0,255] to floats in [0,1]
X_train = X_train.astype('float32') / 255
X_test  = X_test.astype('float32') / 255

In [None]:
X_train.shape

In [None]:
plt.imshow(X_train[0]);

In [None]:
plt.imshow(X_train[1000]);

### Build a batch generator for a Siamese network

A training example for this Siamese network is two photos, plus a 0/1 label indicating whether the photos are of the same kind of fruit.

A label value of 0 means the fruit type is the same.

For flexibility, we use a batch generator instead of creating a fixed training set.

In [None]:
def make_batch(X, y, batch_size):
    """ Return X1, X2, which are 2D Numpy arrays containing rows from X.

    The first half of the rows in X1 and X2 will be random rows of X associated with
    a randomly-selected y value.  The second half of the rows in X1 and X2 will
    be random rows of X.  Also return 'matching', a 1D array such that
    matching[i] is 0 if X1[i] and X2[i] have the same y value, else is 1.
    """

    n = batch_size // 2

    # pick a random class in y, call it class a
    y_a = np.random.choice(y)

    # index values in X
    X_indexes = np.arange(X.shape[0])

    # pick n*2 random rows of X associated with class a
    # pick n*2 random rows of X
    arows = np.random.choice(X_indexes[y == y_a], size=n*2, replace=True)
    brows = np.random.choice(X_indexes,           size=n*2, replace=True)

    # create the batch
    X1 = X[np.concatenate([arows[:n],
                           brows[:n]])]
    X2 = X[np.concatenate([arows[n:],
                           brows[n:]])]
    matching = np.concatenate([np.full(n, 0), (y[brows[:n]] != y[brows[n:]]).astype(int)])

    return (X1, X2), matching


def batch_generator(X, y, batch_size=32):
    while True:
        yield make_batch(X, y, batch_size)

In [None]:
# basic test of make_batch()
(X1, X2), matching = make_batch(X_train, y_train, 32)
print(X1.shape)
print(X2.shape)
print(matching.shape)
print(matching.mean())

### Create the embedding model

#### Problem 2.  Define the convolutional embedding model.

The embedding model is a convolutional model that inputs a photo and outputs an "embedding" (also called an "encoding") of the photo.  

You can think of the embedding model as doing dimensionality reduction, but it is doing it in a way that photos of the same kind of fruit are clustered together in the lower-dimensional space.

Note that you have to define the embedding size (you might want to try a number between 16 and 128) and the convolutional model (you might want to start with a basic model containing only Conv2D and MaxPooling2D layers).

<br>![](https://drive.google.com/uc?id=1ggncdb7v-z6PnurAjPTI8-4l4uOwOFBd)

In [None]:
K.clear_session()  # delete old models
act_fun = 'relu'
embedding_size = None      # DEFINE THE EMBEDDING SIZE

pool_size = 2
conv_size = 3

inputs = Input(img_shape)

# YOUR CONVOLUTIONAL LAYERS GO HERE

x = layers.Flatten()(x)
encoded = layers.Dense(embedding_size, activation=act_fun)(x)
embedding_model = Model(inputs, encoded)

In [None]:
embedding_model.summary()

### Create the Siamese model

The Siamese network takes two photos as input, creates embeddings for them, computes the distances between the embeddings, then normalizes the distance using a sigmoid function so that the distance is between 0 and 1.  The normalized distance value is the output.

The goal of training is for the distance to be small if the input images are of the same fruit type, and for the distance to be large if the input images are of different fruit types.

<br>![](https://drive.google.com/uc?id=1ghBOScyj342jBMZFj0RkZdEaJzZeebAH)

#### Distance functions

Lots of different distance functions can be used in a Siamese network.  Here are two.

In [None]:
def euclidean_distance(vects):
    x, y = vects
    sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
    return K.sqrt(K.maximum(sum_square, K.epsilon()))

# from Andrew Ng's video:
# https://www.youtube.com/watch?v=0NSLgoEtdnw&ab_channel=DeepLearningAI
def abs_difference(vects):
    x, y = vects
    return K.abs(x - y)

#### Problem 3. Select a distance function  by modifying the following cell.

In [None]:
# dist_fun = euclidean_distance
# dist_fun = abs_difference

#### Siamese network

In [None]:
# create the two heads
input0 = Input(img_shape, name='input0')
input1 = Input(img_shape, name='input1')
sub0 = embedding_model(input0)
sub1 = embedding_model(input1)

# compute the distance between the embeddings,
# and send to sigmoid function
merge_layer = Lambda(dist_fun)([sub0, sub1])

dense_layer = layers.Dense(1, activation='sigmoid')(merge_layer)

model = Model([input0, input1], dense_layer)

In [None]:
model.summary()

### Compile and train the Siamese network

The output of the Siamese network is a value between 0 and 1, and the target value is 0 or 1.  This is just like in a binary classification problem.

We can use binary crossentropy loss as the loss function, but unsupervised contrastive loss is also popular.

Unsupervised contrastive loss is not built into Keras, so we need to define a custom loss function.

#### Contrastive loss function

In [None]:
# this code from https://keras.io/examples/vision/siamese_contrastive/
def closs(margin=1):
    """Provides 'constrastive_loss' an enclosing scope with variable 'margin'.

  Arguments:
      margin: Integer, defines the baseline for distance for which pairs
              should be classified as dissimilar.

  Returns:
      'constrastive_loss' function with data ('margin') attached.
  """

    def contrastive_loss(y_true, y_pred):
        """Calculates the constrastive loss.

      Contrastive loss = mean( (1-true_value) * square(prediction) +
                              true_value * square( max(margin-prediction, 0) ))

      Arguments:
          y_true: List of labels, each label is of type float32.
          y_pred: List of predictions of same length as of y_true,
                  each label is of type float32.

      Returns:
          A tensor containing constrastive loss as floating point value.
      """

        square_pred = tf.math.square(y_pred)
        margin_square = tf.math.square(tf.math.maximum(margin - (y_pred), 0))
        y_truef = K.cast(y_true, 'float32')
        return tf.math.reduce_mean(
            (1 - y_truef) * square_pred + (y_truef) * margin_square
        )

    return contrastive_loss

In [None]:
early_stopping = EarlyStopping(patience=5, monitor='val_loss')

#### Problem 4.  Select a loss function by modifying the following cell.

In [None]:
# loss_fun = 'binary_crossentropy'
# loss_fun = closs(margin=1.0)    # contrastive loss

In [None]:
model.compile(optimizer='rmsprop', loss=loss_fun,  metrics=['accuracy'])

history = model.fit(batch_generator(X_train, y_train, batch_size=64), steps_per_epoch=200, epochs=50,
                    validation_data=batch_generator(X_test, y_test, batch_size=32), validation_steps=75,
                   callbacks=[early_stopping])

In [None]:
plot_metric(history)

### Plot the embeddings of the fruit images

After training the Siamese network, we can use the trained embedding model to map fruit images to their embeddings.

The point of using the Siamese network is to obtain an embedding space in which images of the same fruit type are clustered together.

Was this clustering successful?  We can't visualize the embedding space directly, but we can use dimensionality reduction to map points in the embedding space to 2 dimensions.

In [None]:
def plot_embeddings(embedding_model, X, y, npoints=300):
    """ Plot embeddings in 2D. """

    # only use a randomly-chosen subset of the images
    npoints = min(X.shape[0], npoints)
    idx = np.random.choice(X.shape[0], size=npoints, replace=False)

    # map the images to points in the embedding space
    X_emb = embedding_model.predict(X[idx])

    # map the points in the embedding space to 2 dimensions
    reduced = TSNE(n_components=2).fit_transform(X_emb)

    # plot the points in 2D
    plt.figure(figsize=(8,8))
    sns.scatterplot(x=reduced[:, 0], y=reduced[:, 1], s=60, hue=y[idx])
    plt.legend(bbox_to_anchor=(1.05, 1), borderaxespad=0, prop={'size': 13})

In [None]:
plot_embeddings(embedding_model, X_test, y_test)

### Compute fruit prediction accuracy

Now we can build a fruit classifier.  

We can use any kind of classifier to do it:
- Map the training examples to points in the embedding space.  Each point has a label giving the fruit type.
- Train the classifier using these embeddings.
- To classify an image, map it to the embedding space, then make a prediction for that point using the trained classifier.

The confusing part about this is that there are two distinct training sets:
- One training set to train the Siamese network (training examples are pairs of images)
- Another training set to train the classifier (training examples are points in the embedding space)

Think about this: the fruit types in the two training sets can be different.  We can train the Siamese network on 20 different kinds of fruits, but then train the classifier on 100 different types of fruits.

#### Create training and test data for the fruit classifier.

The images in X_train were used to train the Siamese network.

We can't use the images in X_train to train the classifier, because we want to use the classifier on fruits never seen by the Siamese network.  This will make it easy to extend the classifier to work with new fruits.  We won't need to re-train the Siamese network.

To get training data for the classifier, we will break up X_test into training and test parts.

In [None]:
X_train_knn, X_test_knn, y_train_knn, y_test_knn = train_test_split(X_test, y_test, test_size=0.30, random_state=42)

In [None]:
# sanity check
print(X_train_knn.shape)
print(y_train_knn.shape)
print(X_test_knn.shape)
print(y_test_knn.shape)

In [None]:
# It would be a little better to do the train/test split in a way that
# every fruit was represented with the same number of images in the training set.

plt.figure(figsize=(8,4))
plt.title('Counts of images in training set by fruit type')
pd.Series(y_train_knn).value_counts().plot.bar();

#### Problem 5.  Create training and test data for a KNN classifier.   

Your code will modify `X_train_knn` and `X_test_knn`.  Don't forget that scaling is normally used with KNN.

This problem will require some thinking.  Make sure you have understood the discussion under 'Compute fruit prediction accuracy' above.

In [None]:
# YOUR CODE HERE

#### Problem 6. Create and train a KNN classifier.

You should use KNeighborsClassifier, and your trained classifier should have variable name 'clf'.

In [None]:
# YOUR CODE HERE

#### Compute baseline and test accuracy

In [None]:
label_counts = np.unique(y_test_knn, return_counts=True)[1]
most_common_label = y_test_knn[label_counts.argmax()]
baseline_accuracy = clf.score(X_test_knn, np.full(y_test_knn.size, most_common_label))

test_accuracy = clf.score(X_test_knn, y_test_knn)

print('baseline accuracy: {:0.2f}'.format(baseline_accuracy))
print('test accuracy: {:0.2f}'.format(test_accuracy))

I achieve a test accuracy of around 0.9 without tuning the code.

#### Problem 7.  Tune the system.

Try to achieve the best test accuracy you can on fruit prediction.  

Here are the things you're allowed to tweak:
- The embedding model, including the number and types of layers, the sizes of the layers, and activation functions.  You can also use dropout layers, batch normalization, and L1/L2 regularization.
- The optimizer, batch size, number of epochs, steps/epoch, and early stopping parameters.  You can initialize the learning rate and use learning rate scheduling.
- The batch generator.

You may want to use random or grid search for tuning.

Do not add new imports, change X_train, y_train, X_test, y_test, modify the random seed, modify the test/train splits, etc.

If in doubt about what you can change, ask the instructor.

### Conclusions


#### Problem 8.  Add conclusions.  Discuss the main things you learned from your work.

YOUR TEXT HERE