# Assignment 2: CNNs and Transfer Learning
Name(s): Scott Charles

Note: the report with the following headings can be completed in this notebook, or in a separate document.

## 1. Dataset Description
A description of the dataset and the class you chose to predict, including:
1. Biases and limitations of the dataset
2. Class imbalance
3. A summary of your impressions of the dataset

## 2. Basic Model
A description of the model you built from scratch, including:
1. The architecture of the model
2. The loss function and optimizer you used
3. The metrics you used to evaluate the model
4. A discussion of how you approached building, training, and refining the model

## 3. Transfer Learning Model
A description of the transfer learning model, including:
1. A reference to the pre-trained model you used
2. Why you chose that model
3. A discussion of how you approached adding and training the new layers

## 4. Discussion/Conclusion
A discussion/conclusion section describing:
1. How the two models compared in terms of performance and ease of creation
2. Challenges, advantages, and limitations of each approach
3. Which you would choose if you were deploying this model in a production environment
4. Any other thoughts or observations you have about the process

# CNN Model from Scratch

In [None]:
# Import the things
# Note that you will need to pip install tensorflow-datasets

import tensorflow as tf
import tensorflow_datasets as tfds
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Loading
The handy tfds.load function isn't working due to a [known issue](https://github.com/tensorflow/datasets/issues/1482), so follow these steps to get the data (~1.5GB):

### Option 1: On your local machine

1. Download these 5 urls:
    | URL | Size | Checksum | filename | 
    | --- | ---- | ------- | ----------- |
    | https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pY0NSMzRuSXJEVkk | 2836386 | fc955bcb3ef8fbdf7d5640d9a8693a8431b5f2ee291a5c1449a1549e7e073fe7 | list_eval_partition.txt |
    | https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pZjFTYXZWM3FlRnM | 1443490838 | 46fb89443c578308acf364d7d379fe1b9efb793042c0af734b6112e4fd3a8c74 | img_align_celeba.zip |
    | https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pblRyaVFSWGxPY0U | 26721026 | f0e5da289d5ccf75ffe8811132694922b60f2af59256ed362afa03fefba324d0 | list_attr_celeba.txt |
    | https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pd0FJY3Blby1HUTQ | 12156055 | 6c02a87569907f6db2ba99019085697596730e8129f67a3d61659f198c48d43b | list_landmarks_align_celeba.txt |
    | https://drive.google.com/uc?export=download&id=1_ee_0u7vcNLOfNLegJRHmolfH5ICW-XS | 3424458 | c6143857c3e2630ac2da9f782e9c1232e5e59be993a9d44e8a7916c78a6158c0 | identity_CelebA.txt |
2. Move the files to `~/tensorflow_datasets/downloads/manual/`

At this stage your directory structure should look like this:
```bash
~
└── tensorflow_datasets
    └── downloads
        └── manual
            ├── identity_CelebA.txt
            ├── img_align_celeba.zip
            ├── list_attr_celeba.txt
            ├── list_eval_partition.txt
            └── list_landmarks_align_celeba.txt
```
If there are already things in the `manual` directory, that's fine. Just add the files to it.

### Option 2: On Google Colab
1. Download the same 5 files to your local machine
2. Upload the files to your Google Drive in a folder named `tensorflow_datasets/downloads/manual/`. The `tensorflow_datasets` folder should be in the root of your Google Drive (unless you want to change the paths in the code below).

Run the following cell to mount your Google Drive and configure the data directory.

In [None]:
try:
    from google.colab import drive
    COLAB = True
    gdrive_root = '/content/gdrive'
    drive.mount(gdrive_root)
    data_dir = gdrive_root + '/My Drive/tensorflow_datasets/'
except:
    COLAB = False
    data_dir = '~/tensorflow_datasets/'

In [None]:
# Finally we can load the data, but this will still take a while the first time
ds_train = tfds.load('celeb_a', split='train[:80%]', shuffle_files=True, data_dir=data_dir)
ds_val = tfds.load('celeb_a', split='train[80%:]', shuffle_files=True, data_dir=data_dir)
ds_test = tfds.load('celeb_a', split='test', shuffle_files=True, data_dir=data_dir)

## Data exploration
Feel free to explore the data in other ways as well. These cells are provided to give some idea of how to interact with the Tensorflow Dataset format.

The examples shown here illustrate the "Eyeglasses" attribute - pick a **different one** for your assignment.

In [None]:
# Choose a different attribute for your project!
ATTRIBUTE = 'Eyeglasses'

# look at some examples
fig, axes = plt.subplots(1, 5, figsize=(12, 3))
for example, ax in zip(ds_train.take(5), axes):
    image, label = example["image"], example["attributes"][ATTRIBUTE]
    ax.imshow(image)
    ax.set_title(f'{ATTRIBUTE}: {label}')
    ax.axis('off')

In [None]:
# Check out class balance of a random sample
label_count = 0
SHUF_BUF = 1024
random_sample = 1000

# ds.shuffle loads N records (1024 in this case) and then takes the first 1000
for record in ds_train.shuffle(SHUF_BUF).take(random_sample):
    label_count += int(record["attributes"][ATTRIBUTE])

print(f'{label_count} of {random_sample} have {ATTRIBUTE} = True')

## Preprocessing
I recommend using the following functions to preprocess and sample the data. This is a large and unbalanced dataset, so the `get_balanced_data` function creates a subsample with an equal number of positive and negative examples.

In [None]:
# To keep things manageable I recommend downsampling to 128x128
# This will distort the images, but that doesn't matter very much
IMAGE_SIZE = 128

def preprocess_input_dict(feat_dict):
    """
    Separates the image and label from the feature dictionary.
    """
    image = feat_dict['image']
    label = feat_dict['attributes'][ATTRIBUTE]

    # Resize and normalize image.
    image = tf.cast(image, tf.float32)
    image = tf.image.resize(image, [IMAGE_SIZE, IMAGE_SIZE])
    image /= 255.0

    return (image, label)

def get_balanced_data(ds, batch_size):
    """
    Returns a balanced dataset with the specified split and batch size.
    Maps each sample to the preprocessing function.
    """
    pos_ds = ds.filter(lambda d: d["attributes"][ATTRIBUTE] == True)
    neg_ds = ds.filter(lambda d: d["attributes"][ATTRIBUTE] == False)
    balanced = tf.data.Dataset.sample_from_datasets(
        [pos_ds, neg_ds], 
        weights=[0.5, 0.5], 
        stop_on_empty_dataset=True)
    balanced = balanced.shuffle(SHUF_BUF).batch(batch_size)
    return balanced.map(preprocess_input_dict)


## Sample model creation
Since we haven't talked about working with the weird [Tensorflow.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) format, I've included a sample model creation process. This is a simple model that uses the `get_balanced_data` function to create a balanced dataset, and then trains a model on it. It's not a good model, and in fact it behaves much better than I expected, but hopefully it provides a starting point for you to build your own models.

In [None]:
# examples of calling the get_balanced_data function
# Note that the resulting objects are not pandas dataframes or even numpy arrays,
# but tensorflow datasets that have indeterminate size
BATCH_SIZE = 32
train = get_balanced_data(ds_train, BATCH_SIZE)
val = get_balanced_data(ds_val, BATCH_SIZE)
test = get_balanced_data(ds_test, BATCH_SIZE)

In [None]:
# There's a whole bunch of metrics you could look at - which ones make the most sense?
metrics = [
    tf.keras.metrics.BinaryAccuracy(name='accuracy'),
    # add any other metrics here
]

model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)),
    tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

In [None]:
# Feel free to explore different optimizers, number of epocs, etc
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss='BinaryCrossentropy',
    metrics=metrics)

history = model.fit(
    train,
    epochs=5, 
    validation_data=val,
    )
    

In [None]:
# plot the history
hist = pd.DataFrame(history.history)
plt.plot(hist['accuracy'], label='train')
plt.plot(hist['val_accuracy'], label='val')
plt.legend()

### Model Evaluation
Finally, here's some functions you can use to help evaluate the model.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

def plot_confusion(model):
    y_pred = [] 
    y_true = [] 

    # iterate over the dataset
    for image_batch, label_batch in test:
        y_true += list(label_batch)
        preds = model.predict(image_batch)
        y_pred += list(preds > 0.5)

    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    cm = confusion_matrix(y_true, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[f'Not {ATTRIBUTE}', ATTRIBUTE])
    disp.plot()

plot_confusion(model)

# you can also call model.evaluate to calculate all the metrics at once
test_metrics = model.evaluate(test)
print(test_metrics)

# Transfer Learning Model
Building and training a transfer learning model is much like doing it from scratch, but before calling model.compile you need to freeze the layers of the pre-trained model. This is done by setting the `trainable` attribute of the layers to `False`. You can then add new layers to the model and train them as usual.

Basic process:
1. Choose a pre-trained model (I recommend sticking to the [Keras Applications models](https://www.tensorflow.org/api_docs/python/tf/keras/applications) to keep it simple)
2. Load the model and set `trainable=False` for all layers
3. Add new layers to the model to finalize the architecture

In [None]:
# Here's an example of how to use a pre-trained model
base_model = tf.keras.applications.VGG16(
    include_top=False, 
    weights="imagenet",
    input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)
)

# freeze the base model
base_model.trainable = False

# Add on the classification "head"
flatten_layer = tf.keras.layers.Flatten()(base_model.output)
dense_layer = tf.keras.layers.Dense(64, activation='relu')(flatten_layer)
output = tf.keras.layers.Dense(1, activation='sigmoid')(dense_layer)
transfer_model = tf.keras.Model(inputs=base_model.input, outputs=output)

# compile the same way
transfer_model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss='BinaryCrossentropy',
    metrics=metrics
)

transfer_model.summary()

In [None]:
# train as usual
transfer_history = transfer_model.fit(
    train,
    epochs=5, 
    validation_data=val,
)