# Lab Sheet 10: Cats and dogs image classification with Deep Learning: the effect of artificially augmented data and larger datasets (not needed for the coursework)

This notebook is based on an exercise published by Google. Implement the training and move the whole training into the cloud. You can also try to use the whole large dataset.

This lab addresses the use of **artifically augmented data** and larger datasets on deep learning models. In addition, there is furtehr practice in **using machine learning in the cloud**.

```
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

# Cat vs. Dog Image Classification


## 1) Loading Image Files and Reducing Overfitting with Dropout and Data Augmentation

In this notebook we will build a model to classify cats vs. dogs, and improve accuracy by employing a couple of strategies to reduce overfitting: **data augmentation** and **dropout**.

We will follow these steps:

1. Explore how data augmentation works by making random transformations to training images.
2. Add data augmentation to our data preprocessing.
3. Add dropout to the convnet.
4. Retrain the model and evaluate loss and accuracy.

Let's get started!

## 2) Exploring Data Augmentation

Let's get familiar with the concept of **data augmentation**, an essential way to fight overfitting for computer vision models.

In order to make the most of our few training examples, we will "augment" them via a number of random transformations, so that at training time, **our model will never see the exact same picture twice**. This helps prevent overfitting and helps the model generalize better.

This can be done by configuring a number of random transformations to be performed on the images read by our `ImageDataGenerator` instance. Let's get started with an example:

In [None]:
!pip install tensorflow

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

These are just a few of the options available (for more, see the [Keras documentation](https://keras.io/preprocessing/image/). Let's quickly go over what we just wrote:

- `rotation_range` is a value in degrees (0–180), a range within which to randomly rotate pictures.
- `width_shift` and `height_shift` are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally.
- `shear_range` is for randomly applying shearing transformations.
- `zoom_range` is for randomly zooming inside pictures.
- `horizontal_flip` is for randomly flipping half of the images horizontally. This is relevant when there are no assumptions of horizontal assymmetry (e.g. real-world pictures).
- `fill_mode` is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift.

Let's take a look at our augmented images. First let's set up our example files, as in Exercise 1.


**NOTE:** The 2,000 images used in this exercise are excerpted from the ["Dogs vs. Cats" dataset](https://www.kaggle.com/c/dogs-vs-cats/data) available on Kaggle, which contains 25,000 images. Here, we use a subset of the full dataset to decrease training time for educational purposes.

In [None]:
!wget --no-check-certificate \
   https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip -O \
   /tmp/cats_and_dogs_filtered.zip

In [None]:
import os
import zipfile

local_zip = '/tmp/cats_and_dogs_filtered.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()

base_dir = '/tmp/cats_and_dogs_filtered'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

# Directory with our training cat pictures
train_cats_dir = os.path.join(train_dir, 'cats')

# Directory with our training dog pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')

# Directory with our validation cat pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')

# Directory with our validation dog pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')

train_cat_fnames = os.listdir(train_cats_dir)
train_dog_fnames = os.listdir(train_dogs_dir)

Next, let's apply the `datagen` transformations to a cat image from the training set to produce five random variants. Rerun the cell a few times to see fresh batches of random variants.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

from tensorflow.keras.preprocessing.image import array_to_img, img_to_array, load_img

img_path = os.path.join(train_cats_dir, train_cat_fnames[2])
img = load_img(img_path, target_size=(150, 150))  # this is a PIL image
x = img_to_array(img)  # Numpy array with shape (150, 150, 3)
x = x.reshape((1,) + x.shape)  # Numpy array with shape (1, 150, 150, 3)

# The .flow() command below generates batches of randomly transformed images
# It will loop indefinitely, so we need to `break` the loop at some point!
i = 0
for batch in datagen.flow(x, batch_size=1):
  plt.figure(i)
  imgplot = plt.imshow(array_to_img(batch[0]))
  i += 1
  if i % 5 == 0:
    break

## 3) Add Data Augmentation to the Preprocessing Step

Now let's add our data-augmentation transformations from [**Exploring Data Augmentation**](#scrollTo=E3sSwzshfSpE) to our data preprocessing configuration:

In [None]:
# Adding rescale, rotation_range, width_shift_range, height_shift_range,
# shear_range, zoom_range, and horizontal flip to our ImageDataGenerator
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,)

# Note that the validation data should not be augmented!
test_datagen = ImageDataGenerator(rescale=1./255)

# Flow training images in batches of 20 using train_datagen generator
train_generator = train_datagen.flow_from_directory(
        train_dir,  # This is the source directory for training images
        target_size=(150, 150),  # All images will be resized to 150x150
        batch_size=20,
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')

# Flow validation images in batches of 20 using test_datagen generator
validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

If we train a new network using this data augmentation configuration, our network will never see the same input twice. However the inputs that it sees are still heavily intercorrelated, so this might not be quite enough to completely get rid of overfitting.

## 4) Adding Dropout

Another popular strategy for fighting overfitting is to use **dropout**.

**TIP:** To learn more about dropout, see [Training Neural Networks](https://developers.google.com/machine-learning/crash-course/training-neural-networks/video-lecture) in [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/).

Let's reconfigure our convnet architecture from Exercise 1 to add some dropout, right before the final classification layer:

In [None]:
from tensorflow.keras import layers
from tensorflow.keras import Model
from tensorflow.keras.optimizers import RMSprop

# Our input feature map is 150x150x3: 150x150 for the image pixels, and 3 for
# the three color channels: R, G, and B
img_input = layers.Input(shape=(150, 150, 3))

# First convolution extracts 16 filters that are 3x3
# Convolution is followed by max-pooling layer with a 2x2 window
x = layers.Conv2D(16, 3, activation='relu')(img_input)
x = layers.MaxPooling2D(2)(x)

# Second convolution extracts 32 filters that are 3x3
# Convolution is followed by max-pooling layer with a 2x2 window
x = layers.Conv2D(32, 3, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)

# Third convolution extracts 64 filters that are 3x3
# Convolution is followed by max-pooling layer with a 2x2 window
x = layers.Convolution2D(64, 3, activation='relu')(x)
x = layers.MaxPooling2D(2)(x)

# Flatten feature map to a 1-dim tensor
x = layers.Flatten()(x)

# Create a fully connected layer with ReLU activation and 512 hidden units
x = layers.Dense(512, activation='relu')(x)

# Add a dropout rate of 0.5
x = layers.Dropout(0.5)(x)

# Create output layer with a single node and sigmoid activation
output = layers.Dense(1, activation='sigmoid')(x)

# Configure and compile the model
model = Model(img_input, output)
model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(learning_rate=0.001),
              metrics=['acc'])

## 5) Retrain the Model

With data augmentation and dropout in place, let's retrain our convnet model. This time, let's train on all 2,000 images available, for 30 epochs, and validate on all 1,000 test images. (This may take a few minutes to run.) See if you can write the code yourself. The [Keras API reference](https://keras.io/models/sequential/) might come in handy.


In [None]:
#>>> WRITE CODE TO TRAIN THE MODEL ON ALL 2000 IMAGES FOR 30 EPOCHS, AND VALIDATE  ON ALL 1,000 TEST IMAGES

Note that with data augmentation in place, the 2,000 training images are randomly transformed each time a new training epoch runs, which means that the model will never see the same image twice during training.

## 6) Evaluate the Results

Let's evaluate the results of model training with data augmentation and dropout:

In [None]:
# Retrieve a list of accuracy results on training and test data
# sets for each training epoch
acc = history.history['acc']
val_acc = history.history['val_acc']

# Retrieve a list of list results on training and test data
# sets for each training epoch
loss = history.history['loss']
val_loss = history.history['val_loss']

# Get number of epochs
epochs = range(len(acc))

# Plot training and validation accuracy per epoch
plt.plot(epochs, acc)
plt.plot(epochs, val_acc)
plt.title('Training and validation accuracy')
plt.legend(['training', 'validation'], loc='lower right')
plt.figure()

# Plot training and validation loss per epoch
plt.plot(epochs, loss)
plt.plot(epochs, val_loss)
plt.title('Training and validation loss')
plt.legend(['training', 'validation'], loc='upper right')
plt.show()

Interpret the results in terms of overfitting and accuracy improvements and the learning process.

*write your answer here*

## 7) AI in the cloud
Port the training into the Google Cloud AI Platform. Using the same implementation patterns as before, this is a larger task. You can explore using different machine types and can try using the full dataset.

We start with the usual set-up and mount our Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/MyDrive"
!mkdir BD
!mkdir BD/lab10

Then we need to create the directory for our code package for the cloud. It needs to contain a file `__init__.py`, which can be empty. You can use the command line tool `touch` to create it. Look [here](https://www.man7.org/linux/man-pages/man1/touch.1.html) for how it works.

In [None]:
%cd "/content/drive/MyDrive/BD/lab10"
!mkdir trainer
>>> ### use !touch to create a file __init__.py in the trainer directory
!ls -lh trainer

Then we authenticate with the Google Cloud.

In [None]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import auth
    auth.authenticate_user()

Then we need to specify our project and store the name in a Python variable (which can be used in the shell code -- that starts with '!' -- by prepending a '\$' sign like this: '$PROJECT').

In [None]:
PROJECT = 'my-project' ### USE YOUR OWN PROJECT ID HERE! ###
>>>!gcloud config set project ### SET THE PROJECT NAME ###
REGION = 'us-central1' # for use later with ai-platform
!gcloud config list  # show some information

Next, we make sure we have a storage bucket to save our results in. We can't use it for the training data in this lab, because the ImageDataGenerator doesn't read from cloud storage.

In [None]:
BUCKET = 'gs://{}-storage'.format(PROJECT)
!gsutil # make the bucket

Then we need to combine all relevant code into a cell and write it into a file that can be run with the AI Platform.

In [None]:
#%% writefile trainer/task.py
>>>### Copy the code from above here and adapt it to run here.

First, we can run the code directly from the cell (not writing it to a file) and it should behave as above.

Then you can write it to a file (uncommenting the first line of the cell) and then use `%run trainer/task.py` in a code cell. The behaviour should be the same as before. The most common problem is not including all code, which will now become noticable as the file is executed outside the notebook.

In [None]:
# try it here

Next, we can then test the training locally using the `local train` mode of the gcloud tool. You can see [here](https://cloud.google.com/sdk/gcloud/reference/ai-platform/local/train) how to set the missing parameters.

In [None]:
TRAINER_PACKAGE_PATH="trainer"
MAIN_TRAINER_MODULE="trainer.task"
BUCKET = "gs://bd-labs-test-storage"
PACKAGE_STAGING_PATH=BUCKET
import datetime
NOW=datetime.datetime.now().strftime("%y%m%d_%H%M")
JOB_NAME = "bd_labs_job_"+NOW
JOB_DIR=BUCKET+'/jobs/'+JOB_NAME

!gcloud ai-platform local train \
>>># ... set the --job-dir, --package-path, and --module-name here
    -- --config standard_gpu --batch-size 32

If that worked out, we can next move to the AI Platform, but need some modifications first.

You need to download the data to the machine where, to make it accessible for the AI Platform, and adapt your code. You can use the code below to achieve that.

Test the code again locally after modifiction.


In [None]:
### We need to copy the data over to the AI Platform Machine first, because the
# ImageDataGenerator.flow_from_directory can't read from a Bucket.
# Using 'subprocess', as in the Coursework notebook, for getting and unzipping the data.

# This needs to go into your code, you should be able to work out where.
import subprocess
proc = subprocess.run(["wget","--no-check-certificate", "https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip",
                       "-O", "/tmp/cats_and_dogs_filtered.zip"], stderr=subprocess.PIPE)
print("wget returned: " + str(proc.returncode))
print(str(proc.stderr))
local_zip = '/tmp/cats_and_dogs_filtered.zip'
# after this the code from above will work on external machines.

Finally, train in the cloud using `gcloud ai-platform job submit training`. You can look up [here](https://cloud.google.com/sdk/gcloud/reference/ai-platform/jobs/submit/training#--parameter-server-count) how to set the parameters. Next, have a look at the [Cloud cosole](https://console.cloud.google.com/ai-platform/jobs) to monitor job execution. (Unfortunately, it takes several minutes for the jobs to be started, so try this only after all steps above were successful.)

In [None]:
TRAINER_PACKAGE_PATH="trainer"
MAIN_TRAINER_MODULE="trainer.task"
import datetime
NOW=datetime.datetime.now().strftime("%y%m%d_%H%M")
JOB_NAME = "bd_labs_job_"+NOW
BUCKET = "gs://bd-labs-test-storage"
PACKAGE_STAGING_PATH=BUCKET
JOB_DIR=BUCKET+'/jobs/'+JOB_NAME

!gcloud ai-platform jobs submit training $JOB_NAME \
# set --staging-bucket --job-dir --region --package-path --module-name
    --runtime-version 2.3 \
    --python-version 3.7 \
    --scale-tier custom \
    --master-machine-type standard_gpu