I've been curious about TPUs since they were announced on Kaggle and this seems like a good opportunity to learn about them. It's also a good chance to learn a little about tfrecords and tensorflows data api.

As always, let's import the required libraries.

In [None]:
import numpy as np
import pandas as pd 
import tensorflow as tf
import matplotlib.pyplot as plt

from kaggle_datasets import KaggleDatasets

In [None]:
!pip install -q efficientnet
import efficientnet.tfkeras as efn

## Setup the TPU

As explained in notebooks such as this [one](https://www.kaggle.com/ryanholbrook/create-your-first-submission), TPUs are basically a bunch of GPU chips that are grouped together for one model to train on. By using a TPU we replicate the model eight times and split a batch of images to train on these eight models at the same time. Theoretically then, this gives us the speed of a GPU multiplied by eight.

To use a TPU we need a strategy. My understanding of a strategy is that it is like a set of instructions telling tensorflow how to replicate the model and assign these replicas to the eight GPUs. I presume this strategy also includes instructions on how to reconstruct the model from these eight trained replicas. Anyway, here is some code to form that strategy.

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy() 

print("REPLICAS: ", strategy.num_replicas_in_sync)

The dataset also needs to be close to the TPU for the training meaning that a little extra logic is needed to get the path to the dataset. This line of code gets the path to the location that the data is kept in. I believe that this is a bucket in Google cloud.

In [None]:
GCS_DS_PATH = KaggleDatasets().get_gcs_path('tpu-getting-started')
print(GCS_DS_PATH)

## Define Hyper-parameters

To make it easier to manage the models hyper-parameters I'll define them here as global variables.

In [None]:
IMAGE_SIZE = 512
EPOCHS = 35
BATCH_SIZE = 16 * strategy.num_replicas_in_sync

NUM_TRAINING_IMAGES = 12753
NUM_TEST_IMAGES = 7382
STEPS_PER_EPOCH = NUM_TRAINING_IMAGES // BATCH_SIZE

## Data pipeline

As mentioned before the data for this challenge is kept in tfrecords rather than the usual csv or json files that many kaggle datasets are kept in. Luckily the tensorflow data api can easily read in the data from a tfrecord and setup a pipeline to feed the images into the model (for training, validation or testing). I'll start this pipeline then by reading in the tfrecords.

In [None]:
train_data = tf.data.TFRecordDataset(
    tf.io.gfile.glob(GCS_DS_PATH + '/tfrecords-jpeg-' + str(IMAGE_SIZE) + 'x' + str(IMAGE_SIZE) + '/train/*.tfrec'),
    num_parallel_reads = tf.data.experimental.AUTOTUNE
)

As this is the training data pipeline it doesn't matter what order the images are inputted to the model. This is handy as ignoring any sort of order for a batch of images is easier and faster for the TPU to handle. The below config for the pipeline makes sure that the pipeline ignores any sort of order for the images, speeding up the training time.

In [None]:
# disable order and increase speed
ignore_order = tf.data.Options()
ignore_order.experimental_deterministic = False 
train_data = train_data.with_options(ignore_order)

Next the images and labels need extracting from each tfrecord. A couple of helper functions are needed here to read in a tfrecord, extract the image and label from it and decode the jpeg image into a 3D numpy array of float32 data type (though technically it is a tensor data type which contains a numpy array).

In [None]:
def read_labeled_tfrecord(example):
    tfrec_format = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "class": tf.io.FixedLenFeature([], tf.int64), 
    }
    
    example = tf.io.parse_single_example(example, tfrec_format)
    image = decode_image(example['image'])
    label = tf.cast(example['class'], tf.int32)
    
    # returns a dataset of (image, label) pairs
    return image, label 


def decode_image(image_data):
    image = tf.image.decode_jpeg(image_data, channels=3)
    image = tf.cast(image, tf.float32) / 255.0  
    image = tf.reshape(image, [IMAGE_SIZE, IMAGE_SIZE, 3])
    
    return image

Then add those helper functions to the pipeline.

In [None]:
# logic to read a tfrecord, decode the image in the record and return as arrays
train_data = train_data.map(read_labeled_tfrecord)

If an image classification model is to be trained well it needs to see a wide variety of the subject matter. Usually the training dataset alone isn't enough. To help with overfitting augmentation can be used to edit the images on the fly as they are fed into the model. This enables the model to see flowers from different angles and in different lighting.

The Tensorflow data api provides some handy functions to do this. They are limited in functionality compared to say the keras data generator or the albumentations library but for now, they will do.

In [None]:
def augment(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    
    image = tf.image.random_brightness(image, max_delta=0.5)
    image = tf.image.random_saturation(image, lower=0.2, upper=0.5)
    
    image = tf.image.random_crop(image, size=[IMAGE_SIZE, IMAGE_SIZE, 3])
    image = tf.image.resize_with_crop_or_pad(image, IMAGE_SIZE, IMAGE_SIZE)
    
    return image, label

In [None]:
train_data = train_data.map(augment, num_parallel_calls=tf.data.experimental.AUTOTUNE)

Finally add some config to the pipeline to help with training. Use repeat to ensure the pipeline goes back to the start of the dataset after it has finished one epoch of training. Shuffle ensures that the model learns the patterns in the images rather than just memorizing the order that the images come in while batch determines the size of a batch of images.

In [None]:
train_data = train_data.repeat()
train_data = train_data.shuffle(2048)
train_data = train_data.batch(BATCH_SIZE)
train_data = train_data.prefetch(tf.data.experimental.AUTOTUNE)

To see what the pipeline is inputting into the model let's have a look at the first five images in the first batch.

In [None]:
fig, axes = plt.subplots(1, 5, figsize=(15, 5))

for images, labels in train_data.take(1):
    for i in range(5):
        axes[i].set_title('Label: {0}'.format(labels[i]))
        axes[i].imshow(images[i])

And quickly put together a validation pipeline. Luckily kaggle has pre-split the dataset so it simply a case of adjusting the path to the images and removing anything that changes the order of the dataset.

In [None]:
val_data = tf.data.TFRecordDataset(
    tf.io.gfile.glob(GCS_DS_PATH + '/tfrecords-jpeg-' + str(IMAGE_SIZE) + 'x' + str(IMAGE_SIZE) + '/val/*.tfrec'),
    num_parallel_reads = tf.data.experimental.AUTOTUNE
)

val_data = val_data.with_options(ignore_order)

val_data = val_data.map(read_labeled_tfrecord, num_parallel_calls = tf.data.experimental.AUTOTUNE)
val_data = val_data.batch(BATCH_SIZE)
val_data = val_data.cache()
val_data = val_data.prefetch(tf.data.experimental.AUTOTUNE)

## Define model

To get a good start with the training I have loaded the EfficientNetB7 weights. It looks like Tensorflow will be making this available in their keras api in the next minor version. Until then I have loaded the weights through a pip library. The only addition to it is a final dense layer to prepare EfficientNets output for flower classification.

I've also brought back that TPU strategy that was defined at the beginning of this notebook. By wrapping the model in the with statement I am asking Tensorflow to use the strategy to train the model.

In [None]:
with strategy.scope():    
    enet = efn.EfficientNetB7(
        input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3),
        weights='imagenet',
        include_top=False
    )
    
    enet.trainable = True
    
    model = tf.keras.Sequential([
        enet,
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(104, activation='softmax', dtype='float32')
    ])

In [None]:
model.summary()

Nothing fancy here. Compile the model with the adam optimiser and use the usual classification loss function (categorical crossentropy). Collect accuracy metrics as well to help evaluate the quality of the model.

In [None]:
model.compile(
    optimizer='adam',
    loss = 'sparse_categorical_crossentropy',
    metrics=['sparse_categorical_accuracy']
)

I've added callbacks to help the model when loss is struggling to trend downwards. The first reduces the learning rate as loss begins to plateau while the second early stops training if the model is making no more progress. This final one restores the model to the epoch where it performed best to ensure the best model is used for inference.

In [None]:
callbacks = [
    tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', patience=2, verbose=1),
    tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5, verbose=1, restore_best_weights=True),
]

Finally, train the model.

In [None]:
history = model.fit(
    train_data, 
    validation_data = val_data,
    steps_per_epoch = STEPS_PER_EPOCH, 
    epochs = EPOCHS,
    callbacks = callbacks,
)

## Evaluation

Let's see how the model did.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].set_title('Loss')
axes[0].plot(history.history['loss'], label='Train')
axes[0].plot(history.history['val_loss'], label='Validation')
axes[0].legend()

axes[1].set_title('Accuracy')
axes[1].plot(history.history['sparse_categorical_accuracy'], label='Train')
axes[1].plot(history.history['val_sparse_categorical_accuracy'], label='Validation')
axes[1].legend()

plt.show()

## Submission

With the model trained the final thing to do is to make predictions against the test set and submit the results. First a test pipeline will need to be defined. This is similar to the validation pipeline except that this time we won't need to load a label for each image. This adjusted helper function does this for us.

In [None]:
def read_unlabeled_tfrecord(example):
    tfrec_format = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "id": tf.io.FixedLenFeature([], tf.string),  
    }
    
    example = tf.io.parse_single_example(example, tfrec_format)
    image = decode_image(example['image'])
    idnum = example['id']
    
    return image, idnum

In [None]:
test_data = tf.data.TFRecordDataset(
    tf.io.gfile.glob(GCS_DS_PATH + '/tfrecords-jpeg-' + str(IMAGE_SIZE) + 'x' + str(IMAGE_SIZE) + '/test/*.tfrec'),
    num_parallel_reads = tf.data.experimental.AUTOTUNE
)

test_data = test_data.with_options(tf.data.Options())
test_data = test_data.map(read_unlabeled_tfrecord, num_parallel_calls = tf.data.experimental.AUTOTUNE)
test_data = test_data.batch(BATCH_SIZE)
test_data = test_data.prefetch(tf.data.experimental.AUTOTUNE)

Now use the pipeline to make predictions against each image. The model outputs a probability for every possible class per image so argmax is used to get the most probable.

In [None]:
test_images = test_data.map(lambda image, idnum: image)

probabilities = model.predict(test_images)
predictions = np.argmax(probabilities, axis=-1)

The starter notebooks for this competition had an alternative way of getting hold of the ids per image. I got some funny errors though that disconnected me from the bucket the data was in. As such I have used this alternative.

In [None]:
ids = []

for image, image_ids in test_data.take(NUM_TEST_IMAGES):
    ids.append(image_ids.numpy())

ids = np.concatenate(ids, axis=None).astype(str)

Write the predictions and image ids to a file ready for submission.

In [None]:
submission = pd.DataFrame(data={'id': ids, 'label': predictions})
submission.to_csv('submission.csv', index=False)

And save the model in case it's needed in another notebook.

In [None]:
model.save('model.h5')