# Intro
Welcome to the [Cassava Leaf Disease Classification](https://www.kaggle.com/c/cassava-leaf-disease-classification) competition.
![](https://storage.googleapis.com/kaggle-competitions/kaggle/13836/logos/header.png)

There are 5 classifications (click for further informations):
* 0: [Cassava Bacterial Blight (CBB)](https://en.wikipedia.org/wiki/Bacterial_blight_of_cassava)
* 1: [Cassava Brown Streak Disease (CBSD)](https://en.wikipedia.org/wiki/Cassava_brown_streak_virus_disease)
* 2: [Cassava Green Mottle (CGM)](https://en.wikipedia.org/wiki/Cassava_green_mottle_virus)
* 3: [Cassava Mosaic Disease (CMD)](https://en.wikipedia.org/wiki/Cassava_mosaic_virus)
* 4: Healthy"

The goal of this notebook is to give a short tutorial for the usage of TFRecords. We don't focus on optimization of the prediction model.

For a more general tutorial we recommend [this notebook](https://www.kaggle.com/drcapa/tutorial-tfrecords-create-and-read).

<span style="color: royalblue;">Please vote the notebook up if it helps you. Thank you. </span>

# Motivation
TFRecord files (.tfrec) are based on a binary format for storing sequences of values. The TFRecord format was developed by TensorFlow. The motivation of the development is to use Tensor Processing Units (TPUs) to accelerate the applications of machine learning applications.

To use the advantages of TPU you have to switch on your notebook:
1. Klick on the notebook seetings (right upper corner of the notebook).
2. Klick on "Accelerator".
3. Choose TPU v3-8.
![](https://i.ibb.co/mHFPHpN/setting.png)

# Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import re
import json

from sklearn.model_selection import train_test_split

import tensorflow as tf
from functools import partial
from kaggle_datasets import KaggleDatasets
print("Tensorflow version " + tf.__version__)

# Set Up

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print("Device:", tpu.master())
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except:
    strategy = tf.distribute.get_strategy()
print("Number of replicas:", strategy.num_replicas_in_sync)

# Path

In [None]:
path = '/kaggle/input/cassava-leaf-disease-classification/'
os.listdir(path+'test_tfrecords/')

To create the GCS path we need internet access. So we can not use this notebook for submission because internet is forbidden for it.  

In [None]:
path_gcs = KaggleDatasets().get_gcs_path('cassava-leaf-disease-classification')
print(path_gcs) 

# Parameter

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 16*strategy.num_replicas_in_sync
IMAGE_SIZE = [512, 512]

# Load Data

In [None]:
samp_subm = pd.read_csv(path+'sample_submission.csv')

In [None]:
with open(path+'label_num_to_disease_map.json') as json_file:
    label_data = json.load(json_file)

In [None]:
label_data

In [None]:
train_filenames, val_filenames = train_test_split(tf.io.gfile.glob(path_gcs + '/train_tfrecords/*.tfrec'),
                                                  test_size=0.20, random_state=2020)
test_filenames = tf.io.gfile.glob(path_gcs+'/test_tfrecords/*.tfrec')

In [None]:
print('Number of train tfrec files:', len(train_filenames))
print('Number of val tfrec files:', len(val_filenames))
print('Number of test tfrec files:', len(test_filenames))

# Key Names
First we have to extract the features keys. To see the feature keys we have to execute the following code.

There are 3 feature keys for this dataset:
1. image
![](https://i.ibb.co/8rHQQLs/features-1.png)
2. image_name
![](https://i.ibb.co/9HLzNf3/features-2.png)
3. target
![](https://i.ibb.co/r0ML4yZ/features-3.png)

In [None]:
raw_dataset = tf.data.TFRecordDataset(train_filenames)
# for raw_record in raw_dataset.take(1):
#   example = tf.train.Example()
#   example.ParseFromString(raw_record.numpy())
#   print(example.features)

# Functions
To handle tfrecord files we follow the instructions of this [tutorial](https://keras.io/examples/keras_recipes/tfrecord/).

In [None]:
def decode_image(image):
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, [*IMAGE_SIZE])
    image = tf.cast(image, tf.float32)/255.
    image = tf.reshape(image, [*IMAGE_SIZE, 3])
    return image


def read_tfrecord(example, labeled):
    tfrecord_format = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "target": tf.io.FixedLenFeature([], tf.int64)
    } if labeled else {
        "image": tf.io.FixedLenFeature([], tf.string),
        "image_name": tf.io.FixedLenFeature([], tf.string)
    }
    example = tf.io.parse_single_example(example, tfrecord_format)
    image = decode_image(example['image'])
    if labeled:
        label = tf.cast(example['target'], tf.int32)
        return image, label #tf.one_hot(label, 5)
    idnum = example['image_name']
    return image, idnum


def load_dataset(filenames, labeled=True, ordered=False):
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False  # disable order, increase speed
    dataset = tf.data.TFRecordDataset(
        filenames
    )  # automatically interleaves reads from multiple files
    dataset = dataset.with_options(
        ignore_order
    )  # uses data as soon as it streams in, rather than in its original order
    dataset = dataset.map(
        partial(read_tfrecord, labeled=labeled), num_parallel_calls=AUTOTUNE
    )
    # returns a dataset of (image, label) pairs if labeled=True or just images if labeled=False
    return dataset


def get_dataset(filenames, labeled=True, ordered=False):
    dataset = load_dataset(filenames, labeled=labeled, ordered=ordered)
    dataset = dataset.shuffle(2020)
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    dataset = dataset.batch(BATCH_SIZE)
    return dataset


def number_of_files(filenames):
    """ Evaluate the number on files """
    
    num = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    return np.sum(num)


def show_batch(image_batch, label_batch):
    """ Plot 25 images of a batch """
    
    plt.figure(figsize=(20, 20))
    for n in range(25):
        ax = plt.subplot(5, 5, n + 1)
        plt.imshow(image_batch[n])
        plt.title(label_data[str(label_batch[n].numpy())])
        plt.axis("off")

In [None]:
print('Number Files train:', number_of_files(train_filenames))
print('Number Files train:', number_of_files(val_filenames))
print('Number Files test:', number_of_files(test_filenames))

In [None]:
train_dataset = get_dataset(train_filenames)
val_dataset = get_dataset(val_filenames)
test_dataset = get_dataset(test_filenames, labeled=False, ordered=True)

In [None]:
print(train_dataset)
print(val_dataset)
print(test_dataset)

# Show Examples

In [None]:
image_batch, label_batch = next(iter(train_dataset))
show_batch(image_batch, label_batch)

In [None]:
for image, idnum in test_dataset.take(3):
    print(image.numpy().shape, idnum.numpy().shape)
    print(idnum.numpy().astype('U'))

# Model

In [None]:
initial_learning_rate = 1e-5
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate,
    decay_steps=1000,
    decay_rate=0.9
)

In [None]:
weights='../input/models/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5'

We use the sparse_categorical_accuracy metric. So we have not to encode the 5 target labels.

In [None]:
def make_model():
    base_model = tf.keras.applications.ResNet50(include_top=False,
                     weights=weights,
                     input_shape=(*IMAGE_SIZE, 3))
    base_model.trainable = False

    inputs = tf.keras.layers.Input([*IMAGE_SIZE, 3])
    x = tf.keras.applications.resnet50.preprocess_input(inputs)
    x = base_model(x)
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    x = tf.keras.layers.Dense(64, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    outputs = tf.keras.layers.Dense(5, activation="softmax")(x)

    model = tf.keras.Model(inputs=inputs, outputs=outputs)

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
        loss="sparse_categorical_crossentropy",
        metrics=['sparse_categorical_accuracy']
    )

    return model

In [None]:
with strategy.scope():
    model = make_model()
    
model.summary()

In [None]:
history = model.fit(
    train_dataset,
    epochs=5,
    validation_data = val_dataset,
)

# Analyse Results

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(20, 6))
fig.subplots_adjust(hspace = .2, wspace=.2)
axs = axs.ravel()
loss = history.history['loss']
loss_val = history.history['val_loss']
epochs = range(1, len(loss)+1)
axs[0].plot(epochs, loss, 'bo', label='loss_train')
axs[0].plot(epochs, loss_val, 'ro', label='loss_val')
axs[0].set_title('Value of the loss function')
axs[0].set_xlabel('epochs')
axs[0].set_ylabel('value of the loss function')
axs[0].legend()
axs[0].grid()
acc = history.history['sparse_categorical_accuracy']
acc_val = history.history['val_sparse_categorical_accuracy']
axs[1].plot(epochs, acc, 'bo', label='accuracy_train')
axs[1].plot(epochs, acc_val, 'ro', label='accuracy_val')
axs[1].set_title('Accuracy')
axs[1].set_xlabel('Epochs')
axs[1].set_ylabel('Value of accuracy')
axs[1].legend()
axs[1].grid()
plt.show()

# Predict Test Data

Prepare data:

In [None]:
def to_float32(image, idnum):
    return tf.cast(image, tf.float32), idnum

test_dataset = test_dataset.map(to_float32)
test_images = test_dataset.map(lambda image, idnum: image)

Predict test data:

In [None]:
pred_propa = model.predict(test_images, verbose=1)
preds = np.argmax(pred_propa, axis=-1)

Write output for submission:

In [None]:
samp_subm['label'] = preds
samp_subm.to_csv('submission.csv', index=False)

In [None]:
samp_subm