# Cassava Leaf Disease Classification
## Identify the type of disease present on a Cassava Leaf image

### EfficientNet packages
In this notebook I'm going to use the EfficientNet B3 model as pre-trained base-model. It is possible to call its packages from internet or to download as a data set and use as a second input data set. 

In [None]:
!pip install efficientnet

## Librairies

In [None]:
import math, re, os 
import json
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
import cv2
from kaggle_datasets import KaggleDatasets
import albumentations as A
from tensorflow import keras
import efficientnet.tfkeras as efn
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import model_from_json
from functools import partial
from sklearn.model_selection import train_test_split

print("Tensorflow version " + tf.__version__)

## Detect TPU
Tensor Processing Units (TPUs) are hardware accelerators that are specialized for deep learning tasks. 
Thanks to Jesse Mostipak for the "Getting Started Tutorial".
As output we can expect an 1 if not working on TPU, and a 8 when using.

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Device:', tpu.master())
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except:
    strategy = tf.distribute.get_strategy()
print('Number of replicas:', strategy.num_replicas_in_sync)

## Setup
For this competition, I'm using the Cassava leaf disease classification dataset, that has been previously associated to the notebook.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
model_path =  "/kaggle/working/models/model_EffB3.h5"
datasets_dir = KaggleDatasets().get_gcs_path()
batch_size = 64 * strategy.num_replicas_in_sync
image_size = [512, 512]
classes = ['0', '1', '2', '3', '4']
epochs = 100

## Exploratory data analysis (EDA)

The mapping between each label and the real disease name is in the "label_num_to_disease_map.json" file. I'm going to link it to the dataset of images in order to explore how does it look every disease and get a general idea of the kind of images thar are in the dataset. There are 4 types of disease (classes) and an extra class that refers to the healthy plant. 

In [None]:
base_dir = "../input/cassava-leaf-disease-classification/"
with open(os.path.join(base_dir, "label_num_to_disease_map.json")) as file:
    name_classes = json.loads(file.read())
    name_classes = {int(k) : v for k, v in name_classes.items()}
    
print(json.dumps(name_classes, indent=4))

In [None]:
input_files = os.listdir(os.path.join(base_dir, "train_images"))
print(f"Images for training: {len(input_files)}")

With a total of more than 21k JPG images, I can explore the kind of images and the distribution of presence of diseases.

In [None]:
train = pd.read_csv(os.path.join(base_dir, "train.csv"))
train["disease"] = train["label"].map(name_classes)
train

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(y="disease", data=train);
plt.ylabel("Health condition")
plt.xlabel("Disponible images")

As we can see, there are a lot of photographs of Cassava mosaic disease in comparison with others diseases. Is this disease more frequent? Is easier to identify? Is a question of how the dataset has been selected? Let's see how looks like this Cassava disease.

In [None]:
# For visualize only one categorie
def plot_images(label, disease, images_number=3, verbose=0):
    plot_list = train[train["label"] == label].sample(images_number)['image_id'].tolist()    
    if verbose:
        print(plot_list)    
    labels = [disease for i in range(len(plot_list))]        
    plt.figure(figsize=(16, 12))    
    for ind, (image_id, disease) in enumerate(zip(plot_list, labels)):
        plt.subplot(3, 3, ind + 1)
        image = cv2.imread(os.path.join(base_dir,"train_images",image_id))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        plt.imshow(image)
        plt.title(disease, fontsize=12)
        plt.axis("off")   
    plt.show()

In [None]:
plot_images(label=0, disease="Cassava Bacterial Blight (CBB)")

In [None]:
plot_images(label=3, disease="Cassava Mosaic Disease (CMD)")

CMD is the most important threat to cassava production in some African zones, and also in this dataset is the most common disease.

In [None]:
plot_images(label=1, disease="Cassava Brown Streak Disease (CBSD)")

In [None]:
plot_images(label=2, disease="Cassava Green Mottle (CGM)")

In [None]:
plot_images(label=4, disease='Healthy')

Even in the case of healthy plants, the range of posibilities is long. I could find here green, luminous, beatuful, well focused photos, but also low quality photos, damaged or whitered leaves, and a diversity of parts of the plant : roots, steams, leaves, etc.

## Load the data
I'm going to use some fonctions to flow the datasets. 

In [None]:
def decode_image(image):
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.cast(image, tf.float32) / 255.0
    image = tf.reshape(image, [*image_size, 3])
    return image

The TFRecord format is a simple format for storing a sequence of binary records.

In [None]:
def read_tfrecord(example, labeled):
    tfrecord_format = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "target": tf.io.FixedLenFeature([], tf.int64)
    } if labeled else {
        "image": tf.io.FixedLenFeature([], tf.string),
        "image_name": tf.io.FixedLenFeature([], tf.string)}
    
    example = tf.io.parse_single_example(example, tfrecord_format)
    image = decode_image(example['image'])
    
    if labeled:
        label = tf.cast(example['target'], tf.int32)
        return image, label
    idnum = example['image_name']
    return image, idnum

In [None]:
def load_dataset(filenames, labeled=True, ordered=False):
    ignore_order = tf.data.Options()
    if not ordered:
        # disable order, increase speed
        ignore_order.experimental_deterministic = False 
    # automatically interleaves reads from multiple files
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE) 
    # uses data as soon as it streams in, rather than in its original order
    dataset = dataset.with_options(ignore_order) 
    dataset = dataset.map(partial(read_tfrecord, labeled=labeled), 
                          num_parallel_calls=AUTOTUNE)
    return dataset

I'm using 75% of photographs for training and 25% for validation.

In [None]:
train_fnames, valid_fnames = train_test_split(
    tf.io.gfile.glob(datasets_dir + '/train_tfrecords/ld_train*.tfrec'),
    test_size=0.25, random_state=0)
test_fnames = tf.io.gfile.glob(datasets_dir + '/test_tfrecords/ld_test*.tfrec')

In [None]:
def data_augment(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    image = tf.image.random_brightness(image, 0.2)
    image = tf.image.random_saturation(image, 0, 2)
    image = tf.image.random_hue(image, 0.2)
    return image, label

I prefer to use a fonction to specify the way of getting the datasets, since it is faster and cleaner.

In [None]:
def get_training_dataset():
    dataset = load_dataset(train_fnames, labeled=True)  
    dataset = dataset.map(data_augment, num_parallel_calls=AUTOTUNE)  
    dataset = dataset.repeat()
    dataset = dataset.shuffle(2048)
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

In [None]:
def get_validation_dataset(ordered=False):
    dataset = load_dataset(valid_fnames, labeled=True, ordered=ordered) 
    dataset = dataset.batch(batch_size)
    dataset = dataset.cache()
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

In [None]:
def get_test_dataset(ordered=False):
    dataset = load_dataset(test_fnames, labeled=False, ordered=ordered)
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

In [None]:
def count_data_items(filenames):
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    return np.sum(n)

In [None]:
print('Dataset: {} training images, {} validation images, {} (unlabeled) test images'.format(
    count_data_items(train_fnames), count_data_items(valid_fnames), count_data_items(test_fnames)))

Test of apply object detection to dataset as a cleaning traitement before classification.

In [None]:
# def run_detector(detector, image_path):
#     img = tf.io.read_file(image_path)
#     img = tf.image.decode_jpeg(img, channels=3)
#     converted_img  = tf.image.convert_image_dtype(img, tf.float32)[tf.newaxis, ...]
#     result = detector(converted_img)
#     result = {key:value.numpy() for key,value in result.items()}
#     return result, img

# def detecting_intruses(image, input_filename, boxes, class_names, scores, max_boxes=5):
#     plants_options = ["Plant","Houseplant","Flower","Tree"]
#     ch = 0
#     for i in range(min(boxes.shape[0], max_boxes)):
#         if class_names[i].decode("ascii") in plants_options:
#             ch+=1
#     if ch==0:
#         print(ch)
#         print("No plants in image:",input_filename)

In [None]:
# import tensorflow_hub as hub
# module_handle = "https://tfhub.dev/google/openimages_v4/ssd/mobilenet_v2/1"
# detector = hub.load(module_handle).signatures['default']

In [None]:
# for i in range(len(input_files)):
#     image_path = datasets_dir+"/train_images/"+input_files[i]
#     result, img = run_detector(detector, image_path)
#     detecting_intruses(img.numpy(), input_files[i],result["detection_boxes"],
#               result["detection_class_entities"], result["detection_scores"])

# Building the model


## Learning rate schedule
When training a model, it is often recommended to lower the learning rate as the training progresses. This schedule applies an exponential decay function to an optimizer step, given a provided initial learning rate.

In [None]:
lr_scheduler = keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=1e-3, decay_steps=10000, decay_rate=0.9)

## Building the model
In order to ensure that the model is trained on the TPU, it is built using `with strategy.scope()`.    

This model is built using transfer learning, using the pretrained model EfficientNet B3 as base model, and adding the customizable model built using `tf.keras.Sequential`.

Note that we're using `sparse_categorical_crossentropy` as our loss function, because I did not use one-hot-encoder in the labels.

In [None]:
with strategy.scope():    
    
    #     img_adjust_layer = tf.keras.layers.Lambda(tf.keras.applications.mobilenet.preprocess_input, input_shape=[*image_size, 3])
    #     base_model = tf.keras.applications.MobileNet(weights='imagenet', include_top=False)
    
    base_model = efn.EfficientNetB3(weights='imagenet', include_top=False)
    
    base_model.trainable = False
    model = tf.keras.Sequential([
        tf.keras.layers.BatchNormalization(renorm=True),
#         img_adjust_layer,
        base_model,
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(256, activation='relu'),
#         tf.keras.layers.Dropout(0.4),
#         tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.1),
        tf.keras.layers.Dense(len(classes), activation='softmax')  
    ])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=lr_scheduler, epsilon=0.00001),
        loss='sparse_categorical_crossentropy',  
        metrics=['sparse_categorical_accuracy'])


In order to save the best model even if after some epochs it has degradeted, I'm going to use a checkpoint that allows me to go back to the best scores. I'm also using an early-stopping callback, that dont allows the training to continue if the monitored score is not improving. A third callback in test is the diminution of learning rate if after some epochs there is not improvement.

In [None]:
if not os.path.exists("/kaggle/working/models/"):
    os.mkdir("/kaggle/working/models/") 
weights_path = "/kaggle/working/models/weights_EffB3.h5"
callbacks_list = [
      ModelCheckpoint(weights_path, monitor='val_sparse_categorical_accuracy', 
                  verbose=1, save_best_only=True, save_weights_only=True),
      EarlyStopping(monitor='val_sparse_categorical_accuracy', patience=10, verbose=0),
#       ReduceLROnPlateau(monitor='val_sparse_categorical_accuracy', factor=0.2,
#                               patience=5)
      ]

In [None]:
# Let's take a look to see how many layers are in the base model
print("Number of layers in the base model: ", len(base_model.layers))
print("Number of layers in the model: ", len(model.layers))

## Train the model

In [None]:
# load data
train_dataset = get_training_dataset()
valid_dataset = get_validation_dataset()

In [None]:
steps_per_epoch = count_data_items(train_fnames) // batch_size
valid_steps = count_data_items(valid_fnames) // batch_size

history = model.fit(train_dataset, 
                    steps_per_epoch=steps_per_epoch, 
                    epochs=epochs,
                    validation_data=valid_dataset,
                    validation_steps=valid_steps,
                    callbacks=[callbacks_list]
                   )

I'm able to see a printout of each layer, their corresponding shape, as well as the associated number of parameters. 
At the bottom of the printout it's possible to see information on the total parameters, trainable parameters, and non-trainable parameters. Because I am using a pre-trained model, there is a large number of non-trainable parameters.

In [None]:
model.load_weights(weights_path)
model.save(model_path)
model.summary()

# Evaluate the model

In [None]:
# print out variables available to use
print(history.history.keys())

It is useful to see the model performance through a plot of the scores as a fonction of the number of epochs. It gives me information about the moment at which an inflection occurs, about how smooth is the performance evolution, about the relashionship of validation and training datasets evolutions, etc. in a fast and comprehensible way.

In [None]:
history_frame = pd.DataFrame(history.history)
history_frame.loc[:, ['loss', 'val_loss']].plot()
history_frame.loc[:, ['sparse_categorical_accuracy', 'val_sparse_categorical_accuracy']].plot();

## Make predictions
Now that the model is trained, I can use it to make predictions.

In [None]:
# this code will convert our test image data to a float32 
def to_float32(image, label):
    return tf.cast(image, tf.float32), label

In [None]:
test_ds = get_test_dataset(ordered=True) 
test_ds = test_ds.map(to_float32)

print('Predicted label')
test_images_ds = test_ds
test_images_ds = test_ds.map(lambda image, idnum: image)
probabilities = model.predict(test_images_ds)
predictions = np.argmax(probabilities, axis=-1)
print(predictions)