# Introduction

This kernel is designed to work as the training environment for my final degree project, where I intend to create a Deep Learning model able to diagnose illnesses on vegetables, specifically working with the [plant pathology 2021 dataset](https://www.kaggle.com/c/plant-pathology-2021-fgvc8), and therefore can also be seen as one more attempt on the overall competition.

For this reason, the code present here will be only responsible for training and testing the models, as these tasks require heavy computations capabilities for which TPUs are needed. The models will then be saved into .h5 files.

For the rest of the tasks such as data analysis, dataset division and results analysis, they all will be performed locally. All the code related to the utilities functions that implement these parts can be found at github.

Check out the project at my [github repo](https://github.com/gfelis/TFG).

### EfficientNet requires special installation

In [None]:
!pip install -q efficientnet

In [None]:
# Python built-in libraries
import random
import os

# Third party libraries
import numpy as np
import cv2
import pandas as pd
import tensorflow as tf
import warnings
from keras.callbacks import CSVLogger
from kaggle_datasets import KaggleDatasets
from sklearn.model_selection import train_test_split
import tensorflow.keras.layers as L

import efficientnet.tfkeras as efn
from tensorflow.keras.applications import DenseNet121

from tqdm import tqdm
tqdm.pandas()

tf.random.set_seed(0)


# HYPERPARAMS

PATH = '../input/plant-pathology-2021-fgvc8/'
TRAINDIR = PATH + 'train_images/'
TESTDIR = PATH + 'test_images/'
TRAIN_CSV = PATH + 'train.csv'
SUBMISSION = PATH + 'sample_submission.csv'

EPOCHS = 15
SAMPLE_LEN = 100

In [None]:
# Converting a loaded model to TFLite format

from tensorflow import lite
model = tf.keras.models.load_model('../input/densenet-2dataaug-model/dense_net_joint_2daug.h5')
converter = lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
  f.write(tflite_model)

# Loading the dataset

This first set of functions are taken from *utils.py* and are related to data loading as well as to the partition of the whole dataset into a train and test sets.

Later on, we will also create a validation set. The difference between the test and validation sets is that the validation set is going to be used by the model building algorithm at the end of each epoch during its training phase, so it can keep track of how close each epoch gets to optimal weights. On the other hand, the test set is going to be used once the model is completely built to try its overall performance.

This has been the final approach because the provided images for testing only consist of 3 samples, as the competition has a hidden test set that will be only provided once the submission has been done, consisting of 5 thousand extra images.

To summarize it all, we will use a 15% of the total dataset as a validation set and a 10% as the test set.

In [None]:
# To reproduce the same random partition of the dataset into train and test among different experiments
def seed_reproducer(seed=2021):
    np.random.seed(seed)
    random.seed(seed)

def load_split_dataset(frac: float=0.1, data: pd.DataFrame=None) -> "tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]":
    dataset = pd.read_csv(TRAIN_CSV)
    if data is not None:
        dataset = data
    seed_reproducer()
    state = random.randint(0, 10000)
    test = dataset.sample(frac=frac, random_state=state).reset_index()
    train = dataset
    for index in test['index'].values:
        train = train.drop([index])
    train = train.reset_index(drop=True)
    test = test.drop(columns=['index'])
    return dataset, train, test

In [None]:
# Read the dataset file and split them into train and test sets
data, train, test = load_split_dataset()

# This condition doesn't ensure the function is properly implemented, it's necessary but not sufficient
assert(len(data) == len(test) + len(train))

# We can check the dataset
train.head()

# Removing duplicates

To remove duplicated images, we use duplicates.csv file, which contains 62 sequences of duplicates found with image_hash, this list has been taken [from this notebook](https://www.kaggle.com/nickuzmenkov/pp2021-duplicates-revealing). For each duplicate sequence:

We leave only one sample if all duplicates share the same labels, and we will delete all duplicates if at least one of them is labeled differently, because in that case we can't know which one is the correct label.

In [None]:
with open('../input/pp2021duplicatesrevealing/duplicates.csv', 'r') as file:
    duplicates = [x.strip().split(',') for x in file.readlines()]

def eliminate_duplicates(dataframe):
    init_len = len(dataframe)
    
    for row in duplicates:
        sizes = set()
        for img in row:
            labels = dataframe.loc[dataframe['image'] == img]['labels'].values
            sizes.add(len(labels))
        if len(sizes) == 1:
            for img in row[1:]:
                indexName = dataframe[dataframe['image'] == img].index
                dataframe.drop(indexName, inplace=True)
        else:
            for img in row:
                indexName = dataframe[dataframe['image'] == img].index
                dataframe.drop(indexName, inplace=True)
    print(f'Dropping {init_len - len(dataframe)} duplicate samples.')
    
        

eliminate_duplicates(data)

# We split the dataset again, but now taking into account the dropped rows
data, train, test = load_split_dataset(data=data)

# Normalising the dataset

The following functions are also taken from *utils.py*, they are used to change the labels representation in the pandas dataframe. 

Initially the dataset looked like this:

>|***image***    |***labels***   |  
|---|---|  
|e88d1bbd624e9c34.jpg   |powdery_mildew   |  
|8002cb321f8bfcdf.jpg   |scab frog_eye_leaf_spot complex   |  
| ...  | ...  |  


## Disjoint normalisation

The first type of normalisation has been named as *disjoint* because it keeps the original 12 labels as 12 unique classes.

Normalising it this way transforms it to:

> |***image***    |***scab***    |***healthy***  |***frog_eye_leaf_spot***    |  ***rust***    |***complex***    |***powdery_mildew***    |***scab frog_eye_leaf_spot***    |***scab frog_eye_leaf_spot complex***    |***frog_eye_leaf_spot complex***    |***rust frog_eye_leaf_spot***   |***rust complex***    |***powdery_mildew complex***    | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|e88d1bbd624e9c34.jpg|0|0|0|0|0|1|0|0|0|0|0|0|
|8002cb321f8bfcdf.jpg|0|0|0|0|0|0|0|1|0|0|0|0|
|...|...|...|...|...|...|...|...|...|...|...|...|...|

## Joint normalisation

This normalisation joins the labels into 6 basic labels, removing the possibility of having spaces to separate different diseases.

According to the competition's information, different diseases are separated by spaces, that's why this normalisation is also considered and seen as the most accurate one.

This normalisation would transform it into the following way:

> |***image***    |***scab***    |***healthy***  |***frog_eye_leaf_spot***    |  ***rust***    |***complex***    |***powdery_mildew***    |
|---|---|---|---|---|---|---|
|e88d1bbd624e9c34.jpg|0|0|0|0|0|1|
|8002cb321f8bfcdf.jpg|1|0|1|0|1|0|
|...|...|...|...|...|...|...|



In [None]:
# Here we have the 2 types of normalisations

def normalise_from_dataset_disjoint(dataset: pd.DataFrame) -> pd.DataFrame:
    columns = ['image']
    labels = dataset['labels'].value_counts().index.tolist()
        
    columns.extend(labels)
    data = []

    for image, label in zip(dataset['image'], dataset['labels']):
        labelpos = columns.index(label)
        row = [image]
        for _ in labels: row.append(0)
        row[labelpos] =  1
        data.append(row)
    
    return pd.DataFrame(data, columns=columns)

def normalise_from_dataset_joint(dataset: pd.DataFrame) -> pd.DataFrame:
    columns = ['image']
    labels = dataset['labels'].value_counts().index.tolist()
    basic_labels = set()   
    for label in labels:
        for word in label.split():
            basic_labels.add(word)

    columns.extend(basic_labels)
    data = []

    for image, labels in zip(dataset['image'], dataset['labels']):

        row = [image]
        real_labels = labels.split()
        for _ in basic_labels: row.append(0)
        for real_label in real_labels:
            labelpos = columns.index(real_label)
            row[labelpos] =  1
        data.append(row)
    
    return pd.DataFrame(data, columns=columns)

In [None]:
# Normalising with the joint approach
norm_train = normalise_from_dataset_joint(train)
norm_test = normalise_from_dataset_joint(test)

# The assertion should still hold
assert(len(data) == len(norm_train) + len(norm_test))

# If we are lucky enough we can check that there are rows with multiple diseases
norm_train.head()

# Kaggle's TPU configuration

As mentioned before, the sole purpose of this kernel is to perform the computation-intensive tasks that require Tensor Process Units (TPUs), this is why we won't dig into data analysis or data preprocessing and we'll get right to the training part. 

The following piece of code allows the connection of the Kaggle Kernel to the available TPUs or GPUs.

In [None]:
# Allows prefetching of data in the input pipeline for each step of the training process, 
# tuning the values of the optimization algorithm dynamically at runtime
AUTO = tf.data.experimental.AUTOTUNE

print('Using tensorflow %s' % tf.__version__)

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print('Running on TPUv3-8')
except:
    tpu = None
    tf.keras.mixed_precision.set_global_policy('mixed_float16')
    strategy = tf.distribute.get_strategy()
    print('Running on GPU with mixed precision')

# The batch size refers to the number of samples utilized in each iteration of an epoch
BATCH_SIZE = 16 * strategy.num_replicas_in_sync

print('Number of replicas:', strategy.num_replicas_in_sync)
print('Batch size: %.i' % BATCH_SIZE)



# Tensorflow set up

In this section we must prepare the data that we have already gathered in the terms that Tensorflow is going to need it, such as *numpy* dataclases, *tensorflow* images or *tensorflow* dataset, as well as adjusting the path where our images are saves in Google Cloud Storage.

In [None]:
# The dataset is stored at google cloud's storage buckets
GCS_DS_PATH = KaggleDatasets().get_gcs_path('plant-pathology-2021-fgvc8')


#Be careful with this variables, adapt them to the first and last columns names in the dataset, it will cause a compilation error if they don't match
first_label = 'frog_eye_leaf_spot'
last_label = 'rust'

def format_path(st):
    return GCS_DS_PATH + '/train_images/' + st

# For the moment we will only use the test_paths to predict on our model, 
# test_labels is going to be used for validation at the end
test_paths = norm_test['image'].apply(format_path).values
test_labels = np.float32(norm_test.loc[:, first_label:last_label].values)

train_paths = norm_train['image'].apply(format_path).values
train_labels = np.float32(norm_train.loc[:, first_label:last_label].values)

# Similar to the function we build in utils.py, scikit library provides us 
# a function to split data into train and test, used to set up the validation set
train_paths, valid_paths, train_labels, valid_labels =\
train_test_split(train_paths, train_labels, test_size=0.15, random_state=2020)


def decode_image(filename, label=None, image_size=(512, 512)):
    bits = tf.io.read_file(filename)
    image = tf.image.decode_jpeg(bits, channels=3)
    image = tf.cast(image, tf.float32) / 255.0
    image = tf.image.resize(image, image_size)
    
    if label is None:
        return image
    else:
        return image, label

# For the moment we will only use 2 data augmentation techniques
def data_augment(image, label=None):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    #TODO
    #image = tf.image.random_crop()
    #image = tf.image.random_brightness
    #image = tf.image.random_contrast
    
    if label is None:
        return image
    else:
        return image, label
    
# Create Dataset objects
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((train_paths, train_labels))
    .map(decode_image, num_parallel_calls=AUTO)
    .map(data_augment, num_parallel_calls=AUTO)
    .repeat()
    .shuffle(512)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((valid_paths, valid_labels))
    .map(decode_image, num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(test_paths)
    .map(decode_image, num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
)


# Building the learning rate function

In [None]:
def build_lrfn(lr_start=0.00001, lr_max=0.00005, 
               lr_min=0.00001, lr_rampup_epochs=5, 
               lr_sustain_epochs=0, lr_exp_decay=.8):
    lr_max = lr_max * strategy.num_replicas_in_sync

    def lrfn(epoch):
        if epoch < lr_rampup_epochs:
            lr = (lr_max - lr_start) / lr_rampup_epochs * epoch + lr_start
        elif epoch < lr_rampup_epochs + lr_sustain_epochs:
            lr = lr_max
        else:
            lr = (lr_max - lr_min) *\
                 lr_exp_decay**(epoch - lr_rampup_epochs\
                                - lr_sustain_epochs) + lr_min
        return lr
    return lrfn

lrfn = build_lrfn()
STEPS_PER_EPOCH = train_labels.shape[0] // BATCH_SIZE
lr_schedule = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=1)


# Building up the model: DenseNet 121

In [None]:
with strategy.scope():
    model_dense = tf.keras.Sequential([DenseNet121(input_shape=(512, 512, 3),
                                             weights='imagenet',
                                             include_top=False),
                                 L.GlobalAveragePooling2D(),
                                 L.Dense(train_labels.shape[1],
                                         activation='softmax')])
        
    model_dense.compile(optimizer='adam',
                  loss = 'categorical_crossentropy',
                  metrics=['categorical_accuracy'])
    model_dense.summary()
    
# To save the model training history for later reviews
dense_csv_logger = CSVLogger('dense_net_joint_2daug.log', separator=',', append=False)

# Training the model

In [None]:
history_dense = model_dense.fit(train_dataset,
                    epochs=EPOCHS,
                    callbacks=[lr_schedule, dense_csv_logger],
                    steps_per_epoch=STEPS_PER_EPOCH,
                    validation_data=valid_dataset)

# Saving the model
model_dense.save('dense_net_joint_2daug_dedup.h5')

# Building up the model: Efficient Net

In [None]:
with strategy.scope():
    model_efn = tf.keras.Sequential([efn.EfficientNetB7(input_shape=(512, 512, 3),
                                                    weights='imagenet',
                                                    include_top=False),
                                 L.GlobalAveragePooling2D(),
                                 L.Dense(train_labels.shape[1],
                                         activation='softmax')])
    
    
        
    model_efn.compile(optimizer='adam',
                  loss = 'categorical_crossentropy',
                  metrics=['categorical_accuracy'])
    efn_csv_logger = CSVLogger('efn_joint_2daug.log', separator=',', append=False)
    model_efn.summary()

# Training

In [None]:
history_efn = model_efn.fit(train_dataset,
                    epochs=EPOCHS,
                    callbacks=[lr_schedule, efn_csv_logger],
                    steps_per_epoch=STEPS_PER_EPOCH,
                    validation_data=valid_dataset)

# Saving the model
model_efn.save('efn_joint_2daug.h5')

# Building the model: Efficient Noisy Student

In [None]:
with strategy.scope():
    model_efnns = tf.keras.Sequential([efn.EfficientNetB7(input_shape=(512, 512, 3),
                                                    weights='noisy-student',
                                                    include_top=False),
                                 L.GlobalAveragePooling2D(),
                                 L.Dense(train_labels.shape[1],
                                         activation='softmax')])
    
    
        
    model_efnns.compile(optimizer='adam',
                  loss = 'categorical_crossentropy',
                  metrics=['categorical_accuracy'])
    model_efnns.summary()
    efnns_csv_logger = CSVLogger('efnns_joint_2daug.log', separator=',', append=False)

# Training

In [None]:
history_efnns = model_efnns.fit(train_dataset,
                    epochs=EPOCHS,
                    callbacks=[lr_schedule, efnns_csv_logger],
                    steps_per_epoch=STEPS_PER_EPOCH,
                    validation_data=valid_dataset)

# Saving the model
model_efn.save('efnns_joint_2daug.h5')