# Introduction

As part of the Kaggle Competition **[Cassava Leaf Disease Classification competition](https://www.kaggle.com/c/cassava-leaf-disease-classification)** I will try to develop an effective model, because it would have a huge impact on farmers in Africa. The model should be able to classify 4 different diseases based on the pictures of the leaves. The fifth category is intended to classify healthy leaves.   

Farmers may be able to quickly identify diseased plants, potentially saving their crops before they inflict irreparable damage. As an added challenge, effective solutions for farmers must perform well under significant constraints, since African farmers may only have access to mobile-quality cameras with low-bandwidth.

Submissions will be evaluated based on their categorization accuracy.

### Categories  

"0": "Cassava Bacterial Blight (CBB)",   
"1": "Cassava Brown Streak Disease (CBSD)",   
"2": "Cassava Green Mottle (CGM)",   
"3": "Cassava Mosaic Disease (CMD)",   
"4": "Healthy"  

# Set up environment

In [None]:
import os
import tensorflow as tf
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import json
import cv2
from PIL import Image
from tensorflow import keras
from tensorflow.keras import models, layers
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Flatten, Conv2D, BatchNormalization, MaxPool2D, Dropout, Activation, GlobalMaxPooling2D, GlobalAveragePooling2D
from tensorflow.keras.applications.efficientnet import EfficientNetB0, EfficientNetB7, preprocess_input
from tensorflow.keras.metrics import sparse_categorical_accuracy, sparse_categorical_crossentropy
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.models import model_from_json
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from functools import partial

# Load the data

In [None]:
path = '/kaggle/input/cassava-leaf-disease-classification'

train_images = os.listdir(os.path.join(path, "train_images"))
print("Total images for Train: ", len(train_images))

with open ('/kaggle/input/cassava-leaf-disease-classification/label_num_to_disease_map.json') as file:
    classes = json.loads(file.read())
    
print(json.dumps(classes,indent=4))

train_df = pd.read_csv(os.path.join(path, "train.csv"))
train_df.head()

train_df['class'] = train_df['label'].map({int(i) : c for i, c in classes.items()}) 

train_df.head()

In [None]:
# plot the categories/classes to visualize the distribution

plt.subplots(figsize=(12,8))
ax  = sns.countplot(x='class', data=train_df)

for a in ax.patches:
        ax.annotate('{:1}'.format(a.get_height()),
                    (a.get_x()+0.3, a.get_height()))
plt.xticks(rotation=90)
ax.set_title("classes", fontdict={'fontsize':15})
plt.show();

The plot shows us, that we have a unbalanced data set. 

In [None]:
def plot_images(class_id, label, images_number,verbose=0):

    plot_list = train_df[train_df["label"] == class_id].sample(images_number)['image_id'].tolist()
    
    # Printing list of images
    if verbose:
        print(plot_list)
        
    labels = [label for i in range(len(plot_list))]
    size = np.sqrt(images_number)
    if int(size)*int(size) < images_number:
        size = int(size) + 1
        
    plt.figure(figsize=(20, 20))
    
    for ind, (image_id, label) in enumerate(zip(plot_list, labels)):
        plt.subplot(size, size, ind + 1)
        image = cv2.imread(os.path.join(path, "train_images", image_id))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        plt.imshow(image)
        plt.title(label, fontsize=12)
        plt.axis("off")
    
    plt.show()


plot_images(class_id=4, 
    label='Healthy',
    images_number=6,
    verbose=1)

plot_images(class_id=3, 
    label='Cassava Mosaic Disease (CMD)',
    images_number=6,
    verbose=1)

plot_images(class_id=2, 
    label='Cassava Green Mottle (CGM)',
    images_number=6,
    verbose=1)

plot_images(class_id=1, 
    label='Cassava Brown Streak Disease (CBSD)',
    images_number=6,
    verbose=1)

plot_images(class_id=0, 
    label='Cassava Bacterial Blight (CBB)',
    images_number=6,
    verbose=1)

The pictures are not all correctly labeled. I saw some "Healthy" pictures which look like "CBB", and there are also some fruits.  
These components will cause problems for the ML model; garbage in - garbage out  
Normally I would try to identify and remove the wrongly labeled images, but I take part in the competition and if the test data for determining the accuracy have the same noise, my model will be worse. To find the right labels, I would prefer a k-means clustering. --> remove the "fruit" cluster, and have a look on the clusters. Maybe there will be some other diseases, or false labeled pictures. --> **Update: I tried to train the model without fruit pictures --> but it had a negative impact on the submission score** (0.106)

# Set up variables

In the next chunk are some variables defined. Changes would impact the accurracy.  
I tried different target sizes. eg 380 **-> no impact on loss & accurracy**

In [None]:
TARGET_SIZE = (240,240) #380
BATCH_SIZE = 16
STEPS_PER_EPOCH = len(train_df)*0.8 // BATCH_SIZE
VALIDATION_STEPS = len(train_df)*0.2 // BATCH_SIZE
EPOCHS = 10

The test data consists only one picture. So I will do a train-test-split on the training data set.  I split the data to 80%/20%
The function "ImageDataGenerator" is used to increase the amount of data by adding slightly modified copies of already existing data.

The there is no Cross Validtion parameter in the ImageDataGeneration function. (I have to write a function to implement it -> not done for now)

In [None]:
train_df.label = train_df.label.astype(str)


train_datagen = ImageDataGenerator(validation_split = 0.2,
                                    rotation_range = 45, 
                                    zoom_range = 0.2,
                                    horizontal_flip = True,
                                    vertical_flip = True,
                                    fill_mode = 'nearest',
                                    height_shift_range = 0.2,
                                    width_shift_range = 0.2,
                                  )

train_generator = train_datagen.flow_from_dataframe(train_df,
                         directory = '/kaggle/input/cassava-leaf-disease-classification/train_images/',
                         subset = "training",
                         x_col = "image_id",
                         y_col = "label",
                         target_size = TARGET_SIZE,
                         batch_size = BATCH_SIZE,
                         class_mode = "sparse",
                         seed = 2021,
                         shuffle= True)


validation_datagen = ImageDataGenerator(validation_split = 0.2) # no data augmentation on validation set

validation_generator = validation_datagen.flow_from_dataframe(train_df,
                         directory = '/kaggle/input/cassava-leaf-disease-classification/train_images/',
                         subset = "validation",
                         x_col = "image_id",
                         y_col = "label",
                         target_size = TARGET_SIZE,
                         batch_size = BATCH_SIZE,
                         class_mode = "sparse",
                         seed = 2021,
                         shuffle= True)

## Creating CNN

- load the pretrained weights --> EfficientNetB0 
- build the top Layers 

In [None]:
# I use this chunk, to load my pretrained weights. To build the model for training I used the next chunk.
"""
weights_path = '/kaggle/input/efficientnetb0/efficientnetb0_notop.h5'

basemodel = EfficientNetB0(
    weights=weights_path, 
    include_top=False,
    input_shape=TARGET_SIZE+(3,))

headmodel = layers.GlobalAveragePooling2D()(basemodel.output)
headmodel = layers.Dense(5, activation="softmax")(headmodel) # 5 -> for five classes
model = keras.Model(inputs=basemodel.input, outputs=headmodel)

model.load_weights("../input/best-weights-efficient/best.h5")

model.compile(
    optimizer=keras.optimizers.Adam(1e-3),
    loss='categorical_cross_entropy', # becuse not all pictures have the right label
    metrics=["accuracy"],
)
"""

In [None]:

weights_path = '/kaggle/input/efficientnetb0/efficientnetb0_notop.h5' 
# the pretrained weights must be uploaded, because we don´t can use Internet for the competition - if there is a Internet connection use weights = 'imagenet'

basemodel = EfficientNetB0(
    weights=weights_path, #'imagenet'
    include_top=False, # we don´t need the top layers, because we built this layers for our dataset
    input_shape=TARGET_SIZE+(3,))

headmodel = layers.GlobalAveragePooling2D()(basemodel.output)
headmodel = layers.Dense(5, activation="softmax")(headmodel)
model = keras.Model(inputs=basemodel.input, outputs=headmodel)


## Train the Modell  

To train the model is time intensive. The process tooks hours with the GPU usage. 
The biggest challenge is the notebook timeout, because it interrupts training.

In [None]:
# I used this code to train the model... 


model_save = ModelCheckpoint('./best_weights.h5', 
                             save_best_only = True, 
                             monitor = 'val_loss', 
                             mode = 'min', verbose = 1)
reduce_lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.3, 
                              patience = 2, min_lr = 1e-6, 
                              mode = 'min', verbose = 1)
early_stop = EarlyStopping(monitor = 'val_loss', 
                           patience = 3, mode = 'min', verbose = 1,
                           restore_best_weights = True)


model.compile(
    optimizer=keras.optimizers.Adam(1e-3),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

history = model.fit(
    train_generator,
    steps_per_epoch = STEPS_PER_EPOCH,
    epochs = EPOCHS, 
    validation_data = validation_generator,
    validation_steps = VALIDATION_STEPS,
    callbacks = [model_save, early_stop, reduce_lr],
)

model.save("model.h5")



In [None]:
# to plot the model fit history 

def plot_history(history):
    loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' not in s]
    val_loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' in s]
    acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' not in s]
    val_acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' in s]
     
    if len(loss_list) == 0: 
        print('Loss is missing in history') 
        return 
     
    ## As loss always exists
    epochs = range(1,len(history.history[loss_list[0]]) + 1)
    
    ## Loss
    plt.figure(1)
    for l in loss_list: 
        plt.plot(epochs, history.history[l], 'b', label='Training loss (' + str(str(format(history.history[l][-1],'.5f'))+')'))
    for l in val_loss_list:
        plt.plot(epochs, history.history[l], 'g', label='Validation loss (' + str(str(format(history.history[l][-1],'.5f'))+')'))
    
    plt.title('Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    
    ## Accuracy
    plt.figure(2)
    for l in acc_list:
        plt.plot(epochs, history.history[l], 'b', label='Training accuracy (' + str(format(history.history[l][-1],'.5f'))+')')
    for l in val_acc_list:    
        plt.plot(epochs, history.history[l], 'g', label='Validation accuracy (' + str(format(history.history[l][-1],'.5f'))+')')

    plt.title('Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()


plot_history(history)


## Creating a submission file
ready to submit to the competition!

In [None]:
ss = pd.read_csv(os.path.join('/kaggle/input/cassava-leaf-disease-classification', "sample_submission.csv"))
preds = []
results = []

for image_id in ss.image_id:
    image = Image.open(os.path.join('/kaggle/input/cassava-leaf-disease-classification', "test_images", image_id))
    image = image.resize(TARGET_SIZE)
    image = np.expand_dims(image, axis = 0)
    preds.append(np.argmax(model.predict(image)))
    res = max(set(preds), key = preds.count)
    results.append(res)

ss['label'] = results
ss.to_csv('submission.csv', index = False)


## Path of research

I read a lot of papers and articles about image classification. Transfer learning is a popular approach.
There are a lot of nets eg. VGG16, ResNet50, EfficientNet, etc.

#### Some links:
https://www.nature.com/articles/s41598-020-59108-x  
https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a  
https://www.frontiersin.org/articles/10.3389/fpls.2016.01419/full  
https://www.mdpi.com/2223-7747/9/10/1319/htm  
https://thebinarynotes.com/transfer-learning-keras-vgg16/  

There are also a lot of puplic notebooks in this competition which inspired me.

## Challenges

- notebook runtime timeout  
- GPU hours (30h per week are a lot, but if there is a notebook timeout and training is interrupted, then the GPU time is wasted)  
- unbalanced dataset
- submission error (maybe because there is no free GPU left for now)

## Success  
- implement Transfer Learning  
- reached Validation Accurracy ~85%

## Next Steps

- implement Cross Validation (for now the submission score is ~60, the validation accurracy ~85)
- fine tuning 