Hey kagglers,

This is my first competition in kaggle, I would like to share my approach on the Cassava leaf disease competition.
My approach involoves:
* Using keras library
* Using GPU to train the model
* Using albumentations to augment the dataset to prevent overfitting
* Using the StratifiedKFold as the dataset is skewed.

This notebook is for beginner who would like to get started with this competition

References:
1. Approaching (Almost) Any Machine Learning Problem - Book by Abhishek Thakur
2. https://www.kaggle.com/junyingsg/end-to-end-cassava-disease-classification-in-keras#Image-Augmentation-(Albumentations)
3. https://www.kaggle.com/tuckerarrants/cassava-tensorflow-starter-training

Please leave a upvote if you like this notebook

# **Importing libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os 
from sklearn.model_selection import StratifiedKFold
import seaborn as sns
import json
import cv2

import tensorflow as tf
from tensorflow.keras import models, layers,Sequential,regularizers
from tensorflow.keras.callbacks import ReduceLROnPlateau,EarlyStopping, ModelCheckpoint
from keras.layers import Input, Flatten, Dense, Conv2D, MaxPooling2D, GlobalAveragePooling2D, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.applications import Xception
from tensorflow.keras.losses import CategoricalCrossentropy 
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy) #shortens training time by 2x
import warnings
warnings.filterwarnings("ignore")

Cassava leaves are a rich source of protein, minerals, and vitamins. However, the presence of antinutrients and cyanogenic glucosides are the major drawbacks in cassava leaves which limit its human consumption. These antinutrients and toxic compounds of cassava leaves cause various diseases depending on the consumption level. But viral diseases are major sources of poor yields. With the help of data science, it may be possible to identify common diseases so they can be treated.

![](https://www.rural21.com/fileadmin/_processed_/8/2/csm_Seite30_96bc157e8d.jpg)

In this competition, we are introduced with a dataset of 21,367 labeled images collected during a regular survey in Uganda. Our task is to classify each cassava image into four disease categories or a fifth category indicating a healthy leaf.
The label in the dataset is as follows:
* 0 - CBB - Cassava Bacterial Blight
* 1 - CBSD - Cassava Brown Streak Disease
* 2 - CGM - Cassava Green Mottle
* 3 - CMD - Cassava Mosaic Disease
* 4 - Healthy

# **Dataset and Kfold**

In [None]:
train_image_path="../input/cassava-leaf-disease-classification/train_images/"
train_df_path="../input/cassava-leaf-disease-classification/train.csv"

In [None]:
train_df=pd.read_csv(train_df_path)
train_df.head()

In [None]:
with open("../input/cassava-leaf-disease-classification/label_num_to_disease_map.json") as file:
    map_classes = json.loads(file.read())
    map_classes = {int(k) : v for k, v in map_classes.items()}

In [None]:
train_df["Class"]=train_df["label"].map(map_classes)
train_df.head()

In [None]:
sns.set(rc={'figure.figsize':(8,4)})
sns.set_style('whitegrid')

va=sns.countplot(y="Class",data=train_df,palette='rainbow')
plt.xlabel("Classes of leaves",fontsize=20)
plt.ylabel("Count",fontsize=20)
plt.tight_layout()

In [None]:
def visualize(image_ids, labels):
    plt.figure(figsize=(16, 12))
    for ind, (image_id, label) in enumerate(zip(image_ids, labels)):
        plt.subplot(4, 4, ind + 1)
        image = cv2.imread(os.path.join("../input/cassava-leaf-disease-classification/train_images", image_id))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        plt.imshow(image)
        plt.title(f"Class: {label}", fontsize=12)
        plt.axis("off")
        plt.tight_layout()
    plt.show()

In [None]:
train_df1=train_df.sample(8)
image_ids = train_df1["image_id"].values
labels = train_df1["Class"].values
visualize(image_ids, labels)

In [None]:
train_df.label.value_counts()

As you can observe, The data set is skewed with the largest number of samples for label 3, Cassava Mosiac Disease (CMD), and the fewest number of samples for label 0, Cassava Bacterial Blight (CBB). To over come this we use StratifiedKfold to split the dataset for training and validation to check the performance of the model.

The general procedure is as follows:
1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
    * Take the group as a holdout or test data set
    * Take the remaining groups as a training data set
    * Fit a model on the training set and evaluate it on the test set
    * Retain the evaluation score and discard the model
    * Summarize the skill of the model using the sample of model evaluation scores.
    
StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles your data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting.

![](https://miro.medium.com/max/703/0*QKJTHrcriSx2ZNYr.png)

In [None]:
def creat_nfold(train_df,n_split):
    train_df.loc[:,"kfold"]=-1
    train_df=train_df.sample(frac=1).reset_index(drop=True)
    SS=StratifiedKFold(n_splits=n_split)
    y=train_df.label.values
    for fold,(t_,v_) in enumerate((SS.split(X=train_df,y=y))):
        train_df.loc[v_,"kfold"]=fold
    return train_df

In [None]:
train_df.label=train_df.label.astype("str") 
#converting the label to str as we will be using categorical cross entropy as loss function to train the model

In [None]:
train_df=creat_nfold(train_df,5)
train_df.head(5)


# **Augmentation**

Image augmentations is done with the help of the library Albumentation through both ImageDataGenerator. We will use a tool called ImageDataAugmentor (thanks to mjkvaak at github) that allows us to do this. 

In [None]:
!pip install git+https://github.com/mjkvaak/ImageDataAugmentor

In [None]:
from ImageDataAugmentor.image_data_augmentor import *
import albumentations as A

# augmentations referred from: https://www.kaggle.com/khyeh0719/pytorch-efficientnet-baseline-train-amp-aug
train_aug = albumentations.Compose([
            albumentations.RandomResizedCrop(300, 300),
            albumentations.Transpose(p=0.5),
            albumentations.HorizontalFlip(p=0.5),
            albumentations.VerticalFlip(p=0.5),
            albumentations.ShiftScaleRotate(p=0.5),
            albumentations.HueSaturationValue(
                hue_shift_limit=0.2, 
                sat_shift_limit=0.2, 
                val_shift_limit=0.2, 
                p=0.5
            ),
            albumentations.RandomBrightnessContrast(
                brightness_limit=(-0.2,0.2), 
                contrast_limit=(-0.2, 0.2), 
                p=0.5
            ),
            albumentations.Normalize(
                mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225], 
                max_pixel_value=255.0, 
                p=1.0
            ),
            albumentations.CoarseDropout(p=0.5),
            albumentations.Cutout(p=0.5),albumentations.ToFloat()], p=1.)
  
        
valid_aug = albumentations.Compose([
            albumentations.CenterCrop(300, 300, p=1.),
            albumentations.Normalize(
                mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225], 
                max_pixel_value=255.0, 
                p=1.0
            ),albumentations.ToFloat()], p=1.)




In [None]:
def train_generator(train,valid,batch_size,image_size):
    
        datagen_train = ImageDataAugmentor(augment=train_aug)
        datagen_val = ImageDataAugmentor(augment=valid_aug)
        train_generator = datagen_train.flow_from_dataframe(dataframe=train,
                                                    directory=train_image_path,
                                                    x_col="image_id",
                                                    y_col="label",
                                                    subset="training",
                                                    batch_size=batch_size,
                                                    seed=42,
                                                    shuffle=True,
                                                    class_mode="categorical",
                                                    target_size=(image_size,image_size))
        val_generator = datagen_val.flow_from_dataframe(dataframe=valid,
                                                    directory=train_image_path,
                                                    x_col="image_id",
                                                    y_col="label",
                                                    subset="training",
                                                    batch_size=batch_size,
                                                    seed=42,
                                                    shuffle=False,
                                                    class_mode="categorical",
                                                    target_size=(image_size,image_size))
        
        return train_generator, val_generator

# **Define the model**

We will use the Xception architecture to train the model. To read more about this architecture refer https://arxiv.org/abs/1610.02357

In [None]:
def make_model(IMG_SIZE):
    base_model = Xception(input_shape = (IMG_SIZE, IMG_SIZE, 3), include_top = False,
                             weights = 'imagenet')
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    x = Dropout(0.3)(x)
    predictions = Dense(5, activation='softmax',name='Final', dtype='float32')(x)

    model = Model(inputs=base_model.input, outputs=predictions)

    model.compile(optimizer =tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True),
                  loss = CategoricalCrossentropy(from_logits = True,
                                                   label_smoothing=0.2,
                                                   name='categorical_crossentropy'),
                  metrics = ['categorical_accuracy']) 
    return model

# **Training the model**

In [None]:
def run_train(df,batch_size,image_size,fold):
    
    train=df[df.kfold!=fold].reset_index(drop=True)
    valid=df[df.kfold==fold].reset_index(drop=True)
    
    train_gen,val_gen= train_generator(train,valid,batch_size,image_size)
    
    my_callbacks = [EarlyStopping(monitor = 'val_loss', min_delta = 0.001, 
                                  patience = 3, mode = 'min', verbose = 1,
                                  restore_best_weights = True),
                    ModelCheckpoint(filepath=f'model{fold}.h5', 
                                    save_best_only = True, 
                                    monitor = 'val_loss', 
                                    mode = 'min', verbose = 1),
                    ReduceLROnPlateau(monitor='val_loss',
                                      factor=0.1,
                                      patience=2, 
                                      min_lr=0.00001,
                                      mode='min',
                                      verbose=1)]
    
    steps_per_epoch = train_gen.n//train_gen.batch_size
    validation_steps = val_gen.n//val_gen.batch_size
    
    model=make_model(image_size)
    
    history = model.fit_generator(train_gen,
                                  steps_per_epoch=steps_per_epoch,
                                  validation_steps=validation_steps,
                                  validation_data = val_gen,
                                  epochs = 10, 
                                  callbacks =my_callbacks)
    return model, history,train_gen, val_gen

In [None]:
oof_acc=[]

In [None]:
for i in range(5):
        print(25*"-")    
        print(f'{i}-fold training')
        print(25*"-")
        
        model,history,train_gen, val_gen = run_train(train_df,16,300,i)

        train_acc = history.history['categorical_accuracy']
        val_acc = history.history['val_categorical_accuracy']
        loss = history.history['loss']
        val_loss = history.history['val_loss']
        
        oof_acc.append(val_acc)
        
        epochs = range(1, len(train_acc) + 1)
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
        fig.set_size_inches(20,10)

        ax1.plot(epochs , train_acc , 'go-' , label = 'Training Accuracy')
        ax1.plot(epochs , val_acc , 'ro-' , label = 'Validation Accuracy')
        ax1.set_title('Training & Validation Accuracy')
        ax1.legend()
        ax1.set_xlabel("Epochs")
        ax1.set_ylabel("Accuracy")

        ax2.plot(epochs , loss , 'g-o' , label = 'Training Loss')
        ax2.plot(epochs , val_loss , 'r-o' , label = 'Validation Loss')
        ax2.set_title('Testing Accuracy & Loss')
        ax2.legend()
        ax2.set_xlabel("Epochs")
        ax2.set_ylabel("Training & Validation Loss")
       
        fig.tight_layout()
        plt.show()

In [None]:
print(np.mean(oof_acc))


In [None]:
from sklearn.metrics import confusion_matrix, classification_report
        
pred = model.predict_generator(val_gen) # Gives class probabilities
pred = np.round(pred) # Gives one-hot encoded classes
pred = np.argmax(pred, axis = 1) # Gives class labels

# Obtain actual labels
actual = val_gen.classes
    
# Now plot matrix
sns.set(rc={'figure.figsize':(10,10)})
sns.set_style('whitegrid')
cm = confusion_matrix(actual, pred, labels = [0,1,2,3,4])
sns.heatmap(
    cm, 
    cmap="Blues",
    annot = True, 
    fmt = "d"
)
plt.title("Confusion Matrix", fontsize=12)
plt.show()


In [None]:
print(classification_report(actual,pred))

**Work Under progress**

# **Improvements**

Some improvements that could possibly be made:

1. Image augmentation (using cutmix etc.)
2. Different learning rate and learning rate schedule
3. Using TPU to decrease the training time
4. Increased input size
5. Add more dense layers and regularization
6. Using other architectures such as EfficientNet

If this notebook helped you, please leave an upvote!