# Cassava Leaf Disease Classification - Competition Introduction, EDA & Simple Model 



### Competition Summary
The objective of this compeition is to identify the type of disease present on a Cassava Leaf image. There are 4 classes of unhealthy and 1 healthy for a total of 5 label classes across a total of  21,367 images. According to the competition description farmers often take pictures with their phone, hwoever we don't know if these images are taken with farmers mobiles. Finally we expect ~15,000 images in the test set. 


### Notebook Summary Findings
- Training data is imbalanced with Cassava Mosaic Disease (CMD) being the most common disease present
- Images range in type from close up pictures of leaves, to those with many leaves, and several with extra data (i.e. sky, houses, hands, etc)
- Cassava Brown Streak Disease (CBSD) can occur as aymptomatic in the leaves. The disease is only present in the roots.
- In several diseases a varaity of symptoms exist, not always manifesting in the same way.
- Caccava Mosaic Disease (CMD) includes symptom intensity varaible with onset time
- The Cassava Green Mottle (CGM) spectral profile in red and green is distinct from other clases.

### Contents
1. [Data Overview](#data-overview)
2. [Disease Types](#disease-types)
   - 2a. [Healthy](#healthy-leaf)
   - 2b. [Cassava Bacterial Blight (CBB)](#cassava-bacterial-blight)
   - 2c. [Cassava Brown Streak Disease (CBSD)](#cassava-brown-streak-disease)
   - 2e. [Cassava Green Mottle (CGM)](#cassava-green-mottle)
   - 2f. [Cassava Mosaic Disease (CMD)](#caccava-mosaic-disease)
3. [Model Building & Training](#modelling)


### References
1. [Cassava Lead Diseases: Overview](https://www.kaggle.com/c/cassava-leaf-disease-classification/discussion/198143)

In [None]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import RandomOverSampler

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import DenseNet201

import cv2

<a id="balance-data"></a>
# Data Overview

In [None]:
PATH_IMGS = "../input/cassava-leaf-disease-classification/train_images"
train_df = pd.read_csv("../input/cassava-leaf-disease-classification/train.csv")

with open("../input/cassava-leaf-disease-classification/label_num_to_disease_map.json") as f:
    class_names = json.loads(f.read())
f.close()

train_df["label_name"] = train_df['label'].apply(lambda x: class_names[str(x)])
train_df.label = train_df.label.astype(str)

print("Total training samples: ", len(train_df))
train_df.head(10)

In [None]:
fig = plt.figure()
train_df.groupby("label_name")["label_name"].count().plot(kind= "bar", title= "Frequency of image label classes")

We see this is an imbalanced dataset. What seems interesting is that it is not the healthy class that is dominant.

Considering the possibility that images were not taken in a standard way it is worth checking the image resolution to see if they are stadnardized. Below we see that all images have a standard shape of 600 by 800 pixels with 3 channels.

In [None]:
img_shapes = list()
for (_, row) in train_df.iterrows():
    img_path = os.path.join(PATH_IMGS, row.image_id)
    img = cv2.imread(img_path)
    img_shapes.append(img.shape)

print("Unique image shapes in training set:", set(img_shapes))

<a id="disease-types"></a>
# Cassava Disease Profiles
Each disease profile includes a descrption of characterists (primary symptoms, aditional symptoms, and descrption), a collection of images of the disease, and a color density chart of a random sample of 500 images from the respective class.

<a id="healthy-leaf"></a>
## Healthy Leaf
A healthy casava plant is a perennial shrub characterized by brown/purple woodish stems, and lobed leaves reachin gup to 30 cm in lenth. Originally from south america the plant can reach 4 meters high and is usally harvested 9 -12 months after planting.

Below we see a sample of images and the color density of healthy leaves. Interesting to note that we notice leaf defects even on healthy leaves like that in the top left image.


In [None]:
def load_image(image_id):
    image = cv2.imread(os.path.join(PATH_IMGS, image_id))
    return cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

def view_imgs(df, seed=99):
    
    fig = plt.figure(figsize=(18, 10))
    samples = df.sample(6, random_state=seed)

    for idx, (_, row) in enumerate(samples.iterrows()):

        plt.subplot(3, 3, idx+1)
        
        image = load_image(row.image_id)

        plt.imshow(image)
        plt.title(f"Class Label: {row.label_name}", fontsize=12)
        plt.axis("off")

def plot_images_histogram(imgs, name):
    fig = plt.figure(figsize=(5,4))
    #average histogram of image sample
    red_values = np.stack([img[:, :, 0].ravel() for img in imgs]).mean(axis=0)
    green_values = np.stack([img[:, :, 1].ravel() for img in imgs]).mean(axis=0)
    blue_values = np.stack([img[:, :, 2].ravel() for img in imgs]).mean(axis=0)
    
    sns.kdeplot(red_values, alpha=0.5, color='red')
    sns.kdeplot(green_values, alpha=0.5, color='green')
    sns.kdeplot(blue_values, alpha=0.5, color='blue')
    plt.title(f"Color density: {name}")
    plt.show()        

    
healthy = train_df[train_df.label=="4"]
view_imgs(healthy)

imgs = [load_image(img) for img in healthy.sample(500, random_state=99).image_id.values]
plot_images_histogram(imgs, healthy.label_name.unique()[0])

<a id="cassava-bacterial-blight"></a>
## Cassava Bacterial Blight (CBB)

- **Primary symptoms:** 
    - blight, wilting, dieback, and vascular necrosis
- **Additional symptoms:**
    - Pools of extruded gum along cuts in the stem, and leave cross veins
    - Gum forms as both golden liquid and hardened amber deposits
- **Description:**
    - Visible angular necrotic spotting forms on the leaves with a chlorotic ring encircling the spots
    - Spots range from moist brown lesions restricted to the bottom of the plant, to encasing the whole plant and killing entire leaves
[Source](https://en.wikipedia.org/wiki/Bacterial_blight_of_cassava)

In [None]:
cbb = train_df[train_df.label=="0"]
view_imgs(cbb)

imgs = [load_image(img) for img in cbb.sample(500, random_state=99).image_id.values]
plot_images_histogram(imgs, cbb.label_name.unique()[0])

<a id="cassava-brown-streak-disease"></a>
## Cassava Brown Streak Disease (CBSD) 


- **Primary symptoms:** 
    - Sever chlorosis and necrosis on infected leaves, with a yellowish, mottled appearance
- **Additional symptoms:**
    - Brown streaks on stems of plant
    - Dry brown-black necrotic rot of the cassava tuber ranging from a small lesion to the whole root
    - Root constriction from tuber rot and stunted growth
- **Description:**
    - Chlorosis may be associated with the veins, spanning from the mid vein, secondary and tertiary veins, or rather in blotches unconnected to veins
    - Leaf symptoms vary greatly depending on a variety of factors including growing conditions, plant age, and virus species
    - Affected plants do not always possess all symptoms except for those that are severely affected 
    - Plant leaves may be asymptomatic, with disease only affecting the tubers.[Source](https://en.wikipedia.org/wiki/Cassava_brown_streak_virus_disease)

In [None]:
cbsd = train_df[train_df.label=="1"]
view_imgs(cbsd)

imgs = [load_image(img) for img in cbsd.sample(500, random_state=99).image_id.values]
plot_images_histogram(imgs, cbsd.label_name.unique()[0])

<a id="cassava-green-mottle"></a>
## Cassava Green Mottle (CGM)
- **Primary symptoms:** 
    - Leaves puckered with yellow spots (can be faint or distinct), green mosaic patterns, and twisted margins.
    - Shoots usually appear helthy
- **Additional symptoms:**
    - Plant severly stunted
    - Edible roots abset or small and woody
- **Description:**
    - Yellow patterns on leaves range from small dots to iggregular yellow and green patches
    - Leaf margins are also often distorted [Source](https://www.pestnet.org/fact_sheets/cassava_green_mottle_068.htm)

In [None]:
cgm = train_df[train_df.label=="2"]
view_imgs(cgm)

imgs = [load_image(img) for img in cgm.sample(500, random_state=99).image_id.values]
plot_images_histogram(imgs, cgm.label_name.unique()[0])

<a id="caccava-mosaic-disease"></a>
## Caccava Mosaic Disease (CMD)
- **Primary symptoms:** 
    - Chlorotic mosaic of the leaves, leaf distortion, and stunted growth
- **Additional symptoms:**
    - Leaf stalks have characteristic S-shape
- **Description:**
    - Rapid symptom onset correlates with plant recovery
    - Slow development of disease correlates with plant death [Source](https://en.wikipedia.org/wiki/Cassava_mosaic_virus)

In [None]:
cmd = train_df[train_df.label=="3"]
view_imgs(cmd)

imgs = [load_image(img) for img in cmd.sample(500, random_state=99).image_id.values]
plot_images_histogram(imgs, cmd.label_name.unique()[0])

<a id="modelling"></a>
# Modelling

As a baseline model we will generate a balanced dataset and use DenseNet201 as the foundation. There appears to be a lot of natural varaition in the images so no image augmentation will be added. However based on the color densities and varaity of image types we could think about adding image augmentation in the form of zoom, verticle flip, and color filters for browns and reds.

Note: I've used ROC AUC as metric because the model appears to have a bug with accruacy. Suggestions welcome on how to fix it :)

### Steps:
1. Random oversampling of dataset
2. Add folds with Stratified KFold
3. Inialize image data generator and train model

In [None]:
def balance_set(df, x_cols, y_cols):
    ros = RandomOverSampler(random_state=42)

    x_multi, y_multi = ros.fit_resample(df[x_cols], df[y_cols].values)
    data = pd.concat([x_multi, pd.DataFrame(y_multi, columns= y_cols)], axis=1)
    return data

train_df_bal = balance_set(train_df, train_df.drop("label", axis=1).columns, ["label"])

skf = StratifiedKFold(n_splits=5)
for fold, (trn, val) in enumerate(skf.split(train_df_bal, train_df_bal.label)):
    train_df_bal.loc[val, "skfold"] = fold
    
train_df_bal.skfold = train_df_bal.skfold.astype(int)

train_df_bal[train_df_bal.skfold == 0].groupby("label")["image_id"].count().plot(kind='bar', title="Class distribution after balancing")

In [None]:
def make_image_gen(df, fold):

    img_gen = ImageDataGenerator(rescale=1./255)

    trn_df = df[df.skfold != fold]
    val_df = df[df.skfold == fold]
    
    print("Train length: ", trn_df.shape[0])
    print("Val length: ", val_df.shape[0])
    
    train_gen = img_gen.flow_from_dataframe(trn_df,
                                            directory=PATH_IMGS,
                                            x_col = "image_id",
                                            y_col = "label",
                                            target_size = (200,200),
                                            class_mode="categorical",
                                            batch_size=64,
                                            shuffle=True,
                                            seed=99)

    val_gen = img_gen.flow_from_dataframe(val_df,
                                          directory=PATH_IMGS,
                                          x_col = "image_id",
                                          y_col = "label",
                                          target_size = (200,200),
                                          class_mode="categorical",
                                          batch_size=64,
                                          shuffle=True,
                                          seed=99)
    return train_gen, val_gen

In [None]:
def make_model(trainable_weights=False):
    tf.keras.backend.clear_session()

    densenet = DenseNet201(input_shape = (200, 200, 3), weights="imagenet", include_top=False)

    N_CLASSES = len(train_df.label.unique())

    for layer in densenet.layers:
        layer.trainable=trainable_weights
    
    model = tf.keras.models.Sequential([
        densenet,
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(N_CLASSES, activation='softmax')
    ])

    accuracy = tf.keras.metrics.Accuracy()
    roc_auc = tf.keras.metrics.AUC()

    optimizer = tf.keras.optimizers.Adam(lr=0.0005)
    
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=[accuracy, roc_auc])

    return model

print("Baseline Model Summary")
make_model(trainable_weights=False).summary()

In [None]:
TRAIN_ALL_FOLDS = True
if not TRAIN_ALL_FOLDS:
    folds = [0]
else:
    folds = train_df_bal.skfold.unique()

print("Folds: ", folds)
train_results = dict()

for fold in folds:
    
    print(f"Fold {fold}")
    train_gen, val_gen = make_image_gen(train_df_bal, fold)
    
    model = make_model(trainable_weights=False)

    print("Model training")
    history = model.fit_generator(train_gen, 
                                  validation_data=val_gen, 
                                  epochs=5, 
                                  steps_per_epoch=len(train_gen), 
                                  validation_steps=len(val_gen))
    
    train_results[f"fold_{fold}"] = history.history
    
    model.save(f"./densenet_baseline_fold_{fold}.h5")

### Work in progress...