# <center>Cassava Leaf Disease Classification</center>
# ![image](https://i.ytimg.com/vi/VGCHcgmZu24/maxresdefault.jpg)

In this Notebook,  I am going to build a model from scratch using keras framework and analyse how augmentation, different input sizes, image proprocessing techniques impact the accuracy of the model. To start with this, I have chosen an artchitecture mentioned in paper titled **"A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks"** published in 2020. This architecture with hyperparameter tuning and preprocessing was able to achieve 93% accuracy in Cassava 2019 CVPR challenge.

## Contents:

- [About competition](#s1)
- [Importing necessary Libraries](#s2)
- [Previous Experimentation Results](#s9)
- [Dataset Description](#se3)
 - [Class Distribution](#ss31)
 - [Sample Images](#ss32)
- [Necessary Functions](#s11)
 - [Random Cropping](#ss111)
 - [CLAHE](#sss112)
- [Train-Valid Split](#s4)
- [Model Architecture](#s5)
- [Training](#s6)
- [Model Visualization](#s7)
- [Prediction](#s8)
- [Next Steps](#s10)

<a id="s1"> </a>
## About Competition

Cassava, or Manihot esculenta, belongs to the family Euphorbiaceae and is cultivated in tropical and subtropical regions for its edible starchy tuberous root, which is commonly dried into a powder and named tapioca.

As the second-largest provider of carbohydrates in Africa, cassava is a key food security crop grown by smallholder farmers because it can withstand harsh conditions. At least 80% of household farms in Sub-Saharan Africa grow this starchy root, but viral diseases are major sources of poor yields. With the help of data science, it may be possible to identify common diseases so they can be treated.

Existing methods of disease detection require farmers to solicit the help of government-funded agricultural experts to visually inspect and diagnose the plants. This suffers from being labor-intensive, low-supply and costly. As an added challenge, effective solutions for farmers must perform well under significant constraints, since African farmers may only have access to mobile-quality cameras with low-bandwidth.

In this competition, we introduce a dataset of 21,367 labeled images collected during a regular survey in Uganda. Most images were crowdsourced from farmers taking photos of their gardens, and annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab at Makerere University, Kampala. This is in a format that most realistically represents what farmers would need to diagnose in real life.

### HEALTH BENEFITS

Tapioca has been associated with some health benefits, such as **healthy weight gain, increased red blood cell count, improved digestion, preventing diabetes, protecting bone mineral density, preventing Alzheimer’s disease and maintaining fluid balance within the body.**

### EVALUATION
**$$Accuracy=\frac{TP + TN}{TP + FP + TN + FN}$$**

where,  
 - TP: True Positive
 - FP: False Positive
 - TN: True Negative
 - FN: False Negative

<a id ='s2'></a>
## Importing Necessary Libraries

In [None]:
#numpy - for necessary matric operation
import numpy as np

#pandas to work with dataframes
import pandas as pd

#keras - Model Building
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Conv2D,BatchNormalization,MaxPool2D,Dense,Flatten,Dropout
from keras.models import Sequential,load_model,Model
from keras.callbacks import ModelCheckpoint,EarlyStopping,ReduceLROnPlateau
from keras.optimizers import Adam

#cv2 - for image operations
import cv2
#reading disease names from provided json file
import json

import time

#visualizations
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.subplots import make_subplots
from sklearn.manifold import TSNE
from tqdm.notebook import tqdm

#oversampling technique
from imblearn.over_sampling import SMOTE

#train-test split
from sklearn.model_selection import train_test_split

<a id='s9'></a>
## Previous Experimentation Results

In [None]:
fig = go.Figure()
t1 = go.Bar(x=['Without Augmentation','With Augmentation','Augmentation and CLAHE'],y=[0.687,0.683,0.332],text = [0.687,0.683,0.332],textposition='auto')
fig.add_trace(t1)
fig.update_xaxes(title_text="Experimentations")
fig.update_yaxes(title_text="Score")
fig.update_layout(title='Public Score')
fig.show()

<a id='s3'></a>
## Dataset Description

In [None]:
#reading training data
train_df = pd.read_csv('../input/cassava-leaf-disease-classification/train.csv')
#maping the class labels mentioned in json file wiht its respective disease name
disease_names = open('../input/cassava-leaf-disease-classification/label_num_to_disease_map.json')
disease_names = json.load(disease_names)

#parse through every label value and identify the disease name based on label number from json file
train_df['disease_name'] = train_df['label'].apply(lambda x: disease_names[str(x)])
#visualize the top five rows from table
train_df.head()

<a id='ss31'></a>
### Class Labels Distribution

In [None]:
fig = make_subplots(rows=1, cols=2,
            specs=[[{"type": "xy"}, {"type": "domain"}]],)
# value_counts: to count number of images in each class with respect to disease_name column
# Bar plot 
t1 = go.Bar(x=train_df['disease_name'].value_counts().index, 
            y=train_df['disease_name'].value_counts().values,
            text=train_df['disease_name'].value_counts().values,
            textposition='auto',name='Count',
           marker_color='indianred')
#Pie chart with labels and counts
t2 = go.Pie(labels=train_df['disease_name'].value_counts().index,
           values=train_df['disease_name'].value_counts().values,
           hole=0.3)
fig.add_trace(t1,row=1, col=1)
fig.add_trace(t2,row=1, col=2)
fig.update_layout(title='Distribution of Class Labels')
fig.show()

<a id="s32"></a>
### Sample Images from each class

In [None]:
#random seed is used to replicate the same images in every run
np.random.seed(2020)
#plotting 5 random samples for each class with image name and disease name as title
for class_name in train_df['disease_name'].unique():
    plt.figure(figsize=(20,50))
    for idx,img_name in enumerate(np.random.choice(train_df[train_df['disease_name'] == class_name]['image_id'].values,
                                                   size=5,replace=False)):
        plt.subplot(1,5,idx+1)
        #reading the image and converting BGR color space to RGB
        img = cv2.cvtColor(cv2.imread('../input/cassava-leaf-disease-classification/train_images/'+img_name), cv2.COLOR_BGR2RGB)
        plt.imshow(img)
        plt.axis('off')
        plt.title(r"$\bf{"+class_name + "}$"+'\n'+img_name )
    plt.show()

<a id='ss11'></a>
### Necessary Functions

<a id='sss111'></a>
### Random Cropping

In [None]:
def random_crop(img, random_crop_size):
    # Note: image_data_format is 'channel_last'
    assert img.shape[2] == 3
    height, width = img.shape[0], img.shape[1]
    dy, dx = random_crop_size
    x = np.random.randint(0, width - dx + 1)
    y = np.random.randint(0, height - dy + 1)
    return img[y:(y+dy), x:(x+dx), :]


def crop_generator(batches, crop_length):
    """Take as input a Keras ImageGen (Iterator) and generate random
    crops from the image batches generated by the original iterator.
    """
    while True:
        batch_x, batch_y = next(batches)
        batch_crops = np.zeros((batch_x.shape[0], crop_length, crop_length, 3))
        for i in range(batch_x.shape[0]):
            batch_crops[i] = random_crop(batch_x[i], (crop_length, crop_length))
        yield batch_crops, batch_y

<a id='sss112'></a>
### CLAHE: Pre-processing

**Contrast Limited AHE:**

Adaptive histogram equalization (AHE) is a computer image processing technique used to improve contrast in images. It differs from ordinary histogram equalization in the respect that the adaptive method computes several histograms, each corresponding to a distinct section of the image, and uses them to redistribute the lightness values of the image. It is therefore suitable for improving the local contrast and enhancing the definitions of edges in each region of an image.

**Contrast limited AHE** limits the contrast amplification to reduce amplified noise. It does so by distributing that part of the histogram that exceeds the clip limit equally across all histograms.

In [None]:
def clahe_preprocessing(img):
    planes = cv2.split(img.astype(np.uint8))
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    for j in range(len(planes)):
        planes[j] = clahe.apply(planes[j])
    return cv2.merge(planes)

<a id='s4'></a>
## Train-Valid Split

Splitting 20% of the training data for validating the model

In [None]:
VALIDATION_SPLIT_PERCENT = 0.2
TRAINING_IMGS_DIR = '../input/cassava-leaf-disease-classification/train_images/'
IMAGE_ID_COL_NAME = 'image_id'
LABEL_ID_COL_NAME = 'disease_name' #or label
TARGET_SIZE = 512
BATCH_SIZE = 8
CLASS_MODE = 'sparse'

### SMOTE: Synthetic Minority Oversampling Technique

Reference: https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

At first the total no. of oversampling observations, N is set up. Generally, it is selected such that the binary class distribution is 1:1. But that could be tuned down based on need. Then the iteration starts by first selecting a positive class instance at random. Next, the KNN’s (by default 5) for that instance is obtained. At last, N of these K instances is chosen to interpolate new synthetic instances. To do that, using any distance metric the difference in distance between the feature vector and its neighbors is calculated. Now, this difference is multiplied by any random value in (0,1] and is added to the previous feature vector. This is pictorially represented below:
![image](https://editor.analyticsvidhya.com/uploads/77417image1.png)

In [None]:
#smt = SMOTE()
#read all the images in train data and store it in a variable(train_x,train_y) and apply smot as below
#since it involves a huge size of storing it in a variable throwing a memory error- So i am holding SMOTE processing for now and do the Hypoer Parameter tuning
#train_x_smt,train_y_smt = smt.fit_resample(train_x.reshape(len(train_x),-1),train_y)

### Load Train and Valid images

In [None]:
#generate images and split 20% for validation
train_data = ImageDataGenerator(validation_split=VALIDATION_SPLIT_PERCENT,
                                horizontal_flip=True,
                                vertical_flip=True,
                                shear_range=0.1,
                                rescale=1,
                                zoom_range=0.2,
                                width_shift_range=0.1,
                                height_shift_range=0.1)
train_gen = train_data.flow_from_dataframe(train_df,
                                           directory= TRAINING_IMGS_DIR,
                                           subset="training",
                                           x_col= IMAGE_ID_COL_NAME,
                                           y_col= LABEL_ID_COL_NAME,
                                          target_size=(TARGET_SIZE,TARGET_SIZE),
                                           batch_size=BATCH_SIZE,
                                           class_mode=CLASS_MODE)
train_images = crop_generator(train_gen,TARGET_SIZE)
valid_data = ImageDataGenerator(validation_split=VALIDATION_SPLIT_PERCENT)
valid_gen = valid_data.flow_from_dataframe(train_df,
                                           directory= TRAINING_IMGS_DIR,
                                           subset="validation",
                                           x_col=IMAGE_ID_COL_NAME,
                                           y_col= LABEL_ID_COL_NAME,
                                          target_size=(TARGET_SIZE,TARGET_SIZE),
                                           batch_size=BATCH_SIZE,
                                           class_mode=CLASS_MODE)

In [None]:
fig = go.Figure()
t1 = go.Bar(name='Train',x=np.unique(train_gen.labels,return_counts=True)[0],y=np.unique(train_gen.labels,return_counts=True)[1],
           text=np.unique(train_gen.labels,return_counts=True)[1],textposition='auto')
t2 = go.Bar(name='Valid',x=np.unique(valid_gen.labels,return_counts=True)[0],y=np.unique(valid_gen.labels,return_counts=True)[1],
           text=np.unique(valid_gen.labels,return_counts=True)[1],textposition='auto')
fig.add_trace(t1)
fig.add_trace(t2)
#x-axis and y axis title
fig.update_xaxes(title_text="Class Labels")
fig.update_yaxes(title_text="Number of Images")
fig.update_layout(title='Train and Valid Split')
fig.show()

#Pie Chart
fig = make_subplots(rows=1, cols=2,subplot_titles=['Train Data', 'Valid Data'],
            specs=[[{"type": "domain"}, {"type": "domain"}]],)

#Pie chart with labels and counts
t1 = go.Pie(labels=np.unique(train_gen.labels,return_counts=True)[0],
           values=np.unique(train_gen.labels,return_counts=True)[1],
           hole=0.3)
t2 = go.Pie(labels=np.unique(valid_gen.labels,return_counts=True)[0],
           values=np.unique(valid_gen.labels,return_counts=True)[1],
           hole=0.3)
fig.add_trace(t1,row=1, col=1)
fig.add_trace(t2,row=1, col=2)
fig.update_layout(title='Distribution in Train and Valid Split')
fig.show()

<a id='s5'></a>
## Model Architecture       

In [None]:
def model_architecture(IMG_SIZE):
    #arrange the model in sequential manner
    model = Sequential()
    #First Convolutional Layer
    model.add(Conv2D(32,(5,5),activation='relu',input_shape=(IMG_SIZE,IMG_SIZE,3)))
    model.add(BatchNormalization())
    model.add(MaxPool2D((3,3)))
    #Second Convolutional Layer
    model.add(Conv2D(64,(3,3),activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPool2D((3,3)))
    #Third Convolutional Layer
    model.add(Conv2D(128,(3,3),activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPool2D((3,3)))
    #flatten the architecture
    model.add(Flatten())
    
    #First Dense Layer with 1% dropout ratio
    model.add(Dense(512,activation='relu'))
    model.add(Dropout(0.1))
    #Second Dense Layer with 1% dropout ratio
    model.add(Dense(1024,activation='relu'))
    model.add(Dropout(0.1))
    #Third Dense Layer with 1% dropout ratio
    model.add(Dense(1024,activation='relu'))
    model.add(Dropout(0.1))
    #Fourth Dense Layer with 1% dropout ratio
    model.add(Dense(256,activation='relu'))
    model.add(Dropout(0.1))
    #Output layer
    model.add(Dense(5,activation='softmax'))
    model.summary()
    
    #compile the model with Adam optimizer
    model.compile(optimizer = Adam(lr = 0.001),
                  loss = "sparse_categorical_crossentropy",
                  metrics = ["acc"])
    return model

In [None]:
STEPS_PER_EPOCH = len(train_gen) / BATCH_SIZE
VALIDATION_STEPS = len(valid_gen) / BATCH_SIZE
EPOCHS = 100
MODEL_NAME = './cvpr_2019_model_with_augmentation_512_100_epochs.h5'

In [None]:
model = model_architecture(TARGET_SIZE)

In [None]:
#callbacks
model_save = ModelCheckpoint(MODEL_NAME, 
                             save_best_only = True, 
                             save_weights_only = True,
                             monitor = 'val_loss', 
                             mode = 'min', verbose = 1)
early_stop = EarlyStopping(monitor = 'val_loss', min_delta = 0.001, 
                           patience = 5, mode = 'min', verbose = 1,
                           restore_best_weights = True)
reduce_lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.3, 
                              patience = 2, min_delta = 0.001, 
                              mode = 'min', verbose = 1)

<a id='s6'></a>
## Training

In [None]:
history = model.fit(
    train_images,
    steps_per_epoch = STEPS_PER_EPOCH,
    epochs = EPOCHS,
    validation_data = valid_gen,
    validation_steps = VALIDATION_STEPS,
    callbacks = [model_save, early_stop, reduce_lr]
)

In [None]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = [e for e in range(1, len(acc) + 1)]

fig = make_subplots(rows=1, cols=2,subplot_titles=['Accuracy', 'Loss'],
            specs=[[{"type": "xy"}, {"type": "xy"}]],)

t1 = go.Scatter(x=epochs,y=acc,name='Training',mode='markers+lines',line={'color': 'blue'})
t2 = go.Scatter(x=epochs,y=val_acc,name='Validation',mode='markers+lines',line={'dash': 'dash','color': 'red'})

t3 = go.Scatter(x=epochs,y=loss,name='Training',mode='markers+lines',line={'color': 'blue'},showlegend=False)
t4 = go.Scatter(x=epochs,y=val_loss,name='Validation',mode='markers+lines',line={'dash': 'dash','color': 'red'},showlegend=False)

fig.add_trace(t1,row=1, col=1)
fig.add_trace(t2,row=1, col=1)

fig.add_trace(t3,row=1, col=2)
fig.add_trace(t4,row=1, col=2)

fig.update_layout(title='Training History')
fig.show()

<a id='s7'></a>
## Model Visualization

I read one interesting kernel recently, which explains how to visualize the trained model.   

https://www.kaggle.com/harininarasimhan/why-not-to-trust-public-lb-visualization

Thank you **Harini Narasimhan** for sharing this kernel


In [None]:
model.load_weights(MODEL_NAME)

### Train Model Visualization

In [None]:
intermediater_layer_model = Model(inputs=model.inputs, outputs = model.get_layer('dense_3').output)
intermediate_output = intermediater_layer_model.predict(train_gen,verbose=1,batch_size=BATCH_SIZE)
intermediate_output.shape

In [None]:
st_time = time.time()
t_sne = TSNE(random_state=2020)
t_sne_tr = t_sne.fit_transform(intermediate_output)
print('TNSE done; Time take {} seconds'.format(time.time()-st_time))
##T-SNE df
tsne_tr = pd.DataFrame()
for idx in range(t_sne_tr.shape[1]):
    tsne_tr['t_sne'+str(idx+1)] = t_sne_tr[:,idx]
tsne_tr['label'] = np.array(train_gen.labels).astype(int)
tsne_tr['disease_name'] = tsne_tr['label'].apply(lambda x: disease_names[str(x)])
fig = go.Figure()
colors = ['rgb(243, 247, 15)','rgb(13, 160, 200)','rgb(190, 81, 249)','rgb(248, 104, 73)','rgb(0,255,0)']
for idx,dn in enumerate(tsne_tr['disease_name'].unique()):
    df = tsne_tr[tsne_tr['disease_name'] == dn]
    fig.add_trace(go.Scatter(x=df['t_sne1'],y=df['t_sne2'],mode='markers',marker_color = colors[idx],name=dn))
fig.update_layout(title='Trained model performance')
fig.update_xaxes(title_text="TSNE_1")
fig.update_yaxes(title_text="TSNE_2")
fig.show()

### Valid Model Visualization

In [None]:
intermediater_layer_model = Model(inputs=model.inputs, outputs = model.get_layer('dense_3').output)
intermediate_output = intermediater_layer_model.predict(valid_gen,verbose=1,batch_size=BATCH_SIZE)

In [None]:
st_time = time.time()
t_sne = TSNE(random_state=2020)
t_sne_va = t_sne.fit_transform(intermediate_output)
print('TNSE done; Time take {} seconds'.format(time.time()-st_time))
##T-SNE df
tsne_va = pd.DataFrame()
for idx in range(t_sne_va.shape[1]):
    tsne_va['t_sne'+str(idx+1)] = t_sne_va[:,idx]
tsne_va['label'] = np.array(valid_gen.labels).astype(int)
tsne_va['disease_name'] = tsne_va['label'].apply(lambda x: disease_names[str(x)])
fig = go.Figure()
colors = ['rgb(243, 247, 15)','rgb(13, 160, 200)','rgb(190, 81, 249)','rgb(248, 104, 73)','rgb(0,255,0)']
for idx,dn in enumerate(tsne_va['disease_name'].unique()):
    df = tsne_va[tsne_va['disease_name'] == dn]
    fig.add_trace(go.Scatter(x=df['t_sne1'],y=df['t_sne2'],mode='markers',marker_color = colors[idx],name=dn))
fig.update_layout(title='Validated model performance')
fig.update_xaxes(title_text="TSNE_1")
fig.update_yaxes(title_text="TSNE_2")
fig.show()

<a id='s8'></a>
## Prediction

In [None]:
ss = pd.read_csv("../input/cassava-leaf-disease-classification/sample_submission.csv")
preds = []

for image_id in ss.image_id:
    image = cv2.cvtColor(cv2.imread('../input/cassava-leaf-disease-classification/test_images/'+image_id),cv2.COLOR_BGR2RGB)
    image = cv2.resize(image,(TARGET_SIZE,TARGET_SIZE))
    image = np.expand_dims(image, axis = 0)
    preds.append(np.argmax(model.predict(image)))

ss['label'] = preds
ss.to_csv('submission.csv', index = False)

<a id='s10'></a>
## Next Steps

- SMOTE to address imbalance nature of image - On Hold
- Hyper Parameter Tuning

**Do Upvote !!!**