## Problem Statement
The goal of this competition is to identify diseased cassava leaves from images which were crowdsourced from farmers taking photos of their gardens, and annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab at Makerere University, Kampala. This is in a format that most realistically represents what farmers would need to diagnose in real life.  
There are 4 different types of diseases that generally affect the Cassava Plant:
* Cassava Bacterial Blight (CBB)
* Cassava Brown Streak Disease (CBSD)
* Cassava Green Mottle (CGM)
* Cassava Mosaic Disease (CMD) 

Along with the diseased leaves there is also a 5th category of healthy leaves. The photos are taken from relatively inexpensive Mobile cameras generally kept by the farmers and correctly identifying the diseased leaves would result in quciker automated turn arounds enabling early action

![](https://storage.googleapis.com/kaggle-competitions/kaggle/13836/logos/header.png)

In [None]:
# Install the fastai library 
# !pip install fastai2

In [None]:
# Import libraries 
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2
from fastai.vision.all import *
# Auto code completion
%config Completer.use_jedi = False
# Configs
pd.options.display.max_colwidth = 100

In [None]:
# Verify that the pretrained resnet 50 model is available so it's not downloaded
os.listdir('/kaggle/input/pretrained-pytorch')

### EDA- Looking at general aspects of the data, visualizing Images from each category of leaves etc

In [None]:
# Train and Test dataset sizes
num_train_images = len(os.listdir('/kaggle/input/cassava-leaf-disease-classification/train_images'))
num_test_images = len(os.listdir('/kaggle/input/cassava-leaf-disease-classification/test_images'))
print('There are: ', num_train_images, "images in train and", num_test_images, 'images in test')

To prevent cheating and manual labelling the full set of Test Images are only available when the notebook is submitted for scoring. This is why the number of test images is being shown as 1

In [None]:
# Let's see the distribution of classes in the training data
train_mappings_data = pd.read_csv('/kaggle/input/cassava-leaf-disease-classification/train.csv')
mapping_dict = pd.read_json('/kaggle/input/cassava-leaf-disease-classification/label_num_to_disease_map.json',
                           typ='series')
mapping_dict = mapping_dict.to_dict()
train_mappings_data['Label_expanded'] = train_mappings_data['label'].map(mapping_dict)
train_mappings_data.head()

In [None]:
# Class distribution
print(train_mappings_data['Label_expanded'].value_counts() / train_mappings_data.shape[0] * 100)

As can be seen there is a fairly large imbalance in class distribution with ~62% Images belonging to the Cassava Mosaic Disease class. In this kind of a problem selecting a proper evaluation metric like macro f1_score or auc during training is paramount.

### Visualization- Let's visualize a few Images from each of the 5 categories

<h2 style='background:limegreen; border:0; color:black'><center> 3 Cassava Mosaic Disease </center></h2>

In [None]:
def plot_images(img_list, img_label):
    os.chdir('/kaggle/input/cassava-leaf-disease-classification/train_images')
    plt.figure(figsize=(16, 12))
    for i, img in enumerate(img_list):
        plt.subplot(3,2,i+1)
        temp_img = cv2.imread(img)
        temp_img = cv2.cvtColor(temp_img, cv2.COLOR_BGR2RGB)
        plt.imshow(temp_img)
        plt.title(img_label, fontsize=12)
    plt.show()

In [None]:
CMD_IDs = list(train_mappings_data['image_id'][train_mappings_data['label'] == 3])
CMD_IDs_to_display = random.sample(CMD_IDs, 6)
plot_images(CMD_IDs_to_display, mapping_dict[3])

<h3>Let's see some symptoms of the Cassava Mosaic Disease<h3>

<img style="height:480" src="https://www.easybiologyclass.com/wp-content/uploads/2018/09/Leaf-Mosaic-of-Tapioca.jpg" alt="Cassava Mosaic Disease"> <cite> <a href="https://www.easybiologyclass.com/mosaic-disease-of-tapioca-symptoms-and-control-measures/"> Easy Biology Class </a> </cite>

<h2 style='background:limegreen; border:0; color:black'><center> 4 Healthy Leaves </center></h2>

In [None]:
Healthy_IDs = list(train_mappings_data['image_id'][train_mappings_data['label'] == 4])
Healthy_IDs_to_display = random.sample(Healthy_IDs, 6)
plot_images(Healthy_IDs_to_display, mapping_dict[4])

As can be Healthy leaves seem to have more shine and clearer veins but to the naked eyes of a non expert it can be indeed difficult to distinguish

<h2 style='background:limegreen; border:0; color:black'> <center>2 Casava Green Mottle</center></h2>

In [None]:
CGM_IDs = list(train_mappings_data['image_id'][train_mappings_data['label'] == 2])
CGM_IDs_to_display = random.sample(Healthy_IDs, 6)
plot_images(CGM_IDs_to_display, mapping_dict[2])

<h3>Let's see some symptoms of the Cassava Green Mottle Disease<h3>

<div class="img-with-text">
    <img style="height:300px" src="https://www.pestnet.org/fact_sheets/assets/image/cassava_green_mottle_068/cgmv2.jpg" alt="Image 1"/>
    <p><center>Yellow Spots throughout the leaf caused by infection from Cassava Green Mottle Virus</p>
</div>
<div class="img-with-text2">
    <img style="height:300px" src="https://www.pestnet.org/fact_sheets/assets/image/cassava_green_mottle_068/cgmv.jpg" alt="Image 1"/>
    <p><center><b>Larger Irregular spots and patches on distorted leaves</b></p>
</div>
<cite>Link to Images: <a href="https://www.pestnet.org/fact_sheets/cassava_green_mottle_068.htm">Cassava Green Mottle Virus</a></cite>

<h2 style='background:limegreen; border:0; color:black'><center> 1 Cassava Brown Streak Disease </center></h2>

In [None]:
CBB_IDs = list(train_mappings_data['image_id'][train_mappings_data['label'] == 1])
CBB_IDs_to_display = random.sample(CBB_IDs, 6)
plot_images(CBB_IDs_to_display, mapping_dict[1])

<h3>Zoom in to see a more detailed Image</h3>

<img style="height:300;width:300" src="https://newscenter.lbl.gov/wp-content/uploads/sites/2/Characteristic-yellowing-of-casssava-leaves-symptom-of-CBSD-infection.jpg" alt="Cassava Brown Streak Disease"> <cite> <a href="https://newscenter.lbl.gov/2013/03/18/cassava-brief-genomics/"> Newscenter </a> </cite>

<h2 style='background:limegreen; border:0; color:black'><center> 0 Cassava Bacterial Blight </center></h2>

In [None]:
CBBlight_IDs = list(train_mappings_data['image_id'][train_mappings_data['label'] == 0])
CBBlight_IDs_to_display = random.sample(CBBlight_IDs, 6)
plot_images(CBBlight_IDs_to_display, mapping_dict[0])

<h3> Zoom in to see a more detailed image </h3>

<img style="height:300px" src="http://www.pestnet.org/fact_sheets/assets/image/cassava_bacterial_blight_173/thumbs/cassavabb_sml.jpg" alt="Cassava Bacterial Blight"> <cite> <a href="http://www.pestnet.org/fact_sheets/cassava_bacterial_blight_173.htm"> Easy Biology Class </a> </cite>

This category seems to be a bit more easily identifiable as the whole leaf seems more yellowish rather than some spots and also brown splotches and wilting is prevalent

## Getting some baseline results with the Fast AI library for reference

In [None]:
# Create a DataFrame containing the Image labels and paths for easy use with Image Data Loaders from FastAI
train_df = train_mappings_data.loc[:, ['image_id', 'label']]
train_df['Img_path'] = 'train_images/' + train_df['image_id']
train_df.head()

In [None]:
train_img_data = ImageDataLoaders.from_df(train_df,path='/kaggle/input/cassava-leaf-disease-classification/',
                                          valid_pct=0.2, seed=123, fn_col=2, 
                                          item_tfms = RandomResizedCrop(224, min_scale=0.5),
                                          batch_tfms=aug_transforms())

In [None]:
# Move the pre trained model weights to proper location so the resnet model can use it without downloading
# from internet
if not os.path.exists('/root/.cache/torch/hub/checkpoints/'):
        os.makedirs('/root/.cache/torch/hub/checkpoints/')
!cp '/kaggle/input/pretrained-pytorch/resnet50-19c8e357.pth' '/root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth'

In [None]:
# Create lerner
learner_50 = cnn_learner(train_img_data, resnet50, metrics=accuracy)

In [None]:
# Train one cycle
learner_50.fine_tune(4)

In [None]:
# Have a look at the mistakes the model made for future improvements
model_interpreter = ClassificationInterpretation.from_learner(learner_50)
model_interpreter.plot_confusion_matrix()

It is the Healthy leaves which seem to be misclassified the most consistently. Also categories 2 and 3 are misclassified as each other frequently. 

In [None]:
# Let's see top losses
model_interpreter.plot_top_losses(5, nrows=1)

In [None]:
# Generate Predictions for the Test set and establish a baseline
# Ingest the test image names
test_df = pd.read_csv('/kaggle/input/cassava-leaf-disease-classification/sample_submission.csv')
test_df['Img_path'] = "test_images/" + test_df['image_id']
test_df.head()

In [None]:
# Add the test data to the Data Loader
test_img_data = train_img_data.test_dl(test_df)

In [None]:
# See if image was successfully added
test_img_data.show_batch()

In [None]:
# Generate predictions for the test data with TTA
preds, _ = learner_50.tta(dl=test_img_data, n=15, beta=0)

In [None]:
# Submit preds using the same format as the sample submissions file
submission_1 = test_df.drop(columns=['Img_path'])
submission_1['label'] = preds.argmax(dim=-1).numpy()
submission_1

In [None]:
# Make submission
submission_1.to_csv('/kaggle/working/submission.csv',index=False)