# Cassava Leaf Disease Classification - Exploratory Data Analysis

Quick Exploratory Data Analysis for [Cassava Leaf Disease Classification](https://www.kaggle.com/c/cassava-leaf-disease-classification) challenge    

This competition will challenge you to distinguish between several diseases that cause material harm to the food supply of many African countries. In some cases the main remedy is to burn the infected plants to prevent further spread, which can make a rapid automated turnaround quite useful to the farmers.

![](https://storage.googleapis.com/kaggle-competitions/kaggle/13836/logos/header.png)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:black; background:#5BEB9C; border:0' role="tab" aria-controls="home"><center>Quick Navigation</center></h3>

* [Overview](#1)
    
    
* [General Visualization](#2)
* [0 - CBB - Cassava Bacterial Blight](#3)
* [1 - CBSD - Cassava Brown Streak Disease](#4)
* [2 - CGM - Cassava Green Mottle](#5)
* [3 - CMD - Cassava Mosaic Disease](#6)
* [4 - Healthy](#7)
    
    
* [Augmentation Examples](#50)
    
    
* [Submission Example](#100)

<a id="1"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>Overview<center><h2>

In [None]:
import os
import json

import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import cv2
import albumentations as A
from sklearn import metrics as sk_metrics

In [None]:
BASE_DIR = "../input/cassava-leaf-disease-classification/"

In this competition we have 5 classes: **4 diseases** and **1 healthy**   
We can find the mapping between the class number and its name in the file label_num_to_disease_map.json

In [None]:
with open(os.path.join(BASE_DIR, "label_num_to_disease_map.json")) as file:
    map_classes = json.loads(file.read())
    map_classes = {int(k) : v for k, v in map_classes.items()}
    
print(json.dumps(map_classes, indent=4))

In [None]:
input_files = os.listdir(os.path.join(BASE_DIR, "train_images"))
print(f"Number of train images: {len(input_files)}")

Let's take a look at the dimensions of the first 300 images   
As you can see below, all images are the same size (600, 800, 3)

In [None]:
img_shapes = {}
for image_name in os.listdir(os.path.join(BASE_DIR, "train_images"))[:300]:
    image = cv2.imread(os.path.join(BASE_DIR, "train_images", image_name))
    img_shapes[image.shape] = img_shapes.get(image.shape, 0) + 1

print(img_shapes)

Let's load the training dataframe and add a column with the real class name to it.

In [None]:
df_train = pd.read_csv(os.path.join(BASE_DIR, "train.csv"))

df_train["class_name"] = df_train["label"].map(map_classes)

df_train

Let's look at the number of pictures in each class.

In [None]:
plt.figure(figsize=(8, 4))
sn.countplot(y="class_name", data=df_train);

As we can see, the dataset has a fairly large imbalance.

<a id="2"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>General Visualization<center><h2>

In [None]:
def visualize_batch(image_ids, labels):
    plt.figure(figsize=(16, 12))
    
    for ind, (image_id, label) in enumerate(zip(image_ids, labels)):
        plt.subplot(3, 3, ind + 1)
        image = cv2.imread(os.path.join(BASE_DIR, "train_images", image_id))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        plt.imshow(image)
        plt.title(f"Class: {label}", fontsize=12)
        plt.axis("off")
    
    plt.show()

In [None]:
tmp_df = df_train.sample(9)
image_ids = tmp_df["image_id"].values
labels = tmp_df["class_name"].values

visualize_batch(image_ids, labels)

<a id="3"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>0 - CBB - Cassava Bacterial Blight<center><h2>

<img style="height:300px" src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1865449%2Fbe9cdd94efb9b1660066ad10b55c8626%2Fbact_bright.jpeg?generation=1605827469211692&alt=media">
<cite>The image from discussion: <a href="https://www.kaggle.com/c/cassava-leaf-disease-classification/discussion/198143">Cassava Lead Diseases: Overview</a></cite>

In [None]:
tmp_df = df_train[df_train["label"] == 0]
print(f"Total train images for class 0: {tmp_df.shape[0]}")

tmp_df = tmp_df.sample(9)
image_ids = tmp_df["image_id"].values
labels = tmp_df["label"].values

visualize_batch(image_ids, labels)

<a id="4"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>1 - CBSD - Cassava Brown Streak Disease<center><h2>

<img style="height:300px" src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1865449%2Ffeba3dafc914d04517659650d137b77a%2Fbrown_st.jpeg?generation=1605830407530983&alt=media">
<cite>The image from discussion: <a href="https://www.kaggle.com/c/cassava-leaf-disease-classification/discussion/198143">Cassava Lead Diseases: Overview</a></cite>

In [None]:
tmp_df = df_train[df_train["label"] == 1]
print(f"Total train images for class 1: {tmp_df.shape[0]}")

tmp_df = tmp_df.sample(9)
image_ids = tmp_df["image_id"].values
labels = tmp_df["label"].values

visualize_batch(image_ids, labels)

<a id="5"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>2 - CGM - Cassava Green Mottle<center><h2>

<img style="height:300px" src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1865449%2F4f2975866feb2a1d4ef4111c2d57db29%2Fgreen_mottle.jpeg?generation=1605829101431013&alt=media">
<cite>The image from discussion: <a href="https://www.kaggle.com/c/cassava-leaf-disease-classification/discussion/198143">Cassava Lead Diseases: Overview</a></cite>

In [None]:
tmp_df = df_train[df_train["label"] == 2]
print(f"Total train images for class 2: {tmp_df.shape[0]}")

tmp_df = tmp_df.sample(9)
image_ids = tmp_df["image_id"].values
labels = tmp_df["label"].values

visualize_batch(image_ids, labels)

<a id="6"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>3 - CMD - Cassava Mosaic Disease<center><h2>

<img style="height:300px" src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1865449%2F36990f77ded6667e5c30d19b5405d4d3%2Fmosaic_disease.jpeg?generation=1605829705010773&alt=media">
<cite>The image from discussion: <a href="https://www.kaggle.com/c/cassava-leaf-disease-classification/discussion/198143">Cassava Lead Diseases: Overview</a></cite>

In [None]:
tmp_df = df_train[df_train["label"] == 3]
print(f"Total train images for class 3: {tmp_df.shape[0]}")

tmp_df = tmp_df.sample(9)
image_ids = tmp_df["image_id"].values
labels = tmp_df["label"].values

visualize_batch(image_ids, labels)

<a id="7"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>4 - Healthy<center><h2>

In [None]:
tmp_df = df_train[df_train["label"] == 4]
print(f"Total train images for class 4: {tmp_df.shape[0]}")

tmp_df = tmp_df.sample(9)
image_ids = tmp_df["image_id"].values
labels = tmp_df["label"].values

visualize_batch(image_ids, labels)

<a id="50"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>Augmentation Examples<center><h2>

Image augmentation is a process of creating new training examples from the existing ones. To make a new sample, you slightly change the original image. For instance, you could make a new image a little brighter; you could cut a piece from the original image; you could make a new image by mirroring the original one, etc. [[source]](https://albumentations.ai/docs/introduction/image_augmentation/)

<img style="height:500px" src="https://albumentations.ai/docs/images/introduction/image_augmentation/augmentation.jpg">
<cite>The image from the <a href="https://albumentations.ai/docs/introduction/image_augmentation/">Albumentations Documentation</a></cite>

In [None]:
def plot_augmentation(image_id, transform):
    plt.figure(figsize=(16, 4))
    img = cv2.imread(os.path.join(BASE_DIR, "train_images", image_id))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    plt.subplot(1, 3, 1)
    plt.imshow(img)
    plt.axis("off")

    plt.subplot(1, 3, 2)
    x = transform(image=img)["image"]
    plt.imshow(x)
    plt.axis("off")

    plt.subplot(1, 3, 3)
    x = transform(image=img)["image"]
    plt.imshow(x)
    plt.axis("off")
    
    plt.show()

Since we have a fairly limited number of some classes, we can use augmentation    
This section shows examples of augmentation using the [albumentations](https://albumentations.ai/) library

The example below uses rotate-shift-scale augmentation with specular edge complementation. For this kind of pictures, this augmentation looks quite natural.

In [None]:
transform_shift_scale_rotate = A.ShiftScaleRotate(
    p=1.0, 
    shift_limit=(-0.3, 0.3), 
    scale_limit=(-0.1, 0.1), 
    rotate_limit=(-180, 180), 
    interpolation=0, 
    border_mode=4, 
)

plot_augmentation("1003442061.jpg", transform_shift_scale_rotate)

Another useful augmentation could be CoarseDropout. Thanks to this augmentation, you can complicate the life of the model so that she does not look too closely at some of the details of the image.   
Let's look at the example below:

In [None]:
transform_coarse_dropout = A.CoarseDropout(
    p=1.0, 
    max_holes=100, 
    max_height=50, 
    max_width=50, 
    min_holes=30, 
    min_height=20, 
    min_width=20,
)

plot_augmentation("1003442061.jpg", transform_coarse_dropout)

We can compose two or more augmentations into one process.    
For example, let's use shift-scale-rotate and CoarseDropout consistently:

In [None]:
transform = A.Compose(
    transforms=[
        transform_shift_scale_rotate,
        transform_coarse_dropout,
    ],
    p=1.0,
)

plot_augmentation("1003442061.jpg", transform)

<a id="100"></a>
<h2 style='background:#5BEB9C; border:0; color:black'><center>Submission Example<center><h2>

Load the submission template

In [None]:
df_sub = pd.read_csv("../input/cassava-leaf-disease-classification/sample_submission.csv", index_col=0)
df_sub

As we can see only one file in the submission file

In [None]:
os.listdir(os.path.join(BASE_DIR, "test_images"))

This is because [it is a Code Competition](https://www.kaggle.com/c/cassava-leaf-disease-classification/overview/code-requirements), and the test data is hidden   
Your notebook should correct working with unseen test dataset

The full set of test images will only be available to your notebook when it is submitted for scoring.    
Expect to see roughly 15,000 images in the test set.    

The metric of this competition is **Accuracy**.    
Accuracy - the ratio of the number of samples predicted correctly to the total number of samples

$$Accuracy\ Score = \frac{The\ number\ of\ samples\ predicted\ correctly}{Total\ number\ of\ samples}$$

Let's calculate the accuracy on a training set if we select only one class for all examples.

In [None]:
for pred_class in range(0, 5):
    y_true = df_train["label"].values
    y_pred = np.full_like(y_true, pred_class)
    print(f"accuracy score (predict {pred_class}): {sk_metrics.accuracy_score(y_true, y_pred):.3f}")

Since we have a large imbalance of classes, if we predict the most frequent class, then our accuracy is greater in this case

Let's choose the most popular class of training set as the label for all images in test set

In [None]:
df_sub["label"] = 3

And then write result to the submission file

In [None]:
df_sub.to_csv("submission.csv")

If you submit the result for evaluation, you will get an accuracy of 0.614 on a public liderboard (on the train it is 0.615). This may indicate that there is also an imbalance of classes on the public test distribution.

# WORK IN PROGRESS...