<h1 style="border:2px solid Purple;text-align:center">Table of Contents</h1>

1. [The Competition](#competition)

2. [Objective](#objective)

3. [Dataset](#dataset)

4. [Importing the necessary libraries](#imports)

5. [Disease Mappings](#mappings)

6. [Trainining Dataset](#trainds)

7. [Image Training Dataset](#imgtrainds)

8. [Why care about Histograms](#hists)

    8.1. [Healthy Leaves](#healthy)
    
    8.2. [Cassava Bacterial Blight (CBB)](#cbb)
    
    8.3. [Cassava Brown Streak Disease (CBSD)](#cbsd)
    
    8.4. [Cassava Green Mottle (CGM)](#cgm)
    
    8.5. [Cassava Mosaic Disease (CMD)](#cmd)

9. [Image Augmentation](#imageaug)

    9.1 [Image Augmentation - Tensorflow](#imageaugtens)
    
    9.2 [Image Augmentation - Pytorch](#imageaugpy)
    
    9.3 [Image Augmentation - Albumentations](#imagealbu)

<a id=#competition></a>
<h1 style="border:2px solid LightGreen;text-align:center">The Competition</h1>

![](https://www.pestnet.org/fact_sheets/assets/image/cassava_brown_leaf_spot_095/46.jpg)

Manihot esculenta, commonly called cassava, manioc, yuca, macaxeira, mandioca, aipim, and agbeli, is a woody shrub native to South America of the spurge family, Euphorbiaceae. Although a perennial plant, cassava is extensively cultivated as an annual crop in tropical and subtropical regions for its edible starchy tuberous root, a major source of carbohydrates. 

Cassava is the third-largest source of food carbohydrates in the tropics, after rice and maize. Cassava is a major staple food in the developing world, providing a basic diet for over half a billion people. It is one of the most drought-tolerant crops, capable of growing on marginal soils. Nigeria is the world's largest producer of cassava, while Thailand is the largest exporter of cassava starch.

Source : Wikipedia

<a id= "objective"></a>
<h1 style="border:2px solid LightGreen;text-align:center">Objective</h1>


The task is to classify each cassava image into four disease categories or a fifth category indicating a healthy leaf. With the help of data science, farmers may be able to quickly identify diseased plants, potentially saving their crops before they inflict irreparable damage.[](http://)

<a id= "dataset" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">Dataset</h1>

In this competition, we are introduced with a dataset of 21,367 labeled images collected during a regular survey in Uganda. Most images were crowdsourced from farmers taking photos of their gardens, and annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab at Makerere University, Kampala. This is in a format that most realistically represents what farmers would need to diagnose in real life.

<a id= "imports" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">Importing the Necessary Libraries</h1>

In [None]:
import pandas as pd
import numpy as np

# Importing Libraries for Image Augmentations
import tensorflow as tf
import torchvision
import albumentations as A

# Working with Files
import os
from pathlib import Path

# Fancy progress bar
from tqdm import tqdm

# Dynamic Graphs
import plotly.graph_objects as go
import plotly_express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

# Static Graphs
import matplotlib.pyplot as plt

# Working with images
import cv2

# For plotly graphs to be rendered properly
from plotly.offline import init_notebook_mode
init_notebook_mode()

In [None]:
# Storing the base address of the files
base_path = Path('../input/cassava-leaf-disease-classification')
train_img_dir = base_path /'train_images'
test_img_dir = base_path /'test_images'

# reading the train.csv file and the json file with the labels mapped to disease names
train_df = pd.read_csv(base_path/'train.csv')
diseaseMapping = pd.read_json(base_path/'label_num_to_disease_map.json', typ='series')

# List of all train and test Images
train_images = os.listdir(base_path/'train_images/')
test_images = os.listdir(base_path/'test_images/')

<a id= "mappings" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">Disease Mappings</h1>

In [None]:
diseaseMapping

There are 5 classes for prediction in this dataset : 

* **Healthy** -> The Leaf is healthy
* **Cassava Bacterial Blight (CBB)**
* **Cassava Brown Streak Disease (CBSD)**
* **Cassava Green Mottle (CGM)**
* **Cassava Mosaic Disease (CMD)**


In [None]:
# Converting into a Dictionary
mappingDict = diseaseMapping.to_dict()

<a id= "trainds"></a>
<h1 style="border:2px solid LightGreen;text-align:center">Training Dataset</h1>

In [None]:
train_df.head()

There is nothing fancy here. Just the image name and the assiciated labels with the image. Since the labels are given as numbers, we can change them to their corresponding disease name using the mapping provided. 

In [None]:
# Replacing Numeric Labels with Disease Names
train_df = train_df.replace(mappingDict)

In [None]:
# Counting the Number of Training Samples for each Label
labelCounts = train_df['label'].value_counts().reset_index()
labelCounts.columns = ['Label', 'Number of Observations']

# Plotting a Pie Chart to show the Distribution
fig = px.pie(labelCounts, 
             names = 'Label',values='Number of Observations', 
             labels = mappingDict, 
             title = 'Distribution of Labels in the Training Dataset',
             color_discrete_sequence=px.colors.sequential.Greens_r)
fig.show()

Only around 12% of the dataset is of images of healthy leaves, while the rest of the images are for diseased leaves.

The images of Cassava Mosaic Disease (CMD) are the most abundant taking up more than half of the dataset. 

In [None]:
uniqueIds = train_df['image_id'].nunique()
if(uniqueIds == len(train_df)):
    print('There are no repeating Image IDs in the dataset')
else:
    print(f'There are {len(train_df) - uniqueIds} repeating Image IDs')

The training dataset does not have repeating Image IDs. However, it might still be the case that there are duplicate images in the dataset

<a id= "imgtrainds" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">Training Image Dataset</h1>

In [None]:
print(f'There are {len(train_images)} training images in the dataset')

In [None]:
healthyImages = train_df[train_df['label'] == 'Healthy']['image_id'].to_list()
cbbImages = train_df[train_df['label'] == 'Cassava Bacterial Blight (CBB)']['image_id'].to_list()
cbsdImages = train_df[train_df['label'] == 'Cassava Brown Streak Disease (CBSD)']['image_id'].to_list()
cgmImages = train_df[train_df['label'] == 'Cassava Green Mottle (CGM)']['image_id'].to_list()
cmdImages = train_df[train_df['label'] == 'Cassava Mosaic Disease (CMD)']['image_id'].to_list()

In [None]:
''' code modified from Parul Pandey's notebook
https://www.kaggle.com/parulpandey/melanoma-classification-eda-starter
'''
def showImages(images):

    # Extract 9 random images from it
    random_images = [np.random.choice(images) for i in range(9)]

    # Adjust the size of your images
    plt.figure(figsize=(10,8))

    # Iterate and plot random images
    for i in range(9):
        plt.subplot(3, 3, i + 1)
        img = plt.imread(train_img_dir/random_images[i])
        plt.imshow(img, cmap='gray')
        plt.axis('off')

    # Adjust subplot parameters to give specified padding
    plt.tight_layout()   

In [None]:
''' code used from Parul Pandey's notebook
https://www.kaggle.com/parulpandey/melanoma-classification-eda-starter
'''

def showHistogram(sample_img, title):
    f = plt.figure(figsize=(16,8))
    f.add_subplot(1,2, 1)

    raw_image = plt.imread(train_img_dir/sample_img)
    plt.imshow(raw_image, cmap='gray')
    plt.colorbar()
    plt.title(title)
    print(f"Image dimensions:  {raw_image.shape[0],raw_image.shape[1]}")
    print(f"Maximum pixel value : {raw_image.max():.1f} ; Minimum pixel value:{raw_image.min():.1f}")
    print(f"Mean value of the pixels : {raw_image.mean():.1f} ; Standard deviation : {raw_image.std():.1f}")

    f.add_subplot(1,2, 2)

    #_ = plt.hist(raw_image.ravel(),bins = 256, color = 'orange',)
    _ = plt.hist(raw_image[:, :, 0].ravel(), bins = 256, color = 'red', alpha = 0.5)
    _ = plt.hist(raw_image[:, :, 1].ravel(), bins = 256, color = 'Green', alpha = 0.5)
    _ = plt.hist(raw_image[:, :, 2].ravel(), bins = 256, color = 'Blue', alpha = 0.5)
    _ = plt.xlabel('Intensity Value')
    _ = plt.ylabel('Count')
    _ = plt.legend(['Red_Channel', 'Green_Channel', 'Blue_Channel'])
    plt.show()

In [None]:
'''
Inspired and modified from Tarun Paparaju's Work
https://www.kaggle.com/tarunpaparaju/plant-pathology-2020-eda-models
'''

def load_image(image_id):
    image = cv2.imread(str(train_img_dir/image_id))
    return cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

def showChannelDistribution(images, leafType):
    imageArray = [load_image(image_id) for image_id in images]
    
    red_values = [np.mean(imageArray[idx][:, :, 0]) for idx in range(len(imageArray))]
    green_values = [np.mean(imageArray[idx][:, :, 1]) for idx in range(len(imageArray))]
    blue_values = [np.mean(imageArray[idx][:, :, 2]) for idx in range(len(imageArray))]
    values = [np.mean(imageArray[idx]) for idx in range(len(imageArray))]
    
    hist_data = [red_values, green_values, blue_values, values]
    group_labels = ['Red', 'Green', 'Blue', 'All']

    fig = ff.create_distplot(hist_data, group_labels,colors = ['red', 'green','blue','grey'])
    fig.update_layout(template = 'plotly_white', title_text = f'Channel Distribution - {leafType}')
    fig.show()
    return hist_data

In [None]:
def showBoxPlot(histData, leafType):
    figData = []
    for i, name in zip(range(3), ['Red', 'Green', 'Blue']):
        trace = go.Box(y = histData[i], name = name, boxpoints='all', marker_color  = name)
        figData.append(trace)

    fig = go.Figure(figData)
    fig.update_layout(title_text = f'Pixel Intensity Distribution - {leafType}', template = 'plotly_white')
    fig.show() 

<a id= "hists" ></a>
<h1 style="border:2px solid Blue;text-align:center">Why care about Image Histograms?</h1>

In image processing histograms are used to depict many aspects regarding the image we are working with. Such as,
- Exposure
- Contrast
- Dynamic Range
- Saturation

and many more. 

By visualizing the histogram we can improve the visual presence of an image and also we can find out what type of image processing could have been applied by comparing the histograms of an image.

Source : [Histogram in Image Processing with skImage-Python](https://towardsdatascience.com/histograms-in-image-processing-with-skimage-python-be5938962935)

<a id= "healthy" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">Healthy Leaves</h1>

In [None]:
showImages(healthyImages)

In [None]:
showHistogram(healthyImages[0], 'Healthy Image')

In [None]:
data = showChannelDistribution(healthyImages, 'Healthy')

In [None]:
showBoxPlot(data, 'Healthy Leaves')

We see a  difference in the median values of the pixel intensities for each of the channels:
- Red - 108
- Green - 126
- Blue - 80

<a id= "cbb"></a>
<h1 style="border:2px solid LightGreen;text-align:center">CBB Images</h1>

In [None]:
showImages(cbbImages)

In [None]:
showHistogram(cbbImages[0], 'CBB Image')

In [None]:
data = showChannelDistribution(cbbImages, 'CBB Images')

In [None]:
showBoxPlot(data, 'CBB Images')

For CBB Images, the median values of the pixel intensities for the 3 channels are:
- Red - 102
- Green - 117
- Blue - 66

We can see that the median values for the Blue channel are lower than healthy Images. This makes the difference between the median pixel intensity much more striking between the Blue and Red channel. 

<a id= "cbsd" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">CBSD Images</h1>

In [None]:
showImages(cbsdImages)

In [None]:
showHistogram(cbsdImages[0], 'CBSD Image')

In [None]:
data = showChannelDistribution(cbsdImages, 'CBSD Images')

In [None]:
showBoxPlot(data, 'CBSD Images')

The median values for different channels for CBSD Images are:
- Red - 106
- Green - 123
- Blue - 72

<a id= "cgm" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">CGM Images</h1>

In [None]:
showImages(cgmImages)

In [None]:
showHistogram(cgmImages[0], 'CGM Image')

In [None]:
data = showChannelDistribution(cgmImages, 'CGM Images')

In [None]:
showBoxPlot(data, 'CGM Images')

The median values of pixel intensity for CGM Images for the 3 channels are 

- Red - 113
- Green - 128
- Blue - 85

<a id= "cmd" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">CMD Images</h1>

In [None]:
showImages(cmdImages)

In [None]:
showHistogram(cmdImages[0], 'CMD Image')

In [None]:
# Doing this for the First 2k CMD Images, as doing this for all the images crashes the notebook
data = showChannelDistribution(cmdImages[:2000], 'CMD Images')

In [None]:
showBoxPlot(data, 'CMD Images')

The pixel intensities median value for the first 2k CMD Images are
- Red - 110
- Green - 128
- Blue -80

### Insights
- CGM types images have the highest median RGB values 
- CBB type images have the lowest median RGB values
- The channel intensity median values follow this trend G>R>B

In [None]:
channelIntensityDf = pd.DataFrame(
    {
        'Leaf Type' : ['Healthy', 'CBB','CBSD', 'CGM', 'CMD'], 
        'Red Channel Mean' : [108,102,106,113,110],
        'Green Channel Mean' : [126,117,123,128,128],
        'Blue Channel Mean' : [80,66,72,85,80]
    }
)

channelIntensityDf.style.background_gradient(cmap='Greens', axis = 0)

<a id = "imageaug" ></a>
<h1 style="border:2px solid Purple;text-align:center">Image Augmentations</h1>

<a id = "imageaugtens" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">Image Augmentations - Tensorflow</h1>

Tensorflow offers tons of Image Augmentations as part of its tf.image module and many more as part of tensorflow addons. 

Link to the tf.image module docs -> https://www.tensorflow.org/api_docs/python/tf/image

Link to tensorflow addons image module docs -> https://www.tensorflow.org/addons/api_docs/python/tfa/image/

In [None]:
def augmentImage(imageFile, seed = 0):
    image = tf.io.read_file(str(train_img_dir/imageFile))
    image = tf.image.decode_jpeg(image,channels = 3)
    actual_image = image
    brightness = tf.image.random_brightness(image, 0.2, seed = seed)
    contrast = tf.image.random_contrast(image, 0.2,0.3, seed = seed)
    crop = tf.image.random_crop(image, size = [448,448,3], seed = seed)
    left_right = tf.image.flip_left_right(image) #replace with random_flip_left_right when using as part of a augmentation pipeline
    up_down = tf.image.flip_up_down(image) #replace with random_flip_up_down when using as part of a augmentation pipeline
    hue = tf.image.random_hue(image, 0.2, seed = seed)
    saturation = tf.image.random_saturation(image, 5,10, seed = seed)
    jpeg_quality = tf.image.random_jpeg_quality(image, 75,85)
    
    return (
        actual_image, 
        brightness,
        contrast,
        crop,
        left_right,
        up_down,
        hue,
        saturation, 
        jpeg_quality
    )

In [None]:
augmentedImages = augmentImage(healthyImages[0])
plt.figure(figsize=(10, 10))
for i, imageName in zip(range(9), ['Input Image', 'Augmented - Brightness','Augmented - Contrast','Augmented - Crop','Augmented - Horizontal Flip',
                                  'Augmented - Vertical Flip','Augmented - Hue','Augmented - Saturation','Augmented - Jpeg Quality']):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(augmentedImages[i].numpy().astype("uint8"))
    plt.title(imageName)
    plt.axis("off")

<a id = "imageaugpy" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">Image Augmentations - Pytorch</h1>

You can find more image augmentation examples from the official Pytorch Documentation. 

Link to the docs -> https://pytorch.org/docs/stable/torchvision/transforms.html

In [None]:
def augmentImage(imageFile):
    image = load_image(imageFile)
    image = torchvision.transforms.ToPILImage()(image)
    actual_image = image
    brightness = torchvision.transforms.ColorJitter(brightness=0.2)(image)
    contrast = torchvision.transforms.ColorJitter(contrast=(0.2,0.3))(image)
    crop = torchvision.transforms.RandomCrop((448,448))(image)
    left_right = torchvision.transforms.RandomHorizontalFlip(p = 1.0)(image)
    up_down = torchvision.transforms.RandomVerticalFlip(p = 1.0)(image)
    hue = torchvision.transforms.ColorJitter(hue=0.2)(image)
    saturation = torchvision.transforms.ColorJitter(saturation=(0.05,0.1))(image)
    perspective = torchvision.transforms.RandomPerspective(p= 1.0)(image)
    
    return (
        actual_image, 
        brightness,
        contrast,
        crop,
        left_right,
        up_down,
        hue,
        saturation, 
        perspective
    )

In [None]:
augmentedImages = augmentImage(healthyImages[0])
plt.figure(figsize=(10, 10))
for i, imageName in zip(range(9), ['Input Image', 'Augmented - Brightness','Augmented - Contrast','Augmented - Crop','Augmented - Horizontal Flip',
                                  'Augmented - Vertical Flip','Augmented - Hue','Augmented - Saturation','Augmented - Perspective']):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(augmentedImages[i])
    plt.title(imageName)
    plt.axis("off")

<a id = "imageaugalbu" ></a>
<h1 style="border:2px solid LightGreen;text-align:center">Image Augmentations - Albumentations</h1>

As Kagglers, we should be more aware of Albumentations as an Augmentation library. It blends well with both tensorflow and pytorch. 

#### Why use Albumentations?

> Albumentations is a Python library for fast and flexible image augmentations. Albumentations efficiently implements a rich variety of image transform operations that are optimized for performance, and does so while providing a concise, yet powerful image augmentation interface for different computer vision tasks, including object classification, segmentation, and detection.

Albumentations offers a wide variety of Augmentations for all sorts of Computer Vision tasks. You can read more abut the available augmentations at -> https://albumentations.ai/docs/api_reference/augmentations/transforms/


In [None]:
def augmentImage(imageFile):
    image = load_image(imageFile)
    actual_image = image
    brightness = A.RandomBrightness(limit = 0.2, p = 1.0)(image = image)['image']
    contrast = A.RandomContrast(limit = 0.2,p = 1.0)(image = image)['image']
    crop = A.RandomCrop(448,448)(image = image)['image']
    left_right = A.HorizontalFlip(p = 1.0)(image = image)['image']
    up_down = A.VerticalFlip(p = 1.0)(image = image)['image']
    hue = A.ColorJitter(hue=0.2,brightness=0,saturation=0, contrast=0,p=1.0)(image = image)['image']
    saturation = A.ColorJitter(hue=0,brightness=0,saturation=0.2, contrast=0,p=1.0)(image = image)['image']
    downscale = A.Downscale(scale_min = 0.25, scale_max = 0.25,p= 1.0)(image = image)['image']
    
    return (
        actual_image, 
        brightness,
        contrast,
        crop,
        left_right,
        up_down,
        hue,
        saturation, 
        downscale
    )

In [None]:
augmentedImages = augmentImage(healthyImages[0])
plt.figure(figsize=(10, 10))
for i, imageName in zip(range(9), ['Input Image', 'Augmented - Brightness','Augmented - Contrast','Augmented - Crop','Augmented - Horizontal Flip',
                                  'Augmented - Vertical Flip','Augmented - Hue','Augmented - Saturation','Augmented - Quality']):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(augmentedImages[i])
    plt.title(imageName)
    plt.axis("off")

## Work in progress