## Overview:

Cassava anthracnose disease (CAD) is widespread in most of the cassava growing regions of Africa. The disease is caused by a fungus (Collectothricum gloeosporioides) that is also capable of causing diseases on other food crops. It is estimated that CAD causes yield losses in the neighbourhood of 30% or more in susceptible cultivars. The disease affects both leaf and stem production. Severe anthracnose attacks can cause death of stems which can affect the availability of planting materials especially in large scale production systems.

**There are Four kind of disease introduced here**

**1. Cassava Bacterial Blight (CBB):** At first, angular, water-soaked spots occur on the leaves which are restricted by the veins; the spots are more clearly seen on the lower leaf surface. The spots expand rapidly, join together, especially along the margins of the leaves, and turn brown with yellow borders (in the picture, on the left). Droplets of a creamy-white ooze occur at the center of the spots; later, they turn yellow. Stem infections block the flow of water and food and the leaves above wilt, die, and fall, and branches die back (in the picture, on the right). 
    
    Main characteristics to leverage: **angular spots, brown spots with yellow borders, yellow leaves, leaves wilting**
    
   <img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1865449%2Fbe9cdd94efb9b1660066ad10b55c8626%2Fbact_bright.jpeg?generation=1605827469211692&alt=media" style="width:500px;height:300px">

**2. Cassava Brown Streak Disease (CBSD):** Symptoms of cassava brown streak disease appear as patches of yellow areas mixed with normal green color. The characteristic yellow or necrotic vein banding may enlarge and coalesce to form large yellow patches. 

    The infected leaves **do not become distorted in shape as occurs with leaves infected by Cassava mosaic disease.** 
    
    Main characteristics to leverage: **yellow spots**
    
   <img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1865449%2Ffeba3dafc914d04517659650d137b77a%2Fbrown_st.jpeg?generation=1605830407530983&alt=media" style="width:500px;height:300px">

**3. Cassava Green Mottle (CGM):** Young leaves are puckered with faint to distinct yellow spots (in the picture, on the left), green patterns (mosaics), and twisted margins (in the picture, on the right). Occasionally, plants become severely stunted.

    Main characteristics to leverage: **yellow patterns, irregular patches of yellow and green, leaf margins distortion, stunted**
    
   <img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1865449%2F4f2975866feb2a1d4ef4111c2d57db29%2Fgreen_mottle.jpeg?generation=1605829101431013&alt=media" style="width:500px;height:300px">

**4. Cassava Mosaic Disease (CMD):** Cassava Mosaic Disease is characterized by severe mosaic symptoms on leaves, with affected leaves showing mottling and light-green, yellow or white spots. Discoloration, malformation, and puckering of the leaf blade occur.
    
    Main characteristics to leverage: **severe shape distortion, mosaic patterns**
    
   <img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1865449%2F36990f77ded6667e5c30d19b5405d4d3%2Fmosaic_disease.jpeg?generation=1605829705010773&alt=media" style="width:500px;height:300px">
   
Source: Discussion from **Jacopo Repossi**

In [None]:
#  importing libraries
import os
import json
import numpy as np
import pandas as pd
import missingno as msno
from glob import glob
from PIL import Image
import matplotlib.pyplot as plt
%matplotlib inline

#bokeh
from bokeh.models import ColumnDataSource, HoverTool, Panel, FactorRange
from bokeh.plotting import figure
from bokeh.io import output_notebook, show, output_file
from bokeh.palettes import Spectral6

import warnings
warnings.filterwarnings('ignore')

### Setup directory structure

In [None]:
# Setup Directory Structure and Environment variables
BASE_DIR = "../input/cassava-leaf-disease-classification/"
TRAIN_IMAGES = os.path.join(BASE_DIR, "train_images/")
TEST_IMAGES = os.path.join(BASE_DIR, "test_images/")
TRAIN_DF = os.path.join(BASE_DIR, "train.csv")
LABELS = os.path.join(BASE_DIR, "label_num_to_disease_map.json")

In [None]:
# The mapping between each disease code and the real disease name.
with open(LABELS) as label:
    classes = json.loads(label.read())
    
print(json.dumps(classes, indent = 3))

In [None]:
# Total number of Training and Testing Images
print("Number of Training Images are {}".format(len(os.listdir(TRAIN_IMAGES))))
print("Number of Testing Images are {}".format(len(os.listdir(TEST_IMAGES))))

### Data Exploration

In [None]:
# reading dataframe
df_train = pd.read_csv(TRAIN_DF)
df_train.head()

In [None]:
# shape of data
df_train.shape

We have 21397 Training images, So it should also be having 21397 rows in dataframe

In [None]:
# checking for null values
print(df_train.isnull().sum())

msno.matrix(df_train, color=(207/255, 196/255, 171/255), fontsize = 10)

We have labels for each image present and no missing data found.

We have label code in dataframe and mapping is present in json. So, lets add class_name and image_path in dataframe.

In [None]:
# mapping class_name to train.csv
df_train["class_name"] = df_train["label"].astype(str).map(classes)
df_train.head()

In [None]:
# mapping image path to train.csv
df_train["path"] = df_train["image_id"].map(lambda x:TRAIN_IMAGES + "/" + x)
df_train.head()

We have images present for 5 categories. Let's count and visualize number of images per each class.

In [None]:
df_train["class_name"].value_counts(sort=True)

In [None]:
# number of images in each category

Categories = ["Cassava Mosaic Disease (CMD)", "Healthy", "Cassava Green Mottle (CGM)", "Cassava Brown Streak Disease (CBSD)", "Cassava Bacterial Blight (CBB)"]
counts = list(df_train["class_name"].value_counts(sort=True))

source = ColumnDataSource(data = dict(Categories = Categories, counts = counts, color = Spectral6))

p = figure(x_range = Categories, y_range = (0,15000),plot_width = 1000,plot_height = 500, title = "Distribution of the number of images in the training set",
           tools = "hover, pan, box_zoom, wheel_zoom, reset, save", tooltips = ("@Categories: @counts"))

p.vbar(x = 'Categories', top = 'counts', width = 0.9, color = 'color', legend_field = "Categories", source = source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

output_notebook()
show(p)

### Images Exploration

In [None]:
CMD = df_train[df_train['class_name'] == 'Cassava Mosaic Disease (CMD)']
Healthy = df_train[df_train['class_name'] == 'Healthy']
CGM = df_train[df_train['class_name'] == 'Cassava Green Mottle (CGM)']
CBSD = df_train[df_train['class_name'] == 'Cassava Brown Streak Disease (CBSD)']
CBB = df_train[df_train['class_name'] == 'Cassava Bacterial Blight (CBB)']

In [None]:
# Extract 9 random images from CMD
random_images = [np.random.choice((CMD['image_id'].values)) for i in range(9)]

print('Display CMD Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(TRAIN_IMAGES, random_images[i]))
    plt.imshow(img, cmap = 'gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout() 

In [None]:
# Extract 9 random images from Healthy
random_images = [np.random.choice((Healthy['image_id'].values)) for i in range(9)]

print('Display Healthy Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(TRAIN_IMAGES, random_images[i]))
    plt.imshow(img, cmap = 'gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout() 

In [None]:
# Extract 9 random images from CGM
random_images = [np.random.choice((CGM['image_id'].values)) for i in range(9)]

print('Display CGM Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(TRAIN_IMAGES, random_images[i]))
    plt.imshow(img, cmap = 'gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout() 

In [None]:
# Extract 9 random images from CBSD
random_images = [np.random.choice((CBSD['image_id'].values)) for i in range(9)]

print('Display CBSD Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(TRAIN_IMAGES, random_images[i]))
    plt.imshow(img, cmap = 'gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout() 

In [None]:
# Extract 9 random images from CBB
random_images = [np.random.choice((CBB['image_id'].values)) for i in range(9)]

print('Display CBB Images')

# Adjust the size of your images
plt.figure(figsize=(10,8))

# Iterate and plot random images
for i in range(9):
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(TRAIN_IMAGES, random_images[i]))
    plt.imshow(img, cmap = 'gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout() 

In [None]:
# images shape distribution
images_shape = []

for k, image_name in enumerate(df_train['image_id']):
    image = Image.open(TRAIN_IMAGES + "/" + image_name)
    images_shape.append(image.size)

images_shape_df = pd.DataFrame(data = images_shape, columns = ['H', 'W'], dtype='object')
images_shape_df['Size'] = '[' + images_shape_df['H'].astype(str) + ',' + images_shape_df['W'].astype(str) + ']'

In [None]:
images_shape_df.head()

In [None]:
print("We have {} types of different shapes in training images".format(len(list(images_shape_df['Size'].unique()))))

We have all Images with same shape of 800x600. 