
<h2><center>Cassava Leaf Disease Classification. Data analysis and visualization.</center></h2>

<center><img src="https://storage.googleapis.com/kaggle-competitions/kaggle/13277/logos/header.png?t=2019-03-08-20-57-32"></center>

### This competition will challenge you to distinguish between several diseases that cause material harm to the food supply of many African countries. In some cases the main remedy is to burn the infected plants to prevent further spread, which can make a rapid automated turnaround quite useful to the farmers.

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Quick navigation</center></h3>

* [1. train.csv](#1)
* [2. Image samples](#2)
* * [2.1 Healthy](#2.1)
* * [2.2 Cassava Mosaic Disease (CMD)](#2.2)
* * [2.3 Cassava Green Mottle (CGM)](#2.3)
* * [2.4 Cassava Brown Streak Disease (CBSD)](#2.4)
* * [2.5 Cassava Bacterial Blight (CBB)](#2.5)
* [3. Image size analysis](#3)
* [4. List of unusual samples for Cassava Brown Streak Disease](#4)
* [5. List of outliers](#5)

In [None]:
import numpy as np
import pandas as pd
import random
import os
        
import plotly.express as px
import json

import matplotlib.pyplot as plt
import cv2

<a id="1"></a>
<h2 style='background:black; border:0; color:white'><center>1. train.csv</center><h2>

**train.csv**

* image_id the image file name.

* label the ID code for the disease.

In [None]:
train = pd.read_csv("../input/cassava-leaf-disease-classification/train.csv")
train

In [None]:
ds = train['label'].value_counts().reset_index()
ds.columns = [
    'label', 
    'percent'
]

ds['percent'] /= len(train)

fig = px.pie(
    ds, 
    names='label', 
    values='percent', 
    title='Diseases distribution', 
    width=800,
    height=500 
)

fig.show()

**label_num_to_disease_map.json** The mapping between each disease code and the real disease name.

In [None]:
with open("../input/cassava-leaf-disease-classification/label_num_to_disease_map.json") as f:
    map_dis = json.loads(f.read())

print(json.dumps(map_dis, indent=4))

So we can see that the most popular disease is "Cassava Mosaic Disease (CMD)".

**Cassava mosaic disease (CMD)** is primarily spread through the dissemination of stem cuttings (A) obtained from cassava plants affected by the disease. Secondary spread can occur within and between fields through the activities of the whitefly vector Bemisia tabaci (B).

**[train/test]_images the image files.** The full set of test images will only be available to your notebook when it is submitted for scoring. Expect to see roughly 15,000 images in the test set.

<a id="2"></a>
<h2 style='background:black; border:0; color:white'><center>2. Image samples</center><h2>

In [None]:
def plot_images(class_id, label, images_number):
    
    plot_list = train[train["label"] == class_id].sample(images_number)['image_id'].tolist()
    labels = [label for i in range(len(plot_list))]
    size = np.sqrt(images_number)
    if int(size)*int(size) < images_number:
        size = int(size) + 1
        
    plt.figure(figsize=(20, 20))
    
    for ind, (image_id, label) in enumerate(zip(plot_list, labels)):
        plt.subplot(size, size, ind + 1)
        image = cv2.imread(os.path.join('../input/cassava-leaf-disease-classification/', "train_images", image_id))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        plt.imshow(image)
        plt.title(label, fontsize=12)
        plt.axis("off")
    
    plt.show()

<a id="2.1"></a>
<h2 style='background:black; border:0; color:white'><center>2.1. Healthy</center><h2>

In [None]:
plot_images(
    class_id=4, 
    label='Healthy',
    images_number=16
)

<a id="2.2"></a>
<h2 style='background:black; border:0; color:white'><center>2.2. Cassava Mosaic Disease (CMD)</center><h2>

In [None]:
plot_images(
    class_id=3, 
    label='Cassava Mosaic Disease (CMD)',
    images_number=16
)

<a id="2.3"></a>
<h2 style='background:black; border:0; color:white'><center>2.3. Cassava Green Mottle (CGM)</center><h2>

In [None]:
plot_images(
    class_id=2, 
    label='Cassava Green Mottle (CGM)',
    images_number=16
)

<a id="2.4"></a>
<h2 style='background:black; border:0; color:white'><center>2.4. Cassava Brown Streak Disease (CBSD)</center><h2>

In [None]:
plot_images(
    class_id=1, 
    label='Cassava Brown Streak Disease (CBSD)',
    images_number=16
)

<a id="2.5"></a>
<h2 style='background:black; border:0; color:white'><center>2.5. Cassava Bacterial Blight (CBB)</center><h2>

In [None]:
plot_images(
    class_id=0, 
    label='Class Cassava Bacterial Blight (CBB)',
    images_number=16
)

<a id="3"></a>
<h2 style='background:black; border:0; color:white'><center>3. Image size analysis</center><h2>

Let's check shapes for all of images in training set.

In [None]:
%%time

check_dict = dict()

for filename in os.listdir('/kaggle/input/cassava-leaf-disease-classification/train_images/'):
    img = cv2.imread('/kaggle/input/cassava-leaf-disease-classification/train_images/' + filename)
    try:
        check_dict[img.shape] += 1
    except:
        check_dict[img.shape] = 1

In [None]:
check_dict

<a id="4"></a>
<h2 style='background:black; border:0; color:white'><center>4. List of unusual samples for Cassava Brown Streak Disease </center><h2> 

In [None]:
unusual = [
    '1004389140.jpg',
    '1008244905.jpg',
    '1338159402.jpg',
    '1339403533.jpg',
    '159654644.jpg',    
    '1010470173.jpg',
    '1014492188.jpg',
    '1359893940.jpg',
    '1366430957.jpg',
    '1689510013.jpg',
    '1726694302.jpg',
    '1770746162.jpg',
    '1773381712.jpg',
    '1848686439.jpg',
    '1905119159.jpg',
    '1917903934.jpg',
    '1960041118.jpg',
    '199112616.jpg',
    '2016389925.jpg',
    '2073193450.jpg',
    '2074713873.jpg',
    '2084868828.jpg',
    '2139839273.jpg',
    '2166623214.jpg',
    '2262263316.jpg',
    '2276509518.jpg',
    '2278166989.jpg',
    '2321669192.jpg',
    '2320471703.jpg',
    '2382642453.jpg',
    '2415837573.jpg',
    '2482667092.jpg',
    '2604713994.jpg',
    '262902341.jpg',
    '2642216511.jpg',
    '2698282165.jpg',
    '2719114674.jpg',
    '274726002.jpg',
    '2925605732.jpg',
    '2981404650.jpg',
    '3040241097.jpg',
    '3043097813.jpg',
    '3123906243.jpg',
    '3126296051.jpg',
    '3199643560.jpg',
    '3251960666.jpg',
    '3252232501.jpg',
    '3425850136.jpg',
    '3435954655.jpg',
    '3477169212.jpg',
    '3609350672.jpg',
    '3652033201.jpg',
    '3810809174.jpg',
    '3838556102.jpg',
    '3881028757.jpg',
    '3892366593.jpg',
    '4060987360.jpg',
    '4089218356.jpg',
    '4134583704.jpg',
    '4203623611.jpg',
    '421035788.jpg',
    '4239074071.jpg',
    '4269208386.jpg',
    '457405364.jpg',
    '549854027.jpg',
    '554488826.jpg',
    '580111608.jpg',
    '600736721.jpg',
    '616718743.jpg',
    '695438825.jpg',
    '723564013.jpg',
    '746746526.jpg',
    '826231979.jpg',
    '847847826.jpg',
    '9224019.jpg',
    '992748624.jpg'
]

In [None]:
images_number = 16

plot_list = random.sample(unusual, images_number)
labels = ['Unusual Cassava Brown Streak Disease' for i in range(len(plot_list))]
size = np.sqrt(images_number)
if int(size)*int(size) < images_number:
    size = int(size) + 1

plt.figure(figsize=(20, 20))

for ind, (image_id, label) in enumerate(zip(plot_list, labels)):
    plt.subplot(size, size, ind + 1)
    image = cv2.imread(os.path.join('../input/cassava-leaf-disease-classification/', "train_images", image_id))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    plt.imshow(image)
    plt.title(label, fontsize=12)
    plt.axis("off")

plt.show()

<a id="5"></a>
<h2 style='background:black; border:0; color:white'><center>5. List of outliers</center><h2> 

**Already processed classes:**
* 1) Cassava Brown Streak Disease (CBSD)
* 2) Cassava Bacterial Blight (CBB)
* 3) Cassava Mosaic Disease (CMD)
* 4) Cassava Green Mottle (CGM)
* 5) Healthy

In [None]:
outliers = [
    '156080014.jpg',
    '2182500020.jpg',
    '2489013604.jpg',
    '3129393327.jpg',
    '314640668.jpg',
    '490649765.jpg',
    '1285436512.jpg',
    '1403621003.jpg', # Looks like unusual sample from Cassava Brown Streak Disease but labeled like Cassava Mosaic Disease
    '1819546557.jpg',
    '1841279687.jpg',
    '2088351120.jpg',
    '2161797110.jpg',
    '2602649407.jpg',
    '277532565.jpg',
    '3184864595.jpg',
    '3238801760.jpg',
    '3272750945.jpg',
    '3382391338.jpg',
    '357924077.jpg',
    '4044829046.jpg',
    '4059169921.jpg',
    '4280523848.jpg',
    '449389274.jpg',
    '452420525.jpg',
    '479472063.jpg',
    '612680278.jpg',
    '726377415.jpg',
    '1179237425.jpg',
    '1663857014.jpg',
    '2565638908.jpg',
    '3188953817.jpg',
    '3421208425.jpg',
    '504689064.jpg',
    '597389720.jpg',
    '1119403430.jpg',
    '1774341872.jpg',
    '1886828385.jpg',
    '2484530081.jpg',
    '2632579053.jpg',
    '2839068946.jpg',
    '284130814.jpg',
    '3609986814.jpg',
    '3724956866.jpg',
    '3746679490.jpg',
    '3853597900.jpg',
    '927165736.jpg'
]

In [None]:
images_number = 16

plot_list = random.sample(outliers, images_number)
labels = ['Outlier' for i in range(len(plot_list))]
size = np.sqrt(images_number)
if int(size)*int(size) < images_number:
    size = int(size) + 1

plt.figure(figsize=(20, 20))

for ind, (image_id, label) in enumerate(zip(plot_list, labels)):
    plt.subplot(size, size, ind + 1)
    image = cv2.imread(os.path.join('../input/cassava-leaf-disease-classification/', "train_images", image_id))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    plt.imshow(image)
    plt.title(label, fontsize=12)
    plt.axis("off")

plt.show()

### Work in progress