---------------------------------------
# Plant Pathology 2021 FGVC8 - training dataset review

The Plant Pathology 2021 FGVC8 dataset contains about 18k images of Apple leaf diseases. This current version considers six classes, namely 'healthy', 'scab', 'rust', 'powdery mildew', 'frog_eye_leaf_spot', 'complex'. 

During training session, it has been noted that a few predicted labels matched with disease in the images but not with the actual labels. Certainly, such cases compromised the model training and induced a wrong classification in the validation session. So, in an attempt to check the dataset for better training, a large sample (~15k) of the images have been reviewed an some findings have been displayed in this notebook to promote the discussion about it.

**Some important considerations**
* I´m no botanist, so my conclusions are derived from reading some specialized sites, comparing it with google images and looking at the dataset trying to find patterns;
* The images shown here are just selected images. There might be many more misclassfied images but, again, I´m no authority in the field;
* If, in the end, it comes to a conclusion that the dataset might have a large portion of compromised labels, the implication is that the Test Dataset and the Model Scoring might also be compromised;  

**Conclusions and further considerations on the labels mismatch**

Overall, whats been found is that:
* Some images are not related to leaf diseases which should be removed from the dataset;
* Some images have been clearly misclassified;
* Some leaf diseases are very tricky to find on the images. A small, white spot might be an indicative of Powdery Mildew while an olive green patch might be a Scab. So probably discussion may occur as to define if a leaf is healthy or there is some disease.
* The above point is probably related to the life cycle of the disease. An early stage might have just a weak sympton on the leaves and might go unnoticed. But a neural network could easily identify.
* The *complex* class contains a large subset of images that might be classified differently as, for instance, a 'two label' classification. This might be a source of error in the training/testing step.

____________________________________________________
# Summary:
0. Typical images for each disease
1. Images not related to Leaf Diseases
2. Misclassification

        a. In Healthy subset 
        b. In Complex subset
        c. In Rust subset
        
        
3. Disease stage of contamination

        3.a Scab - Possible early contamination
        3.b Powdery Mildew - Possible early contamination
___________________________________________________

In [None]:
import pandas as pd
import numpy as np
import os
import cv2
import matplotlib.pyplot as plt
from plotly import figure_factory as ff

images_path = '/kaggle/input/plant-pathology-2021-fgvc8/train_images/'

def plot_images(images_list, title_list, images_path, nx, ny, width, height):
    
    fig=plt.figure(figsize=(width, height))
    
    for idx, (image, title) in enumerate(zip(images_list, title_list)):
       
        img_file = os.path.join(images_path, image)
        img = cv2.imread(img_file)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        ax = fig.add_subplot(nx, ny, idx+1)
        ax.title.set_text(title+'\n'+image)
        plt.imshow(img)
    plt.show()
    
def plot_table(image_list, orig_class, corr_class):
    
    data_matrix = [image_list, orig_class, corr_class]
    
    fig = ff.create_table(data_matrix)
    fig.show()

## 0. Typical images for each disease
First, let´s just take a look at typical images for each disease so there is a base for comparison. Each disease as its own characteristics:
* **Rust**: Small, pale yellow spots on the upper surfaces. Eventually, tiny, black, fruiting bodies (pycnia) become visible.
* **Powdery mildew**: Small, whitish felt-like patches of fungal growth appear and quickly cover the entire leaf.
* **Scab**: The fungal disease forms pale yellow or olive-green spots on the upper surface of leaves. Dark, velvety spots may appear on the lower surface.
* **Frog eye spot**: Small, purple specks appear on infected leaves. The round to irregularly lobed spots develop a light brown-to-gray center surrounded by one or more dark-brown concentric rings and a purple margin.
* **Complex**: As defined by the competition. Unhealthy leaves with too many diseases to classify visually will have the complex class.


In [None]:
image_list = ['809e3fc0cbac7c59.jpg', '808f897fa9e2adb0.jpg', '828d2f4a91754ddb.jpg', '80a047cd9aa9fb4f.jpg', '80bf21a07dfcf047.jpg', '9090b5d5aba53cd3.jpg']
title_list = ['Healthy', 'Rust', 'Powdery Mildew', 'Scab', 'Frog Eye Leaf Spot', 'Complex']
plot_images(image_list, title_list, images_path, nx=3, ny=2, width=23, height=23)

_________________________________
## 1. Images not related to Leaf Diseases
Some images have been found that are not related to Apple leaf diseases. I´ve particularly found just these four but there might be more.
Those images should simply be removed from the dataset.

    95276ccd226ad933.jpg - Sign
    ead085dfac287263.jpg - Nest
    ccec54723ff91860.jpg - Sign
    da8770e819d2696d.jpg - Sign


In [None]:
image_list = ['95276ccd226ad933.jpg', 'ead085dfac287263.jpg', 'ccec54723ff91860.jpg', 'da8770e819d2696d.jpg']
title_list = ['Random sign', 'Nest (lol)', 'Another sign', 'Another sign']
plot_images(image_list, title_list, images_path, nx=1, ny=4, width=23, height=23)

_______________________________________________
## 2. Misclassification
Another set of images have been found to be misclassified. Below there are some examples of misclassification. Reminder: these are just some selected images and does not compreheend all the images that are set with the wrong labels.

## **a. Healthy subset**
The following images have been found in the healthy sub dataset. Each image title is the possible classification.
##### Some images have only a trace of the disease. For instance, the images:

    bdd290c8cbc9aca9.jpg - small yellow spot (rust) and white material on the leaf tip
    aa896f86f43a38b1.jpg - has a few olive-green patches 
    efbcc09086859ebc.jpg - has a white silky spot near the stem.
    ad8addd454e14a6c.jpg - has a very light clearer spot near the stem on the upper side of the leaf.
    
On these images it can be argued that the disease lifecycle matter for the classification. Image *ce5b7057c3259c1c.jpg* has a very obvious Powdery Mildew infection but image *bdd290c8cbc9aca9.jpg* is not that clear. In fact a large portion of images has a light powdery mildew infection on the leaf tip that went unnoticed (of course if that actually is powdery mildew, I´m no specialist). Some other cases are similar to picture *efbcc09086859ebc.jpg* which a small white patch is somewhere on the blade of the leaf.
The question put here is, in which stage if any the leaf should be classified as healthy or with any contamination. Take, as a matter of example the picture below. Should a coverage less than 1.5% be considered?

![% of leaf with disease sympton](https://www.scielo.br/img/revistas/aabc/v92s1//0001-3765-aabc-92-s1-e20180889-f1.jpg)

In [None]:
image_list_healthy = ['ce5b7057c3259c1c.jpg', 
              'bdd290c8cbc9aca9.jpg', 
              'ad8addd454e14a6c.jpg']

title_list_healthy = ['Powdery Mildew', 
              'Rust + Powdery Mildew', 
              'Scab?']
plot_images(image_list_healthy, title_list_healthy, images_path, nx=3, ny=1, width=35, height=35)

In [None]:
image_list_healthy = ['aa896f86f43a38b1.jpg',
                      'efbcc09086859ebc.jpg']

title_list_healthy = ['Scab',
                      'Powdery Mildew']
plot_images(image_list_healthy, title_list_healthy, images_path, nx=2, ny=1, width=35, height=35)

## **b. Complex subset**
The following images have been found in the Complex sub dataset. Each image title is the possible classification

In [None]:
image_list_complex = ['8cc32cc7f82f81f8.jpg',
                     '90df66a17f502e13.jpg',
                     'a0d79b991b79a0b8.jpg']

title_list_complex = ['Scab',
                     'Rust + Frog eye',
                     'Scab + Frog eye']

plot_images(image_list_complex, title_list_complex, images_path, nx=3, ny=1, width=35, height=35)

In [None]:
image_list_complex = ['a2be96953443c7e4.jpg',
                     'aab096b7030ae3f9.jpg',
                     'a7b0921acf317ce1.jpg']

title_list_complex = ['Rust',
                     'Rust + Scab',
                     'Frog eye + Powdery Mildew']

plot_images(image_list_complex, title_list_complex, images_path, nx=3, ny=1, width=35, height=35)

## **c. Rust subset**

In [None]:
image_list_rust = ['d3945c098edc9dd1.jpg']

title_list_rust = ['Powdery Mildew + Frog Eye']

plot_images(image_list_rust, title_list_rust, images_path, nx=1, ny=1, width=14, height=14)

## **3. Disease stage of contamination**

As mentioned previously, due to the nature of the disease life cycles, some leaves present an earlier stage of contamination than others. This might be hard for the annotation procedure an also for the model training. Some examples are shown below.

## 3.a Scab - Possible early contamination

The images below have been found both on the Healthy and Scab subdataset, which shows how hard it is to determine manually the real state of the disease.
So, the question here is the same as posed before. How to create a rule to determined when there is a early scab contamination and when its a healthy leaf?

**The image title shows which label it originally belongs.**

In [None]:
image_list_escab = ['b9909e99fd34210f.jpg',
                     'dc7fe1898d402277.jpg',
                     'cff02d717850c0b7.jpg']

title_list_escab = ['healthy',
                     'scab',
                     'scab']

plot_images(image_list_escab, title_list_escab, images_path, nx=3, ny=1, width=35, height=35)

In [None]:
image_list_escab = ['ab1981fd06e23f8a.jpg',
                     '8ab57810f73ec568.jpg',
                     'f758af92944f9282.jpg']

title_list_escab = ['healthy',
                     'scab',
                     'healthy']

plot_images(image_list_escab, title_list_escab, images_path, nx=3, ny=1, width=35, height=35)

## 3.b Powdery Mildew - Possible early contamination
As previously stated for Scab, here as some examples for powdery mildew.
The firts 3 cases presented here show a indicative of powdery mildew on the leaf tip. The last image shows some white-fluffly material on the blade.

In [None]:
image_list_epow = ['da9bcaa9a6d088c7.jpg',
                  'f1bb28fee48284c3.jpg']
title_list_epow = ['healthy',
                  'scab']
plot_images(image_list_epow, title_list_epow, images_path, nx=2, ny=1, width=24, height=24)

In [None]:
image_list_epow = ['d54a6887653e273c.jpg',
                  '90ece9d647c23ca6.jpg']
title_list_epow = ['healthy',
                  'healthy']
plot_images(image_list_epow, title_list_epow, images_path, nx=2, ny=1, width=24, height=24)