# Summary of notebook

**Purpose of EDA**<br>
1. Explore label distribution<br>
2. Show images for each label<br>
3. Summarize what I understood<br>

**Summary of EDA**
1. Number of training images are 18632 and image size are not uniformed
2. Main categories are bellow
    1. Healthy
    2. Scab
    3. Frog_eye_leaf_spot
    4. Rust
    5. Powdery_mildew
    6. *Mixed* (label "complex" or ones with multiple label)
3. *Mixed* labels consists of 20% of total
4. “complex” is leaves with too many diseases to classify (Mentioned in official discription)
5. Labels "frog_eye_leaf_spot", "rust", "powdery_mildew" seems easy to identify
6. “scab” seems difficult to identify spots<br>
   -> Consider applying filter to emphasize scab  

# Imports

In [None]:
import os
import glob
import math
from PIL import Image

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read data

In [None]:
!ls ../input/plant-pathology-2021-fgvc8

In [None]:
BASE_DIR = "../input/plant-pathology-2021-fgvc8"

In [None]:
# Training data image path and label
train_image_path = glob.glob(os.path.join(BASE_DIR, "train_images/*.jpg"))
label_df = pd.read_csv(os.path.join(BASE_DIR, "train.csv"))
# Number od trainin images
print("Number of training images: {}".format(len(train_image_path)))

In [None]:
# Show first 10 rows
print(train_image_path[:10])
display(label_df.head(10))

# Label distribution

In [None]:
labels = label_df.labels.unique()
print(labels)
print(len(labels))

In [None]:
# Count number of data for each labels
label_count = label_df.labels.value_counts()
label_ratio = label_df.labels.value_counts(normalize=True, sort=True)
print(label_count)

In [None]:
# Show distribution for label by bar graph and pie chart 
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8))
label_count.plot.bar(ax=ax[0], title="Number of data per label", rot=90, fontsize=15)
label_ratio.plot.pie(ax=ax[1], title="Ratio of label", autopct='%1.1f%%', fontsize=15)
plt.show()

**Note from label distribution**
1. Both healthy and scrab images consists of 25% of total data
2. There are label with mix disease. (e.g. scab frog_eye_leaf_spot complex)

# Image properties

## Number of length of label.csv and images

In [None]:
# Get set of labels and image path
labels = set(label_df["image"].to_list())
images = set(os.path.basename(full_path) for full_path in train_image_path)

print(labels ^ images)

- Number of images in "train.csv" and image files are the same

## Image size

In [None]:
# Get image size
image_property = {"image": [], "width": [], "height": []}
for image_path in train_image_path:
    im = Image.open(image_path)
    width, height = im.size
    file_name = os.path.basename(image_path)
    image_property["image"].append(file_name)
    image_property["width"].append(width)
    image_property["height"].append(height)

In [None]:
# Merge width, height info with label data
image_size = pd.DataFrame(image_property)
image_size = pd.merge(image_size, label_df, on="image", how="outer")
image_size

In [None]:
# Describe image property
image_size.info()

In [None]:
# Plot (width, height) pair
image_size.plot(x="width", y="height", linestyle="none", marker = "x")

In [None]:
# Count value pair for (width, height)
image_size.groupby(["width", "height"]).count()

- All the image do not have the same size

## Show images

In [None]:
labels = label_df.labels.value_counts()
display(labels)

In [None]:
display_num = 9
for label in labels.index:
    df_tmp = label_df[label_df.labels == label]
    images = df_tmp.sample(display_num).image.values
    images = [os.path.join(BASE_DIR, "train_images/", file_name) for file_name in images]
    
    plt.figure(figsize=(20, 20))
    for idx, image in enumerate(images):
        plt.subplot(math.ceil(display_num / 3), 3, idx+1)
        im = plt.imread(image)
        plt.imshow(im)
    print()
    print("==================== Label:{} ===================".format(label))
    plt.tight_layout()
    plt.show()

**Observation from image**
- Main categories are bellow
    1. Healthy
    2. Scab
    3. Frog_eye_leaf_spot
    4. Rust
    5. Powdery_mildew
    6. *Mixed* (label "complex" or ones with multiple label)
- *Mixed* labels consists of 20% of total
- “complex” is leaves with too many diseases to classify (Mentioned in official discription)
- Labels "frog_eye_leaf_spot", "rust", "powdery_mildew" seems easy to identify
- “scab” seems difficult to identify spots<br>
   -> Consider applying filter to emphasize scab  