<div align='center'><font size="6" color="#F39C12">Plant Pathology 2020 - FGVC7</font></div>
<div align='center'><font size="5" color="#F39C12">Detecting the category of foliar diseases in apple trees</font></div>
<hr>
![](https://s3.amazonaws.com/plantvillage-production-new/images/pics/000/002/555/original/2.jpg?1394723222)
*source: Liz West Licensed by Creative Commons Attribution 2.0 Generic http://www.flickr.com/photos/calliope/54070473/
*

Apple fruits are one of the most widely cultivated fruits globally and their cultivated is a major source of employment for a lot of people worldwide. Unfortunately, they are also highly prone to dieases like scab, rust, rot, powdery mildew etc which could result in reduced fruit yield and at some times even death. Therefore, it is necessary to identify the disease symptoms early so that effective steps can be taken to maintain the quality and production.


## Objective

![](https://ya-webdesign.com/transparent250_/icons-transparent-objective-2.png)

[The objective of the compeition is multifold](https://www.kaggle.com/c/plant-pathology-2020-fgvc7/overview/description). We need to train a model on the given images dataset to:
- Classify the Test dataset images into healthy or diseased plants
- A leaf may belong to more than one class of diseases. We need to accurately distinguish between the different leaf diseases.
-  Deal with rare classes and novel symptoms;
- Address depth perception—angle, light, shade, physiological age of the leaf; and 
- ncorporate expert knowledge in identification, annotation, quantification, and guiding computer vision to search for relevant features during learning.


## Dataset

The datasets contains a set of leaf images have been divided into a training and a test set.The images can be healthy, those which are infected with apple rust, those that have apple scab, and those with more than one disease.

Our job is to create a ML model to predict whether the test set tweets belong to a disaster or not, in the form of 1 or 0.This is a classic case of a Binary Classification problem. 

## Evaluation Metric


Evaluation metrics are used to measure the quality of the statistical or machine learning model.There are many different types of evaluation metrics available to test a model. These include classification accuracy, logarithmic loss etc. For this particluar problem, our submissions will be evaluated on mean column-wise ROC AUC. This means the final score is the average of the individual AUCs of each predicted column. 




  

# Table of Contents
* [1. Importing the necessary libraries](#imports)
- [2. Reading the image datasets](#reading)
- [3. Exploring different categories of leaf diseases](#categories)
- [4. Visualizing Images]()
- [5. Histograms]()


# 1. Importing the necessary libraries

In [None]:

import numpy as np 
import pandas as pd 


# For plotting
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects
import matplotlib.image as mpimg
from mpl_toolkits.axes_grid1 import ImageGrid

import cv2
from io import BytesIO
from PIL import Image

# skimage
from skimage.io import imshow, imread, imsave
from skimage.transform import rotate, AffineTransform, warp,rescale, resize, downscale_local_mean
from skimage import color,data
from skimage.exposure import adjust_gamma
from skimage.util import random_noise

#plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot 
import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=True)

# 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib import colors


COLORS = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#93D30C", "#8F00FF"]

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 2. Reading the Image datasets

In [None]:
# List files available
print(os.listdir("../input/plant-pathology-2020-fgvc7"))

In [None]:
# Defining data path
IMAGE_PATH = "../input/plant-pathology-2020-fgvc7/images/"

train_df = pd.read_csv("../input/plant-pathology-2020-fgvc7/train.csv")
test_df = pd.read_csv("../input/plant-pathology-2020-fgvc7/test.csv")


#Training data
print('Training data shape: ', train_df.shape)
train_df.head(5)

In [None]:
# Null values and Data types
print('Train Set')
print(train_df.info())
print('-------------')
print('Test Set')
print(test_df.info())

* There are no null values in the training and the testing dataset.

In [None]:
# Total number of images in the dataset(train+test)
print("Total images in Train set: ",train_df['image_id'].count())
print("Total images in Test set: ",test_df['image_id'].count())


# 3. Exploring different categories of leaf diseases

In [None]:
# Categories of Images
classes = ['healthy', 'multiple_diseases', 'rust', 'scab']
print(f"The dataset images belong to the following categories - {classes} ")

In [None]:
for c in classes:
    print(f"The class {c} has {train_df[c].sum()} samples")

In [None]:
healthy = train_df[train_df['healthy'] == 1]['image_id'].to_list()
multiple_diseases = train_df[train_df['multiple_diseases'] == 1]['image_id'].to_list()
rust = train_df[train_df['rust'] == 1]['image_id'].to_list()
scab = train_df[train_df['scab'] == 1]['image_id'].to_list()

diseases = [len(healthy), len(multiple_diseases), len(rust), len(scab)]
diseases

In [None]:

trace = go.Bar(
                    x = classes,
                    y = diseases ,
                    orientation='v',
                    marker = dict(color=COLORS,
                                 line=dict(color='black',width=1)),
                    )
data = [trace]
layout = go.Layout(barmode = "group",title='',width=800, height=500, 
                       xaxis= dict(title='Leaf Categories'),
                       yaxis=dict(title="Count"),
                       showlegend=False)
fig = go.Figure(data = data, layout = layout)
iplot(fig)

This is a case of imbalanced data.The multiple disease category leaves are comparately less as compared to the other disease categories.

# 4. Checking for duplicate images

We shall check for duplicate images in Train set as well as duplicate images in train and test set.


In [None]:
train_id = set(train_df.image_id.values )
print(f"Unique Training set Images: {len(train_id)}")
test_id = set(test_df.image_id.values )
print(f"Unique Test set Images: {len(train_id)}")

There are no duplicate images in either the train or test set. Now let's check if there are overlapping images in the Test and Train set

In [None]:
def duplicacy(df1, df2, image_id):
   
    df1_unique = set(df1['image_id'].values)
    df2_unique = set(df2['image_id'].values)
    images_in_both_dataframes = list(df1_unique.intersection(df2_unique))
    return images_in_both_dataframes
   
    

duplicacy(train_df, test_df, 'image_id')


Hence, there are no duplicate images in the datasets

# 5. Visualizing Images

## 5.1 Visualizing a random selection of images

Let's visualise some random set of images containing both diseases and non diseased leaves.

In [None]:
images = train_df['image_id'].values

# Extract 9 random images from it
random_images = [np.random.choice(images+'.jpg') for i in range(6)]

# Location of the image dir
img_dir = IMAGE_PATH

print('Display Random Images')

# Adjust the size of your images
plt.figure(figsize=(15,10))

# Iterate and plot random images
for i in range(6):
    plt.subplot(2, 3, i + 1)
    img = plt.imread(os.path.join(img_dir, random_images[i]))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()   

In [None]:
def display_images(images, Image_dir, condition):
    random_images = [np.random.choice(images+'.jpg') for i in range(6)]

    print(f"Display {condition} Images")

   # Adjust the size of your images
    plt.figure(figsize=(15,10))

  # Iterate and plot random images
    for i in range(6):
        plt.subplot(2, 3, i + 1)
        img = plt.imread(os.path.join(img_dir, random_images[i]))
        plt.imshow(img, cmap='gray')
        plt.axis('off')
    
# Adjust subplot parameters to give specified padding
    plt.tight_layout()   
    
    

## 5.2 Visualizing Healthy leaves

Visualising few healthy images.

In [None]:
healthy_images = np.array(healthy,dtype='object')
images = healthy_images
display_images(images,IMAGE_PATH,'healthy')

## 5.3 Visualizing 'Rust' infected leaves

**Rust** appears like some orange spots on the leaves and hence the name.It is caused by a fungus and can lead to damaged fruit.

In [None]:
rust_images = np.array(rust,dtype='object')
images = rust_images
display_images(images,IMAGE_PATH,'rust')

## 5.4 Visualizing 'Scab' infected leaves

[Apple scab](https://www.gardeningchannel.com/common-diseases-of-apple-trees/) is one of the most common and most serious diseases that afflects apple trees.The disease is caused by the fungus Venturia inqequalis, which overwinters in infected leaves left on the ground. 
Apple scab first appears as small, olive-colored lesions on the undersides of the leaves. As the fungus spreads, the top sides of the leaves develop lesions, as well, that may become black or mottled with defined edges. Severely infected trees may become defoliated by mid-summer, making the tree vulnerable to other diseases. The fruit develop black or brown scabs or soft areas. 

In [None]:
scab_images = np.array(scab,dtype='object')
images = scab_images
display_images(images,IMAGE_PATH,'scab')

## 5.5 Visualizing leaves with multiple diseases

Below are leaves which can be categorised into more than category.There could be rust, scab and other dieases on the same leaf.

In [None]:
multiple_diseases_images = np.array(multiple_diseases,dtype='object')
images = multiple_diseases_images
display_images(images,IMAGE_PATH,'multiple_diseases')

# 6. Histograms

Histograms are a graphical representation showing how frequently various color values occur in the image i.e frequency of pixels intensity values. In a RGB color space, pixel values range from 0 to 255 where 0 stands for black and 255 stands for white. Analysis of a histogram can help us understand thee brightness, contrast and intensity distribution of an image. Now let's look at the histogram of a random selected sample from each category.

## Healthy image

In [None]:
f = plt.figure(figsize=(16,8))
f.add_subplot(1,2, 1)

sample_img = healthy[0]+'.jpg'
raw_image = plt.imread(os.path.join(img_dir, sample_img))
plt.imshow(raw_image, cmap='gray')
plt.colorbar()
plt.title('Healthy Image')
print(f"Image dimensions:  {raw_image.shape[0],raw_image.shape[1]}")
print(f"Maximum pixel value : {raw_image.max():.1f} ; Minimum pixel value:{raw_image.min():.1f}")
print(f"Mean value of the pixels : {raw_image.mean():.1f} ; Standard deviation : {raw_image.std():.1f}")

f.add_subplot(1,2, 2)

#_ = plt.hist(raw_image.ravel(),bins = 256, color = 'orange',)
_ = plt.hist(raw_image[:, :, 0].ravel(), bins = 256, color = 'red', alpha = 0.5)
_ = plt.hist(raw_image[:, :, 1].ravel(), bins = 256, color = 'Green', alpha = 0.5)
_ = plt.hist(raw_image[:, :, 2].ravel(), bins = 256, color = 'Blue', alpha = 0.5)
_ = plt.xlabel('Intensity Value')
_ = plt.ylabel('Count')
_ = plt.legend(['Red_Channel', 'Green_Channel', 'Blue_Channel'])
plt.show()


## Rust infested image

In [None]:
f = plt.figure(figsize=(16,8))
f.add_subplot(1,2, 1)

rust_img = rust[0]+'.jpg'
rust_image = plt.imread(os.path.join(img_dir, rust_img))
plt.imshow(rust_image, cmap='gray')
plt.colorbar()
plt.title('Rust Image')
print(f"Image dimensions:  {raw_image.shape[0],raw_image.shape[1]}")
print(f"Maximum pixel value : {raw_image.max():.1f} ; Minimum pixel value:{raw_image.min():.1f}")
print(f"Mean value of the pixels : {raw_image.mean():.1f} ; Standard deviation : {raw_image.std():.1f}")

f.add_subplot(1,2, 2)
#_ = plt.hist(raw_image.ravel(),bins = 256, color = 'orange',)
_ = plt.hist(rust_image[:, :, 0].ravel(), bins = 256, color = 'red', alpha = 0.5)
_ = plt.hist(rust_image[:, :, 1].ravel(), bins = 256, color = 'Green', alpha = 0.5)
_ = plt.hist(rust_image[:, :, 2].ravel(), bins = 256, color = 'Blue', alpha = 0.5)
_ = plt.xlabel('Intensity Value')
_ = plt.ylabel('Count')
_ = plt.legend(['Red_Channel', 'Green_Channel', 'Blue_Channel'])
plt.show()

## Scab infested image

In [None]:
f = plt.figure(figsize=(16,8))
f.add_subplot(1,2, 1)

scab_img = scab[0]+'.jpg'
scab_image = plt.imread(os.path.join(img_dir, scab_img))
plt.imshow(scab_image, cmap='gray')
plt.colorbar()
plt.title('Scab Image')
print(f"Image dimensions:  {raw_image.shape[0],raw_image.shape[1]}")
print(f"Maximum pixel value : {raw_image.max():.1f} ; Minimum pixel value:{raw_image.min():.1f}")
print(f"Mean value of the pixels : {raw_image.mean():.1f} ; Standard deviation : {raw_image.std():.1f}")

f.add_subplot(1,2, 2)
#source: https://towardsdatascience.com/histograms-in-image-processing-with-skimage-python-be5938962935
#_ = plt.hist(raw_image.ravel(),bins = 256, color = 'orange',)
_ = plt.hist(scab_image[:, :, 0].ravel(), bins = 256, color = 'red', alpha = 0.5)
_ = plt.hist(scab_image[:, :, 1].ravel(), bins = 256, color = 'Green', alpha = 0.5)
_ = plt.hist(scab_image[:, :, 2].ravel(), bins = 256, color = 'Blue', alpha = 0.5)
_ = plt.xlabel('Intensity Value')
_ = plt.ylabel('Count')
_ = plt.legend(['Red_Channel', 'Green_Channel', 'Blue_Channel'])
plt.show()


## Multiple Diseases 

In [None]:
f = plt.figure(figsize=(16,8))
f.add_subplot(1,2, 1)

multiple_diseases_img = multiple_diseases[0]+'.jpg'
multiple_diseases_image = plt.imread(os.path.join(img_dir, multiple_diseases_img))
plt.imshow(multiple_diseases_image, cmap='gray')
plt.colorbar()
plt.title('Multiple Diseases Image')
print(f"Image dimensions:  {raw_image.shape[0],raw_image.shape[1]}")
print(f"Maximum pixel value : {raw_image.max():.1f} ; Minimum pixel value:{raw_image.min():.1f}")
print(f"Mean value of the pixels : {raw_image.mean():.1f} ; Standard deviation : {raw_image.std():.1f}")

f.add_subplot(1,2, 2)
#source: https://towardsdatascience.com/histograms-in-image-processing-with-skimage-python-be5938962935
#_ = plt.hist(raw_image.ravel(),bins = 256, color = 'orange',)
_ = plt.hist(multiple_diseases_image[:, :, 0].ravel(), bins = 256, color = 'red', alpha = 0.5)
_ = plt.hist(multiple_diseases_image[:, :, 1].ravel(), bins = 256, color = 'Green', alpha = 0.5)
_ = plt.hist(multiple_diseases_image[:, :, 2].ravel(), bins = 256, color = 'Blue', alpha = 0.5)
_ = plt.xlabel('Intensity Value')
_ = plt.ylabel('Count')
_ = plt.legend(['Red_Channel', 'Green_Channel', 'Blue_Channel'])
plt.show()


If we look at the above histograms, we observe that for the intensity of blue pixels is way higher in diseases leaves as compared to the healthy ones.

# 7. Changing colorspaces

RGB is a very common colorspace. However, one downside of this color scheme is that the color of an image are correlated to the amount of light falling on the image. HSL (hue, saturation, lightness) (or HSB (hue, saturation, brightness)) and HSV (hue, saturation, value) are alternative representations of the RGB color model and emphasie on factors like hue, saturation etc which are often more relevant especially when we are concerned with tasks like image segmentation. Here is the source of the code : https://realpython.com/python-opencv-color-spaces/

Let’s compare the image in both RGB and HSV color spaces by visualizing the color distribution of its pixels.We shall be using OpenCV for this.

In [None]:
img_raw = cv2.imread('/kaggle/input/plant-pathology-2020-fgvc7/images/Train_1.jpg')

img = cv2.cvtColor(img_raw, cv2.COLOR_BGR2RGB) 
plt.imshow(img)

## 3D scatter plot for the image in RGB 

In [None]:
r, g, b = cv2.split(img)
fig = plt.figure()
axis = fig.add_subplot(1, 1, 1, projection="3d")

pixel_colors = img.reshape((np.shape(img)[0]*np.shape(img)[1], 3))
norm = colors.Normalize(vmin=-1.,vmax=1.)
norm.autoscale(pixel_colors)
pixel_colors = norm(pixel_colors).tolist()

axis.scatter(r.flatten(), g.flatten(), b.flatten(), facecolors=pixel_colors, marker=".")
axis.set_xlabel("Red")
axis.set_ylabel("Green")
axis.set_zlabel("Blue")
plt.show()




## 3D scatter plot for the image in HSV

In [None]:
hsv_image = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)
h, s, v = cv2.split(hsv_image)
fig = plt.figure()
axis = fig.add_subplot(1, 1, 1, projection="3d")

axis.scatter(h.flatten(), s.flatten(), v.flatten(), facecolors=pixel_colors, marker=".")
axis.set_xlabel("Hue")
axis.set_ylabel("Saturation")
axis.set_zlabel("Value")
plt.show()

It is quite evident that the colors in HSV colorspace are more concenterated in a particlular region and this property proves to be very useful for processes like Segmentation based on colors.

 # Image segmentation based on Green
 
 

In [None]:
boundaries = [([30,0,0],[70,255,255])]
mask = cv2.inRange(hsv_image, (36, 0, 0), (70, 255,255))
result = cv2.bitwise_and(img, img, mask=mask)

plt.figure(figsize=(16,8))
plt.subplot(1, 2, 1)
plt.imshow(mask, cmap="gray")
plt.subplot(1, 2, 2)
plt.imshow(result)
plt.show()


Let's check for another diseased leaf

In [None]:
img_raw2 = cv2.imread('/kaggle/input/plant-pathology-2020-fgvc7/images/Train_3.jpg')

img2 = cv2.cvtColor(img_raw2, cv2.COLOR_BGR2RGB)
hsv_image2 = cv2.cvtColor(img2, cv2.COLOR_RGB2HSV)
plt.imshow(img2)

boundaries = [([30,0,0],[70,255,255])]
mask = cv2.inRange(hsv_image2, (36, 0, 0), (70, 255,255))
result = cv2.bitwise_and(img2, img2, mask=mask)

plt.figure(figsize=(16,8))
plt.subplot(1, 2, 1)
plt.imshow(mask, cmap="gray")
plt.subplot(1, 2, 2)
plt.imshow(result)
plt.show()

It works reasonably well. There are certainly some false positives, hence it would make sense to segment the image from the background before applying the masks.