# Plant Pathology 2021 - FGVC8

### What is this competition about?

Apples are one of the most important temperate fruit crops in the world. Foliar (leaf) diseases pose a major threat to the overall productivity and quality of apple orchards. The current process for disease diagnosis in apple orchards is based on manual scouting by humans, which is time-consuming and expensive.

Although computer vision-based models have shown promise for plant disease identification, there are some limitations that need to be addressed. Large variations in visual symptoms of a single disease across different apple cultivars, or new varieties that originated under cultivation, are major challenges for computer vision-based disease identification. These variations arise from differences in natural and image capturing environments, for example, leaf color and leaf morphology, the age of infected tissues, non-uniform image background, and different light illumination during imaging etc.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Let's import some libraries!

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import PIL
import cv2
from keras_preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
import tensorflow_addons as tfadd
import seaborn as sns
import matplotlib.pyplot as plt
from tensorflow.keras.applications import InceptionResNetV2

Now, we will set the data directories for images and datasets

In [None]:
train_data = pd.read_csv('/kaggle/input/plant-pathology-2021-fgvc8/train.csv')
test_data = pd.read_csv('/kaggle/input/plant-pathology-2021-fgvc8/sample_submission.csv')
TRAIN_IMG_DIR = '../input/plant-pathology-2021-fgvc8/train_images/'
TEST_IMG_DIR = '../input/plant-pathology-2021-fgvc8/test_images/'

In [None]:
labels = train_data.labels.unique()
value_counts = train_data['labels'].value_counts()

In [None]:
plt.figure(figsize=(10, 6))
plt.xticks(rotation=90)
plt.title("Comparison of Different Labels")
sns.barplot(x=labels, y=value_counts)

### Take a look at the labels!

An image may belong to one class or multiple classes. So, in short we have 6 classes of labels. Out of these 6 classes, there are 5 diseases, namely:

* scab
* complex
* rust
* frog eye leaf spot
* powdery mildew

The remaining one label is "healthy" which is pretty much self explanatory

This is a multi-label classification problem as one image can represent more than one class of diseases.

Let's have a look at some of the images

In [None]:
IMG_SIZE = 250

for i in range(0, 100, 10):
    img_array = cv2.imread(TRAIN_IMG_DIR + train_data['image'][i])
    new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
    plt.imshow(new_array)
    plt.title(train_data['labels'][i])
    plt.show()

We might as well look at the individual label comparison with each other. I will create a copy of my training data as I don't want to make changes to original training data

In [None]:
dummy_train_data = train_data
dummy_train_data = dummy_train_data['labels'].str.split(" ", expand=True).stack()
label_dummies = pd.get_dummies(dummy_train_data).groupby(level=0).sum()

We have converted the categorical data via dummy variable from pandas

In [None]:
label_dummies.head()

So far so good. Let's plot the labels and their occurrence in the training images

In [None]:
cols = label_dummies.columns
label_counts = label_dummies[cols].sum()

In [None]:
plt.figure(figsize=(10, 6))
plt.title("Comparison of all unique Labels")
sns.barplot(x=cols, y=label_counts)

## Problem with the original image size

The image sizes of all the training images is very high. I encountered two issues due to that:

1. While the model was getting trained, most of the CPU time was used to load the images. Due to that, a single epoch took somewhere around 45 mins. GPU was getting used but the main lag was due to the CPU which was busy loading the images from the dataset

2. When I tried downsizing the images to a lower resolution, my RAM got used fully and I was not able to continue ahead. Plus, it also took a lot of time to downsize 18632 images.

Therefore, I will be using a resized images dataset ([link to the dataset](https://www.kaggle.com/ankursingh12/resized-plant2021))

Also, we need the labels of training data in a comma separated fashion

In [None]:
train_data['labels'] = train_data['labels'].str.split(" ")

datagen = ImageDataGenerator(rescale=1./255, validation_split=0.1)
final_train_data = datagen.flow_from_dataframe(train_data,
    directory='/kaggle/input/resized-plant2021/img_sz_512',
    x_col="image",
    y_col="labels",
    target_size=(256, 256),
    color_mode="rgb",
    class_mode="categorical",
    subset="training")

In [None]:
validation_data = datagen.flow_from_dataframe(train_data,
    directory='/kaggle/input/resized-plant2021/img_sz_512',
    x_col="image",
    y_col="labels",
    target_size=(256, 256),
    color_mode="rgb",
    class_mode="categorical",
    subset="validation")

## Model Creation and Fitting

I will be using InceptionResNetV2 pre-trained model. In addition to this, I will be adding a GlobalAveragePooling2D layer and one last Dense layer with 6 nodes, one for each class with 'sigmoid' as activation, one node for each label(this is a multilabel classification problem)

In [None]:
weights = '../input/keras-pretrained-models/inception_resnet_v2_weights_tf_dim_ordering_tf_kernels_notop.h5'

pretrained_weight_model = InceptionResNetV2(
    include_top=False,
    weights=weights,
    input_shape=(256, 256, 3)
)

In [None]:
pretrained_weight_model.input
pretrained_weight_model.output

##  F1 score as metrics

Since it is a multilabel image classification, I will be going for F1 accuracy instead of binary accuracy in macro mode. A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally)

In [None]:
final_model = Sequential([
    pretrained_weight_model,
    GlobalAveragePooling2D(),
    Dense(units=6, activation = 'sigmoid')
])

for layer in final_model.layers[:-1]:
    layer.trainable=False

final_model.summary()

Let's create a callback to prevent overfitting/underfitting

In [None]:
f1_score = tfadd.metrics.F1Score(num_classes=6, average='macro')

early_stopping = EarlyStopping(monitor=f1_score, patience=3, mode='max', restore_best_weights=True)


final_model.compile(loss='binary_crossentropy', optimizer=Adam(epsilon=0.01), 
              metrics= [f1_score])

history = final_model.fit(final_train_data, epochs=60, 
        callbacks=early_stopping, validation_data=validation_data)

## Analysis of scores

1. loss vs f1 score
2. validation loss vs validation f1 score

In [None]:
history_frame = pd.DataFrame(history.history)
history_frame.loc[:, ['loss', 'val_loss']].plot()

In [None]:
history_frame.loc[:, ['f1_score', 'val_f1_score']].plot();

Let's now predict on the test images that we have. Firstly, I will resize the images to 256X256 and then predict the values on it.

In [None]:
for i in range(test_data.shape[0]):
    image_path = TEST_IMG_DIR+'/'+test_data.image[i]
    with PIL.Image.open(image_path) as image_data:
        image_data = image_data.resize((256, 256))
        image_data.save(f'./{test_data.image[i]}')

In [None]:
final_test_data = datagen.flow_from_dataframe(test_data,
    directory='./',
    x_col="image",
    y_col=None,
    target_size=(256, 256),
    color_mode="rgb",
    class_mode=None,
    classes=None,
)

In [None]:
pred_data = final_model.predict(final_test_data)
pred_data = pred_data.tolist()

In [None]:
pred_data

Converting the output values to the indices of labels

In [None]:
index_list = []

for pred in pred_data:
    index = []
    for value in pred:
        if value >= 0.3:
            index.append(pred.index(value))
    if index != []:
        index_list.append(index)
    else:
        index.append(np.argmax(pred))
        index_list.append(index)

Here are the predicted labels:

In [None]:
index_list

Mapping the labels back to the disease names

In [None]:
pred_labels = final_train_data.class_indices
pred_labels = dict((value, key) for key, value in pred_labels.items())

pred_label_names = []

for indices in index_list:
    index = []
    for i in indices:
        index.append(str(pred_labels[i]))
    pred_label_names.append(' '.join(index))

In [None]:
pred_label_names

We have our predicted labels. Now the final step is the submission of our predicted labels

In [None]:
resized_test_images = tf.io.gfile.glob('./*.jpg')

for image in resized_test_images:
    os.remove(image)

test_data['labels'] = pred_label_names
# test_data.to_csv('submission.csv', index=False)

In [None]:
test_data

The below code is just for model saving purpose as the submission has some time constraints attached to it

In [None]:
final_model.save("plant_pathology_2021.h5")