# Why this Competition?
This competiton provides another great opportunity for computer vision to be applied in real world and potential to have meaningful impact on people's lives. It is also a great oppertunity for us Data Science enthusiasts to understand the agricultural disease domain and nonetheless showcase our skills in a competitive setting. It also provides the unique oppertunities for beginners (myself included) to get their hands dirty and indulge is constructive discussions and knowledge sharing on this platform.

# Problem Statement
This Competition requires us to classify the type of foliar disease an apple tree is having using the images of the tree/leaves/fruit. Although computer vision-based models have shown promise for plant disease identification, there are some limitations that need to be addressed. Large variations in visual symptoms of a single disease across different apple cultivars, or new varieties that originated under cultivation, are major challenges for computer vision-based disease identification.
The scoring metric for this competition is also an interesting one for multi-label classification: **F1-Mean**.

## Why bother?
Apples are one of the most important temperate fruit crops in the world. Foliar (leaf) diseases pose a major threat to the overall productivity and quality of apple orchards. The current process for disease diagnosis in apple orchards is based on manual scouting by humans, which is time-consuming and expensive.

## Data Description:-
* About 18,632 images showing the presence of various types of diseases. **A single photo might also contain multiple diseases**.

## Expected Outcome:-
* Detect the presence of all the diseases in a given image.

## Problem Category:-
For the data and objective its is evident that this is a **multi-label classification problem** in the **Computer Vision** domain.

# About this Notebook
* Being a beginner myself, this notebook will focus solely on basics, getting to know the data and build a primitive yet effective model.
    * This notebook will be updated several times as and when I learn new interesting stuff and think will be useful for the audience. Please also consider that I too am on a learning voyage here.
* Our weapon of choice will be Deep-Learning through the journey of this notebook.
* Model Details:-
    * EfficientNet B4
    * Epochs: 20
    * Optimizer: Adam
    * Loss Function: Categorical Crossentropy
    * Scheduler: Constant LR with Reduction by 10x on plateau
    * Augmentations: Simple Left-Right Top-Bottom flips and Brighness change
    * Simple single multi-class classifier

# Imports
Let's start by importing some basic libraries that we require though our journey of this notebook.

In [None]:
# Asthetics
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# General
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import os
import time
import cv2
import random
from kaggle_datasets import KaggleDatasets

# Visialisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from PIL import Image

# Machine Learning
# Pre Procesing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Models
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, cross_val_score
# Deep Learning
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import EfficientNetB4
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, BatchNormalization, GlobalAveragePooling2D
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow_addons.metrics import F1Score, FBetaScore
from tensorflow_addons.callbacks import TQDMProgressBar
from tensorflow.keras.utils import plot_model

#Metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

print('TF',tf.__version__)

In [None]:
RANDOM_SEED = 42

In [None]:
def seed_everything(seed=RANDOM_SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)

In [None]:
seed_everything()

In [None]:
data_path = '../input/plant-pathology-2021-fgvc8'

labels_file_path = os.path.join(data_path, 'train.csv')
sample_submission_path = os.path.join(data_path, 'sample_submission.csv')
train_images_path = os.path.join(data_path, 'train_images')
test_images_path = os.path.join(data_path, 'test_images')

print(f'Label File path: {labels_file_path}')
print(f'Sample Submission File path: {sample_submission_path}')
print(f'Train Images path: {train_images_path}')
print(f'Test Images path: {test_images_path}')

# EDA
Let's start by reading the train file...

In [None]:
train_df = pd.read_csv(labels_file_path)
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.describe()

We see that there are no missing data. Good for us. Now let's see what are those unique labels:-

In [None]:
train_df['labels'].unique()

Thus we have 6 different types of diseases + 1 healthy:-
1. Healthy
2. Scab
3. Frog eye leaf spot
4. Complex
5. Cider apple rust
6. Powdery mildew
7. Rust

And a tree can have multiple of those diseases at any point of time. Now let's see how many examples we have in each...

In [None]:
ax = plt.subplots(figsize=(18, 6))
sns.set_style("whitegrid")
sns.countplot(x='labels', data=train_df);
plt.ylabel("No. of Observations", size=20);
plt.xlabel("Class Name", size=20);
plt.xticks(rotation=90);

Observations:-
1. We have very high representations of single diseases in the train dataset while combination of diseases is very rare.
2. Arounf 51% of input data either belongs to Scab or Healthy category.

## Utils

In [None]:
def show_image(class_name, examples=2, labels_df=train_df, train_images_path=train_images_path):
    image_list = labels_df[labels_df['labels'] == class_name]['image'].sample(frac=1)[:examples].to_list()
    plt.figure(figsize=(20,10))
    for i, img in enumerate(image_list):
        full_path = os.path.join(train_images_path, img)
        img = Image.open(full_path)
        plt.subplot(1 ,examples, i%examples +1)
        plt.axis('off')
        plt.imshow(img)
        plt.title(class_name)

Let's try to go through each disease category and try to answer the following questions:-
1. What are the various types of diseases?
2. How to identify each disease type?
3. What are the unique identifying charecteristics of each disease?

Now with these questions in mind, let's start cracking...

## 1. Healthy
From the class it is pretty clear that these images are of helthy trees. Let's have a look at them to understand what healthy trees look like so that we would have a baseline undersatnding.

In [None]:
show_image(class_name='healthy', examples=4)

Okay, this class is pretty straight forward. Clean green looking leaves are identifying charecterisctics of this class.  
## 2. Scab  
Scab is serious disease of apples and ornamental crabapples, apple scab (Venturia inaequalis) attacks both leaves and fruit. The fungal disease forms pale yellow or olive-green spots on the upper surface of leaves. Dark, velvety spots may appear on the lower surface. Severely infected leaves become twisted and puckered and may drop early in the summer.

Symptoms on fruit are similar to those found on leaves. Scabby spots are sunken and tan and may have velvety spores in the center. As these spots mature, they become larger and turn brown and corky. Infected fruit becomes distorted and may crack allowing entry of secondary organisms. Severely affected fruit may drop, especially when young.  

Let's look at some examples...

In [None]:
show_image(class_name='scab', examples=4)

According to the examples, scabs can be identified by:-
1. Tiny spots on the leaves.
2. The spots are usually yellow/brown in color.

## 3. Frog eye leaf spot
First, small purple spots form on the leaves. These spots gradually enlarge and eventually develop into lesions with a light tan interior, surrounded by a dark purple perimeter. Heavy infections of frog-eye leaf spot can cause leaves to turn yellow and drop.

Frog eye leaf spot symptoms on tree trunks and limbs appear as cankers, which are reddish brown in colour and are slightly sunken. As the wood ages it becomes shrunken and layers of bark begin to peel back.

Frog eye leaf spot also causes small purple black spots on the fruit. These spots eventually enlarge to form concentric brown rings.

Let's look at some examples in our datatset...

In [None]:
show_image(class_name='frog_eye_leaf_spot', examples=4)

According to the examples, Frog eye leaf spots can be identified by:-
1. Small patches on the leaves.
2. The pathes are usually brown in color.
3. They have a distinctive ring type shape with an inner and outer rings.

## 4. Complex
According to data description:- Unhealthy leaves with too many diseases to classify visually will have the complex class, and may also have a subset of the diseases identified.

Let's look at some examples of this disease type in our datatset...

In [None]:
show_image(class_name='complex', examples=4)

This falls inline with the description as we see some leaves having new type of disease which we have not identified yet along with something that looks like scab.

## 5. Cider apple rust
Circular, yellow spots (lesions) appear on the upper surfaces of the leaves shortly after bloom. In late summer, brownish clusters of threads or cylindrical tubes (aecia) appear beneath the yellow leaf spots or on fruits and twigs. The spores associated with the threads or tubes infect the leaves (needles) and twigs of junipers during wet, warm weather.

Let's look at some examples from out datatset...

In [None]:
show_image(class_name='cider_apple_rust', examples=4)

According to the examples, Cider Apple Rusts can be identified by:-
1. Small to Large patches on the leaves.
2. The pathes are usually Yellow to Reddish in color.
3. They have a distinctive color and yellow-red ring type structure when fully grown.

## 6. Powdery Mildew
Powdery mildew is a fungal disease that affects a wide range of plants.

Powdery mildew is one of the easier plant diseases to identify, as its symptoms are quite distinctive. Infected plants display white powdery spots on the leaves and stems. The lower leaves are the most affected, but the mildew can appear on any above-ground part of the plant. As the disease progresses, the spots get larger and denser as large numbers of asexual spores are formed, and the mildew may spread up and down the length of the plant.

Looking at some examples...

In [None]:
show_image(class_name='powdery_mildew', examples=4)

According to the examples, Powdery Mildew can be identified by:-
1. Large patches on the leaves.
2. The leaves look like having some powerdy residue on them.
3. The pathes/residue look white in color.

## 7. Scab + Frog eye leaf spot
This is a combination of Scab and Grog eye leaf spot on single image... Let's see who those look...

In [None]:
show_image(class_name='scab frog_eye_leaf_spot', examples=4)

As seen in the images the leaves have small brown spots charecteristic of Scab, but larger ring shaped spots as well showing presence of Frog eye leaf spot as well.

## 8. Scab + Frog eye leaf spot + Complex
This suggests that these images will be similar to the previous class... But might have some additional diseases as well.  
Let's look at some examples...

In [None]:
show_image(class_name='scab frog_eye_leaf_spot complex', examples=4)

As expected we can see traces of Scab and Frog eye leaf spot majorly. APrt from those there is a hint of some other rusty disease which we have not encountered yet...

## 9. Frog eye leaf spot + Complex
Similarly it is a combination of Frog eye leaf spot and some unkown disease. Looking at some images...

In [None]:
show_image(class_name='frog_eye_leaf_spot complex', examples=4)

The unkown disease varies from image to image. There is some images where the unknown disease looks like scab but in others its looks completely different. So classifying them under the complex umbrella makes sense...

## 10. Rust + Frog eye leaf spot
Similarly this is a combination of Rust and Frog eye leaf spot. Let's see some examples...

In [None]:
show_image(class_name='rust frog_eye_leaf_spot', examples=4)

The rust looks very similar to cider apple rust and also clearly there is presence of Frog eye leaf spot with the carecteristic brown ring type structure.  

## 11. Powdery Mildew + complex
This will have powdery mildew and some unknown disease in the images. Let's see some examples...

In [None]:
show_image(class_name='powdery_mildew complex', examples=4)

As expected the major component of the leaf image is Powdery Mildew while there are some little black spots present cause by some unkown disease.

## 12. Rust + Complex

In [None]:
show_image(class_name='rust complex', examples=4)

The images have the charecteristic rust sopt with some smaller spots and some brown pathces pertaining to some other disease.  

**I think in the multiple disease category Rust actually means Cider Apple Rust that we have seen earlier. Thuse the same identifiers can be used to describe the same. This will be useful when we further persue other detailed models.**

# Model Creation
We will create a basic small model as a baseline for this task. Before that let's define the data generators.

In [None]:
BATCH = 20
IMG_DIM = 380
EPOCHS = 30
IMG_SHAPE = (IMG_DIM, IMG_DIM, 3)
LR = 1e-5

In [None]:
# From https://www.kaggle.com/xhlulu/ranzcr-efficientnet-tpu-training
def auto_select_accelerator():
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
        print("Running on TPU:", tpu.master())
    except ValueError:
        strategy = tf.distribute.get_strategy()
    print(f"Running on {strategy.num_replicas_in_sync} replicas")
    
    return strategy

In [None]:
COMPETITION_NAME = "plant-pathology-2021-fgvc8"
strategy = auto_select_accelerator()
BATCH_SIZE = strategy.num_replicas_in_sync * BATCH
GCS_DS_PATH = KaggleDatasets().get_gcs_path('plant-pathology-2021-fgvc8')

In [None]:
df = pd.read_csv(labels_file_path)
sub_df = pd.read_csv(sample_submission_path)

In [None]:
paths = GCS_DS_PATH + '/train_images/' + df['image']
test_paths = GCS_DS_PATH + '/test_images/' + sub_df['image']

In [None]:
# le = LabelEncoder()
# le.fit(df['labels']);
# df['labels'] = le.transform(df['labels'])

In [None]:
label_cols = 'labels'
# labels = df[label_cols].values
# labels = [[i] for i in labels]

In [None]:
le = LabelEncoder()
le.fit(df['labels']);
integer_encoded = le.transform(df['labels'])

enc = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = enc.fit_transform(integer_encoded)

In [None]:
# label_map = dict(zip(le.classes_, le.transform(le.classes_)))
# label_inv_map = {v: k for k, v in label_map.items()}

In [None]:
(train_paths, valid_paths,
 train_labels, valid_labels) = train_test_split(paths, onehot_encoded, test_size=0.2, random_state=RANDOM_SEED)

In [None]:
def build_decoder(with_labels=True, target_size=(IMG_DIM, IMG_DIM), ext='jpg'):
    def decode(path):
        file_bytes = tf.io.read_file(path)
        if ext == 'png':
            img = tf.image.decode_png(file_bytes, channels=3)
        elif ext in ['jpg', 'jpeg']:
            img = tf.image.decode_jpeg(file_bytes, channels=3)
        else:
            raise ValueError("Image extension not supported")

        img = tf.cast(img, tf.float32) / 255.0
        img = tf.image.resize(img, target_size)

        return img
    
    def decode_with_labels(path, label):
        return decode(path), label
    
    return decode_with_labels if with_labels else decode

In [None]:
def build_augmenter(with_labels=True):
    def augment(img):
        img = tf.image.random_flip_left_right(img)
        img = tf.image.random_flip_up_down(img)
        img = tf.image.random_hue(img, 0.01)
        img = tf.image.random_saturation(img, 0.70, 1.30)
        img = tf.image.random_contrast(img, 0.80, 1.20)
        img = tf.image.random_brightness(img, 0.2)
        return img
    
    def augment_with_labels(img, label):
        return augment(img), label
    
    return augment_with_labels if with_labels else augment

In [None]:
# From https://www.kaggle.com/xhlulu/ranzcr-efficientnet-tpu-training
def build_dataset(paths, labels=None, bsize=BATCH, cache=True,
                  decode_fn=None, augment_fn=None,
                  augment=True, repeat=True, shuffle=1024,
                  cache_dir=""):
    if cache_dir != "" and cache is True:
        os.makedirs(cache_dir, exist_ok=True)
    
    if decode_fn is None:
        decode_fn = build_decoder(labels is not None)
    
    if augment_fn is None:
        augment_fn = build_augmenter(labels is not None)
        
    AUTO = tf.data.experimental.AUTOTUNE
    slices = paths if labels is None else (paths, labels)
    
    dset = tf.data.Dataset.from_tensor_slices(slices)
    dset = dset.map(decode_fn, num_parallel_calls=AUTO)
    dset = dset.cache(cache_dir) if cache else dset
    dset = dset.map(augment_fn, num_parallel_calls=AUTO) if augment else dset
    dset = dset.repeat() if repeat else dset
    dset = dset.shuffle(shuffle) if shuffle else dset
    dset = dset.batch(bsize).prefetch(AUTO)
    
    return dset

In [None]:
def get_model(IMG_DIM=IMG_DIM, Num_Class=df['labels'].nunique()):
    with strategy.scope():
        IMG_SHAPE = (IMG_DIM, IMG_DIM, 3)
        
        feature_extractor = EfficientNetB4(input_shape=IMG_SHAPE,
                                           include_top=False,
                                           drop_connect_rate=0.2,
                                           weights='imagenet')
        feature_extractor.trainable = True

        global_average_layer = GlobalAveragePooling2D()
        dense_layer = Dense(256, activation='relu')
        softmax_layer = Dense(train_df['labels'].nunique(), activation='softmax')
        
        clf_model = Sequential([feature_extractor, global_average_layer,
                                Dropout(0.2), dense_layer, Dropout(0.2),
                                softmax_layer])

        clf_model.compile(
            optimizer = Adam(lr=LR),
            loss = CategoricalCrossentropy(label_smoothing=0.1),
            metrics=[F1Score(num_classes=train_df['labels'].nunique(), average='macro'),
                     'accuracy']
        )

        clf_model.summary()
        return clf_model

In [None]:
decoder = build_decoder(with_labels=True, target_size=(IMG_DIM, IMG_DIM))
test_decoder = build_decoder(with_labels=False, target_size=(IMG_DIM, IMG_DIM))

train_dataset = build_dataset(
    train_paths, train_labels, bsize=BATCH_SIZE, decode_fn=decoder
)

valid_dataset = build_dataset(
    valid_paths, valid_labels, bsize=BATCH_SIZE, decode_fn=decoder,
    repeat=False, shuffle=False, augment=False
)

test_dataset = build_dataset(
    test_paths, cache=False, bsize=BATCH_SIZE, decode_fn=test_decoder,
    repeat=False, shuffle=False, augment=False
)

In [None]:
STEPS_PER_EPOCH = train_paths.shape[0] // BATCH_SIZE
VALID_STEPS = valid_paths.shape[0] // BATCH_SIZE

early = EarlyStopping(monitor='val_accuracy', min_delta=0.001, patience=10, verbose=0, mode='max', baseline=None, restore_best_weights=True)
lr_reducer = ReduceLROnPlateau(monitor='val_auc', patience=5, min_lr=1e-6, mode='max')
checkpoint = ModelCheckpoint(monitor='val_accuracy', filepath='weights_EffB4.hdf5', mode='max', verbose=0, save_best_only=True)
tqdm_callback = TQDMProgressBar()

callbacks_list = [lr_reducer, checkpoint, early, tqdm_callback]

In [None]:
clf_model = get_model()

In [None]:
plot_model(clf_model, show_shapes=True, show_layer_names=True, to_file='model.png')
img = Image.open('model.png')
plot_dim = (10, 15)
ax = plt.subplots(figsize=plot_dim)
plt.imshow(img)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
os.remove('model.png')

In [None]:
validation_steps=20

loss0, f10, accuracy0 = clf_model.evaluate(x = valid_dataset, steps = validation_steps)

In [None]:
print("initial loss: {:.2f}".format(loss0))
print("initial accuracy: {:.2f}".format(accuracy0))
print("initial F1: {:.2f}".format(f10))

In [None]:
tf.keras.backend.clear_session()
history = clf_model.fit(train_dataset,
                        steps_per_epoch=STEPS_PER_EPOCH,
                        epochs = EPOCHS,
                        validation_data = valid_dataset,
                        validation_steps=VALID_STEPS,
                        callbacks=callbacks_list,
                        verbose = 0)

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

f1 = history.history['f1_score']
val_f1 = history.history['val_f1_score']

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(y = val_loss, name = 'Validation Loss', line = dict(color='royalblue', width=4)))
fig.add_trace(go.Scatter(y = loss, name = 'Training Loss', line = dict(color='royalblue', width=4, dash='dash')))

fig.update_layout(title = 'Trained Loss History', xaxis_title = 'Epoch', yaxis_title = 'Loss')

fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(y = val_acc, name = 'Validation Accuracy', line = dict(color='firebrick', width=4)))
fig.add_trace(go.Scatter(y = acc, name = 'Training Accuracy', line = dict(color='firebrick', width=4, dash='dash')))

fig.update_layout(title='Trained Accuracy History', xaxis_title='Epoch', yaxis_title='Accuracy')

fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(y = val_f1, name = 'Validation F1 Score', line = dict(color='firebrick', width=4)))
fig.add_trace(go.Scatter(y = f1, name = 'Training F1 Score', line = dict(color='firebrick', width=4, dash='dash')))

fig.update_layout(title='Trained F1 History', xaxis_title='Epoch', yaxis_title='F1 Macro Score')

fig.show()

# Restore Best Weights

In [None]:
clf_model.load_weights('./weights_EffB4.hdf5')

# Save History Info

In [None]:
hist = pd.DataFrame(history.history)
hist.to_csv('EffNet_B4_History.csv')

# Prediction And Submission

In [None]:
clf_model.predict(test_dataset, verbose=1)

In [None]:
preds = np.argmax(clf_model.predict(test_dataset, verbose=1), axis=1)
inverted = le.inverse_transform(preds)

In [None]:
sub_df[label_cols] = inverted
sub_df.head()

In [None]:
sub_df.to_csv('submission.csv', index=False)