# Modeling and Evaluation

## Objectives

* Answer business requirement 2:
    - The client is interested in predicting if a cherry leaf is healthy or contains powdery mildew

## Inputs

* Split datasets:
    - inputs/datasets/cherry-leaves/train
    - inputs/datasets/cherry-leaves/validation
    - inputs/datasets/cherry-leaves/test

## Outputs

* Plot of balance of target labels in each set
* Target class names
* Leaf health classification model
* Model learning plots - loss and accuracy
* Evaluation of test set performance

---

## Change working directory

Change working directory to project root directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))

# confirm new directory
current_dir = os.getcwd()
current_dir

---

## Set up directories and variables

### Store file paths

Input directories

In [None]:
data_dir = "inputs/datasets/cherry-leaves"

train_dir = data_dir + "/train"
val_dir = data_dir + "/validation"
test_dir = data_dir + "/test"

### Create outputs directory

In [None]:
# Set version here
version = "v1"

file_path = f"outputs/{version}"

if "outputs" in os.listdir(current_dir) and version in os.listdir(current_dir + "/outputs"):
    print("This version tag has already been used. Create a new version.")
    pass
else:
    os.makedirs(name=file_path)

### Store label names

In [None]:
labels = os.listdir(train_dir)
print("The image labels are:", labels)

---

## Import packages

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread
import seaborn as sns

sns.set_style("white")

---

## Display balance of target labels in each set

In [None]:
# code adapted from Code Institute walkthrough projects
# e.g. https://github.com/Code-Institute-Solutions/WalkthroughProject01
def plot_target_balance_per_set(data_dir, save_image=False):
    df_freq = pd.DataFrame([])
    for folder in ["train", "validation", "test"]:
        for label in labels:
            df_freq = df_freq.append(
                pd.Series(
                    data={
                        "Set": folder,
                        "Label": label,
                        "Frequency": int(
                            len(os.listdir(data_dir + "/" + folder + "/" + label))
                        ),
                    }
                ),
                ignore_index=True,
            )

            print(
                f"* {folder} - {label}: {len(os.listdir(data_dir+'/'+ folder + '/' + label))} images"
            )

    print("\n")
    sns.set_style("white")
    plt.figure(figsize=(8, 5))
    sns.barplot(data=df_freq, x="Set", y="Frequency", hue="Label")

    if save_image:
        plt.savefig(
            f"{file_path}/labels_distribution.png", bbox_inches="tight", dpi=150
        )

    plt.show()

In [None]:
plot_target_balance_per_set(data_dir)

Save if image looks good

In [None]:
plot_target_balance_per_set(data_dir, save_image=True)

---

## Load images

Images are loaded in batches to reduce working memory usage during model training.

`label_mode` is set as `categorical` as this one-hot-encodes the target - needed for training the model with a final softmax activation layer.

In [None]:
from tensorflow.keras.utils import image_dataset_from_directory

batch_size = 20

train_set = image_dataset_from_directory(
    train_dir,
    label_mode="categorical",  # encode labels as categorical vector, i.e. OHE
    seed=123,
    batch_size=batch_size,
)

train_set  # second shape tuple should be (None, 2)

In [None]:
validation_set = image_dataset_from_directory(
    train_dir,
    label_mode="categorical",
    seed=123,
    batch_size=batch_size,
)

validation_set  # second shape tuple should be (None, 2)

In [None]:
test_set = image_dataset_from_directory(
    train_dir,
    label_mode="categorical",
    seed=123,
    batch_size=batch_size,
)

test_set  # second shape tuple should be (None, 2)

---

## Save class names

Save label class names so these can be displayed to the user after predictions

In [None]:
import joblib

joblib.dump(value=train_set.class_names, filename=f"{file_path}/class_names.pkl")

---

## Define model

In [None]:
# import packages
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
    Activation,
    Dropout,
    Flatten,
    Dense,
    Conv2D,
    MaxPooling2D,
    Rescaling
)

Initially, we create a fairly standard convolutional neural network as a starting point.

* Rescaling is done within the model (first layer) so that any image passed in real time will have this applied automatically
* Three pairs of convolution/pooling layers are used with kernel sizes of 3 x 3 and pool sizes of 2 x 2, as this is a standard starting configuration (see e.g. Code Institue [Walkthrough Project 1](https://github.com/Code-Institute-Solutions/WalkthroughProject01))
* A Flatten layer flattens the data into a format that the subsequent Dense layers can process more easily
* One Dense layer is used, followed by a dropout layer to reduc the risk of model overfitting
* The output layer uses a softmax activation function with 2 neurons since there are 2 label classes - because of this, the loss function used is `categorical_crossentropy` (see for example CI TensorFlow lesson)
* The optimzer is `adam` and the performance metric evaluated is overall accuracy

In [None]:
# shape of all images in dataset
image_shape = (256, 256, 3)


def create_model():
    model = Sequential()

    # rescale data
    model.add(Rescaling(1.0 / 255))

    # first pair
    model.add(
        Conv2D(
            filters=32,
            kernel_size=(3, 3),
            input_shape=image_shape,
            activation="relu",
        )
    )
    model.add(MaxPooling2D(pool_size=(2, 2)))

    # second pair
    model.add(
        Conv2D(
            filters=64,
            kernel_size=(3, 3),
            input_shape=image_shape,
            activation="relu",
        )
    )
    model.add(MaxPooling2D(pool_size=(2, 2)))

    # third pair
    model.add(
        Conv2D(
            filters=64,
            kernel_size=(3, 3),
            input_shape=image_shape,
            activation="relu",
        )
    )
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Flatten())

    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.5))

    model.add(Dense(2, activation="softmax"))

    model.compile(
        loss="categorical_crossentropy",
        optimizer="adam",
        metrics=["accuracy"],
    )

    return model

---

## Train model

Define early stopping to avoid model overfitting. Stop model training when the validation set accuracy stops improving.

Initially tried `patience` of 10, then 5, which both led to model overfitting (drop in validation accuracy in last epoch(s)).

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor="val_accuracy", patience=3)

Fit the model

In [None]:
model = create_model()

model.fit(
    train_set,
    epochs=100,
    validation_data=validation_set,
    callbacks=[early_stop],
    verbose=1,
)

Save the model

In [None]:
model.save('outputs/v1/leaf_health_clf_model.h5')

---

## Evaluate model

### Plot model learning curve

Plot loss and accuracy for training and validation sets

In [None]:
def plot_learning_curve(model, save_image=False):
    losses = pd.DataFrame(model.history.history)

    sns.set_style("whitegrid")
    losses[["loss", "val_loss"]].plot(style=".-")
    plt.title("Loss")
    if save_image:
        plt.savefig(f"{file_path}/model_training_losses.png", bbox_inches="tight", dpi=150)
    plt.show()

    print("\n")
    losses[["accuracy", "val_accuracy"]].plot(style=".-")
    plt.title("Accuracy")
    if save_image:
        plt.savefig(f"{file_path}/model_training_accuracy.png", bbox_inches="tight", dpi=150)
    plt.show()

plot_learning_curve(model)

Discussion: both the train and validation sets have low loss and high accuracy. The lines have similar shapes and the validation set performance is not significantly different from that of the train set, suggesting that the model did non overfit.

Save if images look good

In [None]:
plot_learning_curve(model, save_image=True)

### Test model

Test model on test set

In [None]:
evaluation = model.evaluate(test_set)

Save test evaluation

In [None]:
joblib.dump(value=evaluation, filename=f"outputs/v1/evaluation.pkl")

### Check model size

GitHub limit: 100MB

In [None]:
os.stat('outputs/v1/leaf_health_clf_model.h5').st_size

---

## Test prediction on live data

Choose a random image from the test dataset

In [None]:
from tensorflow.keras.preprocessing import image

pointer = 66  # choose a random number to select an image
label = labels[0]  # select class of leaf image

test_image = image.load_img(
    test_dir + "/" + label + "/" + os.listdir(test_dir + "/" + label)[pointer],
    target_size=image_shape,  # resize to model training image size
    color_mode="rgb",
)

print(f"Image shape: {test_image.size}")
test_image

Convert the image to an array and add a dimension of length 1 (the model expects a dimension indicating the number of images)

In [None]:
test_image = np.expand_dims(image.img_to_array(test_image), axis=0)
test_image.shape  # should be (1, 256, 256, 3)

Use the model to make a prediction and display the most likely label

In [None]:
class_names = train_set.class_names

# predict on the data, returns an array of probabilities
prediction_probs = model.predict(test_image)

# get the index of the highest probability, use to select label from class_names
prediction_class = class_names[np.argmax(prediction_probs, axis=1)[0]]
prediction_class

---

## Conlusions and next steps

The trained model performed with 99% accuracy on the test data, which meets the business requirement (97% accuracy). Due to this, no further dataset manipulation (such as image augmentation) or model training steps were required. The model, plots and other outputs outlined at the begnning of the notebook have been saved, ready for use in the dashboard. An image was used to test whether a label prediction can be made on live data.

The model currently has very high performance but is fairly large. If desired by the client, a potential next step might be to develop a smaller model that has comparable performance. An initial approach in this case would be to reduce the size of training images.

---