# 1. Train validation test procedure

The provided code defines a `Model` class which is used to create, train, validate, and save a model aiming to identify pneumonia chest x-ray.

All differents elaborate models will follow this organization:


The `__init__` method initializes the class. 
It sets up paths to the training data and creates `Dataset` objects for both the training and validation data. 

These datasets are then built using TensorFlow's `AUTOTUNE` functionality for optimized data loading.

Various information about the data, such as the class names and batch shapes, is then printed out.

In [None]:
import pathlib
import tensorflow as tf
import tensorflowjs as tfjs
import matplotlib.pyplot as plt

from x_ray_dataset_builder import Dataset


class Model:
    def __init__(self):
        train_dir = pathlib.Path("data/train")

        train_ds = Dataset(train_dir, 0.2, "training")
        val_ds = Dataset(train_dir, 0.2, "validation")

        AUTOTUNE = tf.data.AUTOTUNE

        train_ds.build(AUTOTUNE, True)
        val_ds.build(AUTOTUNE)

        class_names = train_ds.get_class_names()
        print("\nClass names:")
        print(class_names)

        train_x_batch_shape = train_ds.get_x_batch_shape()
        print("\nTraining dataset's images batch shape is:")
        print(train_x_batch_shape)

        train_y_batch_shape = train_ds.get_y_batch_shape()
        print("\nTraining dataset's labels batch shape is:")
        print(train_y_batch_shape)

        train_ds.display_images_in_batch(2, "Training dataset")
        train_ds.display_batch_number("Training dataset")
        train_ds.display_distribution("Training dataset")
        train_ds.display_mean("Training dataset")

        val_x_batch_shape = train_ds.get_x_batch_shape()
        print("\nTesting dataset's images batch shape is:")
        print(val_x_batch_shape)

        val_y_batch_shape = train_ds.get_y_batch_shape()
        print("\nTesting dataset's labels batch shape is:")
        print(val_y_batch_shape)

        val_ds.display_images_in_batch(2, "Testing dataset")
        val_ds.display_batch_number("Testing dataset")
        val_ds.display_distribution("Testing dataset")
        val_ds.display_mean("Testing dataset")

        self.class_names = class_names
        self.train_ds = train_ds.normalized_dataset
        self.val_ds = val_ds.normalized_dataset



The `build` method defines and compiles the model.

The model is a simple feed-forward neural network (also known as a multi-layer perceptron or MLP) with one hidden layer of 128 neurons. The goal is not to build a model for performances but to implement a first train, validation, test procedure.

The input data are flattened before being passed through the network.

The output layer uses a softmax activation function, which is standard for multi-class classification problems.

The model is compiled with the Adam optimizer and categorical cross-entropy loss, which is also standard for such tasks. 

Four metrics are monitored during training: categorical accuracy, precision, recall, and AUC (Area Under the ROC Curve).

In [None]:
    def build(self):
        model = tf.keras.Sequential(
            [
                tf.keras.layers.Flatten(input_shape=(180, 180, 1)),
                tf.keras.layers.Dense(128, activation="relu"),
                tf.keras.layers.Dense(len(self.class_names), activation="softmax"),
            ]
        )

        model.compile(
            optimizer="adam",
            loss=tf.keras.losses.CategoricalCrossentropy(from_logits=False),
            metrics=[
                tf.keras.metrics.CategoricalAccuracy(), 
                tf.keras.metrics.Precision(), 
                tf.keras.metrics.Recall(),
                tf.keras.metrics.AUC()
            ],
        )

        model.summary()

        return model



The `train` method trains the model for a specified number of epochs.

The model's accuracy and loss on both the training and validation data are plotted after each epoch.

This allows for the monitoring of the model's performance over time and the detection of any overfitting (where the model performs well on the training data but poorly on the validation data).

After training, the model is saved in both Keras' native format and in a format compatible with TensorFlow.js, which enables the model to be used in a web browser.

In [None]:
    def train(self, epochs):
        model = self.build()

        print("\nStarting training...")
        history = model.fit(self.train_ds, validation_data=self.val_ds, epochs=epochs)
        print("\n\033[92mTraining done !\033[0m")

        acc = history.history["categorical_accuracy"]
        val_acc = history.history["val_categorical_accuracy"]

        loss = history.history["loss"]
        val_loss = history.history["val_loss"]

        epochs_range = range(epochs)

        plt.figure(figsize=(8, 8))
        plt.subplot(1, 2, 1)
        plt.plot(epochs_range, acc, label="Training Categorical Accuracy")
        plt.plot(epochs_range, val_acc, label="Validation Categorical Accuracy")
        plt.legend(loc="lower right")
        plt.title("Training and Validation Categorical Accuracy")

        plt.subplot(1, 2, 2)
        plt.plot(epochs_range, loss, label="Training Loss")
        plt.plot(epochs_range, val_loss, label="Validation Loss")
        plt.legend(loc="upper right")
        plt.title("Training and Validation Loss")
        plt.show()

        print("\nSaving...")
        model.save("notebooks/1_train_validation_test_procedure/model_1.keras")
        tfjs.converters.save_keras_model(model, "notebooks/1_train_validation_test_procedure")
        print("\n\033[92mSaving done !\033[0m")


In summary, this code provides a full procedure for training a machine learning model, including data loading, model creation, training, validation, and saving. 

The procedure is specific to a binary classification task, but could be adapted for other types of tasks. 

The use of a validation dataset allows for the monitoring of the model's performance on unseen data during training, which can help prevent overfitting.
This dataset is build as a subset of the data/train directory. Idem for the training dataset. For the test datasets, it's accessed only during evaluation phase.
This allow us to detect overfiting or bad performances and correspond to the principe of the train, validation, test procedure.