# Recursion Cellular Image Classification

---
### Universidade de Brasília

CIC0193 - Fundamentos de Sistemas Inteligentes

Prof.: Vinicius Borges

Aluno: Pedro Lucas Silva Haga Torres

Matrícula: 16/0141575

##### Atividade IV - Redes Neurais Convolucionais

---
This is an assignment for the *Fundamentos de Sistemas Inteligentes* (Fundaments of Inteligent Systems) course at Universidade de Brasília (University of Brasília). All of the code and documentation is in english, the only exception being the header above, which identifies myself as an student undertaking the course mentioned previously.

## Imports

In [None]:
import glob

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as  tf
import tensorflow_addons as tfa

from keras import backend as K
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.applications import EfficientNetB0
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## Data analysis

In [None]:
# Open train dataframe and print it
# train_df = pd.read_csv(r"../input/recursion-cellular-image-classification/train.csv")

# train_df

As we can see, the only usefull information in the training dataframe are the labels (`sirna`). We can't get to the images using this dataframe, so we'll have to use the information in the images'path to get to their labels. What identifies a label in this dataframe is the experiment, plate and well, so we'll use this information (which, according to the documentation is also in the image path) to label the images.

### Getting the data we need

Since the directory structure don't give us much beyond labeling the images as train and test, we'll start by getting the path to all images using `glob` and saving it to a Pandas dataframe.

In [None]:
# # Set path to get training images
# path = r"../input/recursion-cellular-image-classification/train/*/*/*.png"

# # Save path to all training images in a Pandas dataframe using glob
# df = pd.DataFrame(glob.glob(path), columns=["image_path"])

# # Print dataframe
# df

### Getting images' info based on their path

According to the documentation, we can find the experiment, plate and well in the path or in the image's name, so we're going to use this in order to properly label the images.

In [None]:
# # Get EXPERIMENT from image's path - according to the documentation
# df['experiment'] = df['image_path'].str.split("/").str[4]

# # Get PLATE from image's path
# df['plate'] = df['image_path'].str.split("/").str[5].str.split("Plate").str[1]

# # Cast 'plate' values to int
# df['plate'] = df['plate'].astype(int)

# # Get WELL from image's path
# df['well'] = df['image_path'].str.split("/").str[6].str.split("_").str[0]

# # Print dataframe to check if process went well
# df

### Labeling the images

This is probably not the most efficient way to do this, but we're going to get the labels by searching the training dataframe using the experiment, plate and well that we got from the images' path. This will be saved on a list, first, and then it'll be added on the dataframe with the images' path.

In [None]:
# # Create an empty list to save the labels
# sirna_list = []

# # Iterate through the dataframe using 'itertuples' and search on the training dataframe
# # for each image's label
# for t in df.itertuples(index=False):
#     sirna_list.append(train_df.loc[(t[1] == train_df['experiment']) &
#                                 (t[2] == train_df['plate']) &
#                                 (t[3] == train_df['well'])]['sirna'].values)

# # Add new column to dataframe containing the labels accquired
# df['sirna'] = sirna_list

# # The labels acquired come in the form of a series, we're getting the labels themselves
# # or, if the image is not labeled, we get an empty list, so we're replacing it with NaN
# df['sirna'] = df['sirna'].apply(lambda x: np.nan if len(x) == 0 else x[0])

# # Print the dataframe with the labels
# df

### Removing images without a label

We can see that we have a few unlabeled images, so we're going to remove them using Pandas' `dropna()`.

In [None]:
# # Check for NaNs
# print(df.info(), end="\n\n")

# # Remove NaNs
# df.dropna(inplace=True)

# # Check final product
# df.info()

### Check number of classes and data distribuition

In [None]:
# Plot labels count
# df['sirna'].value_counts().plot(kind='bar', figsize=(14, 7))

We can see that we have a lot of classes (1108, to be precise) but the data is mostly balanced, with around 400 instances for each class.

### Save dataframe containing the images' path and label

In [None]:
# Save dataframe for later use and to avoid the costly method of acquiring the labels
# df.to_csv(r"/kaggle/working/train_dataframe.csv", index=False)

## Classification task

In [None]:
# Open the dataframe with the images' path and labels
df = pd.read_csv(r"../input/rcic-edited-dataframe/train_dataframe.csv")

df

### Stratified split between train, validation and test
We're going to use transfer learning, so, for training the top layers, I used the more common split of 70-20-10 (training, validation and test, respectively, in percentage). But, after consideration, since there are more than 400k images, I choose a split of 90-5-5 to train the whole model.

Since `random_state` is defined, we can repeat the experiment any number of times and get the same results. I also stratified the split according to the labels' distribution.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df, df['sirna'], test_size=0.1,
                                                    random_state=31415,
                                                    stratify=df['sirna'])

x_val, x_test, y_val, y_test = train_test_split(x_test, x_test['sirna'], test_size=0.5,
                                                random_state=31415,
                                                stratify=x_test['sirna'])

print("Train data's shape:      {}; {}".format(x_train.shape, y_train.shape))
print("Validation data's shape: {}; {}".format(x_val.shape, y_val.shape))
print("Test data's shape:       {}; {}".format(x_test.shape, y_test.shape))

### Dependencies
#### Metrics

Since Keras/TensorFlow don't have the necessary metrics to evaluate a multiclass task (such as precision, recall and specificity), we're going to use both **micro** and **macro** averaged F1-score and categorical accuracy (available in TF/Keras) as our metrics. For micro/macro F1, we're using [TensorFlow Addons](https://www.tensorflow.org/addons) library.

In [None]:
# Multiclass F1-score MICRO Avg.
micro_f1 = tfa.metrics.F1Score(
    num_classes=1108,
    average='micro',
    name="Micro F1",
)

# Multiclass F1-score MACRO Avg.
macro_f1 = tfa.metrics.F1Score(
    num_classes=1108,
    average='macro',
    name="Macro F1",
)

### Image data generators
Given the time available for the task, I didn't considered using data augmentation (DA), because I didn't know much about the underlying details of the problem and what methods were appropriate. So, the top layers were trained without DA.

I added DA. after peeking the work of the competition's winner, so I'm using almost the same method as them. This means that I'm using DA in the training of all layers (just to be clear, this step wasn't taken for the top layers' training).

The images are resized to 224x224 px because it's the input size for EfficientNet-B0. Batch size is 512 in order to try and accelerate training and also because there is enough memory for the task (1024 works, but raises warnings). Once again we're seeding the RNG, so experiments should be consistent between runs.

In [None]:
IMG_SIZE = 224
BATCH_SIZE = 512

train_dataGen = ImageDataGenerator(
    rescale=1./255, rotation_range=90, horizontal_flip=True, vertical_flip=True)

train_generator = train_dataGen.flow_from_dataframe(
    dataframe=x_train, x_col='image_path', class_mode='categorical', seed=31415,
    y_col='sirna', target_size=(IMG_SIZE, IMG_SIZE), batch_size=BATCH_SIZE)

val_generator = train_dataGen.flow_from_dataframe(
    dataframe=x_val, x_col='image_path', class_mode='categorical', seed=31415,
    y_col='sirna', target_size=(IMG_SIZE, IMG_SIZE), batch_size=BATCH_SIZE)

test_generator = train_dataGen.flow_from_dataframe(
    dataframe=x_test, x_col='image_path', class_mode='categorical', seed=31415,
    y_col='sirna', target_size=(IMG_SIZE, IMG_SIZE), batch_size=BATCH_SIZE)

### Network instatiation

The chosen architecture was EfficientNet-B0. Since I'm submitting this work as an assignment for an university course, it was chosen in order to minimize training time (as I only had a week to dedicate to this assignment). This means that EfficientNet-B0 is the smallest network of its family and should not have a great classification performance.

In summary, the EfficientNet architecture was proposed by Tan and Le who are (or were) both researchers at Google. Its idea was to study the growth of a CNN parameters (number of layers, filters and input size) as an optimization problem. This can be used with more traditional architectures such as ResNet, Inception, GoogLeNet, etc. in order to optimize them, or, to create a brand new architecture - EfficientNet - and scale its growth towards better performance with a reduced number of trainable parameters. For more information, check the link to the paper hosted in arXiv below:

https://arxiv.org/abs/1905.11946

tl;dr: EfficientNet-B0's performance is close to DenseNet-201 and ResNet-152 in ImageNet, while having way less parameters (about half of DenseNet's, and 1/6 of ResNet's).

In [None]:
# # Load EfficientNet pre-trained w/ ImageNet
# base_model = EfficientNetB0(include_top=False, weights="imagenet")

# # Rebuild top
# avg = layers.GlobalAveragePooling2D(name="avg_pool")(base_model.output)
# norm = layers.BatchNormalization()(avg)
# dropout = layers.Dropout(0.3, name="top_dropout")(norm)
# output = layers.Dense(1108, activation="softmax", name="pred")(dropout)

# model = tf.keras.Model(base_model.input, output, name="EfficientNet-B0")

# # Freeze the pretrained weights
# for layer in base_model.layers:
#     layer.trainable = False

# # Optimizer setup
# optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)

# # Metrics
# metrics = [CategoricalAccuracy(name='Categorical Accuracy'),
#            micro_f1, macro_f1]

# # Compile model
# model.compile(optimizer=optimizer, loss="categorical_crossentropy",
#               metrics=metrics)

### Top layers' training

In [None]:
# # Checkpoint to save network's best weights
# checkpoint = tf.keras.callbacks.ModelCheckpoint(
#     "./effNet-B0_{epoch:02d}",
#     monitor='val_loss', verbose=1, save_best_only=True,
#     save_weights_only=True, mode='min')

In [None]:
# # Top layers training

# # Epochs to train the top layers: min:8; max:80
# history = model.fit(train_generator, validation_data=val_generator, epochs=3,
#                     callbacks=[checkpoint])

In [None]:
# # Save model's current weights
# model.save_weights(r"./effNetB0_topTrained_weights")

# # Save whole model
# model.save(r"./effNetB0_topTrained_model")

### All layers' training

In [None]:
# Unfreeze all layer's pretrained weights
# for layer in model.layers:
#     layer.trainable = True

In [None]:
# Load whole model previously trained
# model = keras.models.load_model(r"../input/rcictfmodel-final/effNet-B0_13")

# Change optimizer's parameters (if needed)
# K.set_value(model.optimizer.learning_rate, 1e-3)
# K.set_value(model.optimizer.beta_1, 0.9)

# Verify changes to optimizer's parameters
# print(model.optimizer.learning_rate)
# print(model.optimizer.beta_1)

In [None]:
# Checkpoint to save model after each epoch of training
# checkpoint = tf.keras.callbacks.ModelCheckpoint(
#     "./effNet-B0_{epoch:02d}",
#     monitor='val_loss', verbose=1, save_best_only=False,
#     save_weights_only=False, mode='min')

In [None]:
# Train the whole model
# history = model.fit(train_generator, validation_data=val_generator, epochs=13,
#                     callbacks=[checkpoint], initial_epoch=10)

In [None]:
# Open dataframes containing previous training results
history_df1 = pd.read_csv(r"../input/rcictfmodel-te02/effNetB0_history.csv")
history_df2 = pd.read_csv(r"../input/rcictf-modelte04/effNetB0_history.csv")
history_df3 = pd.read_csv(r"../input/rcictfmodelte07/effNetB0_history.csv")
history_df4 = pd.read_csv(r"../input/rcictfmodelte10/effNetB0_history.csv")
history_df5 = pd.read_csv(r"../input/rcictfmodel-final/effNetB0_history.csv")

# Append newer epochs training values to 1st dataframe
history_df1 = history_df1.append(history_df2, ignore_index=True)
history_df1 = history_df1.append(history_df3, ignore_index=True)
history_df1 = history_df1.append(history_df4, ignore_index=True)
history_df1 = history_df1.append(history_df5, ignore_index=True)

# Save metrics' history as a CSV file
history_df1.to_csv("./effNetB0_history.csv", index=False)

In [None]:
history_df1.to_csv("./effNetB0_history.csv", index=False)

In [None]:
history_df1

In [None]:
# summarize history for accuracy
plt.plot(history_df1['Categorical Accuracy'])
plt.plot(history_df1['val_Categorical Accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

# summarize history for loss
plt.plot(history_df1['loss'])
plt.plot(history_df1['val_loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

# summarize history for MICRO F1-score
plt.plot(history_df1['Micro F1'])
plt.plot(history_df1['val_Micro F1'])
plt.title('Model F1-Score (micro avg.)')
plt.ylabel('F1-Score (micro avg.)')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

# summarize history for MACRO F1-score
plt.plot(history_df1['Macro F1'])
plt.plot(history_df1['val_Macro F1'])
plt.title('Model F1-score (macro avg.)')
plt.ylabel('F1-score (macro avg.)')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

### Testing the model
I don't expect much from it, I'm just hoping it's on par with the validation metrics.

In [None]:
# Load whole model previously trained
model = keras.models.load_model(r"../input/rcictfmodel-final/effNet-B0_13")

In [None]:
# Predicts labels
test_predictions = model.predict(test_generator, verbose=1)

print(classification_report(test_generator.labels,
                            test_predictions.argmax(1), zero_division=0))

## Conclusion

Given both validation and test results, I'm not going to bother submitting mine, as it would be just a waste of computer cycles. I'll just leave here what I would have done if I had the time to start over:

1. I would have done a better job with image preprocessing (I'd resize instead of just crop with Keras) and data augmentation (I'd increase the number of training instances 3 or 4 fold.
2. I'd have used EfficientNet-B2, at the very least, but I would try using B3 or higher with a smaller batch size (128, at most).
3. Of course, I would train the model for more epochs and using SGD with Nesterov, to compare with Adam.

With this 3 steps, I think I could have done a better job, without the need to go overboard with ensemble, for instance (remember, this is an undergrad assingment). I'm not satisfied with this results, but it is what I could do with the time I had.