CelebA is a dataset of facial images that includes a set of binary attributes per each image. Some of which relate to the color of hair. This will be combined into a single multi-class label, which will be learnt using an InceptionV3 model.

Notes:
- The notebook uses the `celeba_utils` utility script for convinience.
- For simplisity of the notebook, all classes are tightly bound to the `Config` and `Paths` data classes. It is generally advised to make the use of these connfigurations more explicit (and passing the relevant configuration as arguments).

In [None]:
import os
import numpy as np
import random
import tensorflow.compat.v1 as tf
import keras.backend as K

# set a seed for reproducible results
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(1234)
random.seed(1234)
tf.set_random_seed(1234)

from typing import List, Tuple, Callable
import pandas as pd
import matplotlib.pyplot as plt

from keras.applications.inception_v3 import InceptionV3
from keras.callbacks import ModelCheckpoint
from keras.layers import Dropout, Dense, GlobalAveragePooling2D
from keras.models import Model
from keras.optimizers import SGD
from keras.utils import Sequence

from celeba_utils import Config  # struct with configurations 
from celeba_utils import AttrColumns, Paths
from celeba_utils import PartitionType, DataPartition, load_partition_table_with_attributes
from celeba_utils import load_image_set


# Configure current run

Config.TRAINING_SAMPLES = 1984
Config.VALIDATION_SAMPLES = 1984
Config.BATCH_SIZE = 32
Config.NUM_EPOCHS = 15


We will begin with sampling the large dataset using existing official celeba data partitioning (training/validation/test)

`_create_partition` is a generic method that organizes the image ids into a `DataPartition` class (defined in *celeba_utils.py*), based on a pandas DataFrame (celeba's partitioning table) and a method that will implement the sampling strategy: `(df: pd.DataFrame, partition: PartitionType, num_samples: int) -> pd.DataFrame:`.

In [None]:
def _create_partition(df: pd.DataFrame, sample_strategy: Callable) -> DataPartition:
    _training = sample_strategy(
        df, PartitionType.TRAINING, Config.TRAINING_SAMPLES)

    _validation = sample_strategy(
        df, PartitionType.VALIDATION, Config.VALIDATION_SAMPLES)

    _test = sample_strategy(
        df, PartitionType.TEST, Config.TEST_SAMPLES)

    return DataPartition(
        training=_training[AttrColumns.ID.value].values,
        validation=_validation[AttrColumns.ID.value].values,
        test=_test[AttrColumns.ID.value].values
    )


def _sample_random_partition(df: pd.DataFrame, partition: PartitionType, num_samples: int) -> pd.DataFrame:
    filtered = df[df['partition'] == partition.value]
    sampled = filtered.sample(num_samples)
    return sampled


def create_random_partition() -> DataPartition:
    df = pd.read_csv(Paths.DATA_PARTITION)
    return _create_partition(df, _sample_random_partition)


def trace_partition(partition: DataPartition):
    print(f"Training includes {len(partition.training)} images: {partition.training[:10]}...")
    print(f"Validation includes {len(partition.validation)} images: {partition.validation[:10]}...")
    print(f"Test includes {len(partition.test)} images: {partition.test[:10]}...")
    
    
random_partition = create_random_partition()
trace_partition(random_partition)

CelebA offers a set of binary attributes, some of which relate to hair color: 'Black_Hair', 'Blond_Hair', 'Brown_Hair', 'Gray_Hair' and 'Bald' (colum names are defined with the `AttrColumns` enum). We want to combine them into a single multi-class hair color labels.

In [None]:
print(f"Readding attributes table from {Paths.ATTRIBUTES}")
df_attr = pd.read_csv(Paths.ATTRIBUTES)
df_attr.set_index(AttrColumns.ID.value, inplace=True)

# sampling relevant columns
columns = [
    AttrColumns.BALD,
    AttrColumns.BLACK_HAIR,
    AttrColumns.BLOND_HAIR,
    AttrColumns.BROWN_HAIR,
    AttrColumns.GRAY_HAIR,
]        
df_attr = df_attr[[c.value for c in columns]]
df_attr[df_attr <= 0] =  0

print("First 5 labels (one hot encoding)")
print(df_attr.head(5))

print("Distribution of attributes")
print(df_attr.sum())


def trace_attributes_per_sample_count(df, title: str):
    attributes_per_sample = np.bincount(df.sum(axis=1))
    print(title + '=>')
    print(f"\t{attributes_per_sample[0]} samples don't have any hair-color attributes")
    print(f"\t{attributes_per_sample[1]} samples have exactly one hair-color attribute")
    print(f"\t{sum(attributes_per_sample[2:])} samples have more than one hair-color attribute")

    
trace_attributes_per_sample_count(df_attr, "Total attribute count per sample")

That's interesting, right? Shouldn't each person have only a single hair color? Well, real world data is s messy buisiness. This is a critical issue to consider because rows don't form a valid target for 1-hot encoding (labels should be a normalized distribution over the categories). 
1. In the case of "no-color", encoding can be changed from `[0,0,0,0,0]` to `[.2,.2,.2,.2,.2]`. This is not recomended though, as in reality the lack of any color attributes doesn't imply a uniform probability for all colors (note also that about 1/3 of data doesn't have hair color attributes, so that's a lot of low quality data).
1. In the case of multiple color attributes, treating them as a distribution makes more sense, so:
   1. We will explore both sampling method (with or without these samples).
   1. During label preparation we will normalize the rows as a distribution (sum should be 1).

In [None]:
def _sample_non_zero_partition(df: pd.DataFrame, partition: PartitionType, num_samples: int) -> pd.DataFrame:
    """
    At least one column (attribute) must be positive
    """
    filtered = df[df['partition'] == partition.value]
    values = filtered.values[:, 2:]  # ignore first 2 columns: image_id and partition
    values[values <= 0] = 0  # attribute values in CelebA are in {-1, 1}
    filtered = filtered[values.sum(axis=1) > 0]
    sampled = filtered.sample(num_samples)
    return sampled


def _sample_mutualy_exclusive_partition(df: pd.DataFrame, partition: PartitionType, num_samples: int) -> pd.DataFrame:
    """
    Columns (attributes) are mutually exclusive (only one is allowed to be possitive)
    """
    filtered = df[df['partition'] == partition.value]
    values = filtered.values[:, 2:]
    values[values <= 0] = 0
    filtered = filtered[values.sum(axis=1) == 1]
    sampled = filtered.sample(num_samples)
    return sampled


def create_hair_color_partition(sample_strategy: Callable) -> DataPartition:
    attrs = [
        AttrColumns.BALD,
        AttrColumns.BLACK_HAIR,
        AttrColumns.BLOND_HAIR,
        AttrColumns.BROWN_HAIR,
        AttrColumns.GRAY_HAIR,
    ]
    df = load_partition_table_with_attributes([c.value for c in attrs])
    return _create_partition(df, sample_strategy)


non_zero_partition = create_hair_color_partition(_sample_non_zero_partition)
trace_attributes_per_sample_count(df_attr.loc[non_zero_partition.training], "non-zero attribute count (training)")
trace_attributes_per_sample_count(df_attr.loc[non_zero_partition.validation], "non-zero attribute count (validation)")

mutually_exclusive_partition = create_hair_color_partition(_sample_mutualy_exclusive_partition)
trace_attributes_per_sample_count(df_attr.loc[mutually_exclusive_partition.training], "mutually-exclusive attribute count (training)")
trace_attributes_per_sample_count(df_attr.loc[mutually_exclusive_partition.validation], "mutually-exclusive attribute count (validation)")


We are now ready to create our data generators (based on Keras `Sequence` class)

In [None]:
def _generate_multi_class_labels(columns: List[AttrColumns], df_attr: pd.DataFrame) -> np.ndarray:
    attr = df_attr[[c.value for c in columns]]
    labels = np.array(attr.values)
    labels[labels <= 0] = 0
    # we want to keep this method independant of our sampling method
    # therefore, regardless of sampling stretagy, we want our label generation method to work with 
    # images that have no attributes or more than a single attribute
    num_of_attributes = labels.sum(axis=1)
    labels = np.where(
        (num_of_attributes != 0).reshape(-1, 1),
        labels / num_of_attributes.reshape(-1, 1),
        1 / len(columns))
    return labels
        

class MultiClassSequence(Sequence):
    def __init__(self, columns: List[AttrColumns], image_ids: List[str]):
        df_attr = pd.read_csv(Paths.ATTRIBUTES)
        df_attr.set_index(AttrColumns.ID.value, inplace=True)
               
        self.labels = _generate_multi_class_labels(
            columns,
            df_attr.loc[image_ids]
        )
        self.image_ids = image_ids
        self.batch_size = Config.BATCH_SIZE

    def __len__(self):
        return len(self.image_ids) // self.batch_size

    def __getitem__(self, idx) -> Tuple[np.ndarray, np.ndarray]:
        image_ids = self.image_ids[idx * self.batch_size:(idx + 1) * self.batch_size]
        images = load_image_set(image_ids)
        labels = self.labels[idx * self.batch_size:(idx + 1) * self.batch_size]

        return images, labels

    
class HairColorSequence(MultiClassSequence):
    def __init__(self, image_ids: List[str]):
        hair_color_columns = [
            AttrColumns.BALD,
            AttrColumns.BLACK_HAIR,
            AttrColumns.BLOND_HAIR,
            AttrColumns.BROWN_HAIR,
            AttrColumns.GRAY_HAIR,
        ]
        super().__init__(hair_color_columns, image_ids)       


We use a pretrained InceptionV3 model and replace its top layers with a new one-hot encoding classification.

In [None]:
def build_model(num_classes) -> Model:
    inc_model = InceptionV3(
        weights=str(Paths.MODEL_WEIGHTS),
        include_top=False,
        input_shape=(Config.IMG_HEIGHT, Config.IMG_WIDTH, 3))
    
    x = GlobalAveragePooling2D()(inc_model.output)
    x = Dense(1024, activation="relu")(x)
    x = Dropout(0.5)(x)
    x = Dense(512, activation="relu")(x)
    x = Dense(num_classes, activation="softmax")(x)    

    model = Model(inputs=inc_model.input, outputs=x)

    for layer in model.layers[:52]:
        layer.trainable = False

    model.compile(
        optimizer=SGD(lr=0.0001, momentum=0.9), 
        loss='categorical_crossentropy', 
        metrics=['accuracy'])

    return model


We will train the model twice, using both partition methods ("non zero" and "mutually exclusive"). However, please note that validation will always be performed using the `mutually_exclusive_partition`.
1. For a more fair comparison, we should test (validate) all experiments using the same data. In a sense, we need to assume a single source of truth, upon which to measure our results.
1. I usually favour testing with the more challanging test cases. We need to consider though that the multi-attribute cases are less reliable, so if we use them to test ourselves, we may panalize our system at times where it might be correct (and the labels wrong) - and vice versa.
1. Generally speaking, each experiment can use it's own validation set (given it is there to assess and control the training process). What is important is that when you **compare**, do so consistently.

In [None]:
# Rןun non-zero configuration: 
#   data sampling includes all rows where there is atleast one positive attribute
#   (include rows with multiple positive color attributes)

checkpoint = ModelCheckpoint(
    filepath='weights.best.non-zero.hdf5',
    verbose=1,
    save_best_only=True)

hist_non_zero = build_model(num_classes=5).fit_generator(
    HairColorSequence(non_zero_partition.training),
    validation_data=HairColorSequence(mutually_exclusive_partition.validation),
    epochs=Config.NUM_EPOCHS,
    callbacks=[checkpoint],
    verbose=1
)


In [None]:
# Rןun mutually-exclusive configuration: 
#   data sampling includes all rows where there is ONLY one positive attribute
#   (ignore rows with multiple positive color attributes)

checkpoint = ModelCheckpoint(
    filepath='weights.best.mutually_exclusive.hdf5',
    verbose=1,
    save_best_only=True)

hist_mutually_exclusive =  build_model(num_classes=5).fit_generator(
    HairColorSequence(mutually_exclusive_partition.training),
    validation_data=HairColorSequence(mutually_exclusive_partition.validation),
    epochs=Config.NUM_EPOCHS,
    callbacks=[checkpoint],
    verbose=2
)



In [None]:
plt.plot(range(Config.NUM_EPOCHS), hist_non_zero.history['val_accuracy'])
plt.plot(range(Config.NUM_EPOCHS), hist_mutually_exclusive.history['val_accuracy'])
plt.xlabel('Epochs')
plt.ylabel('Val Accuracy')
plt.legend(['non_zero', 'mutually_exclusive'])

In our humble experiment we can see similar results for both the more strict "mutually-exclusive" and the "non-zero" configurations. Please take into account the following:
1. We have used only a portion of all the available data (although we did use a rather large validation set). 
1. To reach any real conclusions, an experiment should be repeated ([cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). Prefereably exploring several hyperparameters.
1. In the "non-zero" sampling strategy, the (supposedly) "lower quality" samples (where a face has multiple color attributes) come at the expense of the "better" samples. The current experiment is relevant for the case where we are sampling a subset of the data and the question under investigation is - wherether to allow to substitute the "high quality" samples with the "low quality" samples.
1. Alternetively, if we plan to use the entire dataset, a different question needs to be investigated - whether to have **additional** "low quality" samples.

How about we try out the above hypothesis? The new sampling strategy will sample `num_samples` rows with exactly one positive attribute, and additional `num_samples/extention_factor` rows with more than one positibe attributes. This way we can verify whether or not adding such samples is benificial.

In [None]:
def _sample_mutually_exclusive_partition_with_extra_multi_attributes(df: pd.DataFrame, partition: PartitionType, num_samples: int, extention_factor = 5) -> pd.DataFrame:    
    filtered = df[df['partition'] == partition.value]
    values = filtered.values[:, 2:]
    values[values <= 0] = 0
    
    mutually_exlusize = filtered[values.sum(axis=1) == 1]
    mutually_exlusize = mutually_exlusize.sample(num_samples)
    
    multiple_attributes = filtered[values.sum(axis=1) > 1]
    multiple_attributes = multiple_attributes.sample(num_samples // extention_factor)
    
    sampled = pd.concat([mutually_exlusize, multiple_attributes]) 
    sampled = sampled.sample(frac=1.)  # reshuffle
    
    return  sampled


partition_with_extra = create_hair_color_partition(_sample_mutually_exclusive_partition_with_extra_multi_attributes)
trace_attributes_per_sample_count(df_attr.loc[partition_with_extra.training], "extended partition attribute count (training)")
trace_attributes_per_sample_count(df_attr.loc[partition_with_extra.validation], "extended partition attribute count (validation)")


In [None]:
checkpoint = ModelCheckpoint(
    filepath='weights.best.extended.hdf5',
    verbose=1,
    save_best_only=True)

hist_extended =  build_model(num_classes=5).fit_generator(
    HairColorSequence(partition_with_extra.training),
    validation_data=HairColorSequence(mutually_exclusive_partition.validation),
    epochs=Config.NUM_EPOCHS,
    callbacks=[checkpoint],
    verbose=2
)


In [None]:
plt.plot(range(Config.NUM_EPOCHS), hist_non_zero.history['val_accuracy'])
plt.plot(range(Config.NUM_EPOCHS), hist_mutually_exclusive.history['val_accuracy'])
plt.plot(range(Config.NUM_EPOCHS), hist_extended.history['val_accuracy'])
plt.xlabel('Epochs')
plt.ylabel('Val Accuracy')
plt.legend(['non_zero', 'mutually_exclusive', 'extended'])

To sum up:
- An example of how to perform multi-class classfication with the CelebA dataset was presented.
- We also explored a few methods to sample our data and saw how to compare between them. 
- As already mentioned, real conclusion requires a more robust experimentation.
- Either way, **make sure you give a thorough consideration to what data you include in the training set**. 
   1. Always check what is the common practice.
   1. Make sure the common practice is actually relevant for your situation. 
   
I can share my personal experience with working on the [LIDC lung nodule dataset][1]: 
- A common practice was estiblished for nodule malignancy classification - not including nodules with unknown malignancy.
- This practice was largely adopted by content based image retrieval studies.
- As part of my [study][2], we saw that the addition of unknown malignancy nodules was acutally benificial to retrieval tasks and has consistently improved results.

References:
1. [LIDC][1]
1. [Loyman, Mark, and Hayit Greenspan. "Lung nodule retrieval using semantic similarity estimates." Medical Imaging 2019: Computer-Aided Diagnosis. SPIE,2019.][2]

[1]: https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI
[2]: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/10950/109503P/Lung-nodule-retrieval-using-semantic-similarity-estimates/10.1117/12.2512115.short
