# Hyperparameter Tuning for Deepfake Detection CNN Model Development

This notebook contains hyperparameter tuning process for CNN model development in deepfake detection. The tuning will be done in two phases, which are the base hyperparameter tuning and dropout rate tuning. In this case, the base hyperparameters are number of convolutional layers including their filters and kernel sizes, number of dense layers including their units, and Adam optimizer's learning rate.

In [1]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import keras_tuner as kt
import pandas as pd
import random
import os

## Define Dataset Directory Path

In [2]:
CDF_TRAIN_DIR = "/kaggle/input/deepfake-detection-datasets/Celeb-DF-v2/Train"
CDF_VAL_DIR = "/kaggle/input/deepfake-detection-datasets/Celeb-DF-v2/Val"

DF_TRAIN_DIR = "/kaggle/input/deepfake-detection-datasets/DeeperForensics-1.0/Train"
DF_VAL_DIR = "/kaggle/input/deepfake-detection-datasets/DeeperForensics-1.0/Val"

DFDC_TRAIN_DIR = "/kaggle/input/deepfake-detection-datasets/DFDC/Train"
DFDC_VAL_DIR = "/kaggle/input/deepfake-detection-datasets/DFDC/Val"

In [3]:
cdf_train_deepfake_dir = os.path.join(CDF_TRAIN_DIR, "Deepfake")
cdf_train_real_vid_dir = os.path.join(CDF_TRAIN_DIR, "Original")
cdf_val_deepfake_dir = os.path.join(CDF_VAL_DIR, "Deepfake")
cdf_val_real_vid_dir = os.path.join(CDF_VAL_DIR, "Original")

df_train_deepfake_dir = os.path.join(DF_TRAIN_DIR, "Deepfake")
df_train_real_vid_dir = os.path.join(DF_TRAIN_DIR, "Original")
df_val_deepfake_dir = os.path.join(DF_VAL_DIR, "Deepfake")
df_val_real_vid_dir = os.path.join(DF_VAL_DIR, "Original")

dfdc_train_deepfake_dir = os.path.join(DFDC_TRAIN_DIR, "Deepfake")
dfdc_train_real_vid_dir = os.path.join(DFDC_TRAIN_DIR, "Original")
dfdc_val_deepfake_dir = os.path.join(DFDC_VAL_DIR, "Deepfake")
dfdc_val_real_vid_dir = os.path.join(DFDC_VAL_DIR, "Original")

In [4]:
print("Dataset loaded:")

print("\nCeleb-DF-v2 (Train Split)")
print(f" - {len(os.listdir(cdf_train_deepfake_dir))} deepfake frames")
print(f" - {len(os.listdir(cdf_train_real_vid_dir))} real vid frames")
print("Celeb-DF-v2 (Val Split)")
print(f" - {len(os.listdir(cdf_val_deepfake_dir))} deepfake frames")
print(f" - {len(os.listdir(cdf_val_real_vid_dir))} real vid frames")

print("\nDeeperForensics-1.0 (Train Split)")
print(f" - {len(os.listdir(df_train_deepfake_dir))} deepfake frames")
print(f" - {len(os.listdir(df_train_real_vid_dir))} real vid frames")
print("DeeperForensics-1.0 (Val Split)")
print(f" - {len(os.listdir(df_val_deepfake_dir))} deepfake frames")
print(f" - {len(os.listdir(df_val_real_vid_dir))} real vid frames")

print("\nDeepfake Detection Challenge (Train Split)")
print(f" - {len(os.listdir(dfdc_train_deepfake_dir))} deepfake frames")
print(f" - {len(os.listdir(dfdc_train_real_vid_dir))} real vid frames")
print("Deepfake Detection Challenge (Val Split)")
print(f" - {len(os.listdir(dfdc_val_deepfake_dir))} deepfake frames")
print(f" - {len(os.listdir(dfdc_val_real_vid_dir))} real vid frames")

Dataset loaded:

Celeb-DF-v2 (Train Split)
 - 7000 deepfake frames
 - 7000 real vid frames
Celeb-DF-v2 (Val Split)
 - 1000 deepfake frames
 - 1000 real vid frames

DeeperForensics-1.0 (Train Split)
 - 7000 deepfake frames
 - 7000 real vid frames
DeeperForensics-1.0 (Val Split)
 - 1000 deepfake frames
 - 1000 real vid frames

Deepfake Detection Challenge (Train Split)
 - 7000 deepfake frames
 - 7000 real vid frames
Deepfake Detection Challenge (Val Split)
 - 1000 deepfake frames
 - 1000 real vid frames


## Create Combined Dataset

For hyperparameter tuning purposes, the three datasets are combined into a new set for each train and validation split. For the train split, the full dataset will be sampled, taking only 4000 data for each dataset to reduce computational cost in the hyperparameter tuning process.

### Get Full Filepath for Each Frame

In [5]:
def get_filepath_from_dir(dir_path):
    return [
        os.path.join(dir_path, filename) for filename in sorted(os.listdir(dir_path))
    ]

#### Train Split

In [6]:
cdf_train_deepfake_filepaths = get_filepath_from_dir(cdf_train_deepfake_dir)
cdf_train_real_vid_filepaths = get_filepath_from_dir(cdf_train_real_vid_dir)

df_train_deepfake_filepaths = get_filepath_from_dir(df_train_deepfake_dir)
df_train_real_vid_filepaths = get_filepath_from_dir(df_train_real_vid_dir)

dfdc_train_deepfake_filepaths = get_filepath_from_dir(dfdc_train_deepfake_dir)
dfdc_train_real_vid_filepaths = get_filepath_from_dir(dfdc_train_real_vid_dir)

In [7]:
SAMPLE_SIZE_PER_DATASET = 4000
NUM_LABELS = 2
SAMPLE_SIZE_PER_LABEL = SAMPLE_SIZE_PER_DATASET // NUM_LABELS

In [8]:
# Sample the full train list

cdf_train_deepfake_filepaths = cdf_train_deepfake_filepaths[:SAMPLE_SIZE_PER_LABEL]
cdf_train_real_vid_filepaths = cdf_train_real_vid_filepaths[:SAMPLE_SIZE_PER_LABEL]

df_train_deepfake_filepaths = df_train_deepfake_filepaths[:SAMPLE_SIZE_PER_LABEL]
df_train_real_vid_filepaths = df_train_real_vid_filepaths[:SAMPLE_SIZE_PER_LABEL]

dfdc_train_deepfake_filepaths = dfdc_train_deepfake_filepaths[:SAMPLE_SIZE_PER_LABEL]
dfdc_train_real_vid_filepaths = dfdc_train_real_vid_filepaths[:SAMPLE_SIZE_PER_LABEL]

#### Val Split

In [9]:
cdf_val_deepfake_filepaths = get_filepath_from_dir(cdf_val_deepfake_dir)
cdf_val_real_vid_filepaths = get_filepath_from_dir(cdf_val_real_vid_dir)

df_val_deepfake_filepaths = get_filepath_from_dir(df_val_deepfake_dir)
df_val_real_vid_filepaths = get_filepath_from_dir(df_val_real_vid_dir)

dfdc_val_deepfake_filepaths = get_filepath_from_dir(dfdc_val_deepfake_dir)
dfdc_val_real_vid_filepaths = get_filepath_from_dir(dfdc_val_real_vid_dir)

### Combine Dataset in a List

#### Train Split

In [10]:
combined_train_filepaths = []
combined_train_filepaths += cdf_train_deepfake_filepaths + cdf_train_real_vid_filepaths
combined_train_filepaths += df_train_deepfake_filepaths + df_train_real_vid_filepaths
combined_train_filepaths += dfdc_train_deepfake_filepaths + dfdc_train_real_vid_filepaths

# Get labels for each file based on their directory name
combined_train_labels = [
    filepath.split("/")[-2]
    for filepath in combined_train_filepaths
]

print(f"Got {len(combined_train_filepaths)} frames in total")

Got 12000 frames in total


#### Val Split

In [11]:
combined_val_filepaths = []
combined_val_filepaths += cdf_val_deepfake_filepaths + cdf_val_real_vid_filepaths
combined_val_filepaths += df_val_deepfake_filepaths + df_val_real_vid_filepaths
combined_val_filepaths += dfdc_val_deepfake_filepaths + dfdc_val_real_vid_filepaths

# Get labels for each file based on their directory name
combined_val_labels = [
    filepath.split("/")[-2]
    for filepath in combined_val_filepaths
]

print(f"Got {len(combined_val_filepaths)} frames in total")

Got 6000 frames in total


### Create a DataFrame to Store the Dataset

#### Train Split

In [12]:
tuning_train_dataset = pd.DataFrame({
    "filepath": combined_train_filepaths,
    "label": combined_train_labels
})

tuning_train_dataset.head()

Unnamed: 0,filepath,label
0,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake
1,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake
2,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake
3,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake
4,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake


#### Val Split

In [13]:
tuning_val_dataset = pd.DataFrame({
    "filepath": combined_val_filepaths,
    "label": combined_val_labels
})

tuning_val_dataset.head()

Unnamed: 0,filepath,label
0,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake
1,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake
2,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake
3,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake
4,/kaggle/input/deepfake-detection-datasets/Cele...,Deepfake


## Image Data Generator

In [14]:
def create_generator(dataset_df):
    datagen = ImageDataGenerator(rescale=1./255)
    generator = datagen.flow_from_dataframe(
        dataset_df,
        x_col="filepath",
        y_col="label",
        target_size=(128, 128),
        batch_size=32,
        color_mode="rgb",
        class_mode="binary",
        shuffle=True,
        seed=42
    )
    
    return generator

In [15]:
print("Train image generator created:")
train_generator = create_generator(tuning_train_dataset)

print("\nValidation image generator created:")
val_generator = create_generator(tuning_val_dataset)

Train image generator created:
Found 12000 validated image filenames belonging to 2 classes.

Validation image generator created:
Found 6000 validated image filenames belonging to 2 classes.


## Model Builder Function

In [16]:
def baseline_model_builder(hp):
    conv_layers = hp.Int("conv_layers", min_value=3, max_value=5)
    conv_layer_filters = [
        hp.Int(f"conv_{i+1}_filters", min_value=32, max_value=256, step=32)
        for i in range(conv_layers)
    ]
    conv_layer_kernel_size = [
        hp.Int(f"conv_{i+1}_kernel_size", min_value=3, max_value=7, step=2)
        for i in range(conv_layers)
    ]
    dense_layers = hp.Int("dense_layers", min_value=2, max_value=5)
    dense_layer_units = [
        hp.Int(f"dense_{i+1}_layer_units", min_value=32, max_value=256, step=32)
        for i in range(dense_layers)
    ]
    learning_rate = hp.Float("learning_rate", min_value=1e-5, max_value=1e-2, step=10)

    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Input(shape=(128, 128, 3)))

    for i in range(conv_layers):
        model.add(
            tf.keras.layers.Conv2D(
                conv_layer_filters[i],
                (conv_layer_kernel_size[i], conv_layer_kernel_size[i]),
                activation="relu",
            )
        )
        model.add(tf.keras.layers.MaxPooling2D(2, 2))

    model.add(tf.keras.layers.Flatten())

    for i in range(dense_layers):
        model.add(tf.keras.layers.Dense(dense_layer_units[i], activation="relu"))

    model.add(tf.keras.layers.Dense(1, activation="sigmoid"))

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss=tf.keras.losses.BinaryCrossentropy(),
        metrics=["accuracy"],
    )

    return model

In [17]:
def dropout_model_builder(hp, base_hp):
    num_conv_layers = base_hp.get("conv_layers")
    num_dense_layers = base_hp.get("dense_layers")
    learning_rate = base_hp.get("learning_rate")
    dropout_rate = hp.Float("dropout_rate", min_value=0.2, max_value=0.8, step=0.05)

    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Input(shape=(128, 128, 3)))

    for i in range(num_conv_layers):
        conv_filter = base_hp.get(f"conv_{i+1}_filters")
        kernel_size = base_hp.get(f"conv_{i+1}_kernel_size")
        model.add(
            tf.keras.layers.Conv2D(
                conv_filter, (kernel_size, kernel_size), activation="relu"
            )
        )
        model.add(tf.keras.layers.MaxPooling2D(2, 2))

    model.add(tf.keras.layers.Flatten())

    for i in range(num_dense_layers):
        dense_units = base_hp.get(f"dense_{i+1}_layer_units")
        model.add(tf.keras.layers.Dense(dense_units, activation="relu"))

    model.add(tf.keras.layers.Dropout(dropout_rate))
    model.add(tf.keras.layers.Dense(1, activation="sigmoid"))

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss=tf.keras.losses.BinaryCrossentropy(),
        metrics=["accuracy"],
    )

    return model

## Search for the Base Hyperparameters

For the base hyperparameters, the tuner will search for the best value based on the loss resulted from the model. The tuner will only fitted to train split only.

In [18]:
base_hp_tuner = kt.Hyperband(baseline_model_builder,
                             objective="loss",
                             max_epochs=20,
                             factor=3,
                             directory="tuner_result",
                             project_name="base_hp",
                             seed=42)

In [19]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="loss",
                                                  patience=2,
                                                  restore_best_weights=True)

In [20]:
base_hp_tuner.search(train_generator,
                     verbose=1,
                     callbacks=[early_stopping])

Trial 28 Complete [00h 00m 01s]

Best loss So Far: 0.010555267333984375
Total elapsed time: 00h 55m 21s


In [21]:
base_hp_tuner.results_summary()

Results summary
Results in tuner_result/base_hp
Showing 10 best trials
Objective(name="loss", direction="min")

Trial 0025 summary
Hyperparameters:
conv_layers: 4
conv_1_filters: 256
conv_2_filters: 224
conv_3_filters: 256
conv_1_kernel_size: 3
conv_2_kernel_size: 3
conv_3_kernel_size: 3
dense_layers: 4
dense_1_layer_units: 64
dense_2_layer_units: 224
learning_rate: 1e-05
conv_4_filters: 256
conv_4_kernel_size: 7
dense_3_layer_units: 224
dense_4_layer_units: 224
dense_5_layer_units: 224
conv_5_filters: 192
conv_5_kernel_size: 3
tuner/epochs: 20
tuner/initial_epoch: 0
tuner/bracket: 0
tuner/round: 0
Score: 0.010555267333984375

Trial 0016 summary
Hyperparameters:
conv_layers: 3
conv_1_filters: 224
conv_2_filters: 160
conv_3_filters: 160
conv_1_kernel_size: 7
conv_2_kernel_size: 5
conv_3_kernel_size: 7
dense_layers: 2
dense_1_layer_units: 160
dense_2_layer_units: 128
learning_rate: 1e-05
conv_4_filters: 224
conv_4_kernel_size: 7
dense_3_layer_units: 256
dense_4_layer_units: 256
dense_5_l

## Search for the Optimal Dropout Rate

For the dropout rate tuning, the tuner will search for the best value based on the val_loss resulted from the model. The tuner will be fitted to the train split, and then validated to the validation split.

In [22]:
best_base_hp = base_hp_tuner.get_best_hyperparameters()[0].values
best_base_hp

{'conv_layers': 4,
 'conv_1_filters': 256,
 'conv_2_filters': 224,
 'conv_3_filters': 256,
 'conv_1_kernel_size': 3,
 'conv_2_kernel_size': 3,
 'conv_3_kernel_size': 3,
 'dense_layers': 4,
 'dense_1_layer_units': 64,
 'dense_2_layer_units': 224,
 'learning_rate': 1e-05,
 'conv_4_filters': 256,
 'conv_4_kernel_size': 7,
 'dense_3_layer_units': 224,
 'dense_4_layer_units': 224,
 'dense_5_layer_units': 224,
 'conv_5_filters': 192,
 'conv_5_kernel_size': 3,
 'tuner/epochs': 20,
 'tuner/initial_epoch': 0,
 'tuner/bracket': 0,
 'tuner/round': 0}

In [23]:
dropout_tuner = kt.Hyperband(lambda hp: dropout_model_builder(hp, best_base_hp),
                             objective="val_loss",
                             max_epochs=20,
                             directory="tuner_result",
                             project_name="dropout_rate",
                             seed=42)

In [24]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                                  patience=2,
                                                  restore_best_weights=True)

In [25]:
dropout_tuner.search(train_generator,
                     validation_data=val_generator,
                     verbose=1,
                     callbacks=[early_stopping])

Trial 19 Complete [00h 02m 01s]
val_loss: 0.6693305373191833

Best val_loss So Far: 0.6603451371192932
Total elapsed time: 00h 31m 51s


In [26]:
dropout_tuner.results_summary()

Results summary
Results in tuner_result/dropout_rate
Showing 10 best trials
Objective(name="val_loss", direction="min")

Trial 0011 summary
Hyperparameters:
dropout_rate: 0.65
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 2
tuner/round: 0
Score: 0.6603451371192932

Trial 0000 summary
Hyperparameters:
dropout_rate: 0.6000000000000001
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 2
tuner/round: 0
Score: 0.66315758228302

Trial 0018 summary
Hyperparameters:
dropout_rate: 0.7
tuner/epochs: 7
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.6693305373191833

Trial 0001 summary
Hyperparameters:
dropout_rate: 0.2
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 2
tuner/round: 0
Score: 0.6719924807548523

Trial 0007 summary
Hyperparameters:
dropout_rate: 0.5
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 2
tuner/round: 0
Score: 0.6723927855491638

Trial 0004 summary
Hyperparameters:
dropout_rate: 0.75
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/brack