# Criteo Click-Through Rate (CTR) Predciton


<div align="center">
  <img src="../images/Criteo-Logo-Orange.png" alt="criteo-logo" width=500 />
</div>


## What is Click-Through Rate (CTR)?

**Click-Through Rate (CTR)** is a key metric in online advertising.

$$
CTR = \frac{\text{Clicks}}{\text{Impressions}} \times 100\%
$$

- **Clicks** -> how many times users clicked on the ad
- **Impressions** -> how many times the ad was shown

The **CTR prediction task** focuses on modeling the _likelihood of a click_ based on:

- Ad characteristics (e.g., text, image, placement)
- User profile data (e.g., demographics, behavior)
- Contextual features (e.g., time of day, device, location)


## Data Preprocessing


### Data Dictionary


- **label (int64):** Target variable that indicates if an ad was clicked (1) or not (0).
- **I1-I13 (float64 | int64):** A total of 13 columns of integer features (mostly count features).
- **C1-C26 (object):** A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.

Data is obtained from: https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download


### Criteo Data (45M)


In [1]:
# import pandas as pd

In [2]:
# train_columns = (
#     ["label"] + [f"I{i}" for i in range(1, 14)] + [f"C{i}" for i in range(1, 27)]
# )
# test_columns = [f"I{i}" for i in range(1, 14)] + [f"C{i}" for i in range(1, 27)]

# criteo_train = pd.read_csv(
#     "data/criteo_kaggle/train.csv", header=None, sep="\t", names=train_columns
# )
# criteo_test = pd.read_csv(
#     "data/criteo_kaggle/test.csv", header=None, sep="\t", names=test_columns
# )

# print(f"Criteo train dataset has shape: {criteo_train.shape}")
# print(f"Criteo test dataset has shape: {criteo_test.shape}")

Train dataset has shape: (45840617, 40) -> 45M entries, 39 features, 1 label

Test dataset has shape: (6042135, 39) -> 6M entries, 39 features


In [3]:
# criteo_train.head(10)

In [4]:
# train_missing_values = criteo_train.isnull().sum()

# train_missing_df = pd.DataFrame(
#     {
#         "Feature": train_missing_values.index,
#         "Missing Count": train_missing_values.values,
#         "Missing Percentage": (train_missing_values / len(criteo_train) * 100).round(2),
#     }
# )

# train_missing_df = train_missing_df.sort_values("Missing Count", ascending=False)
# train_missing_df

In [5]:
# criteo_train.describe()

### Preprocessing

In [6]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from deepctr_torch.inputs import SparseFeat, DenseFeat, get_feature_names

In [None]:
sparse_features = ["C" + str(i) for i in range(1, 27)]
dense_features = ["I" + str(i) for i in range(1, 14)]
target = ["label"]
criteo_features = target + dense_features + sparse_features

criteo_path = "../data/criteo_kaggle/train.csv"
print(f"Loading the Criteo data from {criteo_path}")
criteo_data = pd.read_csv(criteo_path, header=None, sep="\t", names=criteo_features)

criteo_data = criteo_data.sample(n=1_000_000, random_state=1773)

# Categorical feature missing imputation --- need string here
criteo_data[sparse_features] = criteo_data[sparse_features].fillna(
    "-1",
)
# Dense feature missing imputation --- maybe need more inspection
criteo_data[dense_features] = criteo_data[dense_features].fillna(
    0,
)

# ----- Label Encoding for sparse features,and do simple Transformation for dense features ----- #

print(f"Encoding sparse features and transforming dense features...")

for feat in sparse_features:
    lbe = LabelEncoder()
    criteo_data[feat] = lbe.fit_transform(criteo_data[feat])

mms = MinMaxScaler(feature_range=(0, 1))
criteo_data[dense_features] = mms.fit_transform(criteo_data[dense_features])

# ----- Count #unique features for each sparse field, and record dense feature field name ----- #

fixlen_feature_columns = [
    SparseFeat(
        feat,
        vocabulary_size=criteo_data[feat].max() + 1,
        embedding_dim=4,
    )
    for feat in sparse_features
] + [DenseFeat(feat, 1) for feat in dense_features]

dnn_feature_columns = fixlen_feature_columns
linear_feature_columns = fixlen_feature_columns

feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

Loading the Criteo data from ../data/criteo_kaggle/train.csv
Encoding sparse features and transforming dense features...


## Model Training and Evaluation


### Performance Metrics


$$BCELoss = L(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

$$
\mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(t) \, d(\mathrm{FPR}(t))
$$


### DeepFM (DeepCTR-Torch)

In [8]:
import torch
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from deepctr_torch.models import DeepFM

In [9]:
criteo_train, criteo_test = train_test_split(
    criteo_data, test_size=0.2, random_state=1773
)

train_model_input = {name: criteo_train[name] for name in feature_names}
test_model_input = {name: criteo_test[name] for name in feature_names}

use_cuda = True
if use_cuda and torch.cuda.is_available():
    device = "cuda:0"
    print(f"PyTorch: Cuda ready, device={device}")
else:
    device = "cpu"

model = DeepFM(
    linear_feature_columns=linear_feature_columns,
    dnn_feature_columns=dnn_feature_columns,
    dnn_hidden_units=(128, 128, 64),
    task="binary",
    l2_reg_embedding=1e-5,
    device=device,
    seed=1773,
)
model.compile("adam", "binary_crossentropy", metrics=["binary_crossentropy"])

history = model.fit(
    train_model_input,
    criteo_train[target].values,
    batch_size=16000,
    epochs=30,
    verbose=2,  # 0 for non, 1 for progress bar, 2 for every epoch
    validation_split=0.2,  # was 0.2
)

PyTorch: Cuda ready, device=cuda:0
cuda:0
Train on 640000 samples, validate on 160000 samples, 40 steps per epoch
Epoch 1/30
5s - loss:  0.5905 - binary_crossentropy:  0.5905 - val_binary_crossentropy:  0.5362
Epoch 2/30
5s - loss:  0.4952 - binary_crossentropy:  0.4952 - val_binary_crossentropy:  0.4942
Epoch 3/30
6s - loss:  0.4176 - binary_crossentropy:  0.4176 - val_binary_crossentropy:  0.5199
Epoch 4/30
6s - loss:  0.3354 - binary_crossentropy:  0.3354 - val_binary_crossentropy:  0.5535
Epoch 5/30
5s - loss:  0.3004 - binary_crossentropy:  0.3004 - val_binary_crossentropy:  0.5683
Epoch 6/30
6s - loss:  0.2853 - binary_crossentropy:  0.2853 - val_binary_crossentropy:  0.5796
Epoch 7/30
5s - loss:  0.2775 - binary_crossentropy:  0.2775 - val_binary_crossentropy:  0.5855
Epoch 8/30
5s - loss:  0.2725 - binary_crossentropy:  0.2725 - val_binary_crossentropy:  0.5929
Epoch 9/30
5s - loss:  0.2685 - binary_crossentropy:  0.2685 - val_binary_crossentropy:  0.6003
Epoch 10/30
5s - loss:

In [10]:
pred_ans = model.predict(test_model_input, batch_size=256)

print(
    f"TEST: BCE Loss: {round(log_loss(criteo_test[target].values, pred_ans), 4)} | ROC AUC: {round(roc_auc_score(criteo_test[target].values, pred_ans), 4)} "
)

TEST: BCE Loss: 0.6933 | ROC AUC: 0.7001 
