# Criteo Click-Through Rate (CTR) Predciton


<div align="center">
  <img src="images/Criteo-Logo-Orange.png" alt="criteo-logo" width=500 />
</div>


## What is Click-Through Rate (CTR)?

**Click-Through Rate (CTR)** is a key metric in online advertising.

$$
CTR = \frac{\text{Clicks}}{\text{Impressions}} \times 100\%
$$

- **Clicks** -> how many times users clicked on the ad
- **Impressions** -> how many times the ad was shown

The **CTR prediction task** focuses on modeling the _likelihood of a click_ based on:

- Ad characteristics (e.g., text, image, placement)
- User profile data (e.g., demographics, behavior)
- Contextual features (e.g., time of day, device, location)


## Data Preprocessing


### Data Dictionary


- **label (int64):** Target variable that indicates if an ad was clicked (1) or not (0).
- **I1-I13 (float64 | int64):** A total of 13 columns of integer features (mostly count features).
- **C1-C26 (object):** A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.

Data is obtained from: https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download


### Criteo Data (45M)


In [None]:
# import pandas as pd

In [None]:
# train_columns = (
#     ["label"] + [f"I{i}" for i in range(1, 14)] + [f"C{i}" for i in range(1, 27)]
# )
# test_columns = [f"I{i}" for i in range(1, 14)] + [f"C{i}" for i in range(1, 27)]

# criteo_train = pd.read_csv(
#     "data/criteo_kaggle/train.csv", header=None, sep="\t", names=train_columns
# )
# criteo_test = pd.read_csv(
#     "data/criteo_kaggle/test.csv", header=None, sep="\t", names=test_columns
# )

# print(f"Criteo train dataset has shape: {criteo_train.shape}")
# print(f"Criteo test dataset has shape: {criteo_test.shape}")

Train dataset has shape: (45840617, 40) -> 45M entries, 39 features, 1 label

Test dataset has shape: (6042135, 39) -> 6M entries, 39 features


In [None]:
# criteo_train.head(10)

In [None]:
# train_missing_values = criteo_train.isnull().sum()

# train_missing_df = pd.DataFrame(
#     {
#         "Feature": train_missing_values.index,
#         "Missing Count": train_missing_values.values,
#         "Missing Percentage": (train_missing_values / len(criteo_train) * 100).round(2),
#     }
# )

# train_missing_df = train_missing_df.sort_values("Missing Count", ascending=False)
# train_missing_df

In [None]:
# criteo_train.describe()

### Criteo - Training Data Sample

In [8]:
from src.utils.data_utils import criteo_sample_data

criteo_train, criteo_test = criteo_sample_data(
    data_path="data/criteo_kaggle/train.csv",
    use_sklearn_split=False,
    train_sample_size=1_000_000,
    test_sample_size=200_000,
    save=True,
    seed=1773,
)

CRITEO: Train pool shape: (45440617, 40)
CRITEO: Test pool shape: (400000, 40)
CRITEO: Train data shape: (1000000, 40)
CRITEO: Test data shape: (200000, 40)
CRITEO: Train data saved to data\criteo_kaggle\criteo_train_20251212_1655_1000000.csv
CRITEO: Test data saved to data\criteo_kaggle\criteo_test_20251212_1655_200000.csv


### Preprocessing

In [9]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from deepctr_torch.inputs import SparseFeat, DenseFeat, get_feature_names

In [11]:
sparse_features = ["C" + str(i) for i in range(1, 27)]
dense_features = ["I" + str(i) for i in range(1, 14)]
target = ["label"]

# Dense feature missing imputation --- maybe need more inspection
for col in dense_features:
    if col in criteo_train.columns:
        criteo_train[col] = criteo_train[col].fillna(0)
    if col in criteo_test.columns:
        criteo_test[col] = criteo_test[col].fillna(0)

# Categorical feature missing imputation --- need string here
for col in sparse_features:
    if col in criteo_train.columns:
        criteo_train[col] = criteo_train[col].fillna("UNKNOWN")
    if col in criteo_test.columns:
        criteo_test[col] = criteo_test[col].fillna("UNKNOWN")

# Label Encoding for sparse features
for feat in sparse_features:
    lbe = LabelEncoder()
    criteo_train[feat] = lbe.fit_transform(criteo_train[feat])
    criteo_test[feat] = lbe.fit_transform(criteo_test[feat])

# Simple Transformation for dense features
mms = MinMaxScaler(feature_range=(0, 1))
criteo_train[dense_features] = mms.fit_transform(criteo_train[dense_features])
criteo_test[dense_features] = mms.fit_transform(criteo_test[dense_features])

# Count #unique features for each sparse field,and record dense feature field name
fixlen_feature_columns = [
    SparseFeat(feat, vocabulary_size=criteo_train[feat].nunique(), embedding_dim=4)
    for i, feat in enumerate(sparse_features)
] + [
    DenseFeat(
        feat,
        1,
    )
    for feat in dense_features
]

dnn_feature_columns = fixlen_feature_columns
linear_feature_columns = fixlen_feature_columns

feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

## Model Training and Evaluation


### Performance Metrics


$$BCELoss = L(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

$$
\mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(t) \, d(\mathrm{FPR}(t))
$$
