# Avazu Click-Through Rate (CTR) Prediction

<div align="center">
  <img src="images/banner_ctr.jpg" alt="avazu-ctr-img" />
</div>

## What is Click-Through Rate (CTR)?

**Click-Through Rate (CTR)** is a key metric in online advertising.  

$$
CTR = \frac{\text{Clicks}}{\text{Impressions}} \times 100\%
$$

- **Clicks** -> how many times users clicked on the ad  
- **Impressions** -> how many times the ad was shown  

The **CTR prediction task** focuses on modeling the *likelihood of a click* based on:
- Ad characteristics (e.g., text, image, placement)  
- User profile data (e.g., demographics, behavior)  
- Contextual features (e.g., time of day, device, location)  

## Data Preprocessing

### Data Dictionary

- **id (float 64):** Ad identifier
- **click (int64):** **(TARGET)** 0/1 for non-click/click
- **hour (int64)**: Format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
- **C1 (int64):** Anonymized categorical variable
- **banner_pos (int64):**
- **site_id (object):** A hashed value
- **site_domain (object):** A hashed value
- **site_category (object):** A hashed value
- **app_id (object):** A hashed value 
- **app_domain (object):** A hashed value
- **app_category (object):** A hashed value
- **device_id (object):** A hashed value
- **device_ip (object):** A hashed value
- **device_model (object):** A hashed value
- **device_type (int64):** Encoded device type. May be mobile, tablet, pc etc.
- **device_conn_type (int 64):** Encoded connection type. May be wireless, wired, LTE etc.
- **C14-C21 (int 64):** Anonymized categorical variables

### Avazu - October 2014 - Data

In [None]:
import pandas as pd

In [None]:
# Training data
avazu_train = pd.read_csv("data/avazu/train.csv")
avazu_train.head(10)

In [None]:
avazu_train.id.nunique()

In [None]:
# Test data
avazu_test = pd.read_csv("data/avazu/test.csv")
avazu_test.head(10)

In [None]:
print(f"Train dataset has shape: {avazu_train.shape}")
print(f"Test dataset has shape: {avazu_test.shape}")

Train dataset has shape: (40.428.967, 24) -> 40M entries, 23 features, 1 target

Test dataset has shape: (4.577.464, 23)   -> 4.5M entries, 23 features

In [None]:
avazu_train.info()

**Comments:**
- We have the target for training data not for the test data, we will use training data.
- We can't get random samples because data is chronogically ordered.
- Instead we will get first 800K random sampled data from training after dropping last datetime 14103000 to train the values
- And test them on the last 200K data from training set which is all 141030**.

In [None]:
test_data = avazu_train.tail(200_000)

In [None]:
test_data.head()

In [None]:
test_data.tail()

In [None]:
# Extract 800K sample data after dropping 14103000 and 200K from end from training for future use
test_data = avazu_train.tail(200_000)

train_pool = avazu_train[avazu_train["hour"] < 14103000]
train_data = train_pool.sample(n=800_000, random_state=1773)

print(f"Avazu train data has shape: {avazu_train.shape}")
print(f"Avazu train pool has shape: {train_pool.shape}")
print(f"New training data has shape: {train_data.shape}")
print(f"New training data has shape: {test_data.shape}")

train_data.to_csv("data/avazu/avazu_train_800k.csv", index=False)
test_data.to_csv("data/avazu/avazu_test_200k.csv", index=False)

In [None]:
(train_data["hour"] == 14103000).any() # Check if any 14103000

In [None]:
# Get random samples from data to work on smaller subsets -> 80% train 20% test split in the paper
# avazu_train_50_sample = avazu_train.sample(n=50, random_state=1773)
# avazu_train_800k_sample = avazu_train.sample(n=800_000, random_state=1773)
# avazu_test_200k_sample = avazu_test.sample(n=200_000, random_state=1773)

# avazu_train_50_sample.to_csv("data/avazu/avazu_train_50_sample.csv", index=False)
# avazu_train_800k_sample.to_csv("data/avazu/avazu_train_800k_sample.csv", index=False)
# avazu_test_200k_sample.to_csv("data/avazu/avazu_test_200k_sample.csv", index=False)

### Avazu - October 2014 - Training Data Sample (800K)

In [None]:
import pandas as pd

%load_ext autoreload
%autoreload 2

In [None]:
avazu_train_800k = pd.read_csv("data/avazu/avazu_train_800k.csv")

# Datetime conversion and parsing for hour feature
avazu_train_800k["hour"] = pd.to_datetime(
    avazu_train_800k["hour"], format="%y%m%d%H", errors="coerce", utc=True
)

# No need for year and month here, all records are in October, 2014
# avazu_train_800k_sample["year"]  = avazu_train_800k_sample["year"].dt.year
# avazu_train_800k_sample["month"] = avazu_train_800k_sample["hour"].dt.month
avazu_train_800k["day"] = avazu_train_800k["hour"].dt.day
avazu_train_800k["weekday"] = avazu_train_800k[
    "hour"
].dt.weekday  # Monday=0, Sunday=6
avazu_train_800k["hour_of_day"] = avazu_train_800k["hour"].dt.hour

No need for month here, all records are in October, 2014

In [None]:
avazu_train_800k.head(10)

In [None]:
avazu_train_800k.info()

In [None]:
# Check if there is any missing values
avazu_train_800k.isnull().sum()

In [None]:
# Anaylze categorical and numerical features
numerical = []
categorical = []

for col in (avazu_train_800k.columns):
    if avazu_train_800k[col].dtype == "object":
        categorical.append(col)
    else:
        numerical.append(col)
        
max_len = max(len(numerical), len(categorical))

print("="*65)
print(f"{'Numerical Features':30} | {'Categorical Features':30}")
print("-"*65)

for i in range(max_len):
    num = numerical[i] if i < len(numerical) else ""
    cat = categorical[i] if i < len(categorical) else ""
    print(f"{num:30} | {cat:30}")

print("="*65)
print(f"Total features: {len(avazu_train_800k.columns)}")
print(f"Numerical features: {len(numerical)}")
print(f"Categorical features: {len(categorical)}")
print("="*65)

In [None]:
avazu_train_800k.describe()

### Outlier Elimination

In [None]:
from src.utils.plotting_utils import multi_distplot, multi_boxplot
from src.utils.stat_utils import iqr_elimination, percentile_capping

In [None]:
multi_distplot(avazu_train_800k, numerical, dpi=300)

In [None]:
multi_boxplot(
    avazu_train_800k,
    numerical_features=["C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21"],
)

**Comments:**
- We are checking the **anonymized categorical features** (the `C` columns) for potential outliers.  
- Some of these categorical features may actually represent days of the week, mapped to integer values.  
- If that’s the case, then those values are valid categories — not real outliers.  
- In such cases, there’s no need to remove them from the data.

**Outcomes:**
- If values are more than 3 std dev (`σ`) away from the mean (`μ`), they can be flaggged as outlier if data has a **normal like distribution**.
- We can use **IQR (Interquartile Range)** for skewed distributions to check the outliers on specific numeric features.
- There are no visible outliers for features `C1, C14, C17, C18, C20`.
- There are some outliers for features `C15, C16, C19, C21`.
- There are extreme maximum values for features `C15, C16` both from right and left tail.
- There are extreme maximum values for features `C19, C21` from right tail only with different percentiles.

#### IQR Elimination

<div align="center">
  <img src="images/Boxplot_vs_PDF.svg.png" alt="iqr-img" />
</div>

In [None]:
# Eliminate outliers detected on some C columns
c_cols = ["C15", "C16", "C19", "C21"]

iqr_1_5_avazu_train_800k_sample = iqr_elimination(
    avazu_train_800k, c_cols, k=1.5
)
iqr_3_avazu_train_800k_sample = iqr_elimination(avazu_train_800k, c_cols, k=3)

print(f"Before shape:", avazu_train_800k.shape)
print(f"After k=1.5 IQR elimination shape: {iqr_1_5_avazu_train_800k_sample.shape}")
print(f"After k=3 IQR elimination shape: {iqr_3_avazu_train_800k_sample.shape}")

We are losing so much data if we choose IQR elimination. Opting to percentile checks.

#### Percentile Capping (Winsorization)

<div align="center">
  <img src="images/percentile_outlier_removal.png" alt="percantile-outlier-img" />
</div>

In [None]:
# Both tails cap for C15, C16 at 90%
prc_90_both_avazu_train_800k_sample = percentile_capping(
    avazu_train_800k,
    ["C15", "C16"],
    q=0.90,
    side="both",
    gap_factor=None,
)

# Right tail caps for C21 at 90%
prc_90_right_c21 = percentile_capping(
    prc_90_both_avazu_train_800k_sample,
    ["C21"],
    q=0.90,
    side="right",
    gap_factor=None,
)

# Right tail caps for C19 at 82%
prc_82_right_c19 = percentile_capping(
    prc_90_right_c21,
    ["C19"],
    q=0.82,
    side="right",
    gap_factor=None,
)

In [None]:
multi_boxplot(
    prc_82_right_c19, numerical_features=["C15", "C16", "C19", "C21"]
)

In [None]:
prc_82_right_c19[["C15", "C16", "C19", "C21"]].describe()

### Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

In [None]:
avazu_train_800k.nunique()

In [None]:
LABEL_ENCODE_FEATURE_NAMES = [
    "id",
    "C1",
    "banner_pos",
    "site_id",
    "site_domain",
    "site_category",
    "app_id",
    "app_domain",
    "app_category",
    "device_id",
    "device_ip",
    "device_model",
    "device_type",
    "device_conn_type",
    "C14",
    "C15",
    "C16",
    "C17",
    "C18",
    "C19",
    "C20",
    "C21",
]

feature_size = {}

for feature in tqdm(LABEL_ENCODE_FEATURE_NAMES):
    lbe = LabelEncoder()
    avazu_train_800k[feature] = lbe.fit_transform(avazu_train_800k[feature])
    feature_size[feature] = avazu_train_800k[feature].nunique()

In [None]:
feature_size

In [None]:
avazu_train_800k.head(10)

## Model Training and Evaluation

### Performance Metrics

$$BCELoss = L(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

$$
\mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(t) \, d(\mathrm{FPR}(t))
$$

### MLP (Without Feature Embeddings + With Label Encoding) 

In [None]:
import os

from src.dataset_class.avazu_dataset import AvazuCTRDataset
from src.trainer.model_generator import ModelGenerator
from src.utils.logging_utils import *
from src.utils.constants import LOGS_PATH

%load_ext autoreload
%autoreload 2

MODEL_NAME = "mlp"
MLP_INIT_DIM = 21
MLP_HIDDEN_DIMS = (128, 64, 32)
EPOCHS = 10
BATCH_SIZE = 512
SEED = 1773

avazu_train_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_train_800k.csv", drop_hour=True, drop_id=True
)
avazu_test_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_test_200k.csv", drop_hour=True, drop_id=True
)

log_path = os.path.join(
    LOGS_PATH, f"avazu_drop_id_hour_{MODEL_NAME}_epochs_{EPOCHS}_bs_{BATCH_SIZE}_seed_{SEED}.log"
)
logger = setup_logger(log_path)

trainer = ModelGenerator(
    model_name=MODEL_NAME,
    mlp_input_dim=MLP_INIT_DIM,
    mlp_hidden_dims=MLP_HIDDEN_DIMS,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    seed=SEED,
    logger=logger,
)
trainer.train_test(train_dataset=avazu_train_dataset, test_dataset=avazu_test_dataset)

close_logger(logger)

In [None]:
top_n_ads = trainer.recommend_ads(avazu_test_dataset, top_n=5)
top_n_ads

### Logistic Regression (LR)

In [None]:
import os

from src.dataset_class.avazu_dataset import AvazuCTRDataset
from src.trainer.model_generator import ModelGenerator
from src.utils.logging_utils import *
from src.utils.constants import LOGS_PATH

%load_ext autoreload
%autoreload 2

avazu_train_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_train_800k.csv", drop_hour=True,drop_id=True
)

avazu_test_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_test_200k.csv", drop_hour=True,drop_id=True
)

print(
    f"Total field dimensions of {len(avazu_train_dataset.field_dims)} features: {sum(avazu_train_dataset.field_dims)}"
)

MODEL_NAME = "lr"
FIELD_DIMS = avazu_train_dataset.field_dims
EPOCHS = 10
BATCH_SIZE = 512
SEED = 1773


log_path = os.path.join(
    LOGS_PATH, f"avazu_drop_id_hour_{MODEL_NAME}_epochs_{EPOCHS}_bs_{BATCH_SIZE}_seed_{SEED}.log"
)
logger = setup_logger(log_path)


trainer = ModelGenerator(
    model_name=MODEL_NAME,
    field_dims=FIELD_DIMS,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    seed=SEED,
    logger=logger,
)
trainer.train_test(avazu_train_dataset, avazu_test_dataset)

close_logger(logger)

In [None]:
top_n_ads = trainer.recommend_ads(avazu_test_dataset, top_n=5)
top_n_ads

### DeepEmbed (Feature Embeddings + MLP)

In [None]:
import os

from src.dataset_class.avazu_dataset import AvazuCTRDataset
from src.trainer.model_generator import ModelGenerator
from src.utils.logging_utils import *
from src.utils.constants import LOGS_PATH

%load_ext autoreload
%autoreload 2

avazu_train_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_train_800k.csv", drop_hour=True
)

avazu_test_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_test_200k.csv", drop_hour=True
)

# BEST TUNED CONFIG - TRIAL 45
MODEL_NAME = "deep-embed"
FIELD_DIMS = avazu_train_dataset.field_dims
EMBED_DIM = 10
MLP_HIDDEN_DIMS = (441, 48, 266, 111)
DROPOUT = 0.073
LEARNING_RATE = 0.0092
EPOCHS = 17
BATCH_SIZE = 256
SEED = 1773

# CONFIG 
# MODEL_NAME = "deep-embed"
# FIELD_DIMS = avazu_train_dataset.field_dims
# EMBED_DIM = 16
# MLP_HIDDEN_DIMS = (256, 128, 64)
# DROPOUT = 0.36
# EPOCHS = 5
# BATCH_SIZE = 2048  # 1024 is set in the paper
# SEED = 1773


log_path = os.path.join(
    LOGS_PATH,
    f"avazu_{MODEL_NAME}_embed_dim_{EMBED_DIM}_hidden_dims_{MLP_HIDDEN_DIMS}_epochs_{EPOCHS}_bs_{BATCH_SIZE}_seed_{SEED}.log",
)
logger = setup_logger(log_path)


trainer = ModelGenerator(
    model_name=MODEL_NAME,
    field_dims=FIELD_DIMS,
    embed_dim=EMBED_DIM,
    mlp_hidden_dims=MLP_HIDDEN_DIMS,
    dropout= DROPOUT,
    lr=LEARNING_RATE,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    seed=SEED,
    logger=logger,
    save_model=False
)
trainer.train_test(train_dataset=avazu_train_dataset, test_dataset=avazu_test_dataset)

close_logger(logger)

In [None]:
top_n_ads = trainer.recommend_ads(avazu_test_dataset, top_n=5)
top_n_ads

### FINT

In [None]:
import os

from src.dataset_class.avazu_dataset import AvazuCTRDataset
from src.trainer.model_generator import ModelGenerator
from src.utils.logging_utils import *
from src.utils.constants import LOGS_PATH

%load_ext autoreload
%autoreload 2

avazu_train_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_train_800k.csv", drop_hour=True
)

avazu_test_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_test_200k.csv", drop_hour=True
)

# PAPER CONFIG 
MODEL_NAME = "fint"
FIELD_DIMS = avazu_train_dataset.field_dims
EMBED_DIM = 16
FINT_LAYERS = 3
MLP_HIDDEN_DIMS = (300, 300, 300)
DROPOUT = 0.2
LEARNING_RATE = 1e-3
EPOCHS = 5
BATCH_SIZE = 1024  # 1024 is set in the paper
SEED = 1773


log_path = os.path.join(
    LOGS_PATH,
    f"avazu_{MODEL_NAME}_embed_dim_{EMBED_DIM}_fint_layers_{FINT_LAYERS}_hidden_dims_{MLP_HIDDEN_DIMS}_epochs_{EPOCHS}_bs_{BATCH_SIZE}_seed_{SEED}.log",
)
logger = setup_logger(log_path)


trainer = ModelGenerator(
    model_name=MODEL_NAME,
    field_dims=FIELD_DIMS,
    embed_dim=EMBED_DIM,
    fint_layers=FINT_LAYERS,
    mlp_hidden_dims=MLP_HIDDEN_DIMS,
    dropout= DROPOUT,
    lr=LEARNING_RATE,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    seed=SEED,
    logger=logger,
    save_model=False
)
trainer.train_test(train_dataset=avazu_train_dataset, test_dataset=avazu_test_dataset)

close_logger(logger)

In [None]:
top_n_ads = trainer.recommend_ads(avazu_test_dataset, top_n=5)
top_n_ads

## Hyperparameter Tuning (Optuna)

### DeepEmbed (Feature Embeddings + MLP)

In [None]:
from src.trainer.optuna_tuner import OptunaTuner
from src.dataset_class.avazu_dataset import AvazuCTRDataset

%load_ext autoreload
%autoreload 2

avazu_train_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_train_800k.csv", drop_hour=True, drop_id=True
)

# Config
MODEL_NAME = "deep-embed"
FIELD_DIMS = avazu_train_dataset.field_dims
NTRIALS = 50
SEED = 1773

tuner = OptunaTuner(
    train_dataset=avazu_train_dataset,
    model_name=MODEL_NAME,
    field_dims=FIELD_DIMS,
    seed=SEED
)
tuner.tune(n_trials=NTRIALS, storage_name="avazu_deep_embed_opt_3.db")

### FINT

In [None]:
from src.trainer.optuna_tuner import OptunaTuner
from src.dataset_class.avazu_dataset import AvazuCTRDataset

%load_ext autoreload
%autoreload 2

avazu_train_dataset = AvazuCTRDataset(
    data_path="data/avazu/avazu_train_800k.csv", drop_hour=True, drop_id=True
)

# Config
MODEL_NAME = "fint"
FIELD_DIMS = avazu_train_dataset.field_dims
NTRIALS = 50
SEED = 1773

tuner = OptunaTuner(
    train_dataset=avazu_train_dataset,
    model_name=MODEL_NAME,
    field_dims=FIELD_DIMS,
    seed=SEED
)
tuner.tune(n_trials=NTRIALS, storage_name="avazu_fint_opt.db")

## Q3: Model does not perform well in live settings. What can be the reason for that? How can we improve?


Assume that your model performs well on the test set and you deployed your model to production. When you monitor your model’s online performance, you see that your model does not work well. What can be reason for that? Suggest a proposal to improve it.

**Seasonal shifts (Concept Drift):** User behavior can change due to temporal or contextual factors. For instance, during summer or holiday periods, people tend to search for vacation- or celebration-related items and advertisements. Similarly, events such as pandemics, religious feasts, or national holidays can alter users’ online interests and interaction patterns.

**Feature processing bugs:**  Data undergoes multiple transformation stages before being fed into the machine learning model. Any change in the data processing pipeline can distort the data and potentially degrade or corrupt the model’s performance.

**Data Leakage** occurs when information that should be unavailable during training influences the model. Since the data obtained in a time series manner, we can’t make random samples from the whole data. If we mix and match data from future and past to predict an incoming event the validation may be sharply high but its reel world performance may drop when deployed.

**Class Imbalance** arises when one class is significantly more frequent than others in the dataset. This imbalance causes the model to favor the dominant class, predicting it most of the time to minimize overall loss, while failing to recognize or correctly classify minority cases. In click-through rate prediction, for example, clicks (positive class) are rare compared to non-clicks as seen in the Avazu dataset. To mitigate this, techniques such as class weighting, oversampling minority classes, under sampling majority classes, or generating synthetic samples can be applied.

**Evaluation Metric Misalignment** refers to choosing an inappropriate metric that fails to reflect the true objective of the model. For instance, using accuracy in a dataset with severe class imbalance can be misleading since the model is predicting the majority class only and achieve a high score. Model should be optimized based on other scores such as ROC-AUC based on the data characteristics. 


## Dependencies

Full dependencies are listed on `requirements.txt` and also in `ctrpredenv.yml`

Main ones for model training and evaluation are:
- `python=3.13.5`
- `torch==2.8.0+cu129`
- `torchvision==0.23.0+cu129`
- `optuna==4.5.0`
- `optuna-dashboard==0.19.0`

## References

[1] https://www.kaggle.com/competitions/avazu-ctr-prediction/overview

[2] https://en.wikipedia.org/wiki/Interquartile_range

[3] https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-to-detect-and-remove-outliers-with-python-code/

[4] https://github.com/zhishan01/FINT

[5] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

[6] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

[7] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

[8] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

[9] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

[10] https://medium.com/@rjnclarke/build-a-simple-embedding-model-classifier-in-pytorch-6043c0681e65

[11] https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html

[12] https://optuna-dashboard.readthedocs.io/en/latest/getting-started.html