# Ad Click Prediction

In this notebook, we will be analyzing ad click data from [Kaggle](https://www.kaggle.com/datasets/marius2303/ad-click-prediction-dataset) and building a prediction model.

***

# Initialization

Importing libraries, data etc...

In [None]:
# Setting PYTHONHASHSEED
import os

pyhashseed1 = os.environ.get("PYTHONHASHSEED")
os.environ['PYTHONHASHSEED'] = '0'
pyhashseed2 = os.environ.get("PYTHONHASHSEED")

# NOTEBOOK EXCLUSIVE CODE
if __name__ == "__main__":
    print('Make sure the following says \'None\': ', pyhashseed1)
    print('Make sure the following says \'0\': ', pyhashseed2)

In [None]:
# Importing libraries
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

from IPython.display import display

# Setting seed
np.random.seed(42)

In [None]:
# Displaying + saving plot function
IMAGES_PATH = Path() / "plots"

def save_fig(fig_name, tight_layout=True, fig_extension='png', resolution=300):
    '''Saves an image to the plots folder with the specified name.'''
    path = IMAGES_PATH / f'{fig_name}.{fig_extension}'
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

def plot_show(plt_name):
    '''Saves an image using save_fig() under the plt_name and displays it.'''
    save_fig(plt_name)
    plt.show()

In [None]:
# Import data
adclicks = pd.read_csv("data/ad_click_dataset.csv")

***

# EDA + Splitting Data

This section will be a comprehensive analysis and cleaning of the data. We will examine its structure, remove redundant data, and examine relationships. We will also split the data into the training and test set.

## Data Overview

Structure of the dataset, % of missing/duplicate values etc..

In [None]:
# Grab a quick snapshot
adclicks.head()

In [None]:
# Evaluate duplicate values
duplicates = adclicks.duplicated()
print("There are", len(adclicks[duplicates]), "duplicate rows.")

# Drop duplicates
adclicks = adclicks.drop_duplicates()
print("The duplicate rows have been dropped. There are", len(adclicks), "rows remaining.")

In [None]:
# General information
adclicks.info()

## Data Cleaning + Data Visualization

Evaluate missing values and think about how to deal with them. Visualize distributions of features and conduct initial visual analysis.

### Missing Data

Examining missing values and thinking about how to deal with the missingness.

In [None]:
# How many missing values in each feature?
missing = adclicks.isna().sum()
display(missing)

# How many rows with missing values?
missing_rows_count = adclicks.isnull().any(axis=1).sum()
print(f"Number of rows with missing data: {missing_rows_count}")

In [None]:
import missingno as msno

# Check pattern of missingness

# Missing matrix
msno.matrix(adclicks)
plot_show("missing_matrix")

# Nullity correlation heatmap
msno.heatmap(adclicks)
plot_show("nullity_corr_heatmap")

# Dendrogram
msno.dendrogram(adclicks)
plot_show("missing_dendrogram")

In [None]:
# Correlation heatmap for missingness
missing_corr = adclicks.isnull().corr()
sns.heatmap(missing_corr, annot=True, cmap="coolwarm")
plt.title("Correlation of Missingness")
plot_show("missing_correlation")

### Examining Unique Values

Examine the unique values of the categorical features and if any of the columns can serve as a row ID.

In [None]:
# Extract categorical columns
categorical = adclicks.select_dtypes(include=["object", "category", "bool"])

# Display unique values in each categorical column
for feature in categorical:
    print(f"{feature}: " ,list(categorical[feature].unique()))

In [None]:
# Unique number of `id` and `full_name`
print(adclicks["id"].nunique())
print(adclicks["full_name"].nunique())

In [None]:
# Dropping `full_name`
adclicks = adclicks.drop(columns="full_name")

## Examining Recurring Users

Since we have recurring users, we are going to quickly examine their effect on the data to determine the best method of splitting.

In [None]:
# Extract recurring users
# Count `id` occurrences
id_counts = adclicks["id"].value_counts()
print(f"The maximum number of times a user occurs is {id_counts.max()}.")

# Separate recurring users and single users
recurring_ids = id_counts[id_counts > 1].index
recurring_users = adclicks[adclicks["id"].isin(recurring_ids)]
single_users = adclicks[~adclicks["id"].isin(recurring_ids)]

# Count occurrences of each
print(f"Total number of recurring users: {recurring_users["id"].nunique()}")
print(f"Total number of single users: {single_users["id"].nunique()}")

In [None]:
# Describe each dataset
print("Single User Statistics")
display(single_users.describe())
print("Recurring User Statistics")
display(recurring_users.describe())

## Distribution of Categorical Features

We will examine the distribution of the whole dataset and the users by category of occurrence.

In [None]:
# Function to create plot grid of barplots
def plot_barplots(dataframes, dataframe_names, features, colors):
    '''A function that outputs a grid of barplots.'''
    
    num_df = len(dataframes)
    num_features = len(features)
    
    # Each figure will be 6 by 4
    fig, axes = plt.subplots(num_features, num_df, figsize=(6*num_df, 4*num_features), sharey=True)
    
    # Iterate through and plot figures
    for i, ax in enumerate(axes.flatten()):
        # Gather data
        df = dataframes[i % num_df]
        df_name = dataframe_names[i % num_df]
        feature = features[i // num_df]
        feature_name = feature.capitalize()
        color = colors[i // num_df]
        
        # Configure data for sns
        counts = df[feature].value_counts(dropna=False).reset_index()
        counts.columns = [feature_name, "Count"]
        counts[feature_name] = counts[feature_name].fillna("Missing")   # Convert Na's to Missing
        
        # Create barplot
        sns.barplot(x=feature_name, y="Count", data=counts, color=color, ax=ax)
        
        # Extra plot details
        ax.set_title(f"{feature_name} Counts for {df_name}")
        ax.set_xlabel(feature_name)
        ax.set_ylabel("Count")

        

In [None]:
dataframes = [adclicks, single_users, recurring_users]
dataframe_names = ["Whole Data", "Singly Occurring Users", "Recurring Users"]
features = list(set(categorical).intersection(set(adclicks.columns)))
colors = ["skyblue", "orange", "purple", "green", "red"]

plot_barplots(dataframes, dataframe_names, features, colors)
plot_show("categorical_barplots")

## Distributions of Numerical Features

Evaluating the distribution of numerical features.

In [None]:
# Information of numerical features
adclicks.describe()

In [None]:
# Function to create plot grid of boxplots
def plot_boxplots(dataframes, dataframe_names, features, colors, group_by=None):
    '''A function that outputs a grid of boxplots.'''
    
    num_df = len(dataframes)
    num_features = len(features)
    
    # Each figure will be 6 by 4
    fig, axes = plt.subplots(num_features, num_df, figsize=(6*num_df, 4*num_features), sharey=True)
    
    # Iterate through and plot figures
    for i, ax in enumerate(axes.flatten()):
        # Gather data
        df = dataframes[i % num_df]
        df_name = dataframe_names[i % num_df]
        feature = features[i // num_df]
        feature_name = feature.capitalize()
        color = colors[i // num_df]
        
        # Create barplot
        sns.boxplot(x=group_by, y=feature, data=df, color=color, ax=ax)
        
        # Extra plot details
        ax.set_title(f"{feature_name} Boxplot for {df_name}")
        ax.set_xlabel(feature_name)
        ax.set_ylabel(group_by)


In [None]:
# Boxplot of age with user category by occurrence
features = ["age"]
colors = ["lightblue"]

plot_boxplots(dataframes, dataframe_names, features, colors)
plot_show("age_boxplots")

In [None]:
# Gender vs Age
plot_boxplots(dataframes, dataframe_names, features, ["purple"], group_by="gender")
plot_show("age_vs_gender_boxplots")

# Ad Position vs Age
plot_boxplots(dataframes, dataframe_names, features, ["orange"], group_by="ad_position")
plot_show("age_vs_ad_position_boxplots")

# Device Type vs Age
plot_boxplots(dataframes, dataframe_names, features, ["skyblue"], group_by="device_type")
plot_show("age_vs_device_type_boxplots")

# Browsing History vs Age
plot_boxplots(dataframes, dataframe_names, features, ["green"], group_by="browsing_history")
plot_show("age_vs_browsing_history_boxplots")

# Time of Day vs Age
plot_boxplots(dataframes, dataframe_names, features, ["red"], group_by="time_of_day")
plot_show("age_vs_time_of_day_boxplots")

In [None]:
# Function to create plot grid of histograms
def plot_histograms(dataframes, dataframe_names, features, colors):
    '''A function that outputs a grid of histograms.'''
    
    num_df = len(dataframes)
    num_features = len(features)
    
    # Each figure will be 6 by 4
    fig, axes = plt.subplots(num_features, num_df, figsize=(6*num_df, 4*num_features), sharey=True)
    
    # Iterate through and plot figures
    for i, ax in enumerate(axes.flatten()):
        # Gather data
        df = dataframes[i % num_df]
        df_name = dataframe_names[i % num_df]
        feature = features[i // num_df]
        feature_name = feature.capitalize()
        color = colors[i // num_df]
        
        # Create barplot
        sns.histplot(x=feature, data=df, kde=True, color=color, ax=ax)
        
        # Extra plot details
        ax.set_title(f"{feature_name} Histogram for {df_name}")
        ax.set_xlabel(feature_name)


In [None]:
# Histogram of age with user category by occurrence
features = ["age"]
colors = ["blue"]

plot_histograms(dataframes, dataframe_names, features, colors)
plot_show("age_histograms")

***

# Preprocessing Before Splitting

Preprocess data by imputing and encoding in multiple ways for different types of models.

## Aggregating and Collapsing Recurring User Data

In [None]:
# Aggregate and collapse recurring user data
recurring_users = recurring_users.fillna("Missing")

# Add `num_visits` column to represent the number of times a user visited
recurring_users["num_visits"] = recurring_users["id"].map(recurring_users["id"].value_counts())
single_users.loc[:,"num_visits"] =  1

recurring_users_collapsed = recurring_users.groupby("id").agg(lambda x: list(set(x).difference({"Missing"})))
recurring_users_collapsed = recurring_users_collapsed.map(lambda x: x[0] if isinstance(x, list) and x else np.nan).reset_index()

In [None]:
# Add `recurring_user` column to both recurring and singly occuring users
recurring_users_collapsed.loc[:,"recurring_user"] = 1
single_users.loc[:,"recurring_user"] = 0

In [None]:
# Combine data
adclicks_users = pd.concat([single_users, recurring_users_collapsed], ignore_index=True)

## Drop `id`

Since we know each row is a unique user, we can drop `id`.

In [None]:
# Drop `id`
adclicks_users = adclicks_users.drop(columns="id")

## Two Datasets: Imputation vs. Missing category

In [None]:
# Imputation dataset
adclicks2 = adclicks_users

# Complete dataset
adclicks3 = adclicks_users[adclicks_users.columns.difference(["age"])].fillna("Missing")
adclicks3.loc[:, "age"] = adclicks_users[["age"]]
reorder = ["age", "gender", "device_type", "ad_position", "browsing_history", "time_of_day", "click", "num_visits", "recurring_user"]
adclicks3 = adclicks3[reorder]

***

# Splitting data

Split the data with respect to the class imbalance of `click`.

In [None]:
# Downsampling
from sklearn.utils import resample

# Split `adclicks2` and then separate indices from `adclicks3`
adclicks2_majority = adclicks2[adclicks2.click == 0]
adclicks2_minority = adclicks2[adclicks2.click == 1]

adclicks2_majority_ds = resample(adclicks2_majority, replace=False, n_samples=len(adclicks2_minority), random_state=42)
adclicks2_ds = pd.concat([adclicks2_majority_ds, adclicks2_minority])

In [None]:
# Extract same indices from `adclicks3`
adclicks3_ds = adclicks3.loc[adclicks2_ds.index]

In [None]:
# Split data
from sklearn.model_selection import train_test_split

X2 = adclicks3_ds.drop(columns="click")
y2 = adclicks3_ds["click"]

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state=42, test_size=0.2)

In [None]:
# Extract same indices from `adclicks3_ds`
X3 = adclicks3_ds.drop(columns="click")
y3 = adclicks3_ds["click"]

X3_train = X3.loc[X2_train.index]
X3_test = X3.loc[X2_test.index]
y3_train = y3.loc[y2_train.index]
y3_test = y3.loc[y2_test.index]

## Checking class imbalanace

In [None]:
# Function to create plot grid of barplots
def plot_barplots2(dataframes, dataframe_names, features, colors):
    '''A function that outputs a grid of barplots.'''
    
    num_df = len(dataframes)
    num_features = len(features)
    
    # Each figure will be 6 by 4
    fig, axes = plt.subplots(num_features, num_df, figsize=(6*num_df, 4*num_features), sharey=True)
    
    # Iterate through and plot figures
    for i, ax in enumerate(axes.flatten()):
        # Gather data
        df = dataframes[i % num_df]
        df_name = dataframe_names[i % num_df]
        feature = features[i // num_df]
        feature_name = feature.capitalize()
        color = colors[i // num_df]
        
        # Configure data for sns
        counts = df[feature].value_counts(dropna=False).reset_index()
        counts.columns = [feature_name, "Count"]
        counts[feature_name] = counts[feature_name].fillna("Missing")   # Convert Na's to Missing
        
        # Create barplot
        sns.barplot(x=feature_name, y="Count", data=counts, color=color, ax=ax)
        
        # Extra plot details
        ax.set_title(f"{feature_name} Counts for {df_name}")
        ax.set_xlabel(feature_name)
        ax.set_ylabel("Count")

In [None]:
# Checking class imbalance
series = [y2_train, y2_test]
series_names = ["Training data", "Testing data"]

fig, axes = plt.subplots(1, 2, figsize=(6*2, 4*1), sharey=True)

# Iterate through and plot figures
for i, ax in enumerate(axes.flatten()):
    # Gather data
    srs = series[i % 2]
    srs_name = series_names[i % 2]
    feature_name = "Click"
    color = "blue"
    
    # Configure data for sns
    counts = srs.value_counts(dropna=False).reset_index()
    counts.columns = [feature_name, "Count"]
    counts[feature_name] = counts[feature_name].fillna("Missing")   # Convert Na's to Missing
    
    # Create barplot
    sns.barplot(x=feature_name, y="Count", data=counts, color=color, ax=ax)
    
    # Extra plot details
    ax.set_title(f"{feature_name} Counts for {srs_name}")
    ax.set_xlabel(feature_name)
    ax.set_ylabel("Count")

***

# Preprocessing after splitting

Preprocess by imputing missing data, encoding categorical variables, scaling numerical variables.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder

from feature_engine.imputation import RandomSampleImputer

In [None]:
# Transformer Pipeline
num_transformer1 = Pipeline(
    steps =[
        ("imputed", RandomSampleImputer(random_state=42))
    ]
)

num_transformer2 = Pipeline(
    steps =[
        ("imputed", RandomSampleImputer(random_state=42)),
        ("scaled", MinMaxScaler())
    ]
)

preprocessor1 = ColumnTransformer(
    transformers=[
        ("encoding", OneHotEncoder(), ["gender", "device_type", "browsing_history", "ad_position", "time_of_day"]),
        ("num", num_transformer1, ["age"])
    ],
    remainder="passthrough"
)

preprocessor2 = ColumnTransformer(
    transformers=[
        ("encoding", OneHotEncoder(), ["gender", "device_type", "browsing_history", "ad_position", "time_of_day"]),
        ("num", num_transformer2, ["age", "num_visits"])
    ],
    remainder="passthrough"
)

In [None]:
scaled_df = preprocessor1.fit_transform(X3_train, y3_train)
scaled_df_columns = preprocessor1.get_feature_names_out()

scaled_df = pd.DataFrame(scaled_df, columns=scaled_df_columns)

In [None]:
non_scaled_df = preprocessor2.fit_transform(X3_train, y3_train)
# result = result.toarray()
non_scaled_df_columns = preprocessor2.get_feature_names_out()

non_scaled_df = pd.DataFrame(non_scaled_df, columns=non_scaled_df_columns)

***

# Feature Selection

We will look at the correlation of the features to see if we should introduce any feature reduction techniques.

In [None]:
# Correlation matrix for scaled data
correlation_matrix = scaled_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Matrix (Scaled Data)")
plot_show("scaled_feature_corr_matrix")

In [None]:
# Correlation matrix for non-scaled data
correlation_matrix = non_scaled_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Matrix (Non-Scaled Data)")
plot_show("non_scaled_feature_corr_matrix")

***

# Running Models with Cross Validation

**I will be running models on the split datasets from `adclicks3`. If `adclicks3` is not performing well even after tuning and evaluation, we will go back to `adclicks2` and see if the model improves.**

In [None]:
# Creating a pd dataframe to store model metrics
metrics_store = list()

In [None]:
# Initializing cross validation
from sklearn.model_selection import cross_val_score, cross_val_predict, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# Logistic regression pipeline
logreg_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor2),
    ('classifier', LogisticRegression(random_state=42))
])

# Evaluation metrics
metrics = {
    "model" : "LogisticRegression",
    "accuracy" : cross_val_score(logreg_pipeline, X3_train, y3_train, cv=cv, scoring="accuracy").mean(),
    "precision": cross_val_score(logreg_pipeline, X3_train, y3_train, cv=cv, scoring="precision").mean(),
    "recall" : cross_val_score(logreg_pipeline, X3_train, y3_train, cv=cv, scoring="recall").mean(),
    "f1" : cross_val_score(logreg_pipeline, X3_train, y3_train, cv=cv, scoring="f1").mean(),
    "auc" : cross_val_score(logreg_pipeline, X3_train, y3_train, cv=cv, scoring="roc_auc").mean(),
    "neg_log_loss" : cross_val_score(logreg_pipeline, X3_train, y3_train, cv=cv, scoring="neg_log_loss").mean()
}
metrics_store.append(metrics)

# Predict
y_pred = cross_val_predict(logreg_pipeline, X3_train, y3_train, cv=cv, method="predict")
y_prob = cross_val_predict(logreg_pipeline, X3_train, y3_train, cv=cv, method="predict_proba")[:, 1]

# Plot ROC curve
fpr, tpr, threshold = roc_curve(y3_train, y_prob)
auc = metrics_store[-1]["auc"].mean()

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color="blue", lw=2, label=f"ROC Curve (AUC = {auc:.2f})")
plt.plot([0, 1], [0, 1], color='grey', linestyle='--', lw=1)  # Diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")

plt.show()


metrics = pd.DataFrame(metrics_store)

## Decision Tree

In [None]:
from sklearn import tree

# Decision Tree pipeline
decision_tree = Pipeline(steps=[
    ("preprocessor", preprocessor2),
    ('classifier', tree.DecisionTreeClassifier(random_state=42))
])

# Evaluation metrics
metrics = {
    "model" : "DecisionTree",
    "accuracy" : cross_val_score(decision_tree, X3_train, y3_train, cv=5, scoring="accuracy").mean(),
    "precision": cross_val_score(decision_tree, X3_train, y3_train, cv=5, scoring="precision").mean(),
    "recall" : cross_val_score(decision_tree, X3_train, y3_train, cv=5, scoring="recall").mean(),
    "f1" : cross_val_score(decision_tree, X3_train, y3_train, cv=5, scoring="f1").mean(),
    "auc" : cross_val_score(decision_tree, X3_train, y3_train, cv=5, scoring="roc_auc").mean(),
    "neg_log_loss" : cross_val_score(decision_tree, X3_train, y3_train, cv=5, scoring="neg_log_loss").mean()
}
metrics_store.append(metrics)

# Predict
y_pred = cross_val_predict(decision_tree, X3_train, y3_train, cv=5, method="predict")
y_prob = cross_val_predict(decision_tree, X3_train, y3_train, cv=5, method="predict_proba")[:, 1]

# Plot ROC curve
fpr, tpr, threshold = roc_curve(y3_train, y_prob)
auc = metrics_store[-1]["auc"].mean()

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color="blue", lw=2, label=f"ROC Curve (AUC = {auc:.2f})")
plt.plot([0, 1], [0, 1], color='grey', linestyle='--', lw=1)  # Diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")

plt.show()


metrics = pd.DataFrame(metrics_store)

## Random Forest

In [2]:
from sklearn.ensemble import RandomForestClassifier


## XGBoost

In [1]:
from xgboost import XGBClassifier