# Intro

The most common measure of effectiveness of online ads is the **click-through rate (CTR)**, which is the ratio of clicks on a specific ad to its total number of views. The higher the CTR rate, the better targeted an ad is, and the more successful an online advertising campaign is.

# Imports

In [23]:
import gzip
import random
import numpy as np
import pandas as pd
import xgboost as xgb

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier

# Data

We will use the dataset from a Kaggle machine learning competition, Click-Through Rate Prediction (https://www.kaggle.com/c/avazu-ctr-prediction). The dataset can be downloaded from https://www.kaggle.com/c/avazu-ctr-prediction/data.

Only the `train.gz` file contains labeled samples, so we only need to download this and unzip it (it will take a while). We will first focus on only the first 300,000 samples from the train file unzipped from `train.gz`.

In [2]:
n_rows = 300000
with gzip.open('datasets/avazu-ctr-prediction/train.gz') as f:
    df = pd.read_csv(f, nrows=n_rows)

In [3]:
df.head(5)

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,1.000017e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,1.000037e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,1.000064e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,1.000068e+19,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157


Note the anonymized and hashed values. They are categorical features, and each possible value of them corresponds to a real and meaningful value, but it is presented this way due to privacy policy. 

In [4]:
# the target variable is the click column:
Y = df['click'].values

In [5]:
# drop irrelevant columns
X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'], axis=1).values

The samples are in chronological order, as indicated in the `hour` field. Obviously, we cannot use future samples to predict the past ones. Hence, we take the first 90% as training samples and the rest as testing samples:

In [6]:
n_train = int(n_rows * 0.9)
X_train, Y_train = X[:n_train], Y[:n_train]
X_test, Y_test = X[n_train:], Y[n_train:]

# Predicting CTR with a Decision Tree

Tree-based algorithms in scikit-learn require categorical features to be encoded as numerical values before they can be used as input to the algorithms.

There are a few common approaches to encoding categorical features for tree-based algorithms:

1. **Label Encoding**: In this approach, each unique category is assigned a numerical label. For example, "red" could be encoded as 0, "blue" as 1, and "green" as 2. Scikit-learn provides the `LabelEncoder` class for this purpose.

2. **One-Hot Encoding**: This approach creates binary columns for each category, indicating the presence or absence of that category in the original feature. For example, if the original feature had three categories ("red", "blue", "green"), then three binary columns would be created, with values of 1 or 0 indicating the presence or absence of each category. Scikit-learn provides the `OneHotEncoder` class or the `get_dummies()` function in pandas for one-hot encoding.

3. **Ordinal Encoding**: This approach assigns numerical values to categories based on their order or rank. For example, if the categories are "low," "medium," and "high," they could be encoded as 0, 1, and 2, respectively. Scikit-learn does not have a built-in ordinal encoder, but you can use libraries such as the `category_encoders` package to perform ordinal encoding.

Once the categorical features are encoded as numerical values, you can use them as input to tree-based algorithms in scikit-learn, such as decision trees, random forests, or gradient boosting models.

We will now transform string-based categorical features into one-hot encoded vectors using the OneHotEncoder module from scikit-learn.

In the following line of code, the `OneHotEncoder` class is instantiated with the `handle_unknown='ignore'` parameter. This parameter specifies how to handle unknown categories that are encountered during encoding. By setting it to 'ignore', any unknown category found in the test set (categories not seen during training) will be ignored instead of raising an error.

In [7]:
enc = OneHotEncoder(handle_unknown='ignore')

Now we fit the encoder on the training data, transform the training data into its encoded form, and then use the learned encoding to transform the test data.

In [8]:
X_train_enc = enc.fit_transform(X_train)
X_test_enc = enc.transform(X_test)

In [9]:
X_train_enc[0]

<1x8204 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

We will train a decision tree model using grid search. 

We set `scoring= 'roc_auc'`, as it is an imbalanced binary case (only 51,211 out of 300,000 training samples are clicks, which is a 17% positive CTR):

In [10]:
len(df[df.click == 1])/n_rows

0.17070333333333335

In [11]:
parameters = {'max_depth': [3,10, None]}
decision_tree = DecisionTreeClassifier(criterion='gini',
                                       min_samples_split=30)

In [12]:
grid_search = GridSearchCV(decision_tree, 
                           parameters,
                           n_jobs=-1, # use all of the available CPU processors
                           cv=3,# three-fold cross validation
                           scoring='roc_auc')

In [13]:
grid_search.fit(X_train_enc, Y_train)
print(grid_search.best_params_)

{'max_depth': 10}


In [14]:
decision_tree_best = grid_search.best_estimator_

In [15]:
pos_prob = decision_tree_best.predict_proba(X_test_enc)[:, 1]

In [16]:
print(f'The ROC AUC on testing set is: {roc_auc_score(Y_test, pos_prob):.3f}')

The ROC AUC on testing set is: 0.719


The ROC AUC on testing set of 0.719 does not seem to be very high, but click-through involves many intricate human factors, which is why predicting it is not an easy task. Although we can further optimize the hyperparameters, an AUC of 0.72 is actually pretty good. Randomly selecting 17% of the samples to be clicked on will generate an AUC of 0.496:

In [17]:
percent = len(df[df.click == 1])/n_rows
pos_prob = np.zeros(len(Y_test))

random.seed(42) # set the random seed
click_index = np.random.choice(len(Y_test), int(len(Y_test) *  percent), 
                               replace=False)
pos_prob[click_index] = 1

print(f'The ROC AUC on testing set is: {roc_auc_score(Y_test, pos_prob):.3f}')

The ROC AUC on testing set is: 0.496


# Predicting CTR with a Random Forest

In [18]:
random_forest = RandomForestClassifier(n_estimators=100,
                                       criterion='gini',
                                       min_samples_split=30,
                                       n_jobs=-1)

In [19]:
# fine tune the max_depth hyperparameter
grid_search = GridSearchCV(random_forest, 
                           parameters,
                           n_jobs=-1, 
                           cv=3, 
                           scoring='roc_auc')

In [20]:
grid_search.fit(X_train_enc, Y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)

{'max_depth': None}
0.7355880183103466


In [21]:
random_forest_best = grid_search.best_estimator_
pos_prob = random_forest_best.predict_proba(X_test_enc)[:, 1]
print(f'The ROC AUC on testing set is: {roc_auc_score(Y_test, pos_prob):.3f}')

The ROC AUC on testing set is: 0.759


It turns out that the random forest model gives an improvement to the performance.

# Predicting CTR with Gradient Boosting

In gradient boosted trees (GBT) (also called gradient boosting machines), individual trees are trained in succession where a tree aims to correct the errors made by the previous tree.

We will use the XGBoost package (https://xgboost.readthedocs.io/en/latest/) to implement GBT. Check [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier) to see what other hyperparameters we can tweak.

We first install the XGBoost Python API via the following command:

In [22]:
# pip install xgboost

In [26]:
model = xgb.XGBClassifier(learning_rate=0.1,
                          max_depth=10,
                          n_estimators=1000)

In [27]:
model.fit(X_train_enc, Y_train)

In [28]:
pos_prob = model.predict_proba(X_test_enc)[:, 1]
print(f'The ROC AUC on testing set is: {roc_auc_score(Y_test, pos_prob):.3f}')

The ROC AUC on testing set is: 0.771


We are able to achieve 0.77 AUC using the XGBoost GBT model.

# Predicting CTR with Logistic Regression