# Hi! welcome to the Fwumious Wabbit workshop

### Before you start, some prerequisites:

This workshop was built and tested on linux and macOS. no guarantees for other operating systems.
if you run into issues, or some of the instructions are outdated, feel free to contact me at ykarni@outbrain.com

You'll need to have [python 3](https://www.python.org/downloads/) installed,
and the up-to-date rust tools (rustc, cargo).

If you don't have rust, we recommend installing with [rustup](https://rustup.rs/).

Create a designated work dir for the workshop.

Download the fwumious wabbit code, and build it:

(make sure to follow these instructions starting from the directory where you run jupyter notebook from,
or use another and just copy the fw binary so that it's available)

```bash
git clone https://github.com/outbrain/fwumious_wabbit.git
cd fwumious_wabbit
cargo build --release
cp target/release/fw .. # if you didn't start from the desired work dir, replace .. with your work dir
cd ..
```

### If you followed these instructions carefully, fwumious wabbit is now ready to run:

In [None]:
!./fw --help

### Downloading the dataset
hopefully you already downloaded the dataset files from google drive:

https://drive.google.com/drive/folders/1uNpus6CehoamstYh-JFBE_cwbJ-JLizM?usp=sharing

### Review your working directory

In [None]:
!ls -lh

### Let's have a glance at our dataset

The dataset is split into train and dev (cross validation), roughly a 80:20 split,

with train.fw.gz containing 69,713,384 records, and dev.fw.gz 17,428,347 records.

let's examine a single record:

In [None]:
!tail -n 1 sample.fw

Let's use the namespace map file to understand better what we see:

In [None]:
!cat vw_namespace_map.csv

### Great! now let's give fwumious wabbit a test drive

We'll start by training a simple logistic regression model:

In [None]:
from fw_util import train_loop

max_iterations = 20
print_intermediate_loss = True

In [None]:
common_args_str = " ".join(["--cache", \
    "--linear uuid --linear platform --linear geo_location --linear traffic_source --linear document_id", \
    "--linear source_id --linear publisher_id --linear categories --linear ad_id --linear campaign_id --linear advertiser_id", \
    "--linear ad_document_id --linear ad_source_id --linear ad_publisher_id --linear ad_categories --linear user_categories"])

optimization_params = "--adaptive --sgd"

model_name = "logistic.1"

In [None]:
iterations = train_loop(common_args_str, optimization_params, model_name, max_iterations, print_intermediate_loss)

### Optional step: kaggle submission
Let's see how would we fare on the Outbrain click prediction kaggle with this very basic model.

In [None]:
from create_submission_file import create_submission_file
from fw_util import create_model_and_predict_for_test_set

create_model_and_predict_for_test_set(common_args_str, optimization_params, model_name, iterations)
create_submission_file("logistic.1.test_preds", "logistic.1.submission.csv")

We'll drag the output file 'logistic.1.submission.csv' to the target in the Outbrain Kaggle competition "Late Submission" form, which you can find here: https://www.kaggle.com/c/outbrain-click-prediction/data,

Click "Upload" and get the results shortly,

and use the Leaderboard to see where this result would place us

**we scored 0.64318, which would put as at 265th place out of 978. can we do better?**

### Let's try some meta-parameter search:
Try tweaking the learning rate ("-l 0.5") and adagrad smoothing ("--power_t 0.5") command line arguments.
See if you can get better results just by changing them.

Succeeded? great, me too! here's what I have just by trying out a few values:

In [None]:
common_args_str = " ".join(["--cache", \
    "--linear uuid --linear platform --linear geo_location --linear traffic_source", \
    "--linear document_id --linear source_id --linear publisher_id --linear categories", \
    "--linear ad_id --linear campaign_id --linear advertiser_id", \
    "--linear ad_document_id --linear ad_source_id --linear ad_publisher_id", \
    "--linear ad_categories --linear user_categories"])

optimization_params = "--adaptive --sgd --power_t 0.2 -l 0.01"

model_name = "logistic.2"

In [None]:
iterations = train_loop(common_args_str, optimization_params, model_name, max_iterations, print_intermediate_loss)

### That was nice! time for some namespace-mixing action
The big guns! let's try out different feature combinations using the "--linear namespace_a,namespace_b" command line argument.

Go over the namespace list and try to make an educated guess.

How did it go? after some failures, I guessed that combining the publisher_id and advertiser_id might help, and also combining user categories and ad categories, and it did!

In [None]:
common_args_str = " ".join(["--cache", \
    "--linear publisher_id,advertiser_id --linear ad_categories,user_categories", \
    "--linear uuid --linear platform --linear geo_location --linear traffic_source", \
    "--linear document_id --linear source_id --linear publisher_id --linear categories", \
    "--linear ad_id --linear campaign_id --linear advertiser_id", \
    "--linear ad_document_id --linear ad_source_id --linear ad_publisher_id", \
    "--linear ad_categories --linear user_categories"])

optimization_params = "--adaptive --sgd --power_t 0.2 -l 0.01"

model_name = "logistic.3"

In [None]:
iterations = train_loop(common_args_str, optimization_params, model_name, max_iterations, print_intermediate_loss)

### Not bad! but we can do better.

Do we have collisions? try tweaking the hash space size, using the --bit_precision (or -b) command line argument:

In [None]:
common_args_str = " ".join(["--cache -b 25", \
    "--linear publisher_id,advertiser_id --linear ad_categories,user_categories", \
    "--linear uuid --linear platform --linear geo_location --linear traffic_source", \
    "--linear document_id --linear source_id --linear publisher_id --linear categories", \
    "--linear ad_id --linear campaign_id --linear advertiser_id", \
    "--linear ad_document_id --linear ad_source_id --linear ad_publisher_id", \
    "--linear ad_categories --linear user_categories"])

optimization_params = "--adaptive --sgd --power_t 0.2 -l 0.01"

model_name = "logistic.4"

In [None]:
iterations = train_loop(common_args_str, optimization_params, model_name, max_iterations, print_intermediate_loss)

### Nice, it doesn't come free though, we pay with model size:

In [None]:
!ls -lh model.* | awk -F " " '{print $5", "$9}'

### so, where did logistic regression take us so far?
Time for another kaggle submission:

In [None]:
from create_submission_file import create_submission_file
from fw_util import create_model_and_predict_for_test_set

create_model_and_predict_for_test_set(common_args_str, optimization_params, model_name, iterations)
create_submission_file("logistic.4.test_preds", "logistic.4.submission.csv")

**We scored 0.65563, which would place us at 166th place out of 978.** nice improvement of 99 places - logistic regression with feature combinations can go a long way for our use case.

### Sweet! but we want to see some FFM action please...
Let's try to go all-in, and have a field for each namespace:


In [None]:
common_args_str = " ".join(["--cache -b 25 --ffm_k 2 --ffm_bit_precision 25", \
    "--linear publisher_id,advertiser_id --linear ad_categories,user_categories", \
    "--linear uuid --linear platform --linear geo_location --linear traffic_source", \
    "--linear document_id --linear source_id --linear publisher_id --linear categories", \
    "--linear ad_id --linear campaign_id --linear advertiser_id", \
    "--linear ad_document_id --linear ad_source_id --linear ad_publisher_id", \
    "--linear ad_categories --linear user_categories", \
    "--ffm_field_verbose uuid --ffm_field_verbose platform --ffm_field_verbose geo_location", \
    "--ffm_field_verbose traffic_source --ffm_field_verbose document_id", \
    "--ffm_field_verbose source_id --ffm_field_verbose publisher_id", \
    "--ffm_field_verbose categories --ffm_field_verbose ad_id --ffm_field_verbose campaign_id", \
    "--ffm_field_verbose advertiser_id --ffm_field_verbose ad_document_id", \
    "--ffm_field_verbose ad_source_id --ffm_field_verbose ad_publisher_id", \
    "--ffm_field_verbose ad_categories --ffm_field_verbose user_categories"])

optimization_params = "--adaptive --sgd --power_t 0.2 --ffm_power_t 0.2 -l 0.01 --ffm_learning_rate 0.01"

model_name = "ffm.1"

In [None]:
iterations = train_loop(common_args_str, optimization_params, model_name, max_iterations, print_intermediate_loss)

### FFM models are even bigger:

In [None]:
!ls -lh model.* | awk -F " " '{print $5", "$9}'

### Can we improve further by tweaking the meta parameters?
We sure can, BUT I will leave most of the tweaks for you to experiment with. only change here is using ffm_k (the latent vector length) from 2 to 4 - but consider more tweaks:
* Divide the fields to smaller groups, for example '--ffm_field_verbose uuid,platform,document_id --ffm_field_verbose ad_categories,categories,user_categories'
* Get rid of features in the linear part if they don't help (--interaction blah)
* You can still add more feature combinations!
* Tweak ffm_power_t, ffm_learning_rate for the optimization process


In [None]:
common_args_str = " ".join(["--cache -b 25 --ffm_k 2 --ffm_bit_precision 25", \
    "--linear publisher_id,advertiser_id --linear ad_categories,user_categories", \
    "--linear uuid --linear platform --linear geo_location --linear traffic_source", \
    "--linear document_id --linear source_id --linear publisher_id --linear categories", \
    "--linear ad_id --linear campaign_id --linear advertiser_id", \
    "--linear ad_document_id --linear ad_source_id --linear ad_publisher_id", \
    "--linear ad_categories --linear user_categories", \
    "--ffm_field_verbose uuid --ffm_field_verbose platform --ffm_field_verbose geo_location", \
    "--ffm_field_verbose traffic_source --ffm_field_verbose document_id", \
    "--ffm_field_verbose source_id --ffm_field_verbose publisher_id", \
    "--ffm_field_verbose categories --ffm_field_verbose ad_id --ffm_field_verbose campaign_id", \
    "--ffm_field_verbose advertiser_id --ffm_field_verbose ad_document_id", \
    "--ffm_field_verbose ad_source_id --ffm_field_verbose ad_publisher_id", \
    "--ffm_field_verbose ad_categories --ffm_field_verbose user_categories"])

optimization_params = "--adaptive --sgd --power_t 0.2 --ffm_power_t 0.2 -l 0.01 --ffm_learning_rate 0.01"

model_name = "ffm.2"

In [None]:
iterations = train_loop(common_args_str, optimization_params, model_name, max_iterations, print_intermediate_loss)

### Fresh from the oven: feature binning for numerical features
We haven't touched the numerical feature "user_page_views" yet. in the presentation we saw the numerical feature binning capability in Fwumious Wabbit - let's try it out.

The user_page_views feature has the count of user page views before the display event of the recommendation candidate.
a user may have seen 0, 1, 4, 12, 30 or any old number in between or a bit above that.

We'll use BinnerSqrt with a max value of 10, and resolution=1 - you can tweak those numbers to see if you can get better results.

After defining the new feature (--transform), we can use it either in the linear part, alone or as part of an interaction - or as we do here: as a new field.

Let's see if it will improve our model:

In [None]:
common_args_str = " ".join(["--cache -b 25 --ffm_k 4 --ffm_bit_precision 25", \
    "--transform page_views_sqrt=BinnerSqrt(user_page_views)(10,1)", \
    "--linear publisher_id,advertiser_id --linear ad_categories,user_categories", \
    "--linear uuid --linear platform --linear geo_location --linear traffic_source", \
    "--linear document_id --linear source_id --linear publisher_id --linear categories", \
    "--linear ad_id --linear campaign_id --linear advertiser_id", \
    "--linear ad_document_id --linear ad_source_id --linear ad_publisher_id", \
    "--linear ad_categories --linear user_categories", \
    "--ffm_field_verbose uuid --ffm_field_verbose platform --ffm_field_verbose geo_location", \
    "--ffm_field_verbose traffic_source --ffm_field_verbose document_id", \
    "--ffm_field_verbose source_id --ffm_field_verbose publisher_id", \
    "--ffm_field_verbose categories --ffm_field_verbose ad_id --ffm_field_verbose campaign_id", \
    "--ffm_field_verbose advertiser_id --ffm_field_verbose ad_document_id", \
    "--ffm_field_verbose ad_source_id --ffm_field_verbose ad_publisher_id", \
    "--ffm_field_verbose ad_categories --ffm_field_verbose user_categories --ffm_field_verbose page_views_sqrt"])

optimization_params = "--adaptive --sgd --power_t 0.2 --ffm_power_t 0.2 -l 0.01 --ffm_learning_rate 0.01"

model_name = "ffm.3"

In [None]:
iterations = train_loop(common_args_str, optimization_params, model_name, max_iterations, print_intermediate_loss)

### To wrap things up, let's see where we are now on Kaggle:

In [None]:
from create_submission_file import create_submission_file
from fw_util import create_model_and_predict_for_test_set

create_model_and_predict_for_test_set(common_args_str, optimization_params, model_name, iterations)

print("creating submission file from predictions")
create_submission_file("ffm.3.test_preds", "ffm.3.submission.csv")
print("all done! good luck.")

We scored 0.66619 which would place us on 106th place - 60 places up.

Can you do better? if you want to play in the big league you'll probably need to do some more work on the dataset though.

GOOD LUCK!