<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Vowpal Wabbit Deep Dive

<center>
<img src="https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/logo_assets/vowpal-wabbits-github-logo.png?raw=true" height="30%" width="30%" alt="Vowpal Wabbit">
</center>

[Vowpal Wabbit](https://github.com/VowpalWabbit/vowpal_wabbit) is a fast online machine learning library that implements several algorithms relevant to the recommendation use case.

The main advantage of Vowpal Wabbit (VW) is that training is done in an online fashion typically using Stochastic Gradient Descent or similar variants, which allows it to scale well to very large datasets. Additionally, it is optimized to run very quickly and can support distributed training scenarios for extremely large datasets.

In this notebook we demonstrate how to use the VW library to generate recommendations on the [Movielens](https://grouplens.org/datasets/movielens/) dataset.

Several things are worth noting in how VW is being used in this notebook. By leveraging an Azure Data Science Virtual Machine ([DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/)), VW comes pre-installed and can be used directly from the command line. There are also python bindings to be able to use VW within a python environment and even a wrapper conforming to the SciKit-Learn Estimator API. However, the python bindings must be installed as an additional python package with Boost dependencies, so for simplicity's sake execution of VW is done via a subprocess call mimicking what would happen from the command line execution of the model.

VW expects a specific [input format](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format), and to_vw() below is a convenience function to convert the standard movielens dataset into that format. Datafiles then are written to disk and passed to VW for training.

The examples shown are to demonstrate functional capabilities not to indicate performance advantages of different approaches. There are several hyper-parameters (i.e. learning rate and regularization tems) that can greatly impact performance of VW models which can be adjusted using [command line options](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Command-Line-Arguments). To properly compare approaches it is helpful to learn about and tune these parameters for production workloads.

<h3>Environment Setup</h3>

In [6]:
import os
import sys
from subprocess import run
from tempfile import TemporaryDirectory
from time import process_time

import pandas as pd
import papermill as pm

from reco_utils.dataset.movielens import load_pandas_df
from reco_utils.dataset.python_splitters import python_stratified_split
from reco_utils.evaluation.python_evaluation import rmse, mae, exp_var, rsquared, get_top_k_items

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 19:16:44) 
[GCC 7.3.0]
Pandas version: 0.23.4


In [7]:
def to_vw(df, output, logistic=False):
    """Convert Pandas DataFrame to vw input format
    Args:
        df (pd.DataFrame): input DataFrame
        output (str): path to output file
        logistic (bool): flag to convert label to logistic value
    """
    with open(output, 'w') as f:
        tmp = df.reset_index()

        # we need to reset the rating type to an integer to simplify the vw formatting
        tmp['rating'] = tmp['rating'].astype('int64')
        
        # convert rating to binary value
        if logistic:
            tmp['rating'] = tmp['rating'].apply(lambda x: 1 if x >= 3 else -1)
        
        # convert each row to VW input format
        # [label] [tag]|[user namespace] [user id feature] |[item namespace] [movie id feature]
        # label is the true rating, tag is a unique id for the example just used to link predictions to truth
        # user and item namespaces separate the features to support interaction features through command line options
        for _, row in tmp.iterrows():
            f.write('{rating} {index}|user {userID} |item {itemID}\n'.format_map(row))

In [12]:
def run_vw(train_params, test_params, test_data, prediction_path, logistic=False):
    """Convenience function to train, test, and show metrics of interest
    Args:
        train_params (str): vw training parameters
        test_params (str): vw testing parameters
        test_data (pd.dataFrame): test data
        prediction_path (str): path to vw prediction output
        logistic (bool): flag to convert label to logistic value
    Returns:
        (dict): metrics and timing information
    """

    # train model
    train_start = process_time()
    run(train_params.split(' '), check=True)
    train_stop = process_time()
    
    # test model
    test_start = process_time()
    run(test_params.split(' '), check=True)
    test_stop = process_time()
    
    # read in predictions
    pred_df = pd.read_csv(prediction_path, delim_whitespace=True, names=['prediction'], index_col=1).join(test_data)
    
    test_df = test_data.copy()
    if logistic:
        # make the true label binary so that the metrics are captured correctly
        test_df['rating'] = test['rating'].apply(lambda x: 1 if x >= 3 else -1)
    else:
        # ensure results are integers in correct range
        pred_df['prediction'] = pred_df['prediction'].apply(lambda x: int(max(1, min(5, round(x)))))

    # calculate metrics
    result = dict()
    result['RMSE'] = rmse(test_df, pred_df)
    result['MAE'] = mae(test_df, pred_df)
    result['R2'] = rsquared(test_df, pred_df)
    result['Explained Variance'] = exp_var(test_df, pred_df)
    result['Train Time (ms)'] = (train_stop - train_start) * 1000
    result['Test Time (ms)'] = (test_stop - test_start) * 1000
    
    return result

In [9]:
# create temp directory to maintain data files
tmpdir = TemporaryDirectory()

model_path = os.path.join(tmpdir.name, 'vw.model')
train_path = os.path.join(tmpdir.name, 'train.dat')
test_path = os.path.join(tmpdir.name, 'test.dat')
train_logistic_path = os.path.join(tmpdir.name, 'train_logistic.dat')
test_logistic_path = os.path.join(tmpdir.name, 'test_logistic.dat')
prediction_path = os.path.join(tmpdir.name, 'prediction.dat')

<h3>Load & Transform Data</h3>

In [10]:
# load movielens data (use the 1M dataset)
df = load_pandas_df('1m')

# split data to train and test sets, default values take 75% of each users ratings as train, and 25% as test
train, test = python_stratified_split(df)

# save train and test data in vw format
to_vw(df=train, output=train_path)
to_vw(df=test, output=test_path)

# save data for logistic regression (requires adjusting the label)
to_vw(df=train, output=train_logistic_path, logistic=True)
to_vw(df=test, output=test_logistic_path, logistic=True)

<h3>Regression Based Recommendations</h3>

When considering different approaches for solving a problem with machine learning it is helpful to generate a baseline approach to understand how more complex solutions perform across dimensions of performance, time, and resource (memory or cpu) usage.

Regression based approaches are some of the simplest and fastest baselines to consider for many ML problems.

<h4> Linear Regression</h4>

As the data provides a numerical rating between 1-5, fitting those values with a linear regression model is easy first step. This model is trained on examples of ratings as the target variable and corresponding user ids and movie ids as independent features.

By passing each user-item rating in as an example the model will begin to learn weights based on average ratings for each user as well as average ratings per item.

VW uses linear regression by default, so no extra command line options are needed beyond specifying where to locate the model and the data.

This however can generate predicted ratings which are no longer integers, so some additional adjustments should be made at prediction time to convert them back to the integer scale of 1 through 5 if necessary. Here, this is done in the evaluate function.

In [13]:
train_params = 'vw -f {model} -d {data}'.format(model=model_path, data=train_path)
test_params = 'vw -i {model} -d {data} -t -p {pred}'.format(model=model_path, data=test_path, pred=prediction_path)

result = run_vw(train_params=train_params, 
                test_params=test_params, 
                test_data=test, 
                prediction_path=prediction_path)

comparison = pd.DataFrame(result, index=['Linear Regression'])
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.960662,0.686874,0.261995,0.262026,7.852629,4.389875


<h4> Multinomial Logistic Regression</h4>

A similar alternative is to leverage multinomial classification, which treats each rating value as a distinct class. 

This avoids any non integer results, but also reduces the training data for each class which could lead to poorer performance if the counts of different rating levels are skewed.

Basic multiclass logistic regression can be accomplished using the One Against All approach specified by the '--oaa N' option, where N is the number of classes and proving the logistic option for the loss function to be used.

In [14]:
train_params = 'vw --loss_function logistic --oaa 5 -f {model} -d {data}'.format(model=model_path, data=train_path)
test_params = 'vw -i {model} -d {data} -t -p {pred}'.format(model=model_path, data=test_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=test,
                prediction_path=prediction_path)

comparison = comparison.append(pd.DataFrame(result, index=['Multinomial Regression']))
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.960662,0.686874,0.261995,0.262026,7.852629,4.389875
Multinomial Regression,1.065052,0.718019,0.09289,0.118508,8.118148,4.681335


<h4>Logistic Regression</h4>

Additionally, one might simply be interested in whether the user likes or dislikes an item and we can adjust the input data to represent a binary outcome, where ratings in (1,3] are dislikes (negative results) and (3,5] are likes (positive results).

This framing allows for a simple logistic regression model to be applied. To perform logistic regression the loss_function parameter is changed to 'logistic' and the target label is switched to [0, 1]. Also, be sure to set '--link logistic' during prediction to convert the logit output back to a probability value.

In [15]:
train_params = 'vw --loss_function logistic -f {model} -d {data}'.format(model=model_path, data=train_logistic_path)
test_params = 'vw --link logistic -i {model} -d {data} -t -p {pred}'.format(model=model_path, data=test_logistic_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=test,
                prediction_path=prediction_path,
                logistic=True)

comparison = comparison.append(pd.DataFrame(result, index=['Logistic Regression']))
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.960662,0.686874,0.261995,0.262026,7.852629,4.389875
Multinomial Regression,1.065052,0.718019,0.09289,0.118508,8.118148,4.681335
Logistic Regression,0.691872,0.382294,0.134096,0.178364,7.913344,5.097224


<h4>Linear Regression with Interaction Features</h4>

So far we have treated the user features and item features independently, but taking into account interactions between features can provide a mechanism to learn more fine grained preferences of the users.

To generate interaction features use the quadratic command line argument and specify the namespaces that should be combined: '-q ui' combines the user and item namespaces based on the first letter of each.

When generating interaction terms one thing to consider is the [hash space](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction) used for the features. It can be beneficial to increase the size of the space to reduce unwanted collisions.

In [16]:
train_params = 'vw -b 24 -q ui -f {model} -d {data}'.format(model=model_path, data=train_path)
test_params = 'vw -i {model} -d {data} -t -p {pred}'.format(model=model_path, data=test_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=test,
                prediction_path=prediction_path)

comparison = comparison.append(pd.DataFrame(result, index=['Linear Regression w/ Interaction']))
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.960662,0.686874,0.261995,0.262026,7.852629,4.389875
Multinomial Regression,1.065052,0.718019,0.09289,0.118508,8.118148,4.681335
Logistic Regression,0.691872,0.382294,0.134096,0.178364,7.913344,5.097224
Linear Regression w/ Interaction,0.96398,0.691836,0.256889,0.257098,8.054844,4.909482


<h3>Matrix Factorization Based Recommendations</h3>

All of the above approaches train a regression model, but VW also supports matrix factorization with two different approaches.

<h4>Singular Value Decomposition Based Matrix Factorization</h4>

The first approach is called using the '--rank' command line argument and performs matrix factorization based on Singular Value Decomposition (SVD).

See the [Matrix Factorization Example](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Matrix-factorization-example) for more detail.

In [17]:
train_params = 'vw --rank 5 -q ui -f {model} -d {data}'.format(model=model_path, data=train_path)
test_params = 'vw -i {model} -d {data} -t -p {pred}'.format(model=model_path, data=test_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=test,
                prediction_path=prediction_path)

comparison = comparison.append(pd.DataFrame(result, index=['Matrix Factorization (Rank)']))
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.960662,0.686874,0.261995,0.262026,7.852629,4.389875
Multinomial Regression,1.065052,0.718019,0.09289,0.118508,8.118148,4.681335
Logistic Regression,0.691872,0.382294,0.134096,0.178364,7.913344,5.097224
Linear Regression w/ Interaction,0.96398,0.691836,0.256889,0.257098,8.054844,4.909482
Matrix Factorization (Rank),1.019621,0.769993,0.168628,0.22757,8.346378,5.065638


<h4>Factorization Machine Based Matrix Factorization</h4>

An alternative approach based on [Rendel's factorization machines](https://cseweb.ucsd.edu/classes/fa17/cse291-b/reading/Rendle2010FM.pdf) is called using '--lrq' (low rank quadratic). More LRQ details in this [demo](https://github.com/VowpalWabbit/vowpal_wabbit/tree/master/demo/movielens).

This learns two lower rank matrices which are multiplied to generate an approximation of the user-item rating matrix. Compressing the matrix in this way leads to learning generalizable factors which avoids some of the limitations of using regression models with extremely sparse interaction features. This can lead to better convergence and smaller on-disk models.

An additional term to improve performance is --lrqdropout which will dropout columns during training. This however tends to increase the optimal rank size. Other parameters such as L2 regularization can help avoid overfitting.

In [18]:
train_params = 'vw --lrq ui7 --lrqdropout -f {model} -d {data}'.format(model=model_path, data=train_path)
test_params = 'vw -i {model} -d {data} -t -p {pred}'.format(model=model_path, data=test_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=test,
                prediction_path=prediction_path)

comparison = comparison.append(pd.DataFrame(result, index=['Matrix Factorization (LRQ)']))
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.960662,0.686874,0.261995,0.262026,7.852629,4.389875
Multinomial Regression,1.065052,0.718019,0.09289,0.118508,8.118148,4.681335
Logistic Regression,0.691872,0.382294,0.134096,0.178364,7.913344,5.097224
Linear Regression w/ Interaction,0.96398,0.691836,0.256889,0.257098,8.054844,4.909482
Matrix Factorization (Rank),1.019621,0.769993,0.168628,0.22757,8.346378,5.065638
Matrix Factorization (LRQ),1.036123,0.762572,0.1415,0.141591,7.463629,5.28315


<h3>Scoring</h3>

After training a model with any of the above approaches, the model can be used to score potential user-pairs in offline batch mode, or in a real-time scoring mode. The example below shows how to leverage the utilities in the reco_utils directory to  generate Top-K recommendations from offline scored output.

In [19]:
# store all data
data_path = os.path.join(tmpdir.name, 'all.dat')
to_vw(df=df, output=data_path)

# predict on the full set of users
train_params = 'vw --lrq ui7 --lrqdropout -f {model} -d {data}'.format(model=model_path, data=train_path)
test_params = 'vw -i {model} -d {data} -t -p {pred}'.format(model=model_path, data=data_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=df,
                prediction_path=prediction_path)

In [20]:
# load predictions and filter to test set
test_users = [1, 2, 3]
pred_data = pd.read_csv(prediction_path, delim_whitespace=True, names=['prediction'], index_col=1).join(df)
test_user_data = pred_data[pred_data['userID'].isin(test_users)]

get_top_k_items(test_user_data, col_rating='prediction', k=5)

Unnamed: 0,level_0,level_1,prediction,userID,itemID,rating,timestamp
0,0,2,5.0,1,914,3.0,978301968
1,0,8,5.0,1,594,4.0,978302268
2,0,13,5.0,1,2918,4.0,978302124
3,0,21,5.0,1,720,3.0,978300760
4,0,23,5.0,1,527,5.0,978824195
5,1,54,5.0,2,3068,4.0,978299000
6,1,55,5.0,2,1537,4.0,978299620
7,1,57,5.0,2,2194,4.0,978299297
8,1,59,5.0,2,2268,5.0,978299297
9,1,63,5.0,2,3468,5.0,978298542


<h3>Cleanup</h3>

In [None]:
tmpdir.cleanup()