# Example: The RecModel package for implicit feedback

## Get the Data
In the following we test the SLIM model based on the Netflix dataset. As all models in the RecModel package use csr matrix as input, we need to download the Netflix dataset and convert it to csr.

Fortunately, I created a small repo that does exatly that: [Data](https://github.com/titoeb/ImplicitFeedback).

Do the following steps to create the Netflix dataset:

* Clone the Repository
    ```
    git clone https://github.com/titoeb/ImplicitFeedback
    ```

* Change into the repository:
    ```
    cd ImplicitFeedback
    ```
* Allow execution of the Bash Script:
    ```
    chmod +x Create_Data
    ```

* Execute the Bash Script. This may take some time as the Netflix, ML20 and Million Song Dataset is download.
    ```
    ./Create_Data
    ```

* Now the data is downloaded. To create the CSR matrix run the Netflix.py script.
    ```
    python Netflix.py
    ```
* Now you simply copy the Netflix.npz file into the data folder in your directory of the RecModel Package. You can delete the Implicit Feedback Data folder.


To import the data, fit the SLIM model and analyze the results we need the following packages:

In [2]:
import scipy.sparse
import numpy as np
import RecModel
import matplotlib.pyplot as plt
plt.style.use('ggplot')

The Recmodel package implements the following models:

* [Neighbor](https://dl.acm.org/doi/10.1145/371920.372071): RecModel.Neighborhood

* [SLIM](https://dl.acm.org/doi/10.1109/ICDM.2011.134): RecModel.SLIM

* [VAE](https://dl.acm.org/doi/abs/10.1145/3178876.3186150): RecModel.VAE

* [EASE](https://dl.acm.org/doi/abs/10.1145/3308558.3313710): RecModel.EASE

* [WMF](https://dl.acm.org/doi/10.1109/ICDM.2008.22): RecModel.WMF

* [RecWalk](https://dl.acm.org/doi/abs/10.1145/3289600.3291016): RecModel.RecWalk

All these models have different hyper parameters, for mor details look in the Documentation of the individual models.

Before we fit any models we need to load the netflix data. To speed up computation, we only use the first 10000 users and the first 2500 items.

In [3]:
netflix_data = scipy.sparse.load_npz('data/Netflix.npz')[:10000, :2500]

num_users, num_items = netflix_data.shape

And split it into train and test:

In [4]:
train_data, test_data = RecModel.train_test_split_sparse_mat(netflix_data)

In this tutorial, first we are going to fit a baseline model, that predicts random items to users. Afterwards, we are going to fit the SLIM model, a regularized, linear model model.

Let's start with the Baseline!

The Naive Baseline only samples random items for every user. Therefore it does not need be trained.

In [5]:
naive_model = RecModel.NaiveBaseline(num_items=num_items)

Evaluate the recall@4, recall@10, recall@20 and recall@50 performance based on 1000 random items for each user:

In [7]:
naive_model_performance = naive_model.eval_topn(test_mat=test_data, topn = np.array([4, 10, 20, 50]), rand_sampled=1000)

In [8]:
print(naive_model_performance)

{'Recall@4': 0.0055808444, 'Recall@10': 0.014061233, 'Recall@20': 0.027124774, 'Recall@50': 0.06940201}


Random sampling items of items only gets us a recall@50 of about 5 percent (as one would expect when sampling 50 out of 1000 items). Let's use the SLIM model to improve the performance!

In [9]:
slim_model = RecModel.SLIM(num_items=num_items, num_users=num_users)

The SLIM model is an actual, model-based collaborative filtering model. Therefore it needst to be trained. Unfortunately, the SLIM model is expensive to train and would need about an hour of training time based on the full netlix dataset. To get more information about the status during training, set verbose to True!

In [10]:
slim_model.train(X=train_data.astype(np.float64), alpha=4.427181, l1_ratio=0.318495, max_iter=27, tolerance=0.006841, cores=8, verbose=False)

After we trained the model, we can check its performance on the test dataset:

In [13]:
slim_model_performance = slim_model.eval_topn(test_mat=test_data.astype(np.float64), topn = np.array([4, 10, 20, 50]), rand_sampled=1000)

In [14]:
print(slim_model_performance)

{'Recall@4': 0.17652927, 'Recall@10': 0.3280227, 'Recall@20': 0.48915008, 'Recall@50': 0.7147534}


The Slim model already gets a recall@4 performance of about 17.6 percent and a recall@50 performance of 71.4 percent. Finally some solid numbers!