# LikeWise

### A Comparison of Book Recommender Systems

### By _Owen Burton_ and _Tobias Reaper_

---
---

## Outline

* [Introduction]()
* [_Imports and Configuration_]()
* [Data]()
* [Modeling]()
    * [Training]()
    * [Generating Recommendations]()

---
---

## Introduction

### Stop! Collaborate and Filter

[Collaborative Filtering](https://d2l.ai/chapter_recommender-systems/recsys-intro.html#collaborative-filtering) (CF)

> In general, CF only uses the user-item interaction data to make predictions and recommendations.

---
---

## Imports and Configuration

In [None]:
# === General imports === #
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os

In [None]:
# === fastai imports === #
from fastai.collab import *

In [None]:
# === Configuration === #
%matplotlib inline
pd.options.display.max_columns = 100
pd.options.display.max_rows = 200

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# === Set up path to project dir === #
PROJECT_DIR = "/content/drive/My Drive/workshop/buildbox/likewise"

---
---

## Data

Intro to and explanation of dataset — why this dataset?

Here is the dataset used for the LikeWise recommender systems: [UCSD Book Graph - GoodReads Datasets](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home).

For this particular model, I'll be using the "Shelves" dataset, which has interactions between users and books (ratings).

The relevant columns are `user_id`, `book_id`, and `rating`. They are pretty self-explanatory, but just to be explicit, each record indicates a user rating a book that they've (presumably) read and have an opinion on.

Speaking of explicit, the fact that the user _explicitly_ rates the books makes this dataset one of explicit preferences

In [None]:
# === Load data from Drive === #
data_file = "interactions_mystery_thriller_crime_1m_sample.csv"
data_path = os.path.join(PROJECT_DIR, "assets/data", data_file)
interactions = pd.read_csv(data_path)

In [None]:
# === Data types === #
interactions.dtypes

user_id    object
book_id    object
is_read      bool
rating      int64
dtype: object

---
---

## Modeling

For this model, I'll be using the FastAI collaborative filtering algorithm(s).

Resources:

* [fastai.collab](https://docs.fast.ai/collab.html)
* [movielens recommender example](https://github.com/microsoft/recommenders/blob/master/notebooks/00_quick_start/fastai_movielens.ipynb)

### Training

In [None]:
# === First, train on rated books only === #
rating_range = [1, 5]

# Create databunch
data = CollabDataBunch.from_df(
    inters_rated,
    user_name="user_id",
    item_name="book_id",
    rating_name="rating",
    valid_pct=0.2,
    seed=92
)

# === Instantiate learner === #
learn = collab_learner(data, n_factors=50, y_range=rating_range)
learn.model

In [None]:
data.show_batch()

user_id,book_id,target
82,51,4.0
1761,8945,2.0
779,13588,3.0
1627,22778,5.0
1720,3098,4.0


In [None]:
# === Train! === #
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,1.123155,1.084469,00:10
1,0.634403,0.753151,00:12
2,0.298588,0.739403,00:12
3,0.128401,0.741949,00:11
4,0.073346,0.742687,00:10


In [None]:
# === Export trained model === #
data_file = "01_likewise.pkl"
data_path = os.path.join(PROJECT_DIR, "assets/models", data_file)
learn.export(data_path)

---

### Generate Recommendations

In [None]:
# === Import the model from file === #
learner = load_learner(
    path=os.path.join(PROJECT_DIR, "assets/models"),
    file="01_likewise.pkl",
)

In [None]:
# === Get all users and items that the model knows === #
total_users, total_items = learner.data.train_ds.x.classes.values()
total_items = total_items[1:]
total_users = total_users[1:]

print(total_users.shape, total_items.shape)

(1613,) (15925,)


In [None]:
# Get all users from the test set and remove any users
# that were know in the training set

test_users = test_df[USER].unique()
test_users = np.intersect1d(test_users, total_users)

In [None]:
from itertools import product

users_items = product(np.array(total_users), np.array(total_items))
users_items = pd.DataFrame(users_items, columns=["user_id", "book_id"])

In [None]:
users_items

Unnamed: 0,user_id,book_id
0,1,1
1,1,10
2,1,100
3,1,1000
4,1,10001
...,...,...
25687020,999,9990
25687021,999,9992
25687022,999,9993
25687023,999,9996


My search for how to generate recommendations led me to looking at the [`score` function](https://github.com/microsoft/recommenders/blob/master/reco_utils/recommender/fastai/fastai_utils.py) in the reco_utils module, which is used in one of the notebooks I'm referencing [here](https://github.com/microsoft/recommenders/blob/master/notebooks/00_quick_start/fastai_movielens.ipynb).

I copied the `score` function into the cell below to try and use some of it to write my own recommendation function.

In [None]:
def score(
    learner,
    test_df,
    user_col=cc.DEFAULT_USER_COL,
    item_col=cc.DEFAULT_ITEM_COL,
    prediction_col=cc.DEFAULT_PREDICTION_COL,
    top_k=None,
):
    """Score all users+items provided and reduce to top_k items per user if top_k>0
    
    Args:
        learner (obj): Model.
        test_df (pd.DataFrame): Test dataframe.
        user_col (str): User column name.
        item_col (str): Item column name.
        prediction_col (str): Prediction column name.
        top_k (int): Number of top items to recommend.
    Returns:
        pd.DataFrame: Result of recommendation 
    """
    # replace values not known to the model with NaN
    total_users, total_items = learner.data.train_ds.x.classes.values()
    test_df.loc[~test_df[user_col].isin(total_users), user_col] = np.nan
    test_df.loc[~test_df[item_col].isin(total_items), item_col] = np.nan

    # map ids to embedding ids
    u = learner.get_idx(test_df[user_col], is_item=False)
    m = learner.get_idx(test_df[item_col], is_item=True)

    # score the pytorch model
    pred = learner.model.forward(u, m)
    scores = pd.DataFrame(
        {user_col: test_df[user_col], item_col: test_df[item_col], prediction_col: pred}
    )
    scores = scores.sort_values([user_col, prediction_col], ascending=[True, False])
    if top_k is not None:
        top_scores = scores.groupby(user_col).head(top_k).reset_index(drop=True)
    else:
        top_scores = scores
    return top_scores