# LikeWise

### A book recommender system

### By _Tobias Reaper_

---
---

## Outline

* [Introduction]()
* [_Imports and Configuration_]()
* [Data]()
* [Modeling]()
    * [Training]()
    * [Generating Recommendations]()

---
---

## Introduction

### Stop! Collaborate and Filter

[Collaborative Filtering](https://d2l.ai/chapter_recommender-systems/recsys-intro.html#collaborative-filtering) (CF)

> In general, CF only uses the user-item interaction data to make predictions and recommendations.

---
---

## Imports and Configuration

In [1]:
# === General imports === #
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os

In [2]:
# === fastai imports === #
from fastai.collab import *

In [3]:
# === Configuration === #
%matplotlib inline
pd.options.display.max_columns = 100
pd.options.display.max_rows = 200

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [5]:
# === Set up path to project dir === #
PROJECT_DIR = "/content/drive/My Drive/workshop/buildbox/likewise"

---
---

## Data

> Intro to and explanation of dataset — why this dataset?

The dataset used for the LikeWise recommender system is called the [UCSD Book Graph - GoodReads Datasets](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home). For this particular model, I'll be using the "Shelves" dataset, which has interactions between users and books (ratings).

The relevant columns are `user_id`, `book_id`, and `rating`. They are pretty self-explanatory, but just to be explicit, each record indicates a user rating a book that they've (presumably) read and have an opinion on.

Speaking of explicit, the fact that the user _explicitly_ rates the books makes this dataset one of explicit preferences.

In [6]:
# === Load data from Drive === #
data_file = "interactions_mystery_thriller_crime_1m_sample.csv"
data_path = os.path.join(PROJECT_DIR, "assets/data", data_file)
interactions = pd.read_csv(data_path)
print(interactions.shape)
interactions.head()

(1000000, 4)


Unnamed: 0,user_id,book_id,is_read,rating
0,8842281e1d1347389f2ab93d60773d4d,6392944,True,3
1,8842281e1d1347389f2ab93d60773d4d,2279538,False,0
2,8842281e1d1347389f2ab93d60773d4d,20821043,False,0
3,8842281e1d1347389f2ab93d60773d4d,31184479,False,0
4,8842281e1d1347389f2ab93d60773d4d,28684704,True,3


In [7]:
# === Convert `book_id` to string === #
interactions["book_id"] = interactions["book_id"].astype("str")

In [8]:
# === Data types === #
interactions.dtypes

user_id    object
book_id    object
is_read      bool
rating      int64
dtype: object

In [9]:
# === Convert unread "ratings" to nulls === #
interactions["rating"] = np.where(interactions["is_read"] == False, np.NaN, interactions["rating"])
interactions.head()

Unnamed: 0,user_id,book_id,is_read,rating
0,8842281e1d1347389f2ab93d60773d4d,6392944,True,3.0
1,8842281e1d1347389f2ab93d60773d4d,2279538,False,
2,8842281e1d1347389f2ab93d60773d4d,20821043,False,
3,8842281e1d1347389f2ab93d60773d4d,31184479,False,
4,8842281e1d1347389f2ab93d60773d4d,28684704,True,3.0


In [10]:
interactions.isnull().sum()

user_id         0
book_id         0
is_read         0
rating     477996
dtype: int64

In [22]:
# === Get dfs of read/unread books === #
unread = interactions[interactions["is_read"] == False].copy()
print(f"Number of unread books: {unread.shape[0]}")

read = interactions[interactions["is_read"] == True].copy()
print(f"Number of read books: {read.shape[0]}")

Number of unread books: 477996
Number of read books: 522004


---
---

## Modeling

For this model, I'll be using the FastAI collaborative filtering algorithm(s).

Resources:

* [fastai.collab](https://docs.fast.ai/collab.html)
* [movielens recommender example](https://github.com/microsoft/recommenders/blob/master/notebooks/00_quick_start/fastai_movielens.ipynb)

### Training

In [None]:
# === First, train on rated books only === #
rating_range = [0, 5]

# Create databunch
data = CollabDataBunch.from_df(
    read,
    user_name="user_id",
    item_name="book_id",
    rating_name="rating",
    valid_pct=0.2,
    seed=92
)

# === Instantiate learner === #
learn = collab_learner(data, n_factors=50, y_range=rating_range)
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(16072, 50)
  (i_weight): Embedding(69696, 50)
  (u_bias): Embedding(16072, 1)
  (i_bias): Embedding(69696, 1)
)

In [None]:
data.show_batch()

user_id,book_id,target
675e6a96b8104daeaccfeff77424f8a1,10614,2.0
026b67d613257e8e86d4380132d1050f,24649825,4.0
ef67b5b7e15169b4312d1de306b94845,16366,5.0
773c4869ab6fc4306210f06db3a71821,19288043,4.0
84333f5def09812528f3bf1ed941f1a6,66508,3.0


In [None]:
# === Train! === #
learn.fit_one_cycle(3, 5e-3)

epoch,train_loss,valid_loss,time
0,1.404012,1.409553,08:58
1,1.078573,1.337938,13:50
2,0.681099,1.339044,12:51


In [None]:
# === Export trained model === #
data_file = "01_likewise.pkl"
data_path = os.path.join(PROJECT_DIR, "assets/models", data_file)
learn.export(data_path)

---

### Generate Recommendations

In [17]:
# === Import the model from file === #
learner = load_learner(
    path=os.path.join(PROJECT_DIR, "assets/models"),
    file="01_likewise.pkl",
)

In [45]:
doc(learner.predict)

In [None]:
# === 3 user-shelved books that have not been reviewed === #
interactions.iloc[1]
interactions.iloc[2]
interactions.iloc[3]

In [48]:
# === Generate prediction for one book === #
learner.predict(interactions.iloc[2])

(FloatItem 3.978665, tensor(3.9787), tensor(3.9787))

In [49]:
interactions.iloc[2]

user_id    8842281e1d1347389f2ab93d60773d4d
book_id                            20821043
is_read                               False
rating                                  NaN
Name: 2, dtype: object

It worked! The model predicts that this user would rate this book a 3.9787

In [50]:
# === Try another unread book === #
learner.predict(interactions.iloc[3])

(FloatItem 3.2284296, tensor(3.2284), tensor(3.2284))

In [70]:
# === And another === #
print(unread.iloc[2345])
learner.predict(unread.iloc[2345])

user_id    617ccec66dac2d1029600ed3d706e8ed
book_id                            27276292
is_read                               False
rating                                  NaN
Name: 5245, dtype: object


(FloatItem 3.5305958, tensor(3.5306), tensor(3.5306))

In [54]:
# === Try one that has been rated, to see the difference === #
print(interactions.iloc[4])
learner.predict(interactions.iloc[4])

user_id    8842281e1d1347389f2ab93d60773d4d
book_id                            28684704
is_read                                True
rating                                    3
Name: 4, dtype: object


(FloatItem 4.406884, tensor(4.4069), tensor(4.4069))

That prediction isn't very close. Let's try another.

In [57]:
# === Try another that has been rated === #
print(interactions.iloc[855])
learner.predict(interactions.iloc[855])

user_id    4b3636a043e5c99fa27ac897ccfa1151
book_id                               66528
is_read                                True
rating                                    4
Name: 855, dtype: object


(FloatItem 3.9686158, tensor(3.9686), tensor(3.9686))

That one is a lot better!

In [71]:
# === User that doesn't exist === #
interactions[interactions["user_id"] == "617ccec66dac2d1029600ed3d706e8er"]

Unnamed: 0,user_id,book_id,is_read,rating


In [72]:
# === User doesn't exist, book does === #
new_user_book = pd.Series(data=["617ccec66dac2d1029600ed3d706e8er", "27276292", np.NaN], index=["user_id", "book_id", "rating"])
new_user_book

user_id    617ccec66dac2d1029600ed3d706e8er
book_id                            27276292
rating                                  NaN
dtype: object

In [73]:
# === Predict unknown user's rating === #
learner.predict(new_user_book)

(FloatItem 2.7347884, tensor(2.7348), tensor(2.7348))

As can be seen in the above three cells, I created a new user and predicted what rating they would give a book. The rating it spit out was 2.7348.

That seemed a little weird. So I tried it again, with a different `user_id`. It gave the same rating. I'm wondering if that's just the mean rating of that book or something.

In [25]:
# === Get all users and items that the model knows === #
total_users, total_items = learner.data.train_ds.x.classes.values()
total_users = total_users[1:]
total_items = total_items[1:]

print(total_users.shape, total_items.shape)

(16071,) (69695,)


---

### More recommendations

In [19]:
# === Replace values not known to the model with NaN === #
unread.loc[~unread["user_id"].isin(total_users), "user_id"] = np.NaN
unread.loc[~unread["book_id"].isin(total_items), "book_id"] = np.NaN

In [35]:
# === Extract only the users and items in the training data === #
unread["user_id"] = unread.loc[unread["user_id"].isin(total_users), "user_id"]
unread["book_id"] = unread.loc[unread["book_id"].isin(total_items), "book_id"]

In [36]:
print(unread.shape)
unread.head()

(477996, 4)


Unnamed: 0,user_id,book_id,is_read,rating
1,8842281e1d1347389f2ab93d60773d4d,2279538,False,
2,8842281e1d1347389f2ab93d60773d4d,20821043,False,
3,8842281e1d1347389f2ab93d60773d4d,31184479,False,
5,8842281e1d1347389f2ab93d60773d4d,32283133,False,
6,8842281e1d1347389f2ab93d60773d4d,17288661,False,


In [37]:
# === Map ids to embedding ids === #
u = learner.get_idx(unread["user_id"], is_item=False)
m = learner.get_idx(unread["book_id"], is_item=True)

You're trying to access a user that isn't in the training data.
                  If it was in your original data, it may have been split such that it's only in the validation set now.
You're trying to access an item that isn't in the training data.
                  If it was in your original data, it may have been split such that it's only in the validation set now.


In [38]:
# === Create predictions === #
pred = learner.model.forward(u, m)

TypeError: ignored

In [None]:
# Get all users from the test set and remove any users
# that were know in the training set

test_users = test_df[USER].unique()
test_users = np.intersect1d(test_users, total_users)

In [None]:
# === Cartesian product === #
from itertools import product

users_items = product(np.array(total_users), np.array(total_items))
users_items = pd.DataFrame(users_items, columns=["user_id", "book_id"])

In [None]:
users_items

Unnamed: 0,user_id,book_id
0,1,1
1,1,10
2,1,100
3,1,1000
4,1,10001
...,...,...
25687020,999,9990
25687021,999,9992
25687022,999,9993
25687023,999,9996


My search for how to generate recommendations led me to looking at the [`score` function](https://github.com/microsoft/recommenders/blob/master/reco_utils/recommender/fastai/fastai_utils.py) in the reco_utils module, which is used in one of the notebooks I'm referencing [here](https://github.com/microsoft/recommenders/blob/master/notebooks/00_quick_start/fastai_movielens.ipynb).

I copied the `score` function into the cell below to try and use some of it to write my own recommendation function.

In [None]:
def score(
    learner,
    test_df,
    user_col=cc.DEFAULT_USER_COL,
    item_col=cc.DEFAULT_ITEM_COL,
    prediction_col=cc.DEFAULT_PREDICTION_COL,
    top_k=None,
):
    """Score all users+items provided and reduce to top_k items per user if top_k>0
    
    Args:
        learner (obj): Model.
        test_df (pd.DataFrame): Test dataframe.
        user_col (str): User column name.
        item_col (str): Item column name.
        prediction_col (str): Prediction column name.
        top_k (int): Number of top items to recommend.
    Returns:
        pd.DataFrame: Result of recommendation 
    """
    # replace values not known to the model with NaN
    total_users, total_items = learner.data.train_ds.x.classes.values()
    test_df.loc[~test_df[user_col].isin(total_users), user_col] = np.nan
    test_df.loc[~test_df[item_col].isin(total_items), item_col] = np.nan

    # map ids to embedding ids
    u = learner.get_idx(test_df[user_col], is_item=False)
    m = learner.get_idx(test_df[item_col], is_item=True)

    # score the pytorch model
    pred = learner.model.forward(u, m)
    scores = pd.DataFrame(
        {user_col: test_df[user_col], item_col: test_df[item_col], prediction_col: pred}
    )
    scores = scores.sort_values([user_col, prediction_col], ascending=[True, False])
    if top_k is not None:
        top_scores = scores.groupby(user_col).head(top_k).reset_index(drop=True)
    else:
        top_scores = scores
    return top_scores