# Preparation

In [1]:
import os

if os.path.basename(os.getcwd()) == "snorkel-tutorials":
  os.chdir("./recsys")

os.getcwd()

'/Users/scottchu/Projects/learning/snorkel-tutorials/recsys'

In [2]:
import warnings
warnings.simplefilter("ignore")

In [3]:
%pip install -r requirements.txt -q

Note: you may need to restart the kernel to use updated packages.


# Recommender System Tutorial
A setting similar to the Netflix challenge, but with books instead of movies. Given a set of users and books, and for each user, it is given a set of books they have interacted with (read or marked as-to-read). User does not provide any numerical ratings for the books they read, except in a small number of cases. Similarly, some user have written some text reviews.

The goal is to build a recommender system by training a classifier to predict whether a user will read and like any given book. The model will be trained over a user-book pair to predict a `rating` (a `rating` of 1 means the user will read and like the book). To simplify inference, user will be represented by the set of books they interacted with (rather than learning a specific representation of each user). Once the model is trained, it can be used to recommend books to a user when they visit the site. It is possible to predict the rating for the user paired with a book for a few tousand likely books, then pick the books with the top ten predicted ratings.

We will use the Goodreads dataset, from "Item Recommendation on Monotonic Behavior Chains", and "Fine-Grained Spoiler Detection from Large-Scale Review Corpora". In this dataset, we have user interactions and reviews for Young Adult novels from the Goodreads website, along with metadata (like `title` and `authors`) for the novels.

## Loading Data

Data and Context
- `user_idx`: A unique identifier for a suer
- `book_idx`: A unique identifier for a book that is being rated by the user
- `book_idxs`: The set of books that the user has interacted with (read or planned to read)
- `review_text`: Optional text review written by the user for the book
- `rating`: Either `0` (means the user did not read or did not like the book) or `1` (means the user read and liked the book). The `rating` field is missing for `df_train`. The objective is to predict whether a given user (represented by the set of book_idxs the user has interacted with) will read and like any given book. that is, we want to train a model that takes a set of `book_idxs` (the user) and a single `book_idx` (the book to rate) and predicts the `rating`.
- `df_books`: Contains books with metadata for that book (`title` and `first_author`)

In [4]:
from utils import download_and_process_data

(df_train, df_test, df_dev, df_valid), df_books = download_and_process_data()

In [5]:
df_books.head()

Unnamed: 0,authors,average_rating,book_id,country_code,description,is_ebook,language_code,ratings_count,similar_books,text_reviews_count,title,first_author,book_idx
3,[293603],4.35,10099492,US,It all comes down to this.\nVlad's running out...,True,eng,152,"[25861113, 7430195, 18765937, 6120544, 3247550...",9,Twelfth Grade Kills (The Chronicles of Vladimi...,293603,0
4,[4018722],3.71,22642971,US,The future world is at peace.\nElla Shepherd h...,True,eng,1525,"[20499652, 17934493, 13518102, 16210411, 17149...",428,The Body Electric,4018722,1
5,[6537142],3.89,31556136,US,A gorgeously written and deeply felt literary ...,True,,109,[],45,Like Water,6537142,2
12,"[6455200, 5227552]",3.9,18522274,US,Zoe Vanderveen is on the run with her captor t...,True,en-US,191,"[25063023, 18553080, 17567752, 18126509, 17997...",6,"Volition (The Perception Trilogy, #2)",6455200,3
13,[187837],3.19,17262776,US,"The war is over, but for thirteen-year-old Rac...",True,eng,248,"[16153997, 10836616, 17262238, 16074827, 13628...",68,Little Red Lies,187837,4


In [6]:
df_dev.sample(frac=1, random_state=12).head()

Unnamed: 0,user_idx,book_idxs,book_idx,rating,review_text
214186,8283,"(9134, 25220, 17164, 17493, 1429, 29145, 23157...",25220,1,4.5*
357653,13814,"(17461, 15013, 13560, 25955, 27690, 20410, 117...",4220,1,I must say that this book will evoke different...
691372,26865,"(13995, 24232, 19221, 2578, 6711, 8755, 8139, ...",30827,1,
29419,1169,"(15660, 21161, 21162, 12921, 25965, 10394, 840...",10844,0,
539209,20822,"(7047, 8517, 18228, 16282, 25444, 18231, 9760,...",22472,1,


## Writing Labeling Functions

In [7]:
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1

### Theory: Author
When a user interacted with several books written by an author, there is a good chance that the user will read and like other books by the same author.

In [8]:
from snorkel.labeling.lf import labeling_function

book_to_first_author = dict(zip(df_books.book_idx, df_books.first_author))

first_author_to_books_df = df_books.groupby("first_author")[["book_idx"]].agg(set)
first_author_to_books = dict(
  zip(
    first_author_to_books_df.index,
    first_author_to_books_df.book_idx
  )
)

@labeling_function(
  resources=dict(
    book_to_first_author=book_to_first_author,
    first_author_to_books=first_author_to_books
  )
)
def shared_first_author(x, book_to_first_author, first_author_to_books):
  author = book_to_first_author[x.book_idx]
  same_author_books = first_author_to_books[author]
  num_read = len(set(x.book_idxs).intersection(same_author_books))
  return POSITIVE if num_read > 15 else ABSTAIN

### Theory: Review
Long text reviews written by users to guess whether they liked or disliked a book. For example, the third `df_dev` entry above has a review with the text `'4.5 STARS`, which indicates that the user liked the book. We write a simple LF that looks for similar phrases to guess the user's rating of a book. Anything 4 stars or above to indicate a positive rating, while < 4 is negative.

In [9]:
low_rating_stars = [
  "one star",
  "two star",
  "three star",

  "1 star",
  "2 star",
  "2.5 star",
  "3 star",
  "3.5 star",

  "1 out of 5 ",
  "2 out of 5 ",
  "3 out of 5 "
]

high_rating_stars = [
  "four stars",
  "five stars",
  "4 stars",
  "4.5 stars",
  "5 stars"
]

@labeling_function(
  resources=dict(
    low_rating_stars=low_rating_stars,
    high_rating_stars=high_rating_stars
  )
)
def stars_in_review(x, low_rating_stars, high_rating_stars):
  if not isinstance(x.review_text, str):
    return ABSTAIN

  review_text = x.review_text.lower()

  for low_rating_star in low_rating_stars:
    if low_rating_star in review_text:
      return NEGATIVE

  for high_rating_star in high_rating_stars:
    if high_rating_star in review_text:
      return POSITIVE

  return ABSTAIN

### Theory: Sentiment
Analyze the reviews by `TextBlob` and use its polarity and subjectivity scores to estimate the user's rating for the book. These thresholds were picked by analyzing the score distributions and running error analysis.

In [10]:
from snorkel.preprocess import preprocessor
from textblob import TextBlob

@preprocessor(memoize=True)
def textblob_polarity(x):
  if isinstance(x.review_text, str):
    x.blob = TextBlob(x.review_text)
  else:
    x.blob = None

  return x

@labeling_function(pre=[textblob_polarity])
def polarity_positive(x):
  return POSITIVE if x.blob and x.blob.polarity > 0.3 else ABSTAIN

@labeling_function(pre=[textblob_polarity])
def subjectivity_positive(x):
  return POSITIVE if x.blob and x.blob.subjectivity > 0.75 else ABSTAIN

@labeling_function(pre=[textblob_polarity])
def polarity_negative(x):
  return NEGATIVE if x.blob and x.blob.polarity < 0.0 else ABSTAIN



In [11]:
from snorkel.labeling import PandasLFApplier, LFAnalysis

lfs = [
  stars_in_review,
  shared_first_author,
  polarity_positive,
  polarity_negative,
  subjectivity_positive
]

applier = PandasLFApplier(lfs=lfs)
L_dev = applier.apply(df_dev)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7881/7881 [00:02<00:00, 2826.31it/s]


In [12]:
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(df_dev.rating.values)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
stars_in_review,0,"[0, 1]",0.016368,0.004822,0.001903,98,31,0.75969
shared_first_author,1,[1],0.046948,0.000888,0.000508,222,148,0.6
polarity_positive,2,[1],0.046948,0.013323,0.000634,305,65,0.824324
polarity_negative,3,[0],0.017764,0.005202,0.004695,92,48,0.657143
subjectivity_positive,4,[1],0.02081,0.01548,0.004187,117,47,0.713415


### Applying labeling functions to the training set

In [13]:
from snorkel.labeling.model import LabelModel

L_train = applier.apply(df_train)
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=5000, seed=123, log_freq=20, lr=0.01)
preds_train = label_model.predict(L_train)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 797088/797088 [04:10<00:00, 3182.94it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 5668.79epoch/s]


In [14]:
from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, preds_train_filtered = filter_unlabeled_dataframe(
  df_train, preds_train, L_train)

df_train_filtered["rating"] = preds_train_filtered

### Rating Prediction Model
Using a Kera model for predicting ratings given a user's book list and a book (which is being rated). The model represents the list of books the user interacted with, `books_idxs`, by learning an embedding for each idx, and averaging the embeddings in `book_idxs`. It learns another embedding for the `book_idx`, the book to be rated. Then it concatenates the two embeddings and uses an MLP to compute the probability of the `rating` being 1. this type of model is common in large-scale recommender systems.

In [15]:
import numpy as np
import tensorflow as tf
from utils import precision_batch, recall_batch, f1_batch

n_books = max([max(df.book_idx) for df in [df_train, df_test, df_dev, df_valid]])

# Keras model to predict rating given book_idxs and book_idx
def get_model(embed_dim=64, hidden_layer_sizes=[32]):
  # Compute embedding for book_idxs
  len_book_idxs = tf.keras.layers.Input([])
  book_idxs = tf.keras.layers.Input([None])

  # book_idxs % n_books is to prevent crashing if a book_idx in book_idxs is > n_books.
  book_idxs_emb = tf.keras.layers.Embedding(n_books, embed_dim)(book_idxs % n_books)
  book_idxs_emb = tf.math.divide(
    tf.keras.backend.sum(book_idxs_emb, axis=1),
    tf.expand_dims(len_book_idxs, 1)
  )

  # Compute embedding for book_idx
  book_idx = tf.keras.layers.Input([])
  book_idx_emb = tf.keras.layers.Embedding(n_books, embed_dim)(book_idx)

  input_layer = tf.keras.layers.concatenate([book_idxs_emb, book_idx_emb], 1)

  # Build Multi Layer Perceptron on input layer.
  cur_layer = input_layer

  for size in hidden_layer_sizes:
    tf.keras.layers.Dense(size, activation=tf.nn.relu)(cur_layer)

  output_layer = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(cur_layer)

  # Create and compile keras model
  model = tf.keras.Model(
    inputs=[
      len_book_idxs,
      book_idxs,
      book_idx
    ],
    outputs=[
      output_layer
    ]
  )

  model.compile(
    "Adagrad",
    "binary_crossentropy",
    metrics=["accuracy", f1_batch, precision_batch, recall_batch]
  )

  return model

  

We use triples of (`book_idxs`, `book_idx`, `rating`) from our dataframes as training data points. In addition, we want to train the model to recognize when a user will not read a book. To create data points for that, we randomly sample a `bookd_id` not in `book_idxs` and use that with a `rating` of 0 as a random negative dat point for every positive (`rating` 1) data point in our dataframe so that positive and negative data points are roughly balanced.

In [16]:
# Generator to turn dataframe into data points.
def get_data_points_generator(df):
    def generator():
        for book_idxs, book_idx, rating in zip(df.book_idxs, df.book_idx, df.rating):
            # Remove book_idx from book_idxs so the model can't just look it up.
            book_idxs = tuple(filter(lambda x: x != book_idx, book_idxs))
            yield {
                "len_book_idxs": len(book_idxs),
                "book_idxs": book_idxs,
                "book_idx": book_idx,
                "label": rating,
            }
            if rating == 1:
                # Generate a random negative book_id not in book_idxs.
                random_negative = np.random.randint(0, n_books)
                while random_negative in book_idxs:
                    random_negative = np.random.randint(0, n_books)
                yield {
                    "len_book_idxs": len(book_idxs),
                    "book_idxs": book_idxs,
                    "book_idx": random_negative,
                    "label": 0,
                }

    return generator


def get_data_tensors(df):
    # Use generator to get data points each epoch, along with shuffling and batching.
    padded_shapes = {
        "len_book_idxs": [],
        "book_idxs": [None],
        "book_idx": [],
        "label": [],
    }
    dataset = (
        tf.data.Dataset.from_generator(
            get_data_points_generator(df), {k: tf.int64 for k in padded_shapes}
        )
        .shuffle(123)
        .repeat(None)
        .padded_batch(batch_size=256, padded_shapes=padded_shapes)
    )
    tensor_dict = tf.compat.v1.data.make_one_shot_iterator(dataset).get_next()
    return (
        (
            tensor_dict["len_book_idxs"],
            tensor_dict["book_idxs"],
            tensor_dict["book_idx"],
        ),
        tensor_dict["label"],
    )

In [20]:
from utils import get_n_epochs

model = get_model()

X_train, Y_train = get_data_tensors(df_train_filtered)
X_valid, Y_valid = get_data_tensors(df_valid)
model.fit(
    X_train,
    Y_train,
    steps_per_epoch=300,
    validation_data=(X_valid, Y_valid),
    validation_steps=40,
    epochs=get_n_epochs(),
    verbose=1,
)

Epoch 1/30




Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30






<keras.callbacks.History at 0x463368be0>

In [24]:
X_test, Y_test = get_data_tensors(df_test)

_ = model.evaluate(X_test, Y_test, steps=30)







## Summary

## Further Readings
- [Netflix Prize data](https://www.kaggle.com/netflix-inc/netflix-prize-data)
- [Recommender system](https://en.wikipedia.org/wiki/Recommender_system)
- [Multilayer Perceptron (MLP)](https://en.wikipedia.org/wiki/Multilayer_perceptron)