### Exploring Hybrid Recommender

* [Importing libraries & loading data](#chapter1)
* [Prepare data for training](#chapter2)

In this notebook, we propose a hybrid book recommendation engine that combines collaborative filtering and content-based filtering using the LightFM algorithm. The dataset used for this experiment contains user ratings with little cross-rating, where users have rated only a few books with limited overlap in their ratings. Collaborative filtering, which relies on user ratings, and content-based filtering, which utilizes book metadata, are two popular approaches for building recommendation engines. However, in datasets with sparse data and limited cross-rating, both approaches may have limitations in providing accurate and diverse recommendations to users.

To address these limitations, we hypothesize that a hybrid approach that combines collaborative filtering and content-based filtering using the LightFM algorithm can overcome the sparse data issue and provide more accurate and diverse recommendations. The LightFM algorithm is a flexible recommendation algorithm that can handle both explicit and implicit feedback, making it suitable for hybrid recommendation scenarios. In this notebook, we will describe the methodology for building and evaluating the hybrid book recommendation engine using the LightFM algorithm. We will outline the dataset used, the preprocessing steps, the feature engineering for content-based filtering, and the implementation of the LightFM model. We will then evaluate the performance of the hybrid approach using appropriate evaluation metrics and compare it with other traditional collaborative filtering and content-based filtering methods. Finally, we will discuss the results and implications of the experiment in the full project.

### Importing libraries & loading data <a class="anchor" id="chapter1"></a>

In [2]:
import sys
import os

import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scrapbook as sb

from sklearn.preprocessing import LabelEncoder

import lightfm
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm import cross_validation

# Import LightFM's evaluation metrics
from lightfm.evaluation import precision_at_k as lightfm_prec_at_k
from lightfm.evaluation import recall_at_k as lightfm_recall_at_k

# Import repo's evaluation metrics
from recommenders.evaluation.python_evaluation import precision_at_k, recall_at_k

from recommenders.utils.timer import Timer
from recommenders.models.lightfm.lightfm_utils import (
    track_model_metrics, prepare_test_df, prepare_all_predictions,
    compare_metric, similar_users, similar_items)

print("System version: {}".format(sys.version))
print("LightFM version: {}".format(lightfm.__version__))

System version: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]
LightFM version: 1.17




In [3]:
books = pd.read_csv('../data/processed/processed_books.csv')
reviews = pd.read_csv('../data/processed/processed_reviews.csv')
ratings_dist = pd.read_csv('../data/processed/processed_ratings.csv')

Defining variables as per the recommenders library instructs to 

In [4]:
# default number of recommendations
K = 10
# percentage of data used for testing
TEST_PERCENTAGE = 0.25
# model learning rate
LEARNING_RATE = 0.9
# no of latent factors
NO_COMPONENTS = 20
# no of epochs to fit model
NO_EPOCHS = 100
# no of threads to fit model
NO_THREADS = 16
# regularisation for both user and item features
ITEM_ALPHA = 1e-6
USER_ALPHA = 1e-6

# seed for pseudonumber generations
SEED = 42

In [5]:
books

Unnamed: 0.1,Unnamed: 0,book_id,title,author,price,genres,series,publisher,year_published,current_readers,wanted_to_read,num_reviews,num_ratings,rating,awards,primary_lists,book_score,author_score
0,0,77203.The_Kite_Runner,The Kite Runner,Khaled Hosseini,8.717848,"['Fiction', 'Historical Fiction', 'Classics', ...",0,Riverhead Books,2004-05-01,42900.0,1000000.0,90,2935385,4.0,['Borders Original Voices Award for Fiction (2...,['Books That Everyone Should Read At Least Onc...,0.559392,0.064747
1,1,929.Memoirs_of_a_Geisha,Memoirs of a Geisha,Arthur Golden,12.990000,"['Fiction', 'Historical Fiction', 'Romance', '...",0,Vintage Books USA,2005-11-22,12300.0,793000.0,34,1922540,4.0,[],"['Best Books Ever', 'Best Historical Fiction',...",0.504395,0.052931
2,2,128029.A_Thousand_Splendid_Suns,A Thousand Splendid Suns,Khaled Hosseini,12.990000,"['Fiction', 'Historical Fiction', 'Contemporar...",0,Riverhead Books,2007-06-01,32700.0,760000.0,69,1417260,4.0,['British Book Award for Best Read of the Year...,"['Best Books Ever', 'Books That Everyone Shoul...",0.476958,0.064747
3,3,19063.The_Book_Thief,The Book Thief,Markus Zusak,10.990000,"['Historical Fiction', 'Fiction', 'Young Adult...",0,Alfred A. Knopf,2006-03-14,86000.0,2000000.0,134,2345385,4.0,['National Jewish Book Award for Children’s an...,"['Best Books Ever', 'Books That Everyone Shoul...",0.527355,0.034407
4,4,4214.Life_of_Pi,Life of Pi,Yann Martel,8.717848,"['Fiction', 'Fantasy', 'Classics', 'Adventure'...",0,Seal Books,2006-08-29,24900.0,726000.0,51,1544622,3.0,"['Booker Prize (2002)', 'Bollinger Everyman Wo...","['Best Books Ever', 'Books That Everyone Shoul...",0.383873,0.021261
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4982,6257,25489259-death-of-an-alchemist,Death of an Alchemist,Mary Lawrence,5.990000,"['Mystery', 'Historical Fiction', 'Fiction', '...",1,Kensington Books,2016-01-26,-1.0,-1.0,68,285,3.0,[],['Most Anticipated Historical Mysteries for 20...,0.300015,0.000022
4983,6259,52185047-the-lost-boys-of-london,The Lost Boys of London,Mary Lawrence,8.717848,"['Mystery', 'Historical Fiction', 'Historical'...",1,Red Puddle Print,2020-04-28,-1.0,-1.0,51,99,4.0,[],"['Anticipated 2020 Literary Fiction', 'Crime, ...",0.400005,0.000022
4984,6262,36445482-no-cure-for-the-dead,No Cure for the Dead,Christine Trent,12.990000,"['Mystery', 'Historical Fiction', 'Historical ...",1,Crooked Lane Books,2018-05-08,-1.0,-1.0,86,380,3.0,[],"['Historical Fiction 2018', 'Historical Myster...",0.300021,0.000005
4985,6263,15793166-the-midwife-s-tale,The Midwife's Tale,Sam Thomas,5.990000,"['Historical Fiction', 'Mystery', 'Fiction', '...",1,Minotaur Books,2013-01-08,-1.0,-1.0,421,2855,3.0,[],"['Historical Fiction 2013', 'most anticipated ...",0.300155,0.000051


In [6]:
# Merging to review dataset the genre characteristic by book_id
characteristics_book_df = books[['book_id', 'genres', 'price', 'publisher']]
reviews = reviews.drop('Unnamed: 0', axis = 1)
reviews = reviews[['book_id', 'user_id', 'rating']]

df = reviews.merge(characteristics_book_df, on='book_id', how='right')
df

Unnamed: 0,book_id,user_id,rating,genres,price,publisher
0,77203.The_Kite_Runner,613434.0,1.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
1,77203.The_Kite_Runner,31207039.0,5.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
2,77203.The_Kite_Runner,84023.0,2.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
3,77203.The_Kite_Runner,616569.0,5.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
4,77203.The_Kite_Runner,91373.0,1.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
...,...,...,...,...,...,...
104256,25489259-death-of-an-alchemist,,,"['Mystery', 'Historical Fiction', 'Fiction', '...",5.990000,Kensington Books
104257,52185047-the-lost-boys-of-london,,,"['Mystery', 'Historical Fiction', 'Historical'...",8.717848,Red Puddle Print
104258,36445482-no-cure-for-the-dead,,,"['Mystery', 'Historical Fiction', 'Historical ...",12.990000,Crooked Lane Books
104259,15793166-the-midwife-s-tale,,,"['Historical Fiction', 'Mystery', 'Fiction', '...",5.990000,Minotaur Books


Understanding the number of users and picking the 2000 most significant ones

In [7]:
# Group by user_id and count ratings per user
user_ratings_count = df.groupby('user_id').size().reset_index(name='ratings_count')

# Sort by rating count in descending order
sorted_user_ratings_count = user_ratings_count.sort_values(by='ratings_count', ascending=False)

# Select the top 5000 users
top_5000_users = sorted_user_ratings_count.head(5000)
top_5000_users

Unnamed: 0,user_id,ratings_count
2879,614778.0,615
13929,4622890.0,256
26862,17438949.0,225
15498,5253785.0,212
32580,32879029.0,203
...,...,...
11246,3345952.0,3
34725,41627667.0,3
5925,1325473.0,3
11249,3347032.0,3


In [8]:
reviews_top_5000_users = pd.merge(df, top_5000_users, on='user_id', how='inner')
df = reviews_top_5000_users
df = df[['user_id', 'book_id', 'rating', 'genres', 'price', 'publisher']]

Encode book_id

In [9]:
# Instantiate a LabelEncoder object
encoder = LabelEncoder()

# Use the fit_transform method to label encode the 'Category' column
df['book_id'] = encoder.fit_transform(df['book_id'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['book_id'] = encoder.fit_transform(df['book_id'])


In [10]:
df = df.dropna(subset=['rating'])
df


Unnamed: 0,user_id,book_id,rating,genres,price,publisher
0,31207039.0,3208,5.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
1,31207039.0,2239,5.0,"['Nonfiction', 'Psychology', 'Philosophy', 'Hi...",8.717848,Beacon Press
2,31207039.0,1809,5.0,"['Philosophy', 'Classics', 'Nonfiction', 'Poli...",10.990000,Penguin Classics
3,31207039.0,1407,4.0,"['Nonfiction', 'History', 'Science', 'Philosop...",8.717848,Vintage
4,31207039.0,105,3.0,"['Classics', 'Fiction', 'Literature', 'Novels'...",8.717848,Vintage International
...,...,...,...,...,...,...
54036,49334904.0,2049,4.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Mosaic Press
54037,2134638.0,1411,4.0,"['Horror', 'Short Stories', 'Anthologies']",8.717848,Cemetery Dance Publications
54038,2134638.0,2174,5.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Cemetery Dance Publications
54039,2134638.0,2930,4.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Cemetery Dance Publications


In [11]:
df.isna().sum()

user_id      0
book_id      0
rating       0
genres       0
price        0
publisher    0
dtype: int64

### Prepare data for training <a class="anchor" id="chapter2"></a>

Before fitting the LightFM model, we need to create an instance of Dataset which holds the interaction matrix.

In [12]:
dataset = Dataset()

In [13]:
dataset.fit(users=df['user_id'], 
            items=df['book_id'])

# quick check to determine the number of unique users and items in the data
num_users, num_topics = dataset.interactions_shape()
print(f'Num users: {num_users}, num_topics: {num_topics}.')

Num users: 4959, num_topics: 3596.


Next is to build the interaction matrix. The build_interactions method returns 2 COO sparse matrices, namely the interactions and weights matrices.

In [14]:
(interactions, weights) = dataset.build_interactions(df.iloc[:, 0:3].values)

In [15]:
train_interactions, test_interactions = cross_validation.random_train_test_split(
    interactions, test_percentage=TEST_PERCENTAGE,
    random_state=np.random.RandomState(SEED))

In [16]:
print(f"Shape of train interactions: {train_interactions.shape}")
print(f"Shape of test interactions: {test_interactions.shape}")

Shape of train interactions: (4959, 3596)
Shape of test interactions: (4959, 3596)


#### Fitting the LightFM model

In [17]:
model1 = LightFM(loss='warp', no_components=150, 
                 learning_rate=0.90,                 
                 random_state=np.random.RandomState(SEED), user_alpha=0.000005)

In [18]:
model1 = model1.fit(train_interactions, epochs=NO_EPOCHS, num_threads=NO_THREADS, verbose=False)

: 

: 