### Exploring Hybrid Recommender

* [Importing libraries & loading data](#chapter1)
* [Prepare data for training](#chapter2)

In this notebook, we propose a hybrid book recommendation engine that combines collaborative filtering and content-based filtering using the LightFM algorithm. The dataset used for this experiment contains user ratings with little cross-rating, where users have rated only a few books with limited overlap in their ratings. Collaborative filtering, which relies on user ratings, and content-based filtering, which utilizes book metadata, are two popular approaches for building recommendation engines. However, in datasets with sparse data and limited cross-rating, both approaches may have limitations in providing accurate and diverse recommendations to users.

To address these limitations, we hypothesize that a hybrid approach that combines collaborative filtering and content-based filtering using the LightFM algorithm can overcome the sparse data issue and provide more accurate and diverse recommendations. The LightFM algorithm is a flexible recommendation algorithm that can handle both explicit and implicit feedback, making it suitable for hybrid recommendation scenarios. In this notebook, we will describe the methodology for building and evaluating the hybrid book recommendation engine using the LightFM algorithm. We will outline the dataset used, the preprocessing steps, the feature engineering for content-based filtering, and the implementation of the LightFM model. We will then evaluate the performance of the hybrid approach using appropriate evaluation metrics and compare it with other traditional collaborative filtering and content-based filtering methods. Finally, we will discuss the results and implications of the experiment in the full project.

### Importing libraries & loading data <a class="anchor" id="chapter1"></a>

In [75]:
import sys
import os

import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scrapbook as sb

from sklearn.preprocessing import LabelEncoder

import lightfm
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm import cross_validation

# Import LightFM's evaluation metrics
from lightfm.evaluation import precision_at_k as lightfm_prec_at_k
from lightfm.evaluation import recall_at_k as lightfm_recall_at_k

# Import repo's evaluation metrics
from recommenders.evaluation.python_evaluation import precision_at_k, recall_at_k

from recommenders.utils.timer import Timer
from recommenders.models.lightfm.lightfm_utils import (
    track_model_metrics, prepare_test_df, prepare_all_predictions,
    compare_metric, similar_users, similar_items)

print("System version: {}".format(sys.version))
print("LightFM version: {}".format(lightfm.__version__))

System version: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:41:37) 
[Clang 11.1.0 ]
LightFM version: 1.16


In [76]:
books = pd.read_csv('../data/processed/processed_books.csv')
reviews = pd.read_csv('../data/processed/processed_reviews.csv')
ratings_dist = pd.read_csv('../data/processed/processed_ratings.csv')

Defining variables as per the recommenders library instructs to 

In [117]:
# default number of recommendations
K = 10
# percentage of data used for testing
TEST_PERCENTAGE = 0.25
# model learning rate
LEARNING_RATE = 0.25
# no of latent factors
NO_COMPONENTS = 20
# no of epochs to fit model
NO_EPOCHS = 10000
# no of threads to fit model
NO_THREADS = 32
# regularisation for both user and item features
ITEM_ALPHA = 1e-6
USER_ALPHA = 1e-6

# seed for pseudonumber generations
SEED = 42

In [78]:
books

Unnamed: 0.1,Unnamed: 0,book_id,title,author,price,genres,series,publisher,year_published,current_readers,wanted_to_read,num_reviews,num_ratings,rating,awards,primary_lists,book_score,author_score
0,0,77203.The_Kite_Runner,The Kite Runner,Khaled Hosseini,8.717848,"['Fiction', 'Historical Fiction', 'Classics', ...",0,Riverhead Books,2004-05-01,42900.0,1000000.0,90,2935385,4.0,['Borders Original Voices Award for Fiction (2...,['Books That Everyone Should Read At Least Onc...,0.559392,0.064747
1,1,929.Memoirs_of_a_Geisha,Memoirs of a Geisha,Arthur Golden,12.990000,"['Fiction', 'Historical Fiction', 'Romance', '...",0,Vintage Books USA,2005-11-22,12300.0,793000.0,34,1922540,4.0,[],"['Best Books Ever', 'Best Historical Fiction',...",0.504395,0.052931
2,2,128029.A_Thousand_Splendid_Suns,A Thousand Splendid Suns,Khaled Hosseini,12.990000,"['Fiction', 'Historical Fiction', 'Contemporar...",0,Riverhead Books,2007-06-01,32700.0,760000.0,69,1417260,4.0,['British Book Award for Best Read of the Year...,"['Best Books Ever', 'Books That Everyone Shoul...",0.476958,0.064747
3,3,19063.The_Book_Thief,The Book Thief,Markus Zusak,10.990000,"['Historical Fiction', 'Fiction', 'Young Adult...",0,Alfred A. Knopf,2006-03-14,86000.0,2000000.0,134,2345385,4.0,['National Jewish Book Award for Children’s an...,"['Best Books Ever', 'Books That Everyone Shoul...",0.527355,0.034407
4,4,4214.Life_of_Pi,Life of Pi,Yann Martel,8.717848,"['Fiction', 'Fantasy', 'Classics', 'Adventure'...",0,Seal Books,2006-08-29,24900.0,726000.0,51,1544622,3.0,"['Booker Prize (2002)', 'Bollinger Everyman Wo...","['Best Books Ever', 'Books That Everyone Shoul...",0.383873,0.021261
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4982,6257,25489259-death-of-an-alchemist,Death of an Alchemist,Mary Lawrence,5.990000,"['Mystery', 'Historical Fiction', 'Fiction', '...",1,Kensington Books,2016-01-26,-1.0,-1.0,68,285,3.0,[],['Most Anticipated Historical Mysteries for 20...,0.300015,0.000022
4983,6259,52185047-the-lost-boys-of-london,The Lost Boys of London,Mary Lawrence,8.717848,"['Mystery', 'Historical Fiction', 'Historical'...",1,Red Puddle Print,2020-04-28,-1.0,-1.0,51,99,4.0,[],"['Anticipated 2020 Literary Fiction', 'Crime, ...",0.400005,0.000022
4984,6262,36445482-no-cure-for-the-dead,No Cure for the Dead,Christine Trent,12.990000,"['Mystery', 'Historical Fiction', 'Historical ...",1,Crooked Lane Books,2018-05-08,-1.0,-1.0,86,380,3.0,[],"['Historical Fiction 2018', 'Historical Myster...",0.300021,0.000005
4985,6263,15793166-the-midwife-s-tale,The Midwife's Tale,Sam Thomas,5.990000,"['Historical Fiction', 'Mystery', 'Fiction', '...",1,Minotaur Books,2013-01-08,-1.0,-1.0,421,2855,3.0,[],"['Historical Fiction 2013', 'most anticipated ...",0.300155,0.000051


In [79]:
# Merging to review dataset the genre characteristic by book_id
characteristics_book_df = books[['book_id', 'genres', 'price', 'publisher']]
reviews = reviews.drop('Unnamed: 0', axis = 1)
reviews = reviews[['book_id', 'user_id', 'rating']]

df = reviews.merge(characteristics_book_df, on='book_id', how='right')
df

Unnamed: 0,book_id,user_id,rating,genres,price,publisher
0,77203.The_Kite_Runner,613434.0,1.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
1,77203.The_Kite_Runner,31207039.0,5.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
2,77203.The_Kite_Runner,84023.0,2.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
3,77203.The_Kite_Runner,616569.0,5.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
4,77203.The_Kite_Runner,91373.0,1.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
...,...,...,...,...,...,...
104256,25489259-death-of-an-alchemist,,,"['Mystery', 'Historical Fiction', 'Fiction', '...",5.990000,Kensington Books
104257,52185047-the-lost-boys-of-london,,,"['Mystery', 'Historical Fiction', 'Historical'...",8.717848,Red Puddle Print
104258,36445482-no-cure-for-the-dead,,,"['Mystery', 'Historical Fiction', 'Historical ...",12.990000,Crooked Lane Books
104259,15793166-the-midwife-s-tale,,,"['Historical Fiction', 'Mystery', 'Fiction', '...",5.990000,Minotaur Books


In [80]:
df.dtypes

book_id       object
user_id      float64
rating       float64
genres        object
price        float64
publisher     object
dtype: object

In [81]:
df = df.dropna()
df["user_id"] = df["user_id"].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["user_id"] = df["user_id"].astype(int)


Understanding the number of users and picking the 2000 most significant ones

In [82]:
# Group by user_id and count ratings per user
user_ratings_count = df.groupby('user_id').size().reset_index(name='ratings_count')

# Sort by rating count in descending order
sorted_user_ratings_count = user_ratings_count.sort_values(by='ratings_count', ascending=False)

# Select the top 5000 users
top_5000_users = sorted_user_ratings_count.head(5000)
top_5000_users

Unnamed: 0,user_id,ratings_count
2814,614778,610
13579,4622890,246
26184,17438949,224
15097,5253785,212
37412,60866073,203
...,...,...
16086,5721271,3
35092,47555322,3
19473,7645532,3
16130,5743302,3


In [83]:
reviews_top_5000_users = pd.merge(df, top_5000_users, on='user_id', how='inner')
df = reviews_top_5000_users
df = df[['user_id', 'book_id', 'rating', 'genres', 'price', 'publisher']]

Encode book_id

In [84]:
# Instantiate a LabelEncoder object
encoder = LabelEncoder()

# Use the fit_transform method to label encode the 'Category' column
df['book_id'] = encoder.fit_transform(df['book_id'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['book_id'] = encoder.fit_transform(df['book_id'])


In [85]:
df = df.dropna(subset=['rating'])
df


Unnamed: 0,user_id,book_id,rating,genres,price,publisher
0,31207039,3194,5.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
1,31207039,2232,5.0,"['Nonfiction', 'Psychology', 'Philosophy', 'Hi...",8.717848,Beacon Press
2,31207039,1804,5.0,"['Philosophy', 'Classics', 'Nonfiction', 'Poli...",10.990000,Penguin Classics
3,31207039,1403,4.0,"['Nonfiction', 'History', 'Science', 'Philosop...",8.717848,Vintage
4,31207039,104,3.0,"['Classics', 'Fiction', 'Literature', 'Novels'...",8.717848,Vintage International
...,...,...,...,...,...,...
52773,49334904,2044,4.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Mosaic Press
52774,2134638,1407,4.0,"['Horror', 'Short Stories', 'Anthologies']",8.717848,Cemetery Dance Publications
52775,2134638,2167,5.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Cemetery Dance Publications
52776,2134638,2917,4.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Cemetery Dance Publications


In [86]:
df.isna().sum()

user_id      0
book_id      0
rating       0
genres       0
price        0
publisher    0
dtype: int64

In [87]:
df.drop_duplicates(subset=['user_id', 'book_id'], inplace=True)
df.duplicated().sum()

0

### Prepare data for training <a class="anchor" id="chapter2"></a>

Before fitting the LightFM model, we need to create an instance of Dataset which holds the interaction matrix.

In [88]:
dataset = Dataset()

In [89]:
dataset.fit(users=df['user_id'], 
            items=df['book_id'])

# quick check to determine the number of unique users and items in the data
num_users, num_topics = dataset.interactions_shape()
print(f'Num users: {num_users}, num_topics: {num_topics}.')

Num users: 5000, num_topics: 3596.


Next is to build the interaction matrix. The build_interactions method returns 2 COO sparse matrices, namely the interactions and weights matrices.

In [90]:
(interactions, weights) = dataset.build_interactions(df.iloc[:, 0:3].values)

In [91]:
train_interactions, test_interactions = cross_validation.random_train_test_split(
    interactions, test_percentage=TEST_PERCENTAGE,
    random_state=np.random.RandomState(SEED))

In [92]:
print(f"Shape of train interactions: {train_interactions.shape}")
print(f"Shape of test interactions: {test_interactions.shape}")

Shape of train interactions: (5000, 3596)
Shape of test interactions: (5000, 3596)


#### Fitting the LightFM model

In [118]:
model1 = LightFM(loss='warp', no_components=20, 
                 learning_rate=LEARNING_RATE,                 
                 random_state=np.random.RandomState(SEED))

In [119]:
model1 = model1.fit(train_interactions, epochs=NO_EPOCHS, num_threads=NO_THREADS, verbose=False)

In [95]:
uids, iids, interaction_data = cross_validation._shuffle(
    interactions.row, interactions.col, interactions.data, 
    random_state=np.random.RandomState(SEED))

cutoff = int((1.0 - TEST_PERCENTAGE) * len(uids))
test_idx = slice(cutoff, None)

In [96]:
print(type(test_idx))

<class 'slice'>


In [97]:
uid_map, ufeature_map, iid_map, ifeature_map = dataset.mapping()

In [98]:
with Timer() as test_time:
    test_df = prepare_test_df(test_idx, uids, iids, uid_map, iid_map, weights)
print(f"Took {test_time.interval:.1f} seconds for prepare and predict test data.")  
time_reco1 = test_time.interval

Took 1.2 seconds for prepare and predict test data.


In [99]:
test_df.sample(5, random_state=SEED)

Unnamed: 0,userID,itemID,rating
6170,4213258,505,4.0
8108,14684638,1221,5.0
2447,50424456,1332,4.0
3132,52545435,2648,4.0
6778,84049050,2725,4.0


In [100]:
df = df.rename(columns={"user_id": "userID",'book_id': 'itemID'})

In [101]:
# with Timer() as test_time:
#     all_predictions = prepare_all_predictions(df, uid_map, iid_map, 
#                                               interactions=train_interactions,
#                                               model=model1, 
#                                               num_threads=NO_THREADS)
# print(f"Took {test_time.interval:.1f} seconds for prepare and predict all data.")
# time_reco2 = test_time.interval

In [102]:
all_predictions = pd.read_csv('../data/processed/all_predictions.csv')

In [103]:
all_predictions.sample(5, random_state=SEED)
all_predictions.drop(columns=['Unnamed: 0', "Unnamed: 0.1"], axis = 1, inplace = True)


In [104]:
all_predictions

Unnamed: 0,userID,itemID,prediction
0,31207039,1804,-1.450593e+07
1,31207039,2115,-1.137490e+09
2,31207039,2559,-3.342550e+06
3,31207039,1981,-2.849655e+07
4,31207039,3542,-2.127561e+08
...,...,...,...
7667008,1935255,1277,6.849396e+07
7667009,1935255,2980,-5.445119e+06
7667010,1935255,1298,-1.023283e+08
7667011,1935255,20,5.868065e+06


In [126]:
with Timer() as test_time:
    eval_precision_lfm = lightfm_prec_at_k(model1, test_interactions, 
                                           train_interactions, k=K).mean()
    eval_recall_lfm = lightfm_recall_at_k(model1, test_interactions, 
                                          train_interactions, k=K).mean()
time_lfm = test_time.interval
    
print(
    "\n------ Using LightFM evaluation methods ------",
    f"Precision@K:\t{eval_precision_lfm:.6f}",
    f"Recall@K:\t{eval_recall_lfm:.6f}", 
    sep='\n')


------ Using LightFM evaluation methods ------
Precision@K:	0.008241
Recall@K:	0.020841


In [128]:
df

Unnamed: 0,userID,itemID,rating,genres,price,publisher
0,31207039,3194,5.0,"['Fiction', 'Historical Fiction', 'Classics', ...",8.717848,Riverhead Books
1,31207039,2232,5.0,"['Nonfiction', 'Psychology', 'Philosophy', 'Hi...",8.717848,Beacon Press
2,31207039,1804,5.0,"['Philosophy', 'Classics', 'Nonfiction', 'Poli...",10.990000,Penguin Classics
3,31207039,1403,4.0,"['Nonfiction', 'History', 'Science', 'Philosop...",8.717848,Vintage
4,31207039,104,3.0,"['Classics', 'Fiction', 'Literature', 'Novels'...",8.717848,Vintage International
...,...,...,...,...,...,...
52773,49334904,2044,4.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Mosaic Press
52774,2134638,1407,4.0,"['Horror', 'Short Stories', 'Anthologies']",8.717848,Cemetery Dance Publications
52775,2134638,2167,5.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Cemetery Dance Publications
52776,2134638,2917,4.0,"['Horror', 'Anthologies', 'Short Stories']",8.717848,Cemetery Dance Publications


In [129]:
import ast 

def extract_genre_counts(df, column_name):
    """
    Extracts genre counts from a column of lists that are represented as strings in a pandas DataFrame.
    
    Args:
    - df (pandas.DataFrame): The DataFrame that contains the column of interest.
    - column_name (str): The name of the column of interest.
    
    Returns:
    - dict: A dictionary with the count of each genre found in the column.
    """
    # Extract the column as a Series
    column = df[column_name]
    
    # Convert the strings representing lists to actual lists
    column = column.apply(ast.literal_eval)
    
    # Count the occurrences of each genre in the column
    genre_counts = {}
    for lst in column:
        for genre in lst:
            if genre in genre_counts:
                genre_counts[genre] += 1
            else:
                genre_counts[genre] = 1
    
    return genre_counts

genre_counts = extract_genre_counts(df, 'genres')

print(genre_counts)


{'Fiction': 38158, 'Historical Fiction': 10858, 'Classics': 11340, 'Contemporary': 10250, 'Novels': 4027, 'Historical': 7883, 'Literature': 5296, 'Nonfiction': 5360, 'Psychology': 494, 'Philosophy': 1035, 'History': 3342, 'Self Help': 371, 'Memoir': 2117, 'Biography': 2424, 'Politics': 1648, 'School': 1441, 'Science': 761, 'Anthropology': 165, 'Audiobook': 9440, 'American': 728, '20th Century': 328, 'Southern Gothic': 62, 'Read For School': 123, 'Feminism': 929, 'Essays': 479, 'Writing': 150, 'Africa': 740, 'Spirituality': 164, 'Religion': 565, 'Buddhism': 42, 'Fantasy': 17886, 'Adventure': 5417, 'High Fantasy': 1889, 'Science Fiction Fantasy': 4026, 'Epic Fantasy': 1551, 'Mystery': 12023, 'Thriller': 8103, 'Mystery Thriller': 5880, 'Suspense': 3853, 'Adult': 5316, 'France': 626, 'Science Fiction': 7631, 'Space Opera': 1841, 'Young Adult': 10991, 'Childrens': 4727, 'Middle Grade': 3411, 'Christian': 175, 'Poetry': 1791, 'Mythology': 1712, 'American History': 568, 'Social Justice': 630,

In [130]:
# Sort the genre counts by the count in descending order
sorted_genre_counts = sorted(genre_counts.items(), key=lambda x: x[1], reverse=True)

# Select the top 20 genres
top_20_genres = [genre for genre, count in sorted_genre_counts[:20]]

print(top_20_genres)
# Output: ['Drama', 'Horror', 'Anthologies', 'Short Stories', 'Comedy', 'Mystery', 'Thriller']

['Fiction', 'Fantasy', 'Romance', 'Mystery', 'Classics', 'Young Adult', 'Historical Fiction', 'Contemporary', 'Audiobook', 'Thriller', 'Historical', 'Science Fiction', 'Crime', 'Mystery Thriller', 'Adventure', 'Nonfiction', 'Adult', 'Literature', 'Childrens', 'Paranormal']


In [131]:
# Create a new DataFrame with only the rows that contain one of the selected genres
df_filtered = df[df['genres'].apply(lambda lst: any(genre in lst for genre in top_20_genres))]

print(df_filtered)

         userID  itemID  rating  \
0      31207039    3194     5.0   
1      31207039    2232     5.0   
2      31207039    1804     5.0   
3      31207039    1403     4.0   
4      31207039     104     3.0   
...         ...     ...     ...   
52766   2464998     521     2.0   
52767   2464998    3012     5.0   
52768   1815834    1198     3.0   
52769   1815834    2680     3.0   
52770   1815834    3351     4.0   

                                                  genres      price  \
0      ['Fiction', 'Historical Fiction', 'Classics', ...   8.717848   
1      ['Nonfiction', 'Psychology', 'Philosophy', 'Hi...   8.717848   
2      ['Philosophy', 'Classics', 'Nonfiction', 'Poli...  10.990000   
3      ['Nonfiction', 'History', 'Science', 'Philosop...   8.717848   
4      ['Classics', 'Fiction', 'Literature', 'Novels'...   8.717848   
...                                                  ...        ...   
52766  ['Horror', 'Fiction', 'Paranormal', 'Fantasy',...   8.717848   
52767      

In [138]:
dataset2 = Dataset()
dataset2.fit(users = df['userID'], 
            items = df['itemID'],
            item_features = top_20_genres)


In [140]:
item_features = dataset2.build_item_features((x, y) for x,y in zip(df.itemID, top_20_genres))

ValueError: Feature F not in feature mapping. Call fit first.