### Exploring Content Filtering Based Recommendations

#### Table of Contents
* [Importing libraries & loading data](#chapter1)
    * [Dropping nulls & duplicates](#section_1_1)
    * [Quick clean & one-hot enconding](#section_1_2)
    * [Looking at data](#section_1_3)
* [Content based recommenders](#chapter3)
    * [SVD](#section_3_1)
    * [Manual-Item2Item](#section_3_2)
    * [BiVAE](#section_3_3)

#### Importing libraries & loading data <a class="anchor" id="chapter1"></a> 
Importing libraries & datasets. Important, in order to run books, reviews & ratings_dist you need to geenrate the cleaned data from the EDA notebooks.

In [146]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import ast

import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity



In [125]:
books = pd.read_csv('../data/processed/processed_books.csv')
reviews = pd.read_csv('../data/processed/processed_reviews.csv')
ratings_dist = pd.read_csv('../data/processed/processed_ratings.csv')

In [126]:
books

Unnamed: 0.1,Unnamed: 0,book_id,title,author,price,genres,series,publisher,year_published,current_readers,wanted_to_read,num_reviews,num_ratings,rating,awards,primary_lists,book_score,author_score
0,0,77203.The_Kite_Runner,The Kite Runner,Khaled Hosseini,8.717848,"['Fiction', 'Historical Fiction', 'Classics', ...",0,Riverhead Books,2004-05-01,42900.0,1000000.0,90,2935385,4.0,['Borders Original Voices Award for Fiction (2...,['Books That Everyone Should Read At Least Onc...,0.559392,0.064747
1,1,929.Memoirs_of_a_Geisha,Memoirs of a Geisha,Arthur Golden,12.990000,"['Fiction', 'Historical Fiction', 'Romance', '...",0,Vintage Books USA,2005-11-22,12300.0,793000.0,34,1922540,4.0,[],"['Best Books Ever', 'Best Historical Fiction',...",0.504395,0.052931
2,2,128029.A_Thousand_Splendid_Suns,A Thousand Splendid Suns,Khaled Hosseini,12.990000,"['Fiction', 'Historical Fiction', 'Contemporar...",0,Riverhead Books,2007-06-01,32700.0,760000.0,69,1417260,4.0,['British Book Award for Best Read of the Year...,"['Best Books Ever', 'Books That Everyone Shoul...",0.476958,0.064747
3,3,19063.The_Book_Thief,The Book Thief,Markus Zusak,10.990000,"['Historical Fiction', 'Fiction', 'Young Adult...",0,Alfred A. Knopf,2006-03-14,86000.0,2000000.0,134,2345385,4.0,['National Jewish Book Award for Children’s an...,"['Best Books Ever', 'Books That Everyone Shoul...",0.527355,0.034407
4,4,4214.Life_of_Pi,Life of Pi,Yann Martel,8.717848,"['Fiction', 'Fantasy', 'Classics', 'Adventure'...",0,Seal Books,2006-08-29,24900.0,726000.0,51,1544622,3.0,"['Booker Prize (2002)', 'Bollinger Everyman Wo...","['Best Books Ever', 'Books That Everyone Shoul...",0.383873,0.021261
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4982,6257,25489259-death-of-an-alchemist,Death of an Alchemist,Mary Lawrence,5.990000,"['Mystery', 'Historical Fiction', 'Fiction', '...",1,Kensington Books,2016-01-26,-1.0,-1.0,68,285,3.0,[],['Most Anticipated Historical Mysteries for 20...,0.300015,0.000022
4983,6259,52185047-the-lost-boys-of-london,The Lost Boys of London,Mary Lawrence,8.717848,"['Mystery', 'Historical Fiction', 'Historical'...",1,Red Puddle Print,2020-04-28,-1.0,-1.0,51,99,4.0,[],"['Anticipated 2020 Literary Fiction', 'Crime, ...",0.400005,0.000022
4984,6262,36445482-no-cure-for-the-dead,No Cure for the Dead,Christine Trent,12.990000,"['Mystery', 'Historical Fiction', 'Historical ...",1,Crooked Lane Books,2018-05-08,-1.0,-1.0,86,380,3.0,[],"['Historical Fiction 2018', 'Historical Myster...",0.300021,0.000005
4985,6263,15793166-the-midwife-s-tale,The Midwife's Tale,Sam Thomas,5.990000,"['Historical Fiction', 'Mystery', 'Fiction', '...",1,Minotaur Books,2013-01-08,-1.0,-1.0,421,2855,3.0,[],"['Historical Fiction 2013', 'most anticipated ...",0.300155,0.000051


#### Dropping nulls & duplicates <a class="anchor" id="section_1_1"></a>

In [127]:
# count null values in each column
print('\nColumn null count: ', '\n', books.isnull().sum(axis=0))

# count null values in each row
print('\nRow null count: ', '\n', books.isnull().sum(axis=1))


Column null count:  
 Unnamed: 0         0
book_id            0
title              0
author             0
price              0
genres             0
series             0
publisher          0
year_published     0
current_readers    0
wanted_to_read     0
num_reviews        0
num_ratings        0
rating             0
awards             0
primary_lists      0
book_score         0
author_score       0
dtype: int64

Row null count:  
 0       0
1       0
2       0
3       0
4       0
       ..
4982    0
4983    0
4984    0
4985    0
4986    0
Length: 4987, dtype: int64


In [128]:
# count null values in each column
print('\nColumn dup count: ', '\n', books.duplicated().sum(axis=0))


Column dup count:  
 0


#### Quick clean & one-hot enconding <a class="anchor" id="section_1_2"></a>

In [129]:
genres = ["Art", "Biography", "Business", "Chick Lit", "Children's", "Christian", "Classics",
          "Comics", "Contemporary", "Cookbooks", "Crime", "Ebooks", "Fantasy", "Fiction",
          "Gay and Lesbian", "Graphic Novels", "Historical Fiction", "History", "Horror",
          "Humor and Comedy", "Manga", "Memoir", "Music", "Mystery", "Nonfiction", "Paranormal",
          "Philosophy", "Poetry", "Psychology", "Religion", "Romance", "Science", "Science Fiction", 
          "Self Help", "Suspense", "Spirituality", "Sports", "Thriller", "Travel", "Young Adult"]

# Create an empty dictionary to store the one-hot encoding
genrkct = {}

# Loop through each row in the book dataset
for index, row in books.iterrows():
    # Loop through each genre in the row's list of genres
    for genre in row["genres"]:
        # Check if the genre already exists in the dictionary
        if genre in genrkct:
            # If it does, set its value to 1
            genrkct[genre] = 1
        else:
            # If it doesn't, add it to the dictionary with a value of 1
            genrkct[genre] = 1

# Create a new dataframe with the one-hot encoded genre columns
one_hok = pd.DataFrame(columns=genres)

# Loop through each row in the book dataset
for index, row in books.iterrows():
    # Create an empty list to store the one-hot encoded genre values for this row
    one_hot_row = []
    # Loop through each possible genre
    for genre in genres:
        # If the current genre is in the row's list of genres, append a 1 to the one-hot encoded row
        if genre in row["genres"]:
            one_hot_row.append(1)
        # Otherwise, append a 0
        else:
            one_hot_row.append(0)
    # Add the one-hot encoded row to the new dataframe
    one_hok.loc[index] = one_hot_row

# Concatenate the original book dataframe with the one-hot encoded genre dataframe
books = pd.concat([books, one_hok], axis=1)

#### Looking at data <a class="anchor" id="section_1_3"></a>

In [130]:
# Droped unnamed and checking genres
books = books.drop('Unnamed: 0', axis=1)
books['genres']

0       ['Fiction', 'Historical Fiction', 'Classics', ...
1       ['Fiction', 'Historical Fiction', 'Romance', '...
2       ['Fiction', 'Historical Fiction', 'Contemporar...
3       ['Historical Fiction', 'Fiction', 'Young Adult...
4       ['Fiction', 'Fantasy', 'Classics', 'Adventure'...
                              ...                        
4982    ['Mystery', 'Historical Fiction', 'Fiction', '...
4983    ['Mystery', 'Historical Fiction', 'Historical'...
4984    ['Mystery', 'Historical Fiction', 'Historical ...
4985    ['Historical Fiction', 'Mystery', 'Fiction', '...
4986    ['Historical Fiction', 'Mystery', 'Fiction', '...
Name: genres, Length: 4987, dtype: object

In [None]:
# Final check
books

### Content based recommenders <a class="anchor" id="chapter3"></a>

#### Simple content based

Ceating Soup of features

In [131]:
# Function to create a soup with title, author & genres
def create_soup(df):
    # Create a new column to store the soup text
    df['soup'] = pd.Series(dtype='str')

    # Loop through each row in the DataFrame
    for index, row in df.iterrows():
        # Extract the genres for the current row
        genres = row['genres']
        print(row['title'])
        print(row['author'])
        print(row['genres'])

        # Combine the book title, authors, and genres into a single string
        soup_text = row['title'] + ' ' + row['author']
        for genre in ast.literal_eval(genres):
            soup_text = soup_text + ' ' + genre

        # Store the soup text for the current book in the 'soup' column
        df.at[index, 'soup'] = soup_text

    return df

In [132]:
books = create_soup(books)
books['soup']

The Kite Runner
Khaled Hosseini
['Fiction', 'Historical Fiction', 'Classics', 'Contemporary', 'Novels', 'Historical', 'Literature']
Memoirs of a Geisha
Arthur Golden
['Fiction', 'Historical Fiction', 'Romance', 'Historical', 'Classics', 'Japan', 'Adult']
A Thousand Splendid Suns
Khaled Hosseini
['Fiction', 'Historical Fiction', 'Contemporary', 'Historical', 'Novels', 'War', 'Classics']
The Book Thief
Markus Zusak
['Historical Fiction', 'Fiction', 'Young Adult', 'Historical', 'Classics', 'War', 'World War II']
Life of Pi
Yann Martel
['Fiction', 'Fantasy', 'Classics', 'Adventure', 'Contemporary', 'Magical Realism', 'Novels']
The Poisonwood Bible
Barbara Kingsolver
['Fiction', 'Historical Fiction', 'Africa', 'Classics', 'Historical', 'Literary Fiction', 'Literature']
The Girl with the Dragon Tattoo
Stieg Larsson
['Fiction', 'Mystery', 'Thriller', 'Crime', 'Mystery Thriller', 'Suspense', 'Contemporary']
The Diary of a Young Girl
Anne Frank
['Classics', 'Nonfiction', 'History', 'Biography',

0       The Kite Runner Khaled Hosseini Fiction Histor...
1       Memoirs of a Geisha Arthur Golden Fiction Hist...
2       A Thousand Splendid Suns Khaled Hosseini Ficti...
3       The Book Thief Markus Zusak Historical Fiction...
4       Life of Pi Yann Martel Fiction Fantasy Classic...
                              ...                        
4982    Death of an Alchemist Mary Lawrence Mystery Hi...
4983    The Lost Boys of London Mary Lawrence Mystery ...
4984    No Cure for the Dead Christine Trent Mystery H...
4985    The Midwife's Tale Sam   Thomas Historical Fic...
4986    The Harlot's Tale Sam   Thomas Historical Fict...
Name: soup, Length: 4987, dtype: object

In [134]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(books['soup'])

#### Cosine Similarity
We will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two books. Mathematically, it is defined as follows:

$$ cosine(x,y) = (x * y^T) / (||x|| * ||y||) $$

Here, x and y are two vectors representing the feature vectors of the two books, and ⊺ denotes the transpose of a vector. ||x|| and ||y|| represent the magnitudes of the two vectors, which are calculated as the square root of the sum of squares of their elements. The numerator of the expression, x * y^T, represents the dot product of the two vectors. The resulting quantity ranges between -1 and 1, where a value of 1 indicates that the two books are identical, 0 indicates that they are completely dissimilar, and -1 indicates that they are negatively correlated.

In [135]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [136]:
indices = pd.Series(books.index, index=books['title'])
titles = books['title']

In [137]:
def get_recommendations(title, n=10):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    book_indices = [i[0] for i in sim_scores]
    return list(titles.iloc[book_indices].values)[:n]

Trying out with "The Diary of a Young Girl"
- Genres are as follows: 'Classics', 'Nonfiction', 'History', 'Biography', 'Memoir', 'Historical', 'Holocaust'

In [138]:
get_recommendations("The Diary of a Young Girl")

['Twelve Years a Slave',
 '12 Years a Slave',
 "Eighth Moon: The True Story of a Young Girl's Life in Communist China",
 'Cheaper by the Dozen',
 'Cheaper by the Dozen',
 'The Midwife: A Memoir of Birth, Joy, and Hard Times',
 'Four Perfect Pebbles: A Holocaust Story',
 "Angela's Ashes",
 'The Story of My Life',
 'Call the Midwife: A True Story of the East End in the 1950s']

What if I want a specific book but I can't remember the full name...

Let's create a method to get book titles from a partial title.

In [139]:
def get_name_from_partial(title):
    return list(books.title[books.title.str.lower().str.contains(title) == True].values)

In [142]:
title = "business"
l = get_name_from_partial(title)
list(enumerate(l))

[(0, 'Fifth Business'),
 (1,
  'Machine Learning: An Introduction Math Guide for Beginners to Understand Data Science Through the Business Applications'),
 (2, 'The Politics of Breastfeeding: When Breasts Are Bad for Business')]

In [143]:
get_recommendations(l[1])

['Bayes Theorem: A Visual Introduction For Beginners',
 'An Introduction to Fluid Dynamics',
 'Understanding Machine Learning: From Theory to Algorithms',
 'Numerical Analysis',
 'Mathematical Methods of Classical Mechanics',
 'Introduction to Mathematical Thinking',
 'Introduction to Linear Algebra',
 'Linear Algebra and Its Applications',
 'Introduction to the Theory of Computation',
 'Elements of Information Theory']

#### Relation words
What about if a user wants to search for a book that is related to War, but there is no genre, title or author to relate it too with the soup of words and cosine similarity we came up with? Hence, let's take a look at a solution running cosine similarity, directly with the users input.

In [156]:
def recommend_books(input_text, df, n=10):
    # Preprocess the book text data
    corpus = df['soup'].values.tolist()
    vectorizer = CountVectorizer(stop_words='english', max_features=1000)
    X = vectorizer.fit_transform(corpus)
    lda = LatentDirichletAllocation(n_components=10, random_state=42)
    X_topics = lda.fit_transform(X)

    # Preprocess user input
    input_vector = vectorizer.transform([input_text])
    input_topic = lda.transform(input_vector)

    # Compute cosine similarity between input topic and book topics
    similarities = cosine_similarity(input_topic, X_topics)

    # Get book recommendations based on cosine similarity
    top_indices = similarities.argsort()[0][-n:]
    recommendations = df.iloc[top_indices]['title'].values.tolist()
    
    return recommendations

In [157]:
recommend_books('War', books, n=10)

['Fifth Business',
 'Wild Ginger',
 'Cold Mountain',
 'Alias Grace',
 'To Hear a Nightingale',
 'Cradle Will Rock: The Movie and the Moment',
 'Seaview Road',
 'A Commentary on the Chymical Wedding of Christian Rosenkreutz',
 'Caged Lions Never Roar',
 'Carved in Stone: Monochrome Destiny']

#### Popularity & Ratings into account

Now we are recomending in generic, but not taking into account the number of ratings and rating

In [175]:
def improved_recommendations(title, df, n=10):
    # Compute weighted rating for each book in the dataset
    v = df['num_ratings']
    m = df['num_ratings'].quantile(0.60)
    R = df['rating']
    C = df['rating'].mean()
    df['weighted_rating'] = (R*v + C*m) / (v + m)

    # Compute cosine similarity between input title and similar books
    corpus = df['soup'].values.tolist()
    vectorizer = CountVectorizer(stop_words='english', max_features=1000)
    X = vectorizer.fit_transform(corpus)
    lda = LatentDirichletAllocation(n_components=10, random_state=42)
    X_topics = lda.fit_transform(X)
    input_vector = vectorizer.transform([title])
    input_topic = lda.transform(input_vector)
    similarities = cosine_similarity(input_topic, X_topics)

    # Compute final recommendation scores based on weighted rating and cosine similarity
    df['similarity_score'] = similarities[0]
    df['weighted_similarity_score'] = df['similarity_score'] * df['weighted_rating']
    qualified = df[df['num_ratings'] >= m]
    qualified = qualified.sort_values('weighted_similarity_score', ascending=False)

    # Return top n recommended books
    recommendations = qualified.loc[:, ['title', 'book_id', 'rating', 'num_ratings', 'genres', 'weighted_similarity_score']].head(n)
    return recommendations


In [177]:
improved_recommendations('war', books, n=10)

Unnamed: 0,title,book_id,rating,num_ratings,genres,weighted_similarity_score
671,Slaughterhouse-Five,4981.Slaughterhouse_Five,4.0,1295812,"['Classics', 'Fiction', 'Science Fiction', 'Wa...",3.896087
67,All the Light We Cannot See,18143977-all-the-light-we-cannot-see,4.0,1429862,"['Historical Fiction', 'Fiction', 'Historical'...",3.895536
103,The Count of Monte Cristo,7126.The_Count_of_Monte_Cristo,4.0,863191,"['Classics', 'Fiction', 'Historical Fiction', ...",3.8942
804,Pride and Prejudice,1885.Pride_and_Prejudice,4.0,3911165,"['Classics', 'Fiction', 'Romance', 'Historical...",3.89412
0,The Kite Runner,77203.The_Kite_Runner,4.0,2935385,"['Fiction', 'Historical Fiction', 'Classics', ...",3.893995
1853,Sense and Sensibility,14935.Sense_and_Sensibility,4.0,1124041,"['Classics', 'Fiction', 'Romance', 'Historical...",3.891611
61,Sarah's Key,556602.Sarah_s_Key,4.0,460911,"['Historical Fiction', 'Fiction', 'Holocaust',...",3.891465
569,American Dirt,45046527-american-dirt,4.0,495963,"['Fiction', 'Contemporary', 'Audiobook', 'Hist...",3.888999
2,A Thousand Splendid Suns,128029.A_Thousand_Splendid_Suns,4.0,1417260,"['Fiction', 'Historical Fiction', 'Contemporar...",3.888824
4725,The Grapes of Wrath,4395.The_Grapes_of_Wrath,4.0,860088,"['Classics', 'Fiction', 'Historical Fiction', ...",3.887029
