Next key step in building CF-based recommendation systems is to generate user-item ratings matrix from the ratings table.
 

Using SKlearn, we are going to use a variety of functions to find similarity, predict, and recommend different books.

In [None]:
#!pip3 install surprise

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import sklearn
from surprise.model_selection import cross_validate
from sklearn.metrics.pairwise import pairwise_distances

import re
import surprise

import time
import warnings



In [7]:
ratings = pd.read_csv('BX-Book-Ratings.csv', encoding='ISO-8859–1',on_bad_lines='skip',quotechar='"',sep=";",escapechar= "\\")
books = pd.read_csv('BX-Books.csv', encoding='ISO-8859–1',on_bad_lines='skip',quotechar='"',sep=";",escapechar= "\\")
users = pd.read_csv('BX-Users.csv', encoding='ISO-8859–1',on_bad_lines='skip',quotechar='"',sep=";",escapechar= "\\")



In [10]:
#print(ratings.shape)
ratings.columns



#print(books.shape)
#print(users.shape)

Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')

In [11]:
#remove implicit data
drop_duplicate_ratings = ratings.drop_duplicates().dropna()
explicit_book_ratings = drop_duplicate_ratings[drop_duplicate_ratings['Book-Rating'] > 0]


#Merge Users & Ratings dataset
reviews_and_users = pd.merge(left=explicit_book_ratings,right= books, how = 'inner').merge(users.dropna(), how = 'inner')
reviews_and_users = reviews_and_users.drop_duplicates()

#Dataset Cleaning
reviews_and_users = reviews_and_users.drop(columns = ['Location','Image-URL-S','Image-URL-M','Image-URL-L'])
reviews_and_users = reviews_and_users.rename(columns={"User-ID": "UserID", "Book-Rating": "BookRating", "Book-Author": "BookAuthor", "Book-Title": "BookTitle","Year-Of-Publication": "PublicationYear"})
reviews_and_users['BookAuthor'] = reviews_and_users['BookAuthor'].str.title()

In [13]:
reviews_and_users


Unnamed: 0,UserID,ISBN,BookRating,BookTitle,BookAuthor,PublicationYear,Publisher,Age
0,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999,Cambridge University Press,16.0
1,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001,Cambridge University Press,16.0
2,16877,038550120X,9,A Painted House,John Grisham,2001,Doubleday,37.0
3,16877,034539657X,7,Dark Rivers of the Heart,Dean R. Koontz,1995,Ballantine Books,37.0
4,16877,0743211383,3,Dreamcatcher,Stephen King,2001,Scribner,37.0
...,...,...,...,...,...,...,...,...
269625,276660,0583307841,8,ROBOT RACE (MICRO ADV 6),David Antony Kroft,1985,HarperCollins Publishers,15.0
269626,276664,0004703723,9,Dictionary Of Economics-2Nd Ed,Christopher Pass,1991,Trafalgar Square,31.0
269627,276664,0140136908,7,History of Economic Thought (Penguin Economics),William J. Barber,1992,Penguin USA,31.0
269628,276664,0631189629,9,British Social Policy Since 1945 (Making Conte...,Howard Glennerster,1996,Blackwell Publishers,31.0


In [14]:
def popular_ratings(ratings, user_threshold=200, rating_threshold=200, book_threshold=1):
    counts_users = ratings.UserID.value_counts()
    counts_ratings = ratings.BookRating.value_counts()
    sample_ratings = ratings[ratings['UserID'].isin(counts_users[counts_users >= user_threshold].index)]
    sample_ratings = sample_ratings[ratings['BookRating'].isin(counts_ratings[counts_ratings >= rating_threshold].index)]
    isbn_group = sample_ratings.groupby('ISBN', as_index=False)['BookRating'].count()
    sample_ratings = sample_ratings[sample_ratings.ISBN.isin(list(isbn_group[isbn_group.BookRating > book_threshold].ISBN.values))]
    return sample_ratings



In [15]:
test_ratings = popular_ratings(reviews_and_users, user_threshold=400, rating_threshold=400, book_threshold=1)
rating_matrix = test_ratings.pivot(index='UserID',
                                         columns='ISBN',
                                         values= 'BookRating').fillna(0)
print(test_ratings.shape)
print(rating_matrix.shape)
rating_matrix

(4047, 8)
(26, 1841)


  sample_ratings = sample_ratings[ratings['BookRating'].isin(counts_ratings[counts_ratings >= rating_threshold].index)]


ISBN,0001056107,002026478X,0060002050,006000441X,0060004606,0060004622,006000469X,0060004746,0060008865,0060011904,...,1854710443,1855385074,1878448900,1890862185,1890862290,189205101X,1892065487,1895565014,1902852427,1932112138
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16795,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23872,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56399,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,10.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0
60244,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0
63714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
69078,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
76626,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
78973,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93047,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0
95359,8.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
userid_list = rating_matrix.index.tolist()
column_names = ["UserID", "Location", "Age"]
sampled_users = pd.DataFrame(columns = column_names)
sampled_users = users.loc[users['User-ID'].isin(userid_list)]
sampled_users = sampled_users.reset_index()
sampled_users = sampled_users.drop(['index'], axis=1)

print(sampled_users.shape)
sampled_users.head(-1)






(26, 3)


Unnamed: 0,User-ID,Location,Age
0,16795,"mechanicsville, maryland, usa",47.0
1,23872,"tulsa, oklahoma, usa",22.0
2,56399,"n/a, surrey, united kingdom",63.0
3,60244,"alvin, texas, usa",47.0
4,63714,"milton keynes, england, united kingdom",29.0
5,69078,"new york, new york, usa",42.0
6,76626,"london, england, united kingdom",38.0
7,78973,"amadora, lisboa, portugal",29.0
8,93047,"nashua, new hampshire, usa",52.0
9,95359,"charleston, west virginia, usa",33.0


In [47]:
book_isbn_list = rating_matrix.columns.values.tolist()
column_names = ["ISBN", "BookTitle", "BookAuthor", "PublicationYear", "Publisher"]
sampled_books = pd.DataFrame(columns = column_names)
sampled_books = books.loc[books['ISBN'].isin(book_isbn_list)]
sampled_books = sampled_books.reset_index()
sampled_books = sampled_books.drop(['index'], axis=1)
sampled_books = sampled_books.drop(['Image-URL-S',"Image-URL-M","Image-URL-L"], axis=1)


sampled_books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,0971880107,Wild Animus,Rich Shapero,2004,Too Far
1,0446310786,To Kill a Mockingbird,Harper Lee,1988,Little Brown &amp; Company
2,0449005615,Seabiscuit: An American Legend,LAURA HILLENBRAND,2002,Ballantine Books
3,0553582747,From the Corner of His Eye,Dean Koontz,2001,Bantam Books
4,042518630X,Purity in Death,J.D. Robb,2002,Berkley Publishing Group


The Data is more or less ready now to be used for a Collaborative Filtering Model.

The Model I will go with is a KNN Model (K Nearest Neighbors) there is more information about the process below:

In K Nearest Neighbors for collaborative filtering, we use the number of k people who most similar to the person we are looking for to find good recommendations. 

The best value for k depends on the problem. We use KNN with Means algorithm for building user-based recommender system. 

This algorithm takes into account the mean ratings of each user.
We use cosine similarity measure to compute the closeness of users with each other.



# Collaborative Filtering Using OOP


In [23]:
#get neighbors of target user based on similarity measure.
#find k nearest neighbors and use their ratings to recommend the items to the target user.

#Because most of this code was borrowed, I took time to go in and document what I understood about it.



class UserBasedCollaborativeFiltering():
    
    def __init__(self, users, books, ratings, k=10, max_rating=10.0):
        self.users = users
        self.users = self.users.reset_index()
        self.users = self.users.drop(columns=['index'])
        
        self.books = books
        
        self.ratings = ratings
        self.ratings = self.ratings.reset_index()
        self.ratings = self.ratings.drop(columns=['UserID'])
        
        self.k = k
        self.max_rating = max_rating
    
    
    def normalize(self, dataframe):
        #This method normalizes a DataFrame by subtracting the mean of each row from the entries in that row.
        """ 
        row_sum_ratings: computes the sum of the ratings for each row.
        non_zero_count: counts the number of non-zero entries in each row.
        dataframe_mean: computes the mean of each row by dividing row_sum_ratings by non_zero_count.
        self.normalized_ratings: subtracts dataframe_mean from dataframe along the rows.
        """
    
    
        row_sum_ratings = dataframe.sum(axis=1) # sum entries of rows
        non_zero_count = dataframe.astype(bool).sum(axis=1) # count non-zero entries of rows 
        
        dataframe_mean = row_sum_ratings / non_zero_count # mean of rows
        
        self.normalized_ratings = dataframe.subtract(dataframe_mean, axis = 0) # subtract on rows(iteration over columns!)
    
    def compute_similarity(self, x, y):
        
        """
        This method computes the cosine similarity between two vectors x and y.
        
        np.dot(x, y) computes the dot product of x and y.
        np.linalg.norm(x) computes the Euclidean norm (magnitude) of x.
        np.linalg.norm(y) computes the Euclidean norm (magnitude) of y.
        
        The cosine similarity is then computed as the 
        dot product of x and y divided by the product of the Euclidean norms of x and y.
        """
        
        return np.dot(x, y)/ (np.linalg.norm(x) * np.linalg.norm(y))


    def create_similarity_matrix(self):   
        """
        This function computes the similarity between each pair of users in the system using book ratings.
        In order to do this we initialize a numpy array to store the similarities outputted by --
        --the compute_similarity function, for every pair of users.
        It then reshapes this array into a DataFrame and returns it as the similarity matrix.
        """
        
        num_users = len(self.users)
        similarity_array = np.array([self.compute_similarity(self.ratings.iloc[i,:], self.ratings.iloc[j,:])
        for i in range(num_users) for j in range(num_users)])
        similarity_matrix = pd.DataFrame(data = similarity_array.reshape(self.users.shape[0], self.users.shape[0]))
        
        return similarity_matrix

    def get_neighbors(self, user_id, similarity_matrix):
        """
        This function takes a user ID & an inputted similarity matrix & returns the indices of the k most similar users. 
        
        To do this, find the index of the specified user in the users dataframe
        and use this to extract the row from the similarity matrix corresponding to that user.
        
        It then sorts the similarities in decreasing order, takes the top k+1 values (excluding the similarity of the user to themselves), 
        and returns their indices.
        """
        user_index = self.users.loc[self.users['User-ID'] == user_id].index.values[0]
        user_similairities = similarity_matrix.iloc[user_index].values
        temp_neighbors_index = user_similairities.argsort()[-(self.k + 1):][::-1]
        neighbor_index = np.delete(temp_neighbors_index, np.where(temp_neighbors_index[user_index] == user_index))
        
        return neighbor_index    
        
    def score_item(self, user_id, neighbor_rating, neighbor_similarity, ratings):
        """
        This function computes a score for each item in a set of recommended books for a given user.
        
        First, we take the user ID, the normalized ratings of the k most similar users, their similarity scores, and the full ratings matrix. 
        
        It computes the mean rating for the active user (the user we are making recommendations for), 
        then computes the weighted sum of the ratings of the k most similar users, where the weights are the similarity scores. 
        It adds the active user's mean rating to this weighted sum to get the final score, which it returns as a dataframe.
        """
        user_index = self.users.loc[self.users['User-ID'] == user_id].index.values[0]
        active_user_mean_rating = np.mean(ratings.iloc[user_index, :])
        score = np.dot(neighbor_similarity, neighbor_rating) + active_user_mean_rating
        data = score.reshape(1, len(score))
        columns = neighbor_rating.columns
        
        return pd.DataFrame(data= data , columns= columns)
    
    

    def recommend(self, user_id):
        """
        This function takes a user ID as input and returns a set of recommended books for that user. 
        
        finds the index of the specified user in the users dataframe and extracts their ratings from the full ratings matrix.
        It then identifies which books the user has not yet rated and stores their ISBNs in a list. 
        
        We start to combine all of the previous functions to make this process work.
        This first involves calling normalize() to normalize the ratings matrix,
        
        Then create_similarity_matrix() to compute the similarity matrix. 
        
        get_neighbors() takes k most similar users and extracts their normalized ratings for the books that the active user has not yet rated. 


        score_item() to compute a score for each recommended book and returns the top k books with the highest scores. 
        
        Finally, it extracts the details of the recommended books from the books dataframe and returns them as a dataframe.
        """
        user_index = self.users.loc[self.users['User-ID'] == user_id].index.values[0]
        user_ratings = rating_matrix.iloc[user_index]
        recommendation_columns = []

        for i in range(len(user_ratings.index)):
            isbn = user_ratings.index[i]
            rating = user_ratings.values[i]
            if rating == 0.0:
                recommendation_columns.append(isbn)

        self.normalize(self.ratings)  
        similarity_matrix = self.create_similarity_matrix()
        neighbor_index = self.get_neighbors(user_id, similarity_matrix)
        neighbor_rating = self.normalized_ratings.loc[neighbor_index][recommendation_columns]
        neighbor_similarity = similarity_matrix[user_index].loc[neighbor_index]
        recommendation_score = self.score_item(user_id, neighbor_rating, neighbor_similarity, self.ratings)
        recommended_book_ISBNs = recommendation_score.stack().nlargest(self.k)
        recommended_book_ISBNs = [recommended_book_ISBNs.index.values[i][1] for i in range(len(recommended_book_ISBNs))]
        recommended_books = self.books.loc[self.books['ISBN'].isin(recommended_book_ISBNs)]

        return recommended_books
    

In [21]:
user_based_cf = UserBasedCollaborativeFiltering(sampled_users, sampled_books, rating_matrix)
similarity_matrix = user_based_cf.create_similarity_matrix()

user_id = 23872
recommendations = user_based_cf.recommend(user_id)


In [22]:
recommendations.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
7,804106304,The Joy Luck Club,Amy Tan,1994,Prentice Hall (K-12)
39,316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
66,61009059,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,1995,HarperTorch
224,399146431,The Bonesetter's Daughter,Amy Tan,2001,Putnam Publishing Group
255,811802981,The Golden Mean: In Which the Extraordinary Co...,Nick Bantock,1993,Chronicle Books


# My Thoughts with this Model: 

A glaring flaw exists within the design of this model:
For every recommendation, you must find establish similarity between every single User.

This becomes computationally more expensive the more users there are, and is not optimal for our massive Data Set.


# Item Based Collab. Filtering


In [51]:
class ItemBasedCollaborativeFiltering():    
    def __init__(self, users, books, ratings, k=10, max_rating=10.0):
        self.users = users
        self.users = self.users.reset_index()
        self.users = self.users.drop(columns=['index'])
        
        self.books = books
        
        self.ratings = ratings
        self.ratings = self.ratings.reset_index()
        self.ratings = self.ratings.drop(columns=['UserID'])
        
        self.k = k
        self.max_rating = max_rating
        
        self.frequencies = {}
        self.deviations = {}
        
    
    def prepare_data(self):
        
        """
            A method which prepares the data for the algorithm.
        It converts the ratings DataFrame into a list of dictionaries,
        where each dictionary contains the ratings of a user for the items.
                            Returns the list."""
        
        user_indices = list(self.ratings.index.values)
        
        users_ratings = []
        
        for user_index in user_indices:
            rated_book_indices = list(self.ratings.iloc[user_index].to_numpy().nonzero()[0])
            users_ratings.append({user_index: dict(self.ratings[self.ratings.columns[rated_book_indices]].iloc[user_index])})
    
        self.users_ratings = users_ratings
        
        return self.users_ratings
        
        
    def compute_deviations(self):
        
        """
         Computes the deviation of the ratings of each item from the ratings of the other items. 
             Populates two dictionaries:
             self.frequencies: a dictionary that stores the number of times each pair of items has been rated together
             
             self.deviations: a dictionary that stores the average deviation of the ratings of each item from the ratings of the other items

        """
        users_ratings = self.users_ratings
        num_users = len(self.users)
        
        for i in range(num_users):
            for ratings in self.users_ratings[i].values():
                for item, rating in ratings.items():
                    self.frequencies.setdefault(item, {})
                    self.deviations.setdefault(item, {})
                    
                    for (item2, rating2) in ratings.items():
                        if item != item2:
                            self.frequencies[item].setdefault(item2, 0)
                            self.deviations[item].setdefault(item2, 0.0)
                            self.frequencies[item][item2] += 1
                            self.deviations[item][item2] += rating - rating2
            
            for (item, ratings) in self.deviations.items():
                for item2 in ratings:
                    ratings[item2] /= self.frequencies[item][item2]
    
    
    def slope_one_recommend(self, user_ratings):
        
        """
        This function takes in user_ratings, 
        which is a dictionary where the keys are book IDs (ISBNs) and the values are the corresponding ratings given by the user.
        
        It computes recommendations for items that the user has not rated using a Slope One algorithm.


        """
        recommendations = {}
        frequencies = {}
        
        #Looping through each item and rating in the user's input, 
        #then looping through each item and deviation in the deviation dictionary. 
        
        for (user_item, user_rating) in user_ratings.items():
        
            for (diff_item, diff_ratings) in self.deviations.items():
                
                #If the current deviation dictionary item has not been rated by the user,
                #but the current user input item has been rated,
                #the function updates the recommendation dictionary with a weighted average 

                if diff_item not in user_ratings and user_item in self.deviations[diff_item]:
                    freq = self.frequencies[diff_item][user_item]
                    recommendations.setdefault(diff_item, 0.0)
                    frequencies.setdefault(diff_item, 0)
        
                    recommendations[diff_item] += (diff_ratings[user_item] + user_rating) * freq
                    frequencies[diff_item] += freq
        
        recommendations = [(k, v / frequencies[k]) for (k, v) in recommendations.items()]
        
        recommendations.sort(key=lambda ratings: ratings[1], reverse = True)
        
        #returns a list of recommendations, sorted by descending order of recommendation score
        return recommendations
    
    
    def recommend(self, recommendations):
        
        """ 
            takes in a list of recommendations and returns the top k recommendations,
        where k is the value passed when initializing the Class.
        """
        top_k_recommendations = recommendations[: self.k]
        
        isbns = [recommendation[0] for recommendation in top_k_recommendations]
        
        recommended_books = [self.books.loc[self.books['ISBN'] == isbn] for isbn in isbns]
        return pd.concat(recommended_books)

In [52]:
item_based_cf = ItemBasedCollaborativeFiltering(sampled_users, sampled_books, rating_matrix)
users_ratings = item_based_cf.prepare_data()
item_based_cf.compute_deviations()

user_index = 1
pd.DataFrame(sampled_users.iloc[user_index])


Unnamed: 0,1
User-ID,23872
Location,"tulsa, oklahoma, usa"
Age,22.0


In [53]:
recommendations = item_based_cf.slope_one_recommend(users_ratings[user_index][user_index])
item_based_cf.recommend(recommendations)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
1047,0312251947,Naked Came the Phoenix: A Serial Novel,Marcia Talley,2001,St. Martin's Minotaur
1444,0373790988,Slippery When Wet: Under the Covers (Harlequin...,Kristin Hardy,2003,Harlequin
785,0515133868,Once Upon a Kiss,Nora Roberts,2002,Jove Books
514,0140183515,Just So Stories (Penguin Twentieth-Century Cla...,Rudyard Kipling,1990,Penguin Books
1506,1556342616,GURPS Discworld,Phil Masters,1998,Steve Jackson Games
1825,0345335945,Camber of Culdi #1 (Legends of Camber of Culdi),Katherine Kurtz,1982,Del Rey Books
1410,0345347633,Deryni Rising (Chronicles of the Deryni),Katherine Kurtz,1990,Del Rey Books
449,0553560220,Illusion,Paula Volsky,1993,Bantam Books
1318,0553561189,Mistress of the Empire,Raymond E. Feist,1993,Bantam
798,185326119X,The Jungle Book (Wordsworth Collection),Rudyard Kipling,1998,NTC/Contemporary Publishing Company


In [54]:

sampled_users

Unnamed: 0,User-ID,Location,Age
0,16795,"mechanicsville, maryland, usa",47.0
1,23872,"tulsa, oklahoma, usa",22.0
2,56399,"n/a, surrey, united kingdom",63.0
3,60244,"alvin, texas, usa",47.0
4,63714,"milton keynes, england, united kingdom",29.0
5,69078,"new york, new york, usa",42.0
6,76626,"london, england, united kingdom",38.0
7,78973,"amadora, lisboa, portugal",29.0
8,93047,"nashua, new hampshire, usa",52.0
9,95359,"charleston, west virginia, usa",33.0


In [55]:
user_index = 25
pd.DataFrame(sampled_users.iloc[user_index])



Unnamed: 0,25
User-ID,257204
Location,"akron, ohio, usa"
Age,32.0


In [56]:
recommendations = item_based_cf.slope_one_recommend(users_ratings[user_index][user_index])
item_based_cf.recommend(recommendations)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
153,385492081,Into Thin Air : A Personal Account of the Mt. ...,JON KRAKAUER,1998,Anchor
122,312243022,The Hours : A Novel,Michael Cunningham,2000,Picador
54,449907481,A Thousand Acres (Ballantine Reader's Circle),JANE SMILEY,1992,Ballantine Books
722,440404193,"Are You There God? It's Me, Margaret",JUDY BLUME,1971,Yearling
857,440472431,Ramona and Her Mother (Ramona Quimby (Paperback)),Beverly Cleary,1980,Bantam Doubleday Dell
1320,440802458,Egypt Game,Zilpha Keatley Snyder,1991,Bantam Doubleday Dell Publishing Group
730,1558538445,I Hope You Dance,Mark D. Sanders,2000,Rutledge Hill Press
701,440498058,A Wrinkle In Time,MADELEINE L'ENGLE,1998,Yearling
780,553211285,The Adventures of Tom Sawyer (Adventures of To...,MARK TWAIN,1995,Bantam
728,307010856,The Monster at the End of This Book,JON STONE,2003,Golden Books


# Content Based Filtering

In [57]:
from sklearn.metrics.pairwise import sigmoid_kernel

class ContentBasedFiltering():
    
    def __init__(self, books, ratings, k = 10):
        self.ratings = ratings
        self.books = self.prepare_data(books)
        self.tfidf_matrix = self.create_embedding_matrix()
        self.sigmoid = self.create_kernel()
        self.indices = self.create_indices()
        self.k = k
        
        
    def clean(self, text, combine=False):
        text = text.lower()
        text = re.sub('[^a-z0-9 ]', '', text)
        
        if combine:
            return ''.join(t.replace(' ', '') for t in text).strip()
        
        return text.strip()
    
    
    def prepare_data(self, books, rating_threshold = 2):
        
        # select books that has been rated
        rated_books = books[books.ISBN.isin(ratings.ISBN)]
        
        # remove duplicates based on bookTitle
        unique_books = rated_books.drop_duplicates(subset = ['Book-Title'], keep = False)
        
        # if rating count of a book > rating_threshold, then the book will be selected.
        popular_ISBN = list(self.ratings.ISBN.value_counts()[self.ratings.ISBN.value_counts() >= rating_threshold].index)
        
        # Only keep the books that its rating count is > rating_threshold; this means that it is popular.
        popular_books = unique_books[unique_books.ISBN.isin(popular_ISBN)]
        
        popular_books['BookTitleClean'] = popular_books['Book-Title'].map(lambda x: self.clean(x, combine=False))
        
        popular_books['spaghetti'] = popular_books['BookTitleClean']
        
        return popular_books 
    
    
    def create_embedding_matrix(self):
        from sklearn.feature_extraction.text import TfidfVectorizer
        
        self.books['spaghetti'] = self.books['spaghetti'].fillna('')
        self.books = self.books.reset_index()

        tfidf_vectorizer = TfidfVectorizer(stop_words = 'english')

        tfidf_matrix = tfidf_vectorizer.fit_transform(self.books['spaghetti'])
        print('tf-idf embedding matrix shape = ' + str(tfidf_matrix.shape))
        
        return tfidf_matrix
    
    def create_kernel(self):
        return sigmoid_kernel(self.tfidf_matrix, self.tfidf_matrix)
        
    
    def create_indices(self):
        return pd.Series(self.books.index, index = self.books['Book-Title']).drop_duplicates()
    
    
    def recommend(self, query):
        
        idx = self.indices[query]

        sigmoid_scores = list(enumerate(self.sigmoid[idx]))
        sigmoid_scores = sorted(sigmoid_scores, key=lambda x: x[1], reverse=True)
        sigmoid_scores = sigmoid_scores[1: self.k + 1]
        
        book_indices = [i[0] for i in sigmoid_scores]
        
        recommendations =  pd.DataFrame(self.books.iloc[book_indices])
        recommendations = recommendations.drop(columns=['index', 'BookTitleClean', 'spaghetti'])
        
        return recommendations

In [64]:
content_based_cf = ContentBasedFiltering(sampled_books, test_ratings)


tf-idf embedding matrix shape = (1796, 2910)


In [65]:
Query = "Wild"
content_based_cf.recommend(Query)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
1208,515132292,Wild,Lori Foster,2002,Jove Books
954,1551668777,So Wild A Heart,Candace Camp,2002,Mira
1541,843953004,Wild Desire,Phoebe Conn,2003,Leisure Books
0,971880107,Wild Animus,Rich Shapero,2004,Too Far
1671,399149279,Wild Pitch,Mike Lupica,2002,Putnam Publishing Group
1206,743437128,Wild Orchids : A Novel,Jude Deveraux,2003,Atria Books
428,380812037,On a Wild Night (Cynster Novels),Stephanie Laurens,2002,Avon
843,373240880,Waiting For Nick (Those Wild Ukrainians) (Sil...,Nora Roberts,1997,Silhouette
452,671769316,NEW ROADSIDE AMERICA : THE MODERN TRAVELER'S G...,Doug Kirby,1992,Fireside
1,446310786,To Kill a Mockingbird,Harper Lee,1988,Little Brown &amp; Company


In [68]:
author = "Lori Foster"
content_based_cf.recommend(author)

KeyError: 'Lori Foster'

In [72]:
Publisher = "Jove Books"
content_based_cf.recommend(author)

KeyError: 'Kill'

# Final Thoughts / Conclusions

The scope of this project was to implement a working recommender system, based on traditional models. This mainly was done to develop a deeper understanding of the Algorithmic processes involved with this kind of work, while assessing which would be more useful for a given scenario. All three models work, but have slightly different alterations which give them an interesting Dynamic. I will rank them and give my explanation for this below:

1. Item-To-Item Collaborative Filtering works the best with the Book-Crossing Dataset because it finds similarities between ratings, not users. This serves as an advantage, because assessing recommendations becomes computationally expensive the more users you have. In short, you can recommend a lot of books, while computation time stays low.   



1. A unique advantage for Content-Based Filtering is that it can work when not a lot is known about the data. For instance, you can search for titles or Genres and get reasonable results. The only downside here is that, to take full advantage of its capabilities, you need quality features. For instance, my attempts at recommending Authors failed. This may have worked better if in my preprocessing phase, I spent more time looking over the Book-Author column, weeding out duplicates or other anomalies. 



1. Lastly, I rank User based filtering in third place, because I could not make full usage of this algorithm. User to User works best when a substantial amount of overlap comes into play, which, in my Data Analysis file, I found with careful digging. In my Analysis I found a 'popularity' metric can make quick assessments in which some Books are more favorable to recommend. This works in an 'item-to-item' context, but to truly personalize this for every user requires skills I currently do not possess.


As stated, this project started because I wanted to directly engage with the math, while finding effective ways to program them. I consider this a success, but it wouldn't be possible without my sources. In these you'll find code I used to analyze, which inspired the process I undertook. 


# Citations

1. https://towardsdatascience.com/my-journey-to-building-book-recommendation-system-5ec959c41847
1. https://surprise.readthedocs.io/en/stable/knn_inspired.html
1. https://towardsdatascience.com/user-user-collaborative-filtering-for-jokes-recommendation-b6b1e4ec8642
1. https://towardsdatascience.com/my-journey-to-building-book-recommendation-system-5ec959c41847
1. https://github.com/tttgm/fellowshipai/tree/master
1. https://github.com/mohsenMahmoodzadeh/Book-Crossing-Recommender-System
