# Recommender System
We built a recommender system based on the books we scraped.  The idea is that you give it a single book and it will return books you are likely to also enjoy based on their similarity to the book that are provided.

While there are many types of recommender systems, the two most common are *collaborative filters* and *content filters*.


We are working on content based filtering and focus on the similarities between the actual content of the data, such as weighted ratings, similarity of authors, frequency of topics appearing in the description, and so on.  This method requires a direct 'similarity score' between items in order to compute how related they are.

In [25]:
import pandas as pd
import numpy as np

import string
from rake_nltk import Rake
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [26]:
book_data = pd.read_csv('./scraper/output/pages-1-100.tsv', sep='\t')

## Remove duplicates

In [27]:
book_data.drop_duplicates(subset='title', inplace=True)

# Resetting the index.
book_data = book_data.reset_index()

## Weighted rating & top books
We cannot take rating scores directly as they can be imbalanced.  One user rating a book 5/5 is not better than 50,000 people rating it on average 4.5.  We need some kind of algorithm to weight the rating values.

[IMDB's FAQ](https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV?ref_=helpms_helpart_inline#calculatetop) describes the algorithm that they use to weight the rank o movies and TV shows for the top rated lists.  It reads:

$\text{Weighted Rating (WR)} = (\frac{v}{v+m} \cdot R) + (\frac{m}{v+m} \cdot C)$

where

* $R$ is the average rating for the movie (mean).
* $v$ is the number of votes for the movie.
* $m$ is the minimum number of votes to be listed (25,000 in their case)
* $C$ is the mean vote across the whole report.

We already have access to $R$ and $v$ in the columns directly.  $C$ is something we can compute from the data.  $m$ is something we can configure and tweak.  I'll begin with the 10th percentile, essentially chopping off the bottom part of the data.

In [28]:
C = book_data['avg_rating'].mean()
C

4.0527289955780095

In [29]:
m = book_data['num_ratings'].quantile(0.1)
m

2421.9000000000005

In [30]:
def weighted_rating(book, m, C):
    # Average rating for the book.
    R = book['avg_rating']
    # Total number of votes for the book.
    v = book['num_ratings']
    # IMDB formula.
    return (v / (v+m) * R) + (m / (m+v) * C)

# Calculate the weighted rating for books that are within our threshold.
book_data.loc[book_data.num_ratings > m, 'weighted_rating'] = book_data.loc[book_data.num_ratings > m].apply(lambda x: weighted_rating(x, m, C), axis=1)

# Fill the NaN values (i.e., books lower than our threshold) with a zero score.
book_data['weighted_rating'].fillna(0, inplace=True)

In [31]:
book_data.sort_values('weighted_rating', ascending=False).head(5)

Unnamed: 0,index,title,original_title,series,language,authors,avg_rating,num_ratings,num_reviews,genres,description,url,weighted_rating
1525,1539,The Complete Calvin and Hobbes,The Complete Calvin and Hobbes,Calvin and Hobbes,English,Bill Watterson,4.82,33322,961,"Sequential Art,Comics,Humor,Sequential Art,Gra...",[ Box Set | Book One | Book Two | Book Three...,https://www.goodreads.com/book/show/24812.The_...,4.768012
982,988,Words of Radiance,Words of Radiance,The Stormlight Archive,English,Brandon Sanderson,4.76,172432,10541,"Fantasy,Fiction,Fantasy,Epic Fantasy,Fantasy,H...",From #1 New York Times bestselling author Bran...,https://www.goodreads.com/book/show/17332218-w...,4.750204
6306,6538,"Harry Potter Boxed Set, Books 1-5 (Harry Potte...",,,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.78,39132,162,"Fantasy,Young Adult,Fiction,Fantasy,Magic",Box Set containing Harry Potter and the Sorcer...,https://www.goodreads.com/book/show/8.Harry_Po...,4.737612
1469,1481,Harry Potter Series Box Set,,Harry Potter,English,J.K. Rowling,4.74,234260,7065,"Fantasy,Young Adult,Fiction","Over 4000 pages of Harry Potter and his world,...",https://www.goodreads.com/book/show/862041.Har...,4.732967
5288,5455,It's a Magical World,It's a Magical World,Calvin and Hobbes,English,Bill Watterson,4.76,25119,334,"Sequential Art,Comics,Humor,Fiction,Sequential...",When cartoonist Bill Watterson announced that ...,https://www.goodreads.com/book/show/24814.It_s...,4.697804


In [32]:
book_data.sort_values('weighted_rating', ascending=False).tail(5)

Unnamed: 0,index,title,original_title,series,language,authors,avg_rating,num_ratings,num_reviews,genres,description,url,weighted_rating
6308,6540,Awakening Inner Guru,,,English,"Banani Ray,Amit Ray",4.78,104,24,"Spirituality,Inspirational,Self Help",Awakening Inner Guru is a clear and straightfo...,https://www.goodreads.com/book/show/8596181-aw...,0.0
6302,6534,30 Pieces of Gold: Self Growth - How to use In...,,,English,"Ron Millicent,Millie Parker (Editor)",4.31,128,1,"Novels,Inspirational,Contemporary,Adult,Self H...",Inspirational Quotes – Hah - Do They Really Wo...,https://www.goodreads.com/book/show/27467291-3...,0.0
6291,6520,The Pace,The Pace,The Pace,English,Shelena Shorts,3.7,1409,258,"Young Adult,Fantasy,Romance,Fantasy,Paranormal...",Weston Wilson is not immortal and he is of thi...,https://www.goodreads.com/book/show/6599113-th...,0.0
6282,6511,A Midnight Clear,A Midnight Clear,,English,William Wharton,4.18,1391,66,"Fiction,Historical,Historical Fiction,War,War,...",Set in the Ardennes Forest on Christmas Eve 19...,https://www.goodreads.com/book/show/720234.A_M...,0.0
4749,4890,Death of the Body,,Crossing Death,English,Rick Chiantaretto,3.82,217,74,"Fantasy,Fantasy,Paranormal,Fantasy,Urban Fanta...",I grew up in a world of magic. By the time I w...,https://www.goodreads.com/book/show/18624197-d...,0.0


In [33]:
del C
del m

## Content-Based Recommender System
It will be based on the content, so we will be creating an mixture of features per book that will be used to calculate the similarity score between books.

Values used are title, series that it belongs to (if any), language, author(s), genres, and identify keywords from the book's description.

Instead of treating each entry equally, we can add weight to them by mentioning the words multiple times in the vector that we will use to calculate similarity.

In [34]:
def clean_string(s):
    # Remove stopwords and punctuation.
    stop = stopwords.words('english') + list(string.punctuation)
    return [n for n in wordpunct_tokenize(s.lower()) if n not in stop]

def create_soup(x):
    title_importance = 1
    language_importance = 1
    series_importance = 1
    authors_importance = 1
    genres_importance = 1

    soup = ''
    
    # Keywords from description.
    desc = x['description']
    if desc is not np.nan:
        rake = Rake()
        rake.extract_keywords_from_text(desc)
        desc_soup = ' '.join(list(rake.get_word_degrees().keys()))
        soup = ' '.join(filter(None, [soup, desc_soup]))
    
    # Title.
    title_soup = ' '.join(clean_string(x['title']) * title_importance)
    soup = ' '.join(filter(None, [soup, title_soup]))
    
    # Language.
    language = x['language']
    if language is not np.nan:
        language_soup = ' '.join(clean_string(language) * language_importance)
        soup = ' '.join(filter(None, [soup, language_soup]))
    
    # Series.
    series = x['series']
    if series is not np.nan:
        series_soup = ' '.join(clean_string(series) * series_importance)
        soup = ' '.join(filter(None, [soup, series_soup]))

    # Authors.
    authors = x['authors']
    if authors is not np.nan:
        author_soup = ' '.join([a.lower().replace(' ', '') for a in authors.split(',')] * authors_importance)
        soup = ' '.join(filter(None, [soup, author_soup]))
    
    # Genres.
    genres = x['genres']
    if genres is not np.nan:
        genre_soup = ' '.join([g.lower().replace(' ', '') for g in genres.split(',')] * genres_importance)
        soup = ' '.join(filter(None, [soup, genre_soup]))
    
    return soup

book_data['soup'] = book_data.apply(create_soup, axis=1)

In [35]:
book_data.soup.head()

0    remembered enormous project weaknesses crosses...
1    america two decades later fiction tim taking p...
2    philip k life stumbles upon threatens returnin...
3    surrealists porsche pop help eyre affair space...
4    tragic christmas cookies premier interpreters ...
Name: soup, dtype: object

Now it's time to create the similarity matrix between all books based on our lovely steaming soup.

In [36]:
count_vec = CountVectorizer()
count_matrix = count_vec.fit_transform(book_data['soup'])

from sklearn.metrics.pairwise import linear_kernel
cos_sim = cosine_similarity(count_matrix, count_matrix)

In [38]:
# Reverse lookup of title vs. index.
title_to_index = pd.Series(book_data.index, index=book_data['title'])

def get_recommendation(title):
    idx = title_to_index[title]
    print(idx)
    print(book_data.loc[idx].soup)
    
    scores = pd.Series(cos_sim[idx]).sort_values(ascending=False)
    book_indices = list(scores.iloc[1:11].index)
    
    print(scores[1:11])
    return book_data.iloc[book_indices]

get_recommendation('Harry Potter and the Chamber of Secrets')

1643
3308    0.410863
6306    0.408938
2762    0.380682
1675    0.369751
4940    0.361968
1469    0.325887
1739    0.320964
233     0.297491
1698    0.293085
1672    0.280109
dtype: float64


Unnamed: 0,index,title,original_title,series,language,authors,avg_rating,num_ratings,num_reviews,genres,description,url,weighted_rating,soup
3308,3382,Harry Potter and the Order of the Phoenix (Har...,,,English,J.K. Rowling,4.59,22648,418,"Fantasy,Fiction,Young Adult",,https://www.goodreads.com/book/show/1317181.Ha...,4.538096,harry potter order phoenix harry potter 5 part...
6306,6538,"Harry Potter Boxed Set, Books 1-5 (Harry Potte...",,,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.78,39132,162,"Fantasy,Young Adult,Fiction,Fantasy,Magic",Box Set containing Harry Potter and the Sorcer...,https://www.goodreads.com/book/show/8.Harry_Po...,4.737612,secrets goblet harry potter prisoner stone box...
2762,2810,Harry Potter Collection,"Harry Potter Collection (Harry Potter, #1-6)",Harry Potter,English,J.K. Rowling,4.73,29618,923,"Fantasy,Fiction,Young Adult,Fantasy,Magic","Six years of magic, adventure, and mystery mak...",https://www.goodreads.com/book/show/10.Harry_P...,4.678805,elegant hardcover boxed set ages thrilling sea...
1675,1691,Harry Potter and the Goblet of Fire,Harry Potter and the Goblet of Fire,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.55,2286525,40512,"Fantasy,Young Adult,Fiction",Harry Potter is midway through his training as...,https://www.goodreads.com/book/show/6.Harry_Po...,4.549474,normal take place hundred years find age -- ba...
4940,5088,The Harry Potter trilogy,The Harry Potter trilogy: The Philosopher's St...,Harry Potter,English,J.K. Rowling,4.66,6373,167,"Fantasy,Fiction,Young Adult,Childrens,Adventur...",This box set collects hard cover editions Harr...,https://www.goodreads.com/book/show/2337379.Th...,4.492772,secrets harry potter prisoner stone slip case ...
1469,1481,Harry Potter Series Box Set,,Harry Potter,English,J.K. Rowling,4.74,234260,7065,"Fantasy,Young Adult,Fiction","Over 4000 pages of Harry Potter and his world,...",https://www.goodreads.com/book/show/862041.Har...,4.732967,internationally bestselling harry potter serie...
1739,1760,Harry Potter and the Sorcerer's Stone,Harry Potter and the Philosopher's Stone,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.47,6215158,98678,"Fantasy,Young Adult,Fiction",Harry Potter's life is miserable. His parents ...,https://www.goodreads.com/book/show/3.Harry_Po...,4.469837,perfect relatives killing curse inflicted kill...
233,233,The Harry Potter Collection 1-4,Harry Potter Boxed Set Books 1-4,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.67,48175,298,"Fantasy,Young Adult,Fiction","The exciting tales of Harry Potter, the young ...",https://www.goodreads.com/book/show/99298.The_...,4.640453,seem books storm easy buy one sorry -- next ch...
1698,1718,Harry Potter and the Deathly Hallows,Harry Potter and the Deathly Hallows,Harry Potter,English,J.K. Rowling,4.61,2539759,60223,"Fantasy,Young Adult,Fiction",Harry Potter is leaving Privet Drive for the l...,https://www.goodreads.com/book/show/136251.Har...,4.609469,one final battle destroy enemy leaving privet ...
1672,1688,Harry Potter and the Half-Blood Prince,Harry Potter and the Half-Blood Prince,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.56,2176421,34865,"Fantasy,Young Adult,Fiction",When Harry Potter and the Half-Blood Prince op...,https://www.goodreads.com/book/show/1.Harry_Po...,4.559436,apparate objects fascinating fall disposal dum...


pickle files for loading elsewhere

In [16]:
import pickle

should_export = False

if should_export:
    # Book data.
    print('Exporting book data...', end='')
    pickle.dump(book_data, open('book_data.pickle', 'wb'))
    print('done!')
    
    # Cosine similarity (warning: this will be huge).
    print('Exporting similarity matrix...', end='')
    pickle.dump(cos_sim, open('cossim.pickle', 'wb'))
    print('done!')

In [18]:
book_data.loc[book_data.title.str.contains('Harry')]

Unnamed: 0,index,title,original_title,series,language,authors,avg_rating,num_ratings,num_reviews,genres,description,url,weighted_rating,soup
233,233,The Harry Potter Collection 1-4,Harry Potter Boxed Set Books 1-4,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.67,48175,298,"Fantasy,Young Adult,Fiction","The exciting tales of Harry Potter, the young ...",https://www.goodreads.com/book/show/99298.The_...,4.640453,seem books storm easy buy one sorry -- next ch...
402,404,Harry Potter and the Cursed Child: Parts One a...,Harry Potter and the Cursed Child: Parts One a...,Harry Potter,English,"John Tiffany (Adaptation),Jack Thorne,J.K. Row...",3.66,592249,62023,"Fantasy,Fiction,Young Adult,Plays",Based on an original new story by J.K. Rowling...,https://www.goodreads.com/book/show/29056083-h...,3.661599,receive stage darkness comes john tiffany new ...
1469,1481,Harry Potter Series Box Set,,Harry Potter,English,J.K. Rowling,4.74,234260,7065,"Fantasy,Young Adult,Fiction","Over 4000 pages of Harry Potter and his world,...",https://www.goodreads.com/book/show/862041.Har...,4.732967,internationally bestselling harry potter serie...
1643,1659,Harry Potter and the Chamber of Secrets,Harry Potter and the Chamber of Secrets,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.41,2401147,45913,"Fantasy,Young Adult,Fiction",The Dursleys were so mean and hideous that sum...,https://www.goodreads.com/book/show/15881.Harr...,4.40964,poisonous rival finally told girls one everyon...
1672,1688,Harry Potter and the Half-Blood Prince,Harry Potter and the Half-Blood Prince,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.56,2176421,34865,"Fantasy,Young Adult,Fiction",When Harry Potter and the Half-Blood Prince op...,https://www.goodreads.com/book/show/1.Harry_Po...,4.559436,apparate objects fascinating fall disposal dum...
1675,1691,Harry Potter and the Goblet of Fire,Harry Potter and the Goblet of Fire,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.55,2286525,40512,"Fantasy,Young Adult,Fiction",Harry Potter is midway through his training as...,https://www.goodreads.com/book/show/6.Harry_Po...,4.549474,normal take place hundred years find age -- ba...
1679,1695,Harry Potter and the Prisoner of Azkaban,Harry Potter and the Prisoner of Azkaban,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.56,2443994,48308,"Fantasy,Young Adult,Fiction",Harry Potter's third year at Hogwarts is full ...,https://www.goodreads.com/book/show/5.Harry_Po...,4.559498,seem azkaban guards terrible power becomes clo...
1698,1718,Harry Potter and the Deathly Hallows,Harry Potter and the Deathly Hallows,Harry Potter,English,J.K. Rowling,4.61,2539759,60223,"Fantasy,Young Adult,Fiction",Harry Potter is leaving Privet Drive for the l...,https://www.goodreads.com/book/show/136251.Har...,4.609469,one final battle destroy enemy leaving privet ...
1739,1760,Harry Potter and the Sorcerer's Stone,Harry Potter and the Philosopher's Stone,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.47,6215158,98678,"Fantasy,Young Adult,Fiction",Harry Potter's life is miserable. His parents ...,https://www.goodreads.com/book/show/3.Harry_Po...,4.469837,perfect relatives killing curse inflicted kill...
1773,1797,Harry Potter and the Order of the Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.49,2231129,37092,"Fantasy,Young Adult,Fiction",There is a door at the end of a silent corrido...,https://www.goodreads.com/book/show/2.Harry_Po...,4.489526,things pale next night boundless loyalty middl...
