# Content Based

Author: Nirta Ika Yunita & Samuel Natamihardja
<br>Date: November 18, 2019

There have been good datasets for movies (Netflix, Movielens) and music (Million Songs) recommendation, but not for books. That is, until now.

This dataset contains ratings for ten thousand popular books. As to the source, let's say that these ratings were found on the internet. Generally, there are 100 reviews for each book, although some have less - fewer - ratings. Ratings go from one to five.

Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424. All users have made at least two ratings. Median number of ratings per user is 8.

There are also books marked to read by the users, book metadata (author, year, etc.) and tags.

## Import Library

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## Import Data

In [2]:
books = pd.read_csv('new_books.csv')
ratings = pd.read_csv('ratings.csv')

In [3]:
books = books[['book_id', 'authors', 'title', 'language_code', 'tag_name', 'image_url']]
print("Shape of 'books' dataset :", books.shape)
books.head()

Shape of 'books' dataset : (9759, 6)


Unnamed: 0,book_id,authors,title,language_code,tag_name,image_url
0,2767052,Suzanne Collins,"The Hunger Games (The Hunger Games, #1)",eng,young adult,https://images.gr-assets.com/books/1447303603m...
1,3,"J.K. Rowling, Mary GrandPré",Harry Potter and the Sorcerer's Stone (Harry P...,eng,fantasy,https://images.gr-assets.com/books/1474154022m...
2,41865,Stephenie Meyer,"Twilight (Twilight, #1)",en-US,young adult,https://images.gr-assets.com/books/1361039443m...
3,2657,Harper Lee,To Kill a Mockingbird,eng,classics,https://images.gr-assets.com/books/1361975680m...
4,4671,F. Scott Fitzgerald,The Great Gatsby,eng,classics,https://images.gr-assets.com/books/1490528560m...


In [4]:
print("Shape of 'ratings' dataset :", books.shape)
ratings.head()

Shape of 'ratings' dataset : (9759, 6)


Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4


## Content Based Recommender

In [5]:
# create metadata for similarity using 'author', 'tag_name', and 'language_code'
def create_metadata(x):
    return ''.join(x['authors'])+'  '+''.join(x['tag_name'])+'  '+''.join(str(x['language_code']))

In [6]:
books['metadata']= books.apply(create_metadata,axis=1)
books['metadata']= books['metadata'].fillna('')
books.head()

Unnamed: 0,book_id,authors,title,language_code,tag_name,image_url,metadata
0,2767052,Suzanne Collins,"The Hunger Games (The Hunger Games, #1)",eng,young adult,https://images.gr-assets.com/books/1447303603m...,Suzanne Collins young adult eng
1,3,"J.K. Rowling, Mary GrandPré",Harry Potter and the Sorcerer's Stone (Harry P...,eng,fantasy,https://images.gr-assets.com/books/1474154022m...,"J.K. Rowling, Mary GrandPré fantasy eng"
2,41865,Stephenie Meyer,"Twilight (Twilight, #1)",en-US,young adult,https://images.gr-assets.com/books/1361039443m...,Stephenie Meyer young adult en-US
3,2657,Harper Lee,To Kill a Mockingbird,eng,classics,https://images.gr-assets.com/books/1361975680m...,Harper Lee classics eng
4,4671,F. Scott Fitzgerald,The Great Gatsby,eng,classics,https://images.gr-assets.com/books/1490528560m...,F. Scott Fitzgerald classics eng


**TfidfVectorizer** function from scikit-learn, which transforms text to feature vectors that can be used as input to estimator.

In [7]:
# finding the similarity between two books
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english')
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [8]:
tfidf_matrix = vectorizer.fit_transform(books['metadata'])
tfidf_matrix

<9759x6154 sparse matrix of type '<class 'numpy.float64'>'
	with 47698 stored elements in Compressed Sparse Row format>

**Cosine Similarity** to calculate a numeric value that denotes the similarity between two books.

In [9]:
# cosine similarity using linear kernel
from sklearn.metrics.pairwise import linear_kernel

cos_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)
cos_matrix

array([[1.        , 0.01477531, 0.17848556, ..., 0.01830137, 0.01439408,
        0.        ],
       [0.01477531, 1.        , 0.        , ..., 0.01708608, 0.01343824,
        0.        ],
       [0.17848556, 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.01830137, 0.01708608, 0.        , ..., 1.        , 0.04980255,
        0.        ],
       [0.01439408, 0.01343824, 0.        , ..., 0.04980255, 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

### Return 15 most similar books based on the cosine similarity score

In [10]:
# build a 1-dimensional array fot book titles and indices
titles = books['title']
indices = pd.Series(books.index, index = books['title'])

# function that get book recommendations based on the cosine similarity score of 'metadata'
def get_recommendations(name, sim):
    index = books.loc[books['title'] == name].index
    index = indices[name]
    sim_scores = list(enumerate(sim[index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse = True)
    sim_scores = sim_scores[1:16]
    book_indeces = [i[0] for i in sim_scores]
    return titles.iloc[book_indeces]

In [11]:
# interactive script
print("We can give you a recommendation! What are you reading lately?\n")
read = input("Lately I'm reading --> ")
print("\nThese are our top 15 books recommendation for you :)")
get_recommendations(read, cos_matrix)

We can give you a recommendation! What are you reading lately?

Lately I'm reading --> Romeo and Juliet

These are our top 15 books recommendation for you :)


338                     Othello
691                   King Lear
828               Twelfth Night
1824             As You Like It
2137         The Complete Works
749     The Taming of the Shrew
779                 The Tempest
6373           Titus Andronicus
8675                 Richard II
6530       The Comedy of Errors
813      The Merchant of Venice
146                     Macbeth
6263          The Winter's Tale
512      Much Ado About Nothing
3601                Richard III
Name: title, dtype: object