# Hybrid

Author: Nirta Ika Yunita & Samuel Natamihardja
<br>Date: November 18, 2019

There have been good datasets for movies (Netflix, Movielens) and music (Million Songs) recommendation, but not for books. That is, until now.

This dataset contains ratings for ten thousand popular books. As to the source, let's say that these ratings were found on the internet. Generally, there are 100 reviews for each book, although some have less - fewer - ratings. Ratings go from one to five.

Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424. All users have made at least two ratings. Median number of ratings per user is 8.

There are also books marked to read by the users, book metadata (author, year, etc.) and tags.

#### Importing Library

In [83]:
import pandas as pd #data wrangling
import numpy as np #calculation
import matplotlib.pyplot as plt #visualization
import seaborn as sns #visualization


from scipy.sparse import csr_matrix #prepare matrix

#Model
import surprise
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import Reader


#Model Evaluation
from sklearn import metrics
from sklearn.metrics import auc, roc_curve
from sklearn.metrics import accuracy_score

## Importing Dataset

In [84]:
#importing dataset
books = pd.read_csv('new_books.csv')
ratings = pd.read_csv('ratings.csv')

#### Copy Dataset

In [85]:
df_books = books[['book_id','original_publication_year','title','authors', 'language_code','tag_name','image_url']]
print("Shape of 'books' dataset :", df_books.shape)
df_books.head()

Shape of 'books' dataset : (9759, 7)


Unnamed: 0,book_id,original_publication_year,title,authors,language_code,tag_name,image_url
0,2767052,2008.0,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,eng,young adult,https://images.gr-assets.com/books/1447303603m...
1,3,1997.0,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré",eng,fantasy,https://images.gr-assets.com/books/1474154022m...
2,41865,2005.0,"Twilight (Twilight, #1)",Stephenie Meyer,en-US,young adult,https://images.gr-assets.com/books/1361039443m...
3,2657,1960.0,To Kill a Mockingbird,Harper Lee,eng,classics,https://images.gr-assets.com/books/1361975680m...
4,4671,1925.0,The Great Gatsby,F. Scott Fitzgerald,eng,classics,https://images.gr-assets.com/books/1490528560m...


In [86]:
df_ratings = ratings.copy()

#### Handling Missing Value

In [87]:
df_books.isnull().sum()

book_id                         0
original_publication_year      21
title                           0
authors                         0
language_code                1069
tag_name                        0
image_url                       0
dtype: int64

In [88]:
df_books.dropna(inplace = True)
df_books.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


book_id                      0
original_publication_year    0
title                        0
authors                      0
language_code                0
tag_name                     0
image_url                    0
dtype: int64

In [89]:
df_ratings.isnull().sum()

book_id    0
user_id    0
rating     0
dtype: int64

In [90]:
df_ratings = df_ratings.astype(int)

In [91]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981756 entries, 0 to 981755
Data columns (total 3 columns):
book_id    981756 non-null int32
user_id    981756 non-null int32
rating     981756 non-null int32
dtypes: int32(3)
memory usage: 11.2 MB


### Final Dataset

Dataset books

In [92]:
df_books = df_books.astype({"original_publication_year": int})
df_books.head()

Unnamed: 0,book_id,original_publication_year,title,authors,language_code,tag_name,image_url
0,2767052,2008,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,eng,young adult,https://images.gr-assets.com/books/1447303603m...
1,3,1997,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré",eng,fantasy,https://images.gr-assets.com/books/1474154022m...
2,41865,2005,"Twilight (Twilight, #1)",Stephenie Meyer,en-US,young adult,https://images.gr-assets.com/books/1361039443m...
3,2657,1960,To Kill a Mockingbird,Harper Lee,eng,classics,https://images.gr-assets.com/books/1361975680m...
4,4671,1925,The Great Gatsby,F. Scott Fitzgerald,eng,classics,https://images.gr-assets.com/books/1490528560m...


Dataset books rating

In [93]:

df_ratings.head()

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4


## Item-Based Collaborative Filtering(IB-CF)
With this method, item to item filtering tried to find items similarity.
Example question: "Users who liked this item also liked ..."

In [94]:
#reader
reader = Reader(rating_scale=(1, 5))

In [95]:
#prepare data that going tobe used by model
data = Dataset.load_from_df(df_ratings[["user_id", "book_id", "rating"]].head(5000), reader)

In [96]:
#modelling with KNN with means
model_knn = KNNWithMeans(sim_options={"name":"msd","user_base":False},k=5)

In [97]:
#prepare training data
data_train = data.build_full_trainset()

In [98]:
#fit training data to model
model_knn.fit(data_train)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x131d9b08>

### Prediction Test

recommend books to our top reviewer user.

In [99]:
#show top 5 user based on their reviews
df_ratings['user_id'].value_counts().head(5)

12874    200
30944    200
52036    199
28158    199
12381    199
Name: user_id, dtype: int64

In [100]:
df_ratings[df_ratings.user_id == 12874].sort_values('rating', ascending = False)

Unnamed: 0,book_id,user_id,rating
34340,344,12874,5
14128,142,12874,5
86730,868,12874,5
68037,681,12874,5
33834,339,12874,5
...,...,...,...
97635,977,12874,2
101438,1015,12874,2
116620,1167,12874,2
15331,154,12874,2


In [101]:
#temporary dataset for prediction result
pred = pd.DataFrame()

#looping book recommendation for user: 12874
for i in range(1,len(df_books)):
    pred.set_value(i,'book_id',df_books.book_id.iloc[i])
    pred.set_value(i,'predRating',model_knn.predict(12874, df_books.book_id.iloc[i]).est)

  
  import sys


In [102]:
#display recommendation result for user: 12874
final_pred = pd.merge(left=pred,right=df_books, left_on='book_id', right_on='book_id')
display(final_pred[~final_pred.book_id.isin(df_ratings[['book_id']][df_ratings.user_id == 12874])].sort_values('predRating',ascending=False).head(10))

Unnamed: 0,book_id,predRating,original_publication_year,title,authors,language_code,tag_name,image_url
1826,25.0,4.708366,1998,I'm a Stranger Here Myself: Notes on Returning...,Bill Bryson,en-US,nonfiction,https://s.gr-assets.com/assets/nophoto/book/11...
2132,26.0,4.65339,1989,The Lost Continent: Travels in Small Town America,Bill Bryson,en-US,travel,https://images.gr-assets.com/books/1404042682m...
25,1.0,4.613233,2005,Harry Potter and the Half-Blood Prince (Harry ...,"J.K. Rowling, Mary GrandPré",eng,fantasy,https://images.gr-assets.com/books/1361039191m...
16,5.0,4.406787,1999,Harry Potter and the Prisoner of Azkaban (Harr...,"J.K. Rowling, Mary GrandPré, Rufus Beck",eng,fantasy,https://images.gr-assets.com/books/1499277281m...
314,13.0,4.326805,1996,The Ultimate Hitchhiker's Guide to the Galaxy,Douglas Adams,eng,science fiction,https://images.gr-assets.com/books/1404613595m...
3401,10.0,4.17325,2005,"Harry Potter Collection (Harry Potter, #1-6)",J.K. Rowling,eng,fantasy,https://images.gr-assets.com/books/1328867351m...
19,2.0,4.162607,2003,Harry Potter and the Order of the Phoenix (Har...,"J.K. Rowling, Mary GrandPré",eng,fantasy,https://images.gr-assets.com/books/1387141547m...
364,50.0,4.123626,1986,"Hatchet (Brian's Saga, #1)",Gary Paulsen,en-US,young adult,https://s.gr-assets.com/assets/nophoto/book/11...
350,21.0,4.097652,2003,A Short History of Nearly Everything,Bill Bryson,en-US,history,https://s.gr-assets.com/assets/nophoto/book/11...
175,33.0,4.036739,1955,"The Lord of the Rings (The Lord of the Rings, ...",J.R.R. Tolkien,eng,fantasy,https://images.gr-assets.com/books/1411114164m...


## Content Based Recommender

In [103]:
# create metadata for similarity using 'author', 'tag_name', and 'language_code'
def create_metadata(x):
    return ''.join(x['authors'])+'  '+''.join(x['tag_name'])+'  '+''.join(str(x['language_code']))

In [104]:
df_books['metadata']= df_books.apply(create_metadata,axis=1)
df_books['metadata']= df_books['metadata'].fillna('')
df_books.head()

Unnamed: 0,book_id,original_publication_year,title,authors,language_code,tag_name,image_url,metadata
0,2767052,2008,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,eng,young adult,https://images.gr-assets.com/books/1447303603m...,Suzanne Collins young adult eng
1,3,1997,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré",eng,fantasy,https://images.gr-assets.com/books/1474154022m...,"J.K. Rowling, Mary GrandPré fantasy eng"
2,41865,2005,"Twilight (Twilight, #1)",Stephenie Meyer,en-US,young adult,https://images.gr-assets.com/books/1361039443m...,Stephenie Meyer young adult en-US
3,2657,1960,To Kill a Mockingbird,Harper Lee,eng,classics,https://images.gr-assets.com/books/1361975680m...,Harper Lee classics eng
4,4671,1925,The Great Gatsby,F. Scott Fitzgerald,eng,classics,https://images.gr-assets.com/books/1490528560m...,F. Scott Fitzgerald classics eng


**TfidfVectorizer** function from scikit-learn, which transforms text to feature vectors that can be used as input to estimator.

In [105]:
# finding the similarity between two books
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english')
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [106]:
tfidf_matrix = vectorizer.fit_transform(df_books['metadata'])
tfidf_matrix

<8674x5719 sparse matrix of type '<class 'numpy.float64'>'
	with 42516 stored elements in Compressed Sparse Row format>

**Cosine Similarity** to calculate a numeric value that denotes the similarity between two books.

In [107]:
# cosine similarity using linear kernel
from sklearn.metrics.pairwise import linear_kernel

cos_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

In [108]:
# build a 1-dimensional array fot book titles and indices
titles = books['title']
indices = pd.Series(books.index, index = books['title'])

# function that get book recommendations based on the cosine similarity score of 'metadata'
def get_recommendations(name, sim):
    index = books.loc[books['title'] == name].index
    index = indices[name]
    sim_scores = list(enumerate(sim[index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse = True)
    book_indeces = [i[0] for i in sim_scores]
    return titles.iloc[book_indeces]

What is latest book that our top user read?

In [138]:
#get latest book id
temp_rating = ratings[ratings.user_id == 12874]
temp_rating = pd.merge(left = temp_rating,right=df_books, left_on='book_id', right_on='book_id')
temp_result = temp_rating[['title','original_publication_year']].sort_values('original_publication_year', ascending = False).head(1)
temp_result

Unnamed: 0,title,original_publication_year
3,"Harry Potter Collection (Harry Potter, #1-6)",2005


In [140]:
read = temp_result['title'].iloc[0]
print("\nThese are recommendation books for you :)")
get_recommendations(read, cos_matrix)


These are recommendation books for you :)


3651         Harry Potter Collection (Harry Potter, #1-6)
5118                 Czas pogardy (Saga o Wiedźminie, #4)
5351    The Law of Attraction: The Basics of the Teach...
8608        The Green Mile, Part 2: The Mouse on the Mile
1209    Behind the Beautiful Forevers: Life, Death, an...
                              ...                        
8669                                Fruits Basket, Vol. 2
8670    Misquoting Jesus: The Story Behind Who Changed...
8671                                          Wait for It
8672                  The Other Wind (Earthsea Cycle, #6)
8673               The Paris Vendetta (Cotton Malone, #5)
Name: title, Length: 8674, dtype: object

## Hybrid Recommendation
Average of Content based and Collaborative Filtering

In [141]:
df_sim_scores = pd.DataFrame(sim_scores, columns = ['book_id','pred_sim'])
df_sim_scores['pred_sim'] = df_sim_scores['pred_sim']*10
df_sim_scores.head()

Unnamed: 0,book_id,pred_sim
0,28,10.0
1,338,7.473266
2,691,7.473266
3,828,7.473266
4,1824,7.473266


In [137]:
df_hybrid = pd.merge(left = df_sim_scores,right=final_pred, left_on='book_id', right_on='book_id')

df_hybrid['pred_hybrid'] = (df_hybrid.pred_sim + df_hybrid.predRating)/2
df_hybrid[['book_id','title','authors','tag_name','pred_hybrid']].sort_values('pred_hybrid', ascending = False).head(5)

Unnamed: 0,book_id,title,authors,tag_name,pred_hybrid
0,28,Notes from a Small Island,Bill Bryson,travel,6.728702
1,1824,The Men Who Stare at Goats,Jon Ronson,fiction,5.700633
2,2137,A Home at the End of the World,Michael Cunningham,fiction,5.700633
3,6530,"Trace (Kay Scarpetta, #13)",Patricia Cornwell,thriller,5.471959
4,119,The Lord of the Rings: The Art of The Fellowsh...,Gary Russell,fantasy,4.092403
