# Books recommendation system

## Datasets

  https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home

### Download book data
- Go to https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/books
- Download book data from https://drive.google.com/uc?id=1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK

### Download interaction data

- Go to https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/shelves
- Download https://drive.google.com/open?id=1zmylV7XW2dfQVCLeg1LbllfQtHD2KUon
- Download https://drive.google.com/uc?id=1CHTAaNwyzvbi1TR08MJrJ03BxA266Yxr


**Note**: Goodreads no longer provides API access to datasets. We are using Goodreads Book Data scraped by reseachers at UCSD and is availbe at above links.


## Project steps

1. search for books
2. Create books list
3. Recommend books

## Step1 - Build books search engine

In [21]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
# copy files to runtime
!cp /content/drive/MyDrive/Colab_Notebooks/Projects/books_recommendation/datasets/* .

In [23]:
## Let's look at first dataset
# number of lines in this file
!wc -l goodreads_books.json.gz

7588375 goodreads_books.json.gz


In [24]:
# how large this dataset is - size
!ls -lh | grep goodreads_books.json.gz

-rw------- 1 root root 2.0G Feb 23 20:49 goodreads_books.json.gz


In [25]:
# this dataset jason file is large and pd.read_json is not good way to work with it in memory
# use streaming method as below

import gzip


with gzip.open("goodreads_books.json.gz", 'r') as f:
    line = f.readline()

In [26]:
line

b'{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin\'s Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "t

In [27]:
# use json midile to load this single line

import json

json.loads(line)

{'asin': '',
 'authors': [{'author_id': '604031', 'role': ''}],
 'average_rating': '4.00',
 'book_id': '5333265',
 'country_code': 'US',
 'description': '',
 'edition_information': '',
 'format': 'Paperback',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'is_ebook': 'false',
 'isbn': '0312853122',
 'isbn13': '9780312853129',
 'kindle_asin': '',
 'language_code': '',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'num_pages': '256',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'publication_day': '1',
 'publication_month': '9',
 'publication_year': '1984',
 'publisher': "St. Martin's Press",
 'ratings_count': '3',
 'series': [],
 'similar_books': [],
 'text_reviews_count': '1',
 'title': 'W.C. Fields: A Life on Film',
 'title_without_series': 'W.C. Fields: A Life on Film',
 'u

In [28]:
# write function to parse single line and return parts of json

def parse_fields(line):
    data = json.loads(line)
    return {
        "book_id": data["book_id"],
        "title": data["title_without_series"],
        "ratings": data["ratings_count"],
        "url": data["url"],
        "image_url": data["image_url"]
    }   

In [30]:
# Create a lis of dict to go line by line and parse

books_titles = []
with gzip.open("goodreads_books.json.gz", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break

        fields = parse_fields(line)

        try:
            ratings = int(fields["ratings"])  ## turn into int - if errors out - continue
        except ValueError:
            continue
        if ratings > 5:    ## get rid of data with leass than 15 ratings and get list of books with ratings more than 15 only 
            books_titles.append(fields)

In [31]:
# convert list into dataframe

import pandas as pd

titles = pd.DataFrame.from_dict(books_titles)

In [32]:
# turn ratings to numerical column
titles["ratings"] = pd.to_numeric(titles["ratings"])

In [33]:
titles

Unnamed: 0,book_id,title,ratings,url,image_url
0,1333909,Good Harbor,10,https://www.goodreads.com/book/show/1333909.Go...,https://s.gr-assets.com/assets/nophoto/book/11...
1,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...
2,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...
3,287140,Runic Astrology: Starcraft and Timekeeping in ...,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...
4,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...
...,...,...,...,...,...
1782574,3084038,"This Sceptred Isle, Vol. 10: The Age of Victor...",12,https://www.goodreads.com/book/show/3084038-th...,https://images.gr-assets.com/books/1494763458m...
1782575,26168430,Sherlock Holmes and the July Crisis,6,https://www.goodreads.com/book/show/26168430-s...,https://images.gr-assets.com/books/1440592011m...
1782576,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...
1782577,22017381,"101 Nights: Volume One (101 Nights, #1-3)",70,https://www.goodreads.com/book/show/22017381-1...,https://images.gr-assets.com/books/1398621236m...


In [34]:
# preprocessing
titles["mod_title"] = titles["title"].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
titles["mod_title"] = titles["mod_title"].str.lower()
titles["mod_title"] = titles["mod_title"].str.replace("\s+", " ", regex=True)
titles = titles[titles["mod_title"].str.len() > 0]
titles.to_json("books_titles.json")

In [35]:
titles

Unnamed: 0,book_id,title,ratings,url,image_url,mod_title
0,1333909,Good Harbor,10,https://www.goodreads.com/book/show/1333909.Go...,https://s.gr-assets.com/assets/nophoto/book/11...,good harbor
1,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
2,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
3,287140,Runic Astrology: Starcraft and Timekeeping in ...,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,runic astrology starcraft and timekeeping in t...
4,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
...,...,...,...,...,...,...
1782574,3084038,"This Sceptred Isle, Vol. 10: The Age of Victor...",12,https://www.goodreads.com/book/show/3084038-th...,https://images.gr-assets.com/books/1494763458m...,this sceptred isle vol 10 the age of victoria ...
1782575,26168430,Sherlock Holmes and the July Crisis,6,https://www.goodreads.com/book/show/26168430-s...,https://images.gr-assets.com/books/1440592011m...,sherlock holmes and the july crisis
1782576,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,the childrens classic poetry collection
1782577,22017381,"101 Nights: Volume One (101 Nights, #1-3)",70,https://www.goodreads.com/book/show/22017381-1...,https://images.gr-assets.com/books/1398621236m...,101 nights volume one 101 nights 13


In [36]:
# find term frequency
# create bag of words using TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

tfidf = vectorizer.fit_transform(titles["mod_title"])

In [37]:
# build single item search query

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

query = "fire upon the deep"
processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
query_vec = vectorizer.transform([processed])
similarity = cosine_similarity(query_vec, tfidf).flatten()

In [38]:
similarity

array([0.        , 0.01391682, 0.        , ..., 0.02141545, 0.        ,
       0.00984393])

In [39]:
# numpy argpartition
indices = np.argpartition(similarity, -10)[-10:]
results = titles.iloc[indices]

In [40]:
results

Unnamed: 0,book_id,title,ratings,url,image_url,mod_title
1709903,8291178,"A Fire Upon the Deep (Zones of Thought, #1)",1435,https://www.goodreads.com/book/show/8291178-a-...,https://images.gr-assets.com/books/1328110731m...,a fire upon the deep zones of thought 1
1375428,77711,"A Fire Upon the Deep (Zones of Thought, #1)",38809,https://www.goodreads.com/book/show/77711.A_Fi...,https://images.gr-assets.com/books/1333915005m...,a fire upon the deep zones of thought 1
543695,28409581,"A Fire Upon the Deep (Zones of Thought, #1)",33,https://www.goodreads.com/book/show/28409581-a...,https://images.gr-assets.com/books/1451560537m...,a fire upon the deep zones of thought 1
71278,13551450,A Fire Upon the Deep,25,https://www.goodreads.com/book/show/13551450-a...,https://images.gr-assets.com/books/1332081952m...,a fire upon the deep
557833,17699614,A Fire Upon the Deep,48,https://www.goodreads.com/book/show/17699614-a...,https://images.gr-assets.com/books/1364507590m...,a fire upon the deep
1545944,91147,A Fire Upon The Deep,125,https://www.goodreads.com/book/show/91147.A_Fi...,https://images.gr-assets.com/books/1408940052m...,a fire upon the deep
1781816,9975763,A Fire Upon The Deep,194,https://www.goodreads.com/book/show/9975763-a-...,https://images.gr-assets.com/books/1316730548m...,a fire upon the deep
108444,6441120,A Fire Upon the Deep,27,https://www.goodreads.com/book/show/6441120-a-...,https://images.gr-assets.com/books/1304839097m...,a fire upon the deep
629521,940486,A Fire Upon the Deep,111,https://www.goodreads.com/book/show/940486.A_F...,https://images.gr-assets.com/books/1328110769m...,a fire upon the deep
928694,18900762,A Fire Upon the Deep,219,https://www.goodreads.com/book/show/18900762-a...,https://s.gr-assets.com/assets/nophoto/book/11...,a fire upon the deep


In [41]:
# sort and remove duplicates
results = results.sort_values("ratings", ascending=False)
results.head(5)

Unnamed: 0,book_id,title,ratings,url,image_url,mod_title
1375428,77711,"A Fire Upon the Deep (Zones of Thought, #1)",38809,https://www.goodreads.com/book/show/77711.A_Fi...,https://images.gr-assets.com/books/1333915005m...,a fire upon the deep zones of thought 1
1709903,8291178,"A Fire Upon the Deep (Zones of Thought, #1)",1435,https://www.goodreads.com/book/show/8291178-a-...,https://images.gr-assets.com/books/1328110731m...,a fire upon the deep zones of thought 1
928694,18900762,A Fire Upon the Deep,219,https://www.goodreads.com/book/show/18900762-a...,https://s.gr-assets.com/assets/nophoto/book/11...,a fire upon the deep
1781816,9975763,A Fire Upon The Deep,194,https://www.goodreads.com/book/show/9975763-a-...,https://images.gr-assets.com/books/1316730548m...,a fire upon the deep
1545944,91147,A Fire Upon The Deep,125,https://www.goodreads.com/book/show/91147.A_Fi...,https://images.gr-assets.com/books/1408940052m...,a fire upon the deep


In [42]:
# Now build a function for single item search

def search(query,vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    query_vec = vectorizer.transform([processed])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -10)[-10:]
    results = titles.iloc[indices]
    results = results.sort_values("ratings", ascending=False)
    return results.head(5)

In [43]:
search("east of Eden", vectorizer)

Unnamed: 0,book_id,title,ratings,url,image_url,mod_title
236353,883438,East of Eden,2336,https://www.goodreads.com/book/show/883438.Eas...,https://images.gr-assets.com/books/1503315060m...,east of eden
1771254,276626,East of Eden,252,https://www.goodreads.com/book/show/276626.Eas...,https://images.gr-assets.com/books/1327385674m...,east of eden
427624,910727,East of Eden,119,https://www.goodreads.com/book/show/910727.Eas...,https://images.gr-assets.com/books/1299275232m...,east of eden
1363899,4415,East Of Eden,69,https://www.goodreads.com/book/show/4415.East_...,https://images.gr-assets.com/books/1274475012m...,east of eden
945242,20443029,East of Eden,41,https://www.goodreads.com/book/show/20443029-e...,https://s.gr-assets.com/assets/nophoto/book/11...,east of eden


In [44]:
# add clickable html links and images
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)

def search(query,vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    query_vec = vectorizer.transform([query])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -10)[-10:]
    results = titles.iloc[indices]
    results = results.sort_values("ratings", ascending=False)
    
    return results.head(5).style.format({'url': make_clickable, 'cover_image': show_image})

In [45]:
search("Homegoing", vectorizer)

Unnamed: 0,book_id,title,ratings,url,image_url,mod_title
1720008,27071490,Homegoing,49315,Goodreads,https://images.gr-assets.com/books/1448108591m/27071490.jpg,homegoing
1305894,28683066,Homegoing,6645,Goodreads,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,homegoing
1680449,30070018,Homegoing,685,Goodreads,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,homegoing
367699,988505,Homegoing,301,Goodreads,https://images.gr-assets.com/books/1493938611m/988505.jpg,homegoing
1043208,30306735,Homegoing,195,Goodreads,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,homegoing


In [46]:
# Now, build complete books search engine all in one place

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)

def search(query,vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    query_vec = vectorizer.transform([query])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -10)[-10:]
    results = titles.iloc[indices]
    results = results.sort_values("ratings", ascending=False)
    
    return results.head(5).style.format({'url': make_clickable, 'cover_image': show_image})

### search for more boooks now to verify 

In [47]:
search("Pachinko", vectorizer)

Unnamed: 0,book_id,title,ratings,url,image_url,mod_title
1446134,29983711,Pachinko,8161,Goodreads,https://images.gr-assets.com/books/1462393298m/29983711.jpg,pachinko
1478531,32619967,Pachinko,1361,Goodreads,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,pachinko
501019,684819,Dreaming Pachinko,283,Goodreads,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,dreaming pachinko
241911,34051011,Pachinko,254,Goodreads,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,pachinko
1370375,34335439,Pachinko,97,Goodreads,https://images.gr-assets.com/books/1501063789m/34335439.jpg,pachinko


In [48]:
search("Foundation", vectorizer)

Unnamed: 0,book_id,title,ratings,url,image_url,mod_title
289103,76680,"Foundation (Foundation, #1)",3469,Goodreads,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,foundation foundation 1
1673947,414853,"Foundation (Foundation, #1)",604,Goodreads,https://images.gr-assets.com/books/1395929677m/414853.jpg,foundation foundation 1
774110,1706321,"Foundation (Foundation, #1)",154,Goodreads,https://images.gr-assets.com/books/1258332664m/1706321.jpg,foundation foundation 1
1222816,10287899,"Foundation (Foundation, #1)",143,Goodreads,https://images.gr-assets.com/books/1295683312m/10287899.jpg,foundation foundation 1
49197,31380440,"Foundation (Foundation, #1)",112,Goodreads,https://images.gr-assets.com/books/1470738316m/31380440.jpg,foundation foundation 1


In [49]:
search("Harry potter", vectorizer)

Unnamed: 0,book_id,title,ratings,url,image_url,mod_title
630902,86940,"هاري بوتر وحجر الفيلسوف (Harry Potter, #1)",1290,Goodreads,https://images.gr-assets.com/books/1327275224m/86940.jpg,harry potter 1
1422621,49869,"هاري بوتر وسجين أزكابان (Harry Potter, #3)",1023,Goodreads,https://images.gr-assets.com/books/1329651788m/49869.jpg,harry potter 3
1230739,70355,"هاري بوتر وجماعة العنقاء (Harry Potter, #5)",955,Goodreads,https://images.gr-assets.com/books/1351790790m/70355.jpg,harry potter 5
815063,8683527,"แฮร์รี่ พอตเตอร์กับศิลาอาถรรพ์ (Harry Potter, #1)",84,Goodreads,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,harry potter 1
921387,22601967,"ჰარი პოტერი და ფენიქსის ორდენი (Harry Potter, #5)",33,Goodreads,https://images.gr-assets.com/books/1404053183m/22601967.jpg,harry potter 5


In [50]:
#  create a list of liked books
liked_books = ["8132407", "31147619", "29983711", "5996629", "7809996"]

In [51]:
liked_books

['8132407', '31147619', '29983711', '5996629', '7809996']