# Book Recommender System - Book Search Engine

#### Count the number of lines in the book metadata file

In [1]:
!wc -l goodreads_books.json.gz

7588375 goodreads_books.json.gz


We have the data for over 7 million books.

#### Show the file size

In [2]:
!ls -lh | grep goodreads_books.json.gz

-rwxrwxrwx 1 root   root  2.0G Dec 26 23:55 goodreads_books.json.gz


This is a pretty large file, so we will try not to read the whole file at once, but instead we're going to stream it line by line.

In [2]:
import gzip
import json
import numpy as np
import pandas as pd
import re

In [3]:
with gzip.open("goodreads_books.json.gz") as f:
    line = f.readline()

In [4]:
line

b'{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin\'s Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "t

This is a single line in the file.

In [5]:
data = json.loads(line)
data

{'isbn': '0312853122',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': '',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '4.00',
 'kindle_asin': '',
 'similar_books': [],
 'description': '',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'authors': [{'author_id': '604031', 'role': ''}],
 'publisher': "St. Martin's Press",
 'num_pages': '256',
 'publication_day': '1',
 'isbn13': '9780312853129',
 'publication_month': '9',
 'edition_information': '',
 'publication_year': '1984',
 'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'book_id': '5333265',
 'ratings_count': '3',
 'work_id': '5400751',
 'title': '

## Parsing the Book Data

In [7]:
def parse_fields(line):
    data = json.loads(line)
    return {
        "book_id": data["book_id"], 
        "title": data["title_without_series"], 
        "ratings": data["ratings_count"], 
        "url": data["url"], 
        "cover_image": data["image_url"]
    }

Here we only load the fields that we're going to work on:
- Book ID
- Book title without series
- Book ratings count
- Book URL
- Book cover image

In [8]:
book_titles = []

with gzip.open("goodreads_books.json.gz") as f:
    while True:
        line = f.readline()
        if not line:
            break
        fields = parse_fields(line)
        try: 
            ratings = int(fields["ratings"])
        except ValueError:
            continue
        if ratings > 10:
            book_titles.append(fields)

We will only include books that had been rated for more than 10 times, because there are a lot of data on books that have very few ratings count. Those books have very little chance of being recommended hence they are not very useful for our project.

## Data Cleaning

In [9]:
titles = pd.DataFrame.from_dict(book_titles)

In [10]:
titles["ratings"] = pd.to_numeric(titles["ratings"])

In [11]:
def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]", "", title.lower())
    return title

In [12]:
titles["title_clean"] = titles["title"].apply(clean_title)

In [13]:
titles["title_clean"] = titles["title_clean"].str.replace("\s+", " ", regex=True)

In [14]:
titles = titles[titles["title_clean"].str.len() > 0]

We clean the book titles data by:
- Using regex to replace characters that aren't uppercase or lowercase letters, numbers, or space into an empty string.
- Turning all letter into lowercase.
- Replacing spaces that are more than one in a row into a single space.
- Removing any title that are null by checking the length of title.

In [15]:
titles.head()

Unnamed: 0,book_id,title,ratings,url,cover_image,title_clean
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287140,Runic Astrology: Starcraft and Timekeeping in ...,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,runic astrology starcraft and timekeeping in t...
3,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
4,378460,The Wanting of Levine,12,https://www.goodreads.com/book/show/378460.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the wanting of levine


In [16]:
#titles.to_json("book_titles.json")

Turn the result into a json file for future use.

## Creating a Search Function

### Creating TFIDF Matrix (Term Frequency - Inverse Document Frequency)

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(titles["title_clean"])

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

In [20]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
    return '<img src="{}" width=50></img>'.format(val)

def search(query):
    query = clean_title(query)
    query_vec = vectorizer.transform([query])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -10)[-10:]
    results = titles.iloc[indices]
    results = results.sort_values("ratings", ascending=False)
    
    return results.head(5).style.format({'url': make_clickable, 'cover_image': show_image})

In [21]:
search("the kite runner")

Unnamed: 0,book_id,title,ratings,url,cover_image,title_clean
150998,2975,The Kite Runner,5163,Goodreads,,the kite runner
1109543,18996134,The Kite Runner,3580,Goodreads,,the kite runner
1436229,819495,The Kite Runner,1469,Goodreads,,the kite runner
360489,77204,The Kite Runner,1453,Goodreads,,the kite runner
803758,457061,The Kite Runner,803,Goodreads,,the kite runner


### Building an Interactive Search Box

In [22]:
import ipywidgets as widgets
from IPython.display import display

In [23]:
book_input = widgets.Text(
    value='',
    description='Book Title:',
    disabled=False
)
book_list = widgets.Output()

def on_type(data):
    with book_list:
        book_list.clear_output()
        title = data["new"]
        if len(title) >= 1:
            display(search(title))

book_input.observe(on_type, names='value')

display(book_input, book_list)

Text(value='', description='Book Title:')

Output()