In [1]:
# Libraries
import ujson
import pandas as pd

---
### Exploratory Data Analysis

**File containing book data: goodreads_books.json**

In [2]:
!wc -l goodreads_books.json

 2360655 goodreads_books.json


In [3]:
!ls -lh | grep goodreads_books.json

-rw-r--r--@ 1 ramiroromero  staff   8.6G Aug 27  2022 goodreads_books.json


In [4]:
with open("goodreads_books.json", 'r') as f:
    line = f.readline()
ujson.loads(line)

{'isbn': '0312853122',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': '',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '4.00',
 'kindle_asin': '',
 'similar_books': [],
 'description': '',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'authors': [{'author_id': '604031', 'role': ''}],
 'publisher': "St. Martin's Press",
 'num_pages': '256',
 'publication_day': '1',
 'isbn13': '9780312853129',
 'publication_month': '9',
 'edition_information': '',
 'publication_year': '1984',
 'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'book_id': '5333265',
 'ratings_count': '3',
 'work_id': '5400751',
 'title': '

The output above is a single line of the main JSON file. Considering the substantial scale of our dataset—comprising over 2.36 million observations, 29 variables, and requiring 8.6 GB of storage—it is advisable to minimize memory allocation. To address this, for the duration of this project, we will adopt a streaming approach to read the data. This method ensures that we allocate only the essential memory needed to process an optimal number of lines in the dataset.

Author metadata:
- Contains author ID's and their roles but not their names
- We have a secondary dataset which we will use to map the main files author ID's to their respective author names

**File containing author data: goodreads_book_authors.json**

In [5]:
with open("goodreads_book_authors.json", 'r') as f:
    line = f.readline()
ujson.loads(line)

{'average_rating': '3.98',
 'author_id': '604031',
 'text_reviews_count': '7',
 'name': 'Ronald J. Fields',
 'ratings_count': '49'}

For the purpose of our search engine, we only care about the author name. We will use the author ID's in the main file to find the author names in the secondary file.

---

### Creating Datasets

In [6]:
#._ Create a dataframe for the book titles

## Helper function
def parse_fields(line):
    '''
    Takes a single line and only returns the fields specified.
    '''
    try: 
        data = ujson.loads(line)
    except ujson.JSONDecodeError:
        return None
    
    return {
        "book_id": data.get("book_id"),
        "title": data.get("title_without_series"),
        "author": data.get("authors"),
        "ratings": data.get("ratings_count"),
        "url": data.get("url"),
        "cover_image": data.get("image_url")
    }

## Run through the lines of the JSON file and parse the relevant fields
books_titles = []
with open("goodreads_books.json", 'r') as f:
    for line in f:
        fields = parse_fields(line)
        
        if fields is not None:
            try: 
                ratings = int(fields["ratings"])
            except (ValueError, TypeError):
                continue
        
        # Only include books with more than 10 ratings
        if ratings > 10:
            books_titles.append(fields)

## Coerce the parsed fields into a dataframe
titles = pd.DataFrame.from_dict(books_titles)

titles.head(5)

Unnamed: 0,book_id,title,author,ratings,url,cover_image
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...","[{'author_id': '10333', 'role': ''}]",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...
1,6066819,Best Friends Forever,"[{'author_id': '9212', 'role': ''}]",51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...
2,287140,Runic Astrology: Starcraft and Timekeeping in ...,"[{'author_id': '149918', 'role': ''}]",15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...
3,287141,The Aeneid for Boys and Girls,"[{'author_id': '3041852', 'role': ''}]",46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...
4,378460,The Wanting of Levine,"[{'author_id': '215594', 'role': ''}]",12,https://www.goodreads.com/book/show/378460.The...,https://s.gr-assets.com/assets/nophoto/book/11...


In [7]:
#._ Find the author names for each respective book

## Create a dictionary containing author ID's and names
authors = {}
with open("goodreads_book_authors.json", 'r') as f:
    for line in f:
        try:
            data = ujson.loads(line)
        except ujson.JSONDecodeError:
            continue
        
        authors[data.get("author_id")] = data.get("name")

## Collect the author ID's for the books in the titles dataframe
def extract_author_ids(dictionary_list):
    return [d.get("author_id") for d in dictionary_list]

titles_authors_ids = titles["author"].apply(extract_author_ids)

## Collect the author names for the books in the titles dataframe
def extract_author_names(titles_authors_ids):
    return [authors[i] for i in titles_authors_ids if len(titles_authors_ids) > 0]

titles_authors_names = titles_authors_ids.apply(extract_author_names)

## Assign author names to the "autor" variable in the titles dataframe
titles["author"] = titles_authors_names
titles.head(5)

Unnamed: 0,book_id,title,author,ratings,url,cover_image
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",[Barbara Hambly],140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...
1,6066819,Best Friends Forever,[Jennifer Weiner],51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...
2,287140,Runic Astrology: Starcraft and Timekeeping in ...,[Nigel Pennick],15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...
3,287141,The Aeneid for Boys and Girls,[Alfred J. Church],46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...
4,378460,The Wanting of Levine,[Michael Halberstam],12,https://www.goodreads.com/book/show/378460.The...,https://s.gr-assets.com/assets/nophoto/book/11...


---
### Data Sanitization

**Coercion**

In [8]:
# Coerce ratings into integers
titles["ratings"] = pd.to_numeric(titles["ratings"])

# Coerce author names into a single string
titles["author"] = titles['author'].apply(lambda x: ' '.join(x))

**Regex**

In [9]:
## Simplify title strings

# only include alpha-numeric characters
titles["mod_title"] = titles["title"].str.replace("[^a-zA-Z0-9 ]","",regex = True)

# replace any instance of continuous spaces with only one space
titles["mod_title"] = titles["mod_title"].str.replace("\s+"," ", regex = True)

# coerce all characters into lowercase characters
titles["mod_title"] = titles["mod_title"].str.lower()

# remove any books with empty titles
titles = titles[titles["mod_title"].str.len() > 0]


## Simplify author strings

# only include alpha-numeric characters
titles["mod_author"] = titles["author"].str.replace("[^a-zA-Z0-9 ]","",regex = True)

# replace any instance of continuous whitespace with a single spaec
titles["mod_author"] = titles["mod_author"].str.replace("\s+", " ", regex = True)

# coerce all characters into lowercase characters
titles["mod_author"] = titles["mod_author"].str.lower()



In [10]:
## Concatenate title and author
titles["mod_title"] = titles["mod_title"] + ' ' + titles["mod_author"]

**Save clean dataset to working directory**

In [11]:
titles.to_json("books_titles.json")
titles.head(5)

Unnamed: 0,book_id,title,author,ratings,url,cover_image,mod_title,mod_author
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",Barbara Hambly,140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12...,barbara hambly
1,6066819,Best Friends Forever,Jennifer Weiner,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever jennifer weiner,jennifer weiner
2,287140,Runic Astrology: Starcraft and Timekeeping in ...,Nigel Pennick,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,runic astrology starcraft and timekeeping in t...,nigel pennick
3,287141,The Aeneid for Boys and Girls,Alfred J. Church,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls alfred j church,alfred j church
4,378460,The Wanting of Levine,Michael Halberstam,12,https://www.goodreads.com/book/show/378460.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the wanting of levine michael halberstam,michael halberstam


---
## TFIDF Matrix

We will construct a Term frequency-inverse document frequency matrix which will represent a numerical collection of our book titles, where each book title is represented as a vector in a high-dimensional space.

This matrix will be used as the basis for our text-based similarity calculations in our search engine. Titles with similar TF-IDF vectors are considered more similar in content. Our engine will utilize a cosine similarity metric to rank documents based on their relevance to a user's query.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

tfidf = vectorizer.fit_transform(titles["mod_title"])

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_cover(val):
    return '<img src="{}" width = 50></img>'.format(val)

def search(query, vectorizer = vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    query_vec = vectorizer.transform([processed])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -10)[-10:]
    results = titles.iloc[indices].copy() # make a copy to avoid SettingWithCopyWarning
    
    # Determine weights for sorting criteria
    rating_weight = 0.01
    title_similarity_weight = 0.99
    
    # Center and standardize factors for sorting criteria
    ratings_standardized = (results["ratings"] - results["ratings"].mean()) / results["ratings"].std()
    if np.std(similarity[indices]) != 0:
        similarity_standardized = (similarity[indices] - np.mean(similarity[indices])) / np.std(similarity[indices])
    else :
        similarity_standardized = (similarity[indices] - np.mean(similarity[indices]))
    
    # Calculate weighted sum for sorting
    results.loc[:, "weighted_sum"] = (
        (rating_weight * ratings_standardized) + 
        (title_similarity_weight * similarity[indices])
    )
    
    results = results.sort_values("weighted_sum", ascending = False)
    results = results.drop(columns = ["weighted_sum"])
    return results.head(20).style.format({'url': make_clickable, 'cover_image':show_cover})

## Searching for our liked books

Now lets make a csv file containing books which we liked.

We will use this the search function to create a dictionary containing the book id's of books we liked as keys and what we rate those books on a scale from 1-5 as the value.

In [14]:
search("their eyes are watching god")

Unnamed: 0,book_id,title,author,ratings,url,cover_image,mod_author
105152,170436,Their Eyes Were Watching God,Zora Neale Hurston,2986,Goodreads,,zora neale hurston
274250,6522048,Their Eyes Were Watching God,Zora Neale Hurston,1819,Goodreads,,zora neale hurston
1243695,885843,Their Eyes Were Watching God,Zora Neale Hurston,1301,Goodreads,,zora neale hurston
169524,7011207,Their Eyes Were Watching God,Zora Neale Hurston,299,Goodreads,,zora neale hurston
1243693,885847,Their Eyes Were Watching God,Zora Neale Hurston,273,Goodreads,,zora neale hurston
349714,51261,Their Eyes Were Watching God,Zora Neale Hurston,254,Goodreads,,zora neale hurston
502844,904282,Their Eyes Were Watching God,Zora Neale Hurston,181,Goodreads,,zora neale hurston
443225,29432264,Their Eyes Were Watching God,Zora Neale Hurston,159,Goodreads,,zora neale hurston
623361,19570107,Their Eyes Were Watching God,Zora Neale Hurston,47,Goodreads,,zora neale hurston
160857,1162432,Their Eyes Were Watching God,Zora Neale Hurston,26,Goodreads,,zora neale hurston


In [16]:
liked_books = {"25899336":5,"18373":5,"108713":5, "6856680":5, "35239798":5,
              "4030991":5, "596686":4, "133518":5, "357636":4, "75855":3, 
               "11987":4, "8238259":4, "275612":5}

df = titles[titles["book_id"].isin(liked_books)][["book_id", "title", "mod_title"]]
df["rating"] = df["book_id"].map(liked_books)
df["user_id"] = "-1"

df.to_csv("liked_books.csv")