# GoodReads Recommendations Data Preparation

## Table of Contents

1. Background
2. Business Understanding
3. Data Understanding
4. Data Preparation
    - 4.a. Book Metadata
    - 4.b. User Review Data
    - 4.c. User Reviews by Book


In [29]:
# Imports
import numpy as np
import pandas as pd
import matplotlib as plt
%matplotlib inline

from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer

## 1. Background

## 2. Business Understanding

## 3. Data Understanding

## 4. Data Preparation

Note that the datasets are quite large, so they'll take some time to run. I've included comments with average run time for your information.

### 4.a. Book Metadata
I need the metadata of the books in a dataset for two reasons: first, In my collaborative filtering, I will be returning recommendations to users. I'd like to give them information about the books they are recommended. Second, content-based filtering requires finding similar books using their traits/qualities/features. Metadata like the number of pages and the book's description will act as those features.

In [7]:
# This took about 2 mins
cols = ['isbn', 'average_rating', 'similar_books', 'description', 'link', 'authors',
        'num_pages', 'book_id', 'ratings_count', 'title', 'publication_year']
json_reader = pd.read_csv('data/meta_gr.csv', chunksize=10000)

In [8]:
list_df = []
for chunk in json_reader:
    list_df.append(chunk[cols])

In [9]:
metadata = pd.concat(list_df)

In [10]:
metadata.head()

Unnamed: 0,isbn,average_rating,similar_books,description,link,authors,num_pages,book_id,ratings_count,title,publication_year
0,312853122.0,4.0,[],,https://www.goodreads.com/book/show/5333265-w-...,"[{'author_id': '604031', 'role': ''}]",256.0,5333265,3.0,W.C. Fields: A Life on Film,1984.0
1,743509986.0,3.23,"['8709549', '17074050', '28937', '158816', '22...","Anita Diamant's international bestseller ""The ...",https://www.goodreads.com/book/show/1333909.Go...,"[{'author_id': '626222', 'role': ''}]",,1333909,10.0,Good Harbor,2001.0
2,,4.03,"['19997', '828466', '1569323', '425389', '1176...",Omnibus book club edition containing the Ladie...,https://www.goodreads.com/book/show/7327624-th...,"[{'author_id': '10333', 'role': ''}]",600.0,7327624,140.0,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",1987.0
3,743294297.0,3.49,"['6604176', '6054190', '2285777', '82641', '75...",Addie Downs and Valerie Adler were eight when ...,https://www.goodreads.com/book/show/6066819-be...,"[{'author_id': '9212', 'role': ''}]",368.0,6066819,51184.0,Best Friends Forever,2009.0
4,850308712.0,3.4,[],,https://www.goodreads.com/book/show/287140.Run...,"[{'author_id': '149918', 'role': ''}]",,287140,15.0,Runic Astrology: Starcraft and Timekeeping in ...,


Column-by-column review:
- I will keep 'isbn' and 'link' because it can help users find the exact book when they are recommended it
- I will keep 'average_rating', 'num_pages', 'ratings_count', 'title', and 'publication_year' to act as features in my content-based recommendation model and as descriptors for returned recommendations
- I will keep 'description' to act as a feature in my content-based recommendation model and as a descriiptor for my returned recommendations. It will be NLP processed before entering the content-based model, in the content-based notebook (not here)
- I will keep 'similar_books' as a check for my model's recommendations. The user can see the what the dataset has currently determined are books similar to my model's recommendations
- I will keep 'authors' as a descriptor for my returned recommendations, but I will need to merge in the authors' names using a **separate authors dataset** below
- The 'book_id' column is crucial for connecting this dataset to others

Notably missing here is genre information from the books. I'll merge that in using a **separate genres dataset** below

In [None]:
# Loading in genre data

### 4.b. User Review Data

Here we want to end up with: rating, user id, book id, the review text for each user per book

In [11]:
# This took 6 mins to load in
reviews = pd.read_json('data/goodreads_reviews_spoiler.json.gz', compression='gzip', lines=True)


In [12]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1378033 entries, 0 to 1378032
Data columns (total 7 columns):
 #   Column            Non-Null Count    Dtype         
---  ------            --------------    -----         
 0   user_id           1378033 non-null  object        
 1   timestamp         1378033 non-null  datetime64[ns]
 2   review_sentences  1378033 non-null  object        
 3   rating            1378033 non-null  int64         
 4   has_spoiler       1378033 non-null  bool          
 5   book_id           1378033 non-null  int64         
 6   review_id         1378033 non-null  object        
dtypes: bool(1), datetime64[ns](1), int64(2), object(3)
memory usage: 64.4+ MB


In [14]:
# I only need user_id, review_sentences, rating, and book_id 
reviews_subset = reviews[['user_id', 'rating', 'book_id', 'review_sentences']]
reviews_subset.head()

Unnamed: 0,user_id,rating,book_id,review_sentences
0,8842281e1d1347389f2ab93d60773d4d,5,18245960,"[[0, This is a special book.], [0, It started ..."
1,8842281e1d1347389f2ab93d60773d4d,3,16981,"[[0, Recommended by Don Katz.], [0, Avail for ..."
2,8842281e1d1347389f2ab93d60773d4d,3,28684704,"[[0, A fun, fast paced science fiction thrille..."
3,8842281e1d1347389f2ab93d60773d4d,0,27161156,"[[0, Recommended reading to understand what is..."
4,8842281e1d1347389f2ab93d60773d4d,4,25884323,"[[0, I really enjoyed this book, and there is ..."


In [15]:
# The review sentences column is a list of each sentence with a number for the index at the beginning
reviews_subset['review_sentences'][2]

[[0, 'A fun, fast paced science fiction thriller.'],
 [0, "I read it in 2 nights and couldn't put it down."],
 [0,
  'The book is about the quantum theory of many worlds which states that all decisions we make throughout our lives basically create branches, and that each possible path through the decision tree can be thought of as a parallel world.'],
 [0,
  'And in this book, someone invents a way to switch between these worlds.'],
 [0, 'This was nicely alluded to/foreshadowed in this quote:'],
 [0, '"I think about all the choices we\'ve made that created this moment.'],
 [0, 'Us sitting here together at this beautiful table.'],
 [0,
  'Then I think of all the possible events that could have stopped this moment from ever happening, and it all feels, I don\'t know..." "What?"'],
 [0, '"So fragile."'],
 [0, 'Now he becomes thoughtful for a moment.'],
 [0,
  'He says finally, "It\'s terrifying when you consider that every thought we have, every choice we could possibly make, branches int

In [16]:
# Function to concatenate sentences and remove numbers and quotes
def concatenate_sentences(sentences):
    concatenated_text = ''
    for sentence_info in sentences:
        # Remove the number at the beginning of each sentence
        sentence = sentence_info[1]
        if sentence_info[0] == 1:
            sentence = sentence[1:]  # Remove the number at the beginning
        # Remove quotes around the sentence
        if sentence.startswith('"') and sentence.endswith('"'):
            sentence = sentence[1:-1]
        # Append the sentence to the concatenated text
        concatenated_text += sentence + ' '
    return concatenated_text.strip()  # Remove trailing space

# Apply the function to concatenate sentences in the 'review_sentences' column
reviews_subset['review_sentences'] = reviews_subset['review_sentences'].apply(concatenate_sentences)
reviews_subset['review_sentences'][2]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews_subset['review_sentences'] = reviews_subset['review_sentences'].apply(concatenate_sentences)


'A fun, fast paced science fiction thriller. I read it in 2 nights and couldn\'t put it down. The book is about the quantum theory of many worlds which states that all decisions we make throughout our lives basically create branches, and that each possible path through the decision tree can be thought of as a parallel world. And in this book, someone invents a way to switch between these worlds. This was nicely alluded to/foreshadowed in this quote: "I think about all the choices we\'ve made that created this moment. Us sitting here together at this beautiful table. Then I think of all the possible events that could have stopped this moment from ever happening, and it all feels, I don\'t know..." "What?" So fragile. Now he becomes thoughtful for a moment. He says finally, "It\'s terrifying when you consider that every thought we have, every choice we could possibly make, branches into a new world." his book can\'t be discussed without spoilers. t is a book about choice and regret. ver 

In [17]:
# Checking that review_sentences has been turned into one string
reviews_subset.head()

Unnamed: 0,user_id,rating,book_id,review_sentences
0,8842281e1d1347389f2ab93d60773d4d,5,18245960,This is a special book. It started slow for ab...
1,8842281e1d1347389f2ab93d60773d4d,3,16981,Recommended by Don Katz. Avail for free in Dec...
2,8842281e1d1347389f2ab93d60773d4d,3,28684704,"A fun, fast paced science fiction thriller. I ..."
3,8842281e1d1347389f2ab93d60773d4d,0,27161156,Recommended reading to understand what is goin...
4,8842281e1d1347389f2ab93d60773d4d,4,25884323,"I really enjoyed this book, and there is a lot..."


In [18]:
# I want to see how many individual (unique) users there are
user_vc = reviews_subset['user_id'].value_counts()
user_vc

user_id
aca760854b57ce2ec981df32e46dc96c    1815
843a44e2499ba9362b47a089b0b0ce75    1647
8e7e5b546a63cb9add8431ee6914cf59    1214
667b94d4c7e0b014bb6ab3636999e712    1189
c5b70e45e230a166bb00201662495d69    1165
                                    ... 
49e4ac9e5bc4f06039e6120de035738e       1
424fa9e638254075ebe6b4cba7310a91       1
1857263fbefd251fdac5cd6d468aa6e0       1
a0d87052bf1a80b01a9066744fb46e26       1
62e5c281998b211e61dab629ba595ef4       1
Name: count, Length: 18892, dtype: int64

In [24]:
# I want to see how many users I'd be left with if I only kept users who have written at least 5 reviews
users_vc_5 = user_vc[user_vc >=5 ].index
users_vc_5

Index(['aca760854b57ce2ec981df32e46dc96c', '843a44e2499ba9362b47a089b0b0ce75',
       '8e7e5b546a63cb9add8431ee6914cf59', '667b94d4c7e0b014bb6ab3636999e712',
       'c5b70e45e230a166bb00201662495d69', 'aed35dbc626957174ebedf3c555b63d0',
       'd7879573928a367edb1d1accf2372810', 'c0e0fda388f87af0deffad748c9c8b67',
       '422e76592e2717d5d59465d22d74d47c', 'ccf944f8aca6814c1ec21cad667b7123',
       ...
       '54b7ab241624ebbc4158968b48cdd680', 'aaf7d634af94f2c65c04511f80ebdf02',
       '5fcdf0b98fb5d2dea561f5b93b298f8e', '156503c1d86446fd175c965320cb2129',
       'ec4a85bca6494332b52e7257e84b0714', '7ece23a7577ee3377c206ad52a287a47',
       'fdfaa1eeddeb558af311d36eaac263d8', 'c7303bb03f7f4a96c89d44f0504d6b0f',
       '8593c4720b979d9531b023ef8abf0e6d', '58fe4880f1c5746d7311867093747775'],
      dtype='object', name='user_id', length=17268)

The length of the users_vc vs. the users_vc_5 has a difference of only 1,624 users, or 8% of users.

In [25]:
# Keeping only users who have written at least 5 reviews
reviews_subset_2 = reviews_subset[reviews_subset['user_id'].isin(users_5)]

In [26]:
reviews_subset_2['user_id'].value_counts()

user_id
aca760854b57ce2ec981df32e46dc96c    1815
843a44e2499ba9362b47a089b0b0ce75    1647
8e7e5b546a63cb9add8431ee6914cf59    1214
667b94d4c7e0b014bb6ab3636999e712    1189
c5b70e45e230a166bb00201662495d69    1165
                                    ... 
f8b91520e708af9a6c2ea8b2739f28bb       5
3b206409e55a73ca47284a85d84e0cc1       5
fdfaa1eeddeb558af311d36eaac263d8       5
d7c4e0e475bf1421b37890416f4cf1b9       5
297142f776e6cd363a2bc2f7aaa11c81       5
Name: count, Length: 17268, dtype: int64

In [27]:
# We have about 25,500 unique books in the reviews df
reviews_subset_2['book_id'].value_counts()

book_id
11870085    2725
11235712    2276
2767052     2236
7260188     1983
16096824    1716
            ... 
385483         8
2535743        8
16070990       7
118349         7
16148398       1
Name: count, Length: 25475, dtype: int64

In [28]:
reviews_subset_2.info()
# It's fine for user_id to be an object b/c the user id has letters in it

<class 'pandas.core.frame.DataFrame'>
Index: 1373966 entries, 0 to 1378032
Data columns (total 4 columns):
 #   Column            Non-Null Count    Dtype 
---  ------            --------------    ----- 
 0   user_id           1373966 non-null  object
 1   rating            1373966 non-null  int64 
 2   book_id           1373966 non-null  int64 
 3   review_sentences  1373966 non-null  object
dtypes: int64(2), object(2)
memory usage: 52.4+ MB


The review text data will need to be NLP processed before it can be meaningfully used.  Natural Language Processing (NLP) invovles inputs of a certain type: namely, "tokenized" text. Ideally, a string of lower-case, individual, normalized, semantic words. 

Below we initialize a tokenizer, a stopwords list, and a lemmatizer that we use in a custom function. 

The tokenizer will use a regex pattern to turn all words that are at least 3 letters long into a "token."

The stopwords list will be used to remove words like "is" the "the." These are filler words that have no semantic meaning but are still the majority of most speech. They are not useful for prediction and they dramatically increase the input that a model must process. Therefore, our function iterates through the tokenized text and removes them.

Finally our lemmatizer will get to the meaningful "base" or "lemma" of a word. So it will take "change," "changes," "changed," and "changing" and identify them all as the token "change" instead of 4 separate words. This is essentially the "normalization" of text.

In [30]:
tokenizer = RegexpTokenizer(r"(?u)\w{3,}") # This pattern finds words that are at least 3 letters long
stopwords = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

def preprocessing(text, tokenizer, stopwords, lemmatizer):
    # Make everything in the df["Text"] column into a lower-case string
    #text = ["".join(item for item in lst).lower() for lst in text]

    # Tokenize
    tokens = tokenizer.tokenize(text)
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stopwords]
    
    # Lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return tokens

In [31]:
# Apply the preprocessing function to the 'Text' column
reviews_subset_nlp = reviews_subset_2.copy()
reviews_subset_nlp['list_tokens'] = reviews_subset_2['review_sentences'].apply(lambda x: preprocessing(x, tokenizer, stopwords, lemmatizer))
reviews_subset_nlp.head()

In [None]:
type(reviews_subset_nlp['list_tokens'][0])

In [None]:
# # making the tokens list of strings into one big string 
# result_df['string_tokens'] = result_df['list_tokens'].apply(lambda x: ' '.join(x))
# result_df['string_tokens'][0]

In [None]:
# Saving the review data
reviews_subset_nlp.to_csv('data/gr_reviews_clean.csv')


### 4.c. User Reviews by Book
Continuation from the user reviews dataset above. We need to join all the reviews for each book into one string. The dataset should have one row per book, with one cell for all that joined review text. This is needed for content-based recommendations, since the review text for each book will act as a feature in the model's search for similar books.

In [None]:
reviews_per_book = reviews_subset_nlp.groupby(['book_id'], as_index=False).agg({'list_tokens': ' '.join})
reviews_per_book.head()

In [None]:
# Locing into the joined reviews for book_id 1
reviews_per_book['list_tokens'][0]

In [None]:
# Checkingto see that the per-book reviews for book_id 1 match
# all the reviews for book_id 1 in the per-user per book gr_reviews_subset2 dataframe
reviews_subset_nlp[reviews_subset_nlp['book_id']==1]

In [None]:
# Saving off the per-book reviews
reviews_per_book.to_csv('data/gr_reviews_per_book.csv')


### Genre Data

We want to add this genre data to the books metadata

In [None]:
genres.info()

In [None]:
genres.head()

### Books Meta Data

We want one row per book

In [None]:
books.info()

In [None]:
books.head()