# GoodReads Review to Rating Notebook

## I. Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## II. Prepare Data

### 1. Install kaggle and get data from Kaggle

In [None]:
## Install kaggle for data
%pip install -q kaggle

In [None]:
from google.colab import files 
file = files.upload()

In [None]:
kaggle_path = f"{os.getenv('HOME')}/.kaggle"
kaggle = Path(kaggle_path)
kaggle.mkdir(parents=True)


os.replace("kaggle.json", os.path.join(kaggle_path, "kaggle.json"))
os.chmod(os.path.join(kaggle_path, "kaggle.json"), 0o600)

In [None]:
! kaggle competitions download -c goodreads-books-reviews-290312

### 2. Unzip and move data to its specific folders

In [None]:
zip_file = ZipFile("goodreads-books-reviews-290312.zip")
zip_file.extractall()

In [None]:
data_path = "data"
train_path = os.path.join(data_path, "train")
test_path = os.path.join(data_path, "test")

In [None]:
for path in (train_path, test_path):
  os.makedirs(path)

## Move train
os.replace("./goodreads_train.csv", os.path.join(train_path, "goodreads_train.csv"))
os.replace("./goodreads_test.csv", os.path.join(test_path, "goodreads_test.csv"))

### 3. Loading data

In [None]:
train_data = pd.read_csv("../input/goodreads-books-reviews-290312/goodreads_train.csv")

In [None]:
train_data.head()

## III. Data Exploration

In [None]:
train_data.info()

There are 11 columns in the dataset with exactly 900000 comments. Most of the columns are not null, but there are still some null columns, such as `read_at`, `started_at`. There are some initial thoughts of why these are nulls, such as:
* The reviewers did not read the book but still give reviews.
* They may forget to update these information or may not want to give these information.


We will look at our data again to get some basic sense.

In [None]:
train_data.head()

We also check the columns of the dataset.

In [None]:
train_data.columns

Based on the columns and the dataset, we can safely assume that there are 2 types of data in the dataset: categorical data and the numerical data.

* Numerical data columns are: `n_votes`, `n_comments`.
* Categorical data columns are the remaining coumns: `rating`, `user_id`, `book_id`, `review_id`, `review_text`, `date_added`, `date_updated`, `read_at`, `started_at`


### 1. Initial Hypothesis

#### a) Reviewing process

We can start making some hypothesis after viewing the process to add a new review on GoodReads.
1. Login in to GoodReads to make a review. Anonymous and guest cannot do reviews. (`user_id`)
2. Access the book homepage. (`book_id`)
3. Rate the book. The rating scale is 0-5 stars with the interval size 1. (`rating`)
4. A "Add a review" modal is popped up. We can add our review here. (`review_text`)
5. We can also add some optional information such as mark the review as spoilers, started read date (`started_date`) and ended read date.
6. The review is added to the system (`review_id`) and the time the review is written is logged (`date_added`).
7. Other readers can view these comments and can like (`n_votes`) for the comment if they found it was helpful. Moreover, they can comment to the review to discuss their opinions and ideas (`n_comments`).
8. We can also update the review if we found it is unapproriate or out of date. The time we perform the update will be logged (`date_updated`).


#### b) Initiate hypothesis

- Most importants features are `review_text`, `n_votes`, and `n_comments` which are all numerical columns:
  + `n_votes` indicates how reputable the review is. People tends to upvote reviews that they find to be helpful rather than others. Therefore, <b>a higher `n_votes` value would lead to a more accuracy review.</b>
  + When `n_comments` is high, it means there are 2 cases:
    * The review is helpful and others comment to thank for the review.
    * The review is debatable and others comment to share their opinions.
    * => <b>High `n_votes` also means that the review's rating is accuracy.</b>


- After looking at some sample data, we notice that there are some promotion reviews. Therefore, another task for us is to build a review classifier to determine whether a review is spam (use to 5-star a book) or not. (This is out-of-scope of this notebook).

### 2. Numerical features

We can start by learning about the numerical features first. Let's look at some basic statistics.

In [None]:
train_data[["n_votes", "n_comments"]].describe()

The `describe` table showed us many notable information:
* The `rating` column may be right-skewed as its Q3 is already the max rating (5.0 point or 5 stars)
* `n_votes` and `n_comments` are heavily left-skewed.
*  <b>ATTENTION</b>: `n_votes` 's min is -3, with is impossible. Goodreads does not have downvote feature.
*  <b>ATTENTION</b>: `n_comments` 's min is -1, with is also impossible. 


We will plot boxplots of the numerical columns to verify the first 2 thoughts.

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(20,16))
sns.boxplot(data=train_data, x="n_votes", ax=ax[0])
sns.boxplot(data=train_data, x="n_comments", ax=ax[1])
plt.grid()

#### a) Rating

#### a) Number of votes

Since there are so many outliers in `n_comments` and `n_votes`, we would zoom in the plots where most of the comments and votes are.

In [None]:
plt.figure(figsize=(10,6))
ax1 = sns.boxplot(data=train_data, x="n_votes")
ax1.set(xlim=(-10,20))
ax1.axvline(x=train_data["n_votes"].median(), c='r')
ax1.set_xticks(range(-10, 21))
_ = ax1.set_xticklabels([str(label) for label in range(-10, 21)])
plt.grid()

This boxplot tells us that 
- Q1 and Q2 are coincident. (Red line indicates the data's median). Therefore, roughly 25% of the reviews has 0 votes.
- There is quite an amount of reviews has negative votes.


In [None]:
plt.figure(figsize=(10,6))
ax1 = sns.histplot(x=train_data["n_votes"][train_data["n_votes"] < 21], bins=100)
ax1.set(xlim=(-10,20))
ax1.axvline(x=train_data["n_votes"].median(), c='r')
ax1.set_xticks(range(-10, 21))
_ = ax1.set_xticklabels([str(label) for label in range(-10, 21)])

Removing the negative votes reviews will make the distribution more like the geometry distribution (This is a discrete numerical feature).

In [None]:
plt.figure(figsize=(10,6))
ax1 = sns.boxplot(data=train_data, x="n_comments")
ax1.set(xlim=(-5,5))
ax1.set_xticks(range(-5, 6))
_ = ax1.set_xticklabels([str(label) for label in range(-5, 6)])

In [None]:
plt.figure(figsize=(20,16))
sns.regplot(data=train_data,x="n_votes", y="n_comments", scatter_kws={"alpha":0.3})


`n_votes`'s median is somewhere

We can see that `n_votes` and `n_comments` have a lot of outliers. We may consider drop of those outliers or not. 

* Dropping outliers of `n_votes` is undesired, because highly upvotes reviews are likely to be more reputable than lower upvotes reviews, as other readers should find these reviews to be more helpful.

* We must investigate whether should we drop `n_comments` outliers. We may think that the larger `n_comments` is, the more arguable and debatable the review is. Another hypothesis is, that review is a helpful one, which indicates in an enormous gratefully comments.

In [None]:
train_data[train_data["n_comments"] < 0]

#### a) Rating

In [None]:
plt.figure(figsize=(10,6))
ax1 = sns.histplot(data=train_data, x="rating")

The rating distribution in our dataset is not nearly uniform. In fact, it is right-skewed (most reviews are rated 4-5 stars). Since we are using review text to predict the ratings, this can impact our model to be likely to predict 4-5 stars than 0-2 stars.

One solution for this is to use Stratified Sampling based on rating.


In [2]:
dtypes = {"review_id": str, "review_text": str, "n_votes": np.int8, "n_comments" : np.int8, "rating":np.int8}

In [3]:
train_data = pd.read_csv("../input/goodreads-books-reviews-290312/goodreads_train.csv", usecols=["review_id", "review_text", "n_votes", "n_comments", "rating"],dtype=dtypes)

In [4]:
train_pos = train_data[(train_data["n_votes"] >= 0) & (train_data["n_comments"] >= 0)]

In [5]:
del(train_data)

In [None]:
train_pos.describe()

In [6]:
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=5, random_state=42, test_size=0.9)

In [7]:
for train_index, test_index in sss.split(train_pos.drop("rating", axis=1), train_pos["rating"]):
  df_train, df_test = train_pos.iloc[train_index], train_pos.iloc[test_index]


In [8]:
import gc
del(train_pos)
del(df_test)
gc.collect()

23

In [None]:
plt.figure(figsize=(10,6))
ax1 = sns.histplot(data=df_train, x="rating")

In [10]:
from fastai.text.all import *

In [None]:
dls = TextDataLoaders.from_df(df=df_train, is_lm=True, text_col="review_text", label_col="rating")
dls.show_batch(max_n=3)

In [None]:
dls.categorize.vocab.map_objs([0,1,2,3,4,5])

In [None]:
learn = language_model_learner(dls,  AWD_LSTM, metrics=[F1Score(average="macro"),Perplexity()],  wd=0.1).to_fp16()

In [None]:
learn.fit_one_cycle(1)

In [None]:
learn.save('1epoch')

In [None]:
learn = learn.load('1epoch')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10)

In [None]:
learn.save_encoder('finetuned')

In [None]:
TEXT = "I liked this book because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

In [11]:
dls_cls = TextDataLoaders.from_df(df=df_train, text_col="review_text", label_col="rating")
dls_cls.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos "" maybe love is n't something that comes full circle . xxmaj it just ebbs and flows , in and out , just like the people in our lives . -- xxmaj just because we did n't end up on the same wave , does n't mean we are n't apart of the same ocean . "" \n▁ xxmaj colleen xxmaj hoover is my favorite author , and this is without a doubt my favorite book from her . i finished reading this book months ago and i have n't been able to stop thinking about this story and these characters . xxmaj this story is so incredibly important and moving and emotional , and this book feels so raw and real . i love that each one of these characters has flaws , some bigger than others , but all of them feel so realistic and have so",5
1,"xxbos xxmaj the first time i read this book was smack in the middle of the year when i lived in xxmaj japan teaching xxmaj english -- an odd , yet not - so - odd place to first interact with this master of the xxmaj southern xxmaj gothic . \n▁ i had become a xxmaj christian only two years before and found my faith tested living in another country . xxmaj had i chosen xxmaj christianity because i wanted to fit in ? xxmaj in xxmaj japan , xxmaj christians do not fit in . xxmaj faith must be found off in the corners , in small acts and unexpected friendships . \n▁ o'connor 's stories revealed to me the faith that i was looking for while living in a very different culture . xxmaj although xxmaj christianity - as - religion is a major part of southern culture",5
2,"xxbos xxmaj please note : i am rating this overall 5 stars because there was one xxup perfect story and one so close to perfect that for the sake of not doing half - stars xxmaj i 'm going to call a 5 star as well . xxmaj the rest of the stories i rate two 4 's and one 3 . xxmaj so the average rating of the collection ought to be about 4 stars . xxmaj but xxmaj i 'm saying it 's 5 for the sake of the two i loved most . \n▁ xxup what xxup eyes xxup can xxup see - xxmaj elisabeth xxmaj brown \n▁ xxmaj my xxmaj rating : 3 stars \n▁ xxmaj this was certainly enjoyable , and had some different twists , so the idea was fine but something about it just fell a little flat for me ? i liked",5


In [None]:
del(dls)
gc.collect()

In [12]:
learn = text_classifier_learner(dls_cls, AWD_LSTM, drop_mult=0.5, metrics=F1Score(average="macro"))

In [13]:
learn = learn.load_encoder('finetuned')

In [None]:
learn.fit_one_cycle(1)

In [14]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,f1_score,time
0,1.01821,0.991959,0.489457,05:06


In [15]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,f1_score,time
0,0.952744,0.959398,0.495178,08:55
1,0.897126,0.947927,0.526146,08:56


In [None]:
learn.fine_tune(10)

In [16]:
learn.show_results()

Unnamed: 0,text,category,category_
0,"xxbos xxmaj to be fair , i started this book knowing that \n▁ i will hate it \n▁ . xxmaj it was kinda obvious since firstly it 's a romance novel ( and romance - especially an unrealistic one- is really not my thing ) , and secondly it was a teenage love story . xxmaj everything that i will expect to go wrong here will go wrong . \n▁ xxmaj but i actually do like the irony of this book , in that this story essentially is a ' cancer perk ' . xxmaj would the readers feel the same if the lovers were n't stricken by cancer ? xxmaj and did the author really just try to romanticise cancer ? \n▁ xxmaj however i also find it important to note that the author did write this book for a cancer - stricken patient who never got to live",1,2
1,"xxbos xxmaj current rating : 2 stars ( original rating was 3.5 ) ( was 2.5 , went down to 2 as the review kept getting longer ) \n▁ ( this review is semi - spoiler filled ? xxmaj legit spoilers will be hidden as per usual , but i feel like this has a lot of mini reviews ) \n▁ xxmaj as always , my reviews are a tad messy , because xxmaj i 'm still getting used to doing legit reviews and not rambling to my friends about how good or bad a book is . \n▁ "" and no one warns you about this . xxmaj no one tells you how hard it is , because , yay , love ! xxmaj and we 're so happy for them ! xxmaj but there 's this sharp edge to it , right ? xxmaj because yeah , you",1,3
2,"xxbos xxmaj was it worth spending a month slugging through this book ? > .- i am not really sure , but i liked xxmaj prince of xxmaj thorns and i was n't about to turn a blind eye to xxmaj king of xxmaj thorns ( also , the title of xxmaj king , sounds so much more badass than xxmaj prince , methinks ^_^ ) . xxmaj this has got to be the best 7 euro i have spent . \n▁ xxmaj may contain spoilers ! xxmaj read at your own risk ! \n▁ xxmaj this was in my opinion the strangest 5 star worthy book i have read . xxmaj it builds up slowly , rarely displays any hint of action or anything aside from xxmaj jorg 's introspection and memories , but i believe this is where the book truly shines . xxmaj the blurbs , are",5,4
3,"xxbos xxmaj this book honestly made me so angry ! i should warn this will probably be a rant and will be full of spoilers , xxmaj i 'll try to keep it to a minimum but just be aware . \n▁ xxup america : \n▁ i was about 50 pages in when i started to question why this book existed . xxmaj the story seemed to drag on with more of the same stuff . \n▁ ' oh i love xxmaj maxon , i want to be with him . ' , \n▁ ' no i do n't like xxmaj maxon , xxmaj aspin is the one i should marry . ' , \n▁ ' nope wrong again , xxmaj maxon , definitely xxmaj maxon . ' , \n▁ ' silly me it 's xxmaj aspin . ' \n▁ i just wanted to shake some sense into xxmaj america",3,3
4,"xxbos 5 xxunk stars \n▁ xxmaj god where do i even start with this book . xxmaj this is my first xxmaj leylah xxmaj attar book , and i am completely hooked . xxmaj there is no going back . xxmaj this is the kind of book , the genre i absolutely adore . xxmaj darkness melded with romance and you 've got one drunk reader in your hands . \n▁ i have to admit that i started this book because the cover was too damn appealing for me to not to read it . i mean just look at the cover ! xxmaj it 's got such an haunting melody to it i literally had the chills running down , and i have to say that the writing did not disappoint . xxmaj there was a point in the story when i had goosebumps from the language . xxmaj",5,5
5,"xxbos "" when the world owed you nothing , you demanded something of it anyway . "" \n▁ xxup ya books had become something of a chore for me before i picked up this series . i watched the xxmaj throne of xxmaj glass series be butchered before my eyes , i struggled through xxmaj the xxmaj grisha xxmaj trilogy , xxmaj miss xxmaj peregrine 's xxmaj home for xxmaj peculiar xxmaj children had been overhyped for me . i was losing faith in the xxup ya xxmaj fantasy genre , and that was probably why it took me so long to pick up these books . \n▁ xxmaj six of xxmaj crows and xxmaj crooked xxmaj kingdom are books driven by the characters . xxmaj they 're a stark contrast to the xxmaj grisha xxmaj trilogy , where plot became the main focus and i did n't find myself",5,4
6,"xxbos xxmaj oh . \n▁ xxmaj oh , my . \n▁ xxmaj this series pushes my boundaries in ways that are kind of awesome . i mean , xxmaj i 'm no prude . xxmaj it 's just … usually i like my books heavy on the swoon and lighter on the gratuitous sex ( although i like the sex to be hot as hell when it does show up ) . xxmaj sawyer xxmaj bennett is showing me that i can absolutely become emotionally invested in a series about a kinky as hell sex club . xxmaj even when i might sometimes be exposed to things that are not my personal favorite kinks ? xxmaj it 's still … kink . xxmaj and it 's still sexy . xxmaj and somehow , magically , xxmaj ms . xxmaj bennett also makes it emotional . \n▁ xxmaj cain is the",4,4
7,"xxbos i found xxmaj cloud xxmaj atlas to be less a novel than a series of short stories . xxmaj and on top of that , i found the quality of the stories varied wildly . xxmaj furthermore , the differences in style and tone of the various stories jarred me . \n▁ xxmaj some modest spoilers ahead ( the worst is hidden ) , so stop here if you do n't want to know anything about xxmaj cloud xxmaj atlas whatsoever . \n▁ i just could n't shake the feeling that xxmaj mitchell had written a series of short stories then later decided to weave them together . xxmaj this could certainly be factually inaccurate , but as a reader impression , it really does n't matter his intention . xxmaj while reading it , i could n't shake the feeling that they were short stories with gimmicky tricks",3,3
8,"xxbos xxmaj you can now read my interview with author xxmaj liz xxmaj prince ! xxmaj she was gracious enough to take the time to answer my questions about being a comic artist and about xxmaj tomboy . \n▁ xxmaj review originally published at xxmaj grab the xxmaj lapels . xxmaj please click the link to see the review with all the images . \n▁ xxmaj thirty - one - year - old comic artist xxmaj liz xxmaj prince shares her history as a tomboy . xxmaj she begins with her tantrum at age three when she did n't want to wear a dress . xxmaj all through elementary and middle school , xxmaj prince is tormented . xxmaj no one wants to play with her , she hates all things girly , and classmates begin to question her sexuality . xxmaj high school is a huge problem area until",5,4


In [18]:
learn.predict("I don't like this book")

('1', tensor(1), tensor([0.2086, 0.3936, 0.1066, 0.0611, 0.0672, 0.1628]))

In [19]:
learn.export("review.pkl")

In [None]:
!ls -al 
