# BetterReads: Optimizing the number of reviews

This notebook explores how many reviews are needed to achieve meaningful results with the BetterReads algorithm. We assume the general model framework laid out in the `01_modelling_with_use.ipynb` notebook and developed further in the `02_optimizing_kmeans.ipynb` notebook.

We have two basic exploratory goals. The first goal is to determine how few review sentences we need to have for our algorithm to work as intended, so as to know how big our review datasets need to be. The second goal is to investigate whether filtering reviews by their rating leads to more meaningful or less meaningful results. Obviously, these two questions are related, as both concern the impact of the size of our dataset on the results of our model.

In [1]:
import numpy as np
import pandas as pd
import random

from sklearn.cluster import KMeans

import tensorflow_hub as hub

In [2]:
# Loads Universal Sentence Encoder locally, from downloaded module
embed = hub.load('../../Universal Sentence Encoder/module/')

# Loads Universal Sentence Encoder remotely, from Tensorflow Hub
# embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

## How many reviews does our algorithm need?

Thus far we have been trialing our algorithm on fairly large datasets consisting of close to 10,000 reviews. Although plenty of books on GoodReads have this number of reviews, not all do, especially newer books. Furthermore, our web-scraping script can currently extract only up to 1,500 reviews for any given book. While we have access to larger datasets through UCSD Book Graph, these reviews are a few years old now, and limited in their scope. Ideally we would want our algorithm to be able to produce meaningful results for any book it is given, and to do so it must be able to work on smaller sets of reviews.

### Sampling sentences

Our approach here will be as follows. Rather than simply seeing how our algorithm performs on smaller datasets directly, we will see how it performs on progressively smaller samples of a known larger dataset. This way we will have a sense, from its performance on the original dataset, of what results we should expect. If the algorithm starts to output drastically different results below a certain sample size, we can take this an indication that the sample size is too small.

As in the previous notebooks, we will start by looking at *Gone Girl*. 

In [3]:
def load_sentences(file_path):
    '''
    Function to load and embed a book's sentences
    '''
    
    # Read in processed file as dataframe
    df = pd.read_csv(file_path)
    
    print(f'This dataset consists of {df.shape[0]} sentences from {df["review_index"].nunique()} reviews\n')

    # Copy sentence column to new variable
    sentences = df['sentence'].copy()

    # Vectorize sentences
    sentence_vectors = embed(sentences)
    
    return sentences, sentence_vectors

In [4]:
def get_clusters(sentences, sentence_vectors, k, n):
    '''
    Function to extract the n most representative sentences from k clusters, with density scores
    '''
    
    # Instantiate the model
    kmeans_model = KMeans(n_clusters=k, random_state=24)

    # Fit the model
    kmeans_model.fit(sentence_vectors);

    # Set the number of cluster centre points to look at when calculating density score
    centre_points = int(len(sentences) * 0.02)
    
    # Initialize list to store mean inner product value for each cluster
    cluster_density_scores = []
    
    # Initialize dataframe to store cluster centre sentences
    df = pd.DataFrame()

    # Loop through number of clusters
    for i in range(k):

        # Define cluster centre
        centre = kmeans_model.cluster_centers_[i]

        # Calculate inner product of cluster centre and sentence vectors
        ips = np.inner(centre, sentence_vectors)

        # Find the sentences with the highest inner products
        top_indices = pd.Series(ips).nlargest(n).index
        top_sentences = list(sentences[top_indices])
        
        centre_ips = pd.Series(ips).nlargest(centre_points)
        density_score = round(np.mean(centre_ips), 5)
        
        # Append the cluster density score to master list
        cluster_density_scores.append(density_score)

        # Create new row with cluster's top 10 sentences and density score
        new_row = pd.Series([top_sentences, density_score])
        
        # Append new row to master dataframe
        df = df.append(new_row, ignore_index=True)

    # Rename dataframe columns
    df.columns = ['sentences', 'density']

    # Sort dataframe by density score, from highest to lowest
    df = df.sort_values(by='density', ascending=False).reset_index(drop=True)
    
    # Loop through number of clusters selected
    for i in range(k):
        
        # Save density / similarity score & sentence list to variables
        sim_score = round(df.loc[i]["density"], 3)
        sents = df.loc[i]['sentences'].copy()
        
        print(f'Cluster #{i+1} sentences (density score: {sim_score}):\n')
        print(*sents, sep='\n')
        print('\n')
        
    model_density_score = round(np.mean(cluster_density_scores), 5)
    
    print(f'Model density score: {model_density_score}')

In [5]:
# Set file path
gg_file_path = 'data/8442457_gone_girl.csv'

# Load and embed sentences
sentences, sentence_vectors = load_sentences(gg_file_path)

This dataset consists of 41873 sentences from 8275 reviews



In [6]:
# Get cluster sentences
get_clusters(sentences, sentence_vectors, k=9, n=8)

Cluster #1 sentences (density score: 0.538):

I was kind of disappointed with the ending but it was a real ending.
I wasn't a huge fan of the ending.
I don't think I liked the ending.
I wasn't a huge fan of the ending but at the same time, the ending made sense with the story.
Although I didn't like the ending - it left me wanting an ending.
Definitely a page-turner, but I didn't like the ending.
I thought the ending was AWFUL.
I really didn't care for the ending.


Cluster #2 sentences (density score: 0.459):

Gillian Flynn is a good storyteller but I'm not sure that I'll read another of her books.
Gillian Flynn is a good writer and she had me fooled.
Let's start with the fact that Gillian Flynn is an excellent writer.
Gillian Flynn though, is clearly a good writer.
Gillian Flynn is an awesome writer.
Gillian Flynn is a brilliant writer.
I'm definitely going to be reading another Gillian Flynn book.
Gillian Flynn is an amazing writer.


Cluster #3 sentences (density score: 0.443):

Th

As we saw in previous notebooks, when we run our algorithm on the the full dataset with k = 9 clusters, our algorithm identifies the following nine broad themes in the reviews:

1. Didn't like the ending
1. Gillian Flynn is an excellent writer
1. Couldn't put it down
1. Main characters (Nick and Amy)
1. Plot twists & turns
1. Awful characters, great story
1. Loved the book, but not the ending
1. A story of a bad marriage
1. Messed up & disturbing

Not all of these themes are entirely distinct, but we should expect to see several of them in any smaller sample we take.

We will begin by taking random samples of the full dataset and keeping the numbers of clusters set at 9. As we decrease the size of our dataset we may need to decrease the number of clusters, but that is something we will investigate further down the line.

Let us start by taking a 50% sample of our data and inspect the results. (Note that, by taking a 50% sample of sentences, we will not achieve of 50% sample of reviews, since some reviews have more sentences than others. Whether this makes a difference to our algorithm is something else we'll investigate further on.)

In [7]:
def get_sample(file_path, frac):
    '''
    Function to take a random sample of a dataset's sentences
    '''
    
    # Read in file as dataframe
    df = pd.read_csv(file_path)
    
    # Take the sample
    sample_df = df.sample(frac=frac, random_state=24).reset_index(drop=True)
    print(f'This sample consists of {sample_df.shape[0]} sentences from {sample_df["review_index"].nunique()} reviews\n')
    
    # Copy sentence column to new variable
    sentences = sample_df['sentence'].copy()

    # Vectorize sentences
    sentence_vectors = embed(sentences)
    
    return sentences, sentence_vectors

In [8]:
# Get sentence clusters from a 50% sample of our dataset
sentences, sentence_vectors = get_sample(gg_file_path, 0.5)
get_clusters(sentences, sentence_vectors, k=9, n=8)

This sample consists of 20936 sentences from 6759 reviews

Cluster #1 sentences (density score: 0.538):

Definitely a page-turner, but I didn't like the ending.
I liked the book, but the ending was not what I expected AT ALL!
The ending was a total anti-climax - I see the intention of the ending but it irritated me.
I must admit I hated the ending.
I wish the ending had been better, but I guess the particular ending was supposed to be that way.
I was a little disappointed in the ending, but the rest of the book was so good, it didn't matter.
I didn't like the ending at all by the way.
Some people didn't like the ending but I found that the ending was just right.


Cluster #2 sentences (density score: 0.461):

Gillian Flynn is a good storyteller but I'm not sure that I'll read another of her books.
Let's start with the fact that Gillian Flynn is an excellent writer.
Gillian Flynn though, is clearly a good writer.
Gillian Flynn is a brilliant writer.
I love the way Gillian Flynn writes.


As we can see, a 50% sample still produces perfectly intelligible results. Let's summarize each of the identified clusters:

1. Didn't like the ending
1. Gillian Flynn is an excellent writer
1. Couldn't put it down
1. Main characters (Nick and Amy)
1. Plot twists & turns
1. Awful characters, great story
1. Loved the book, but not the ending
1. A story of a bad marriage
1. Messed up & disturbing

In other words, the algorithm identifies exactly the same thematic clusters as before (though in a different order, since the order of k-means clusters is random and not of any significance). Furthermore, the density scores of the individual clusters are nearly all the same, and the overall model density score is also virtually identical. This suggests that our model has not lost any significant information by taking a 50% sample of our data.

Let's then see how things turn out with smaller samples. How about a 25% sample?

In [9]:
# Get sentence clusters from a 25% sample of our dataset
sentences, sentence_vectors = get_sample(gg_file_path, 0.25)
get_clusters(sentences, sentence_vectors, k=9, n=8)

This sample consists of 10468 sentences from 4958 reviews

Cluster #1 sentences (density score: 0.543):

I liked the book, but the ending was not what I expected AT ALL!
The ending was a total anti-climax - I see the intention of the ending but it irritated me.
I must admit I hated the ending.
I didn't like the ending at all by the way.
That said, I didn't like the ending, and I didn't actually like a lot of the book.
I liked everything but the ending.
Loved the story right up until the end, very disappointed in the ending.
Anyway, it's well written even if i didn't get the ending i was expecting.


Cluster #2 sentences (density score: 0.462):

Gillian Flynn though, is clearly a good writer.
Let's start with the fact that Gillian Flynn is an excellent writer.
I love the way Gillian Flynn writes.
Gillian Flynn is clearly a talented writer.
Gillian Flynn is a wonderful writer.
Gillian Flynn is a good writer though, and I can't wait to read her other novels.
Gillian Flynn is such an incre

Much the same results, with no significant drops in density score! Let's try going even lower...

In [10]:
# Get sentence clusters from a 90% sample of our dataset
sentences, sentence_vectors = get_sample(gg_file_path, 0.1)
get_clusters(sentences, sentence_vectors, k=9, n=8)

This sample consists of 4187 sentences from 2801 reviews

Cluster #1 sentences (density score: 0.537):

I liked the book, but the ending was not what I expected AT ALL!
I must admit I hated the ending.
Loved the story right up until the end, very disappointed in the ending.
The only thing that disappointed me tremendously was the ending.
I was somewhat disappointed in the ending (I won't spoil it) but in a way, I'm not sure what other ending would really suit the story.
I wasn't crazy about the ending, but then, it was kind of appropriate.
The ending was different than expected, but I'm happy it was.
I am not too sure I like the ending, but I enjoyed the book.


Cluster #2 sentences (density score: 0.469):

Gillian Flynn though, is clearly a good writer.
I love the way Gillian Flynn writes.
Gillian Flynn is clearly a talented writer.
Gillian Flynn is a wonderful writer.
Gillian Flynn is such an incredible author.
Gillian Flynn is an incredibly gifted writer.
Gillian Flynn is a beautifu

Again, the results here look good. Can we go even lower?

In [11]:
# Get sentence clusters from a 99% sample of our dataset
sentences, sentence_vectors = get_sample(gg_file_path, 0.01)
get_clusters(sentences, sentence_vectors, k=9, n=8)

This sample consists of 419 sentences from 396 reviews

Cluster #1 sentences (density score: 0.511):

Unfortunately I was extremely disappointed with the ending.
Not sure I love the ending, I'll leave it at that.
I would have given it 5 stars, but I absolutely hated the ending.
i didn't hate that ending, but i didn't feel completely satisfied by it either.
I literally just finished it so SPOILER ALERT   I hated the ending!!!
The ending disappointed me, to be honest.
The only thing which I found slightly disappointing about this book is the ending.
def not a fan of the ending!


Cluster #2 sentences (density score: 0.491):

Gillian Flynn is a wonderful writer.
Needless to say I probably won't be reading another of Gillian Flynn's books anytime soon.
In the end, I am still awed and fascinated by the characters gillian flynn created, and I can't wait to talk about this book with the friend who gave it to me and with everyone else who reads it.
One of my top favorites for sure and now I am

Remarkably, even with only a 1% sample of our original dataset – only 419 sentences – our model still identifies same nine broad clusters as it originally did. This is a very strong indication of two things: first, that the clusters that our original model identified are indeed representative of the most commonly expressed in the dataset; and second, that our model can still produce meaningful results from a very small dataset of reviews.

### Sampling reviews

There is, however, a potential weakness in how we've been proceeding thus far. In all of the above examples we have been sampling our dataset at the *sentence* level. Yet this is not the best indicator of how our model will perform on smaller sets of *reviews*. So let us now sample our dataset at the level of its reviews.

In [12]:
def get_review_sample(file_path, frac):
    '''
    Function to take a random sample of a dataset's reviews
    '''
    
    # Read in file as dataframe
    df = pd.read_csv(file_path)
    
    # Find the number of reviews in the dataset
    num_reviews = df['review_index'].nunique()
    
    # Calculate number of reviews to include in sample
    sample_size = int(num_reviews * frac)

    # Create a list of all review indices in dataset
    review_ids = df['review_index'].unique()

    # Take a random sample of these reviews indicates
    sample_reviews = random.sample(set(review_ids), sample_size)

    # Create dataframe of sentences from the sampled reviews
    sample_df = df[df['review_index'].isin(sample_reviews)].reset_index(drop=True)
    
    print(f'This sample consists of {sample_df.shape[0]} sentences from {sample_df["review_index"].nunique()} reviews\n')
    
    # Copy sentence column to new variable
    sentences = sample_df['sentence'].copy()

    # Vectorize sentences
    sentence_vectors = embed(sentences)
    
    return sentences, sentence_vectors

In [13]:
# Get sentence clusters from a 90% review sample of our dataset
random.seed(24)
sentences, sentence_vectors = get_review_sample(gg_file_path, 0.1)
get_clusters(sentences, sentence_vectors, k=9, n=8)

This sample consists of 4170 sentences from 827 reviews

Cluster #1 sentences (density score: 0.53):

Loved the story right up until the end, very disappointed in the ending.
Personally, seeing what the book turned into towards the end (in this it's a good thing), I thought that the ending was great.
Just wasn't crazy about the ending.
I didn't like the ending though.
While I can't say that I liked the ending, I thought it was perfect for the book.
I am really disappointed in the ending, but the rest of the book was amazing.
Not sure I liked the ending though.
Good story, but I didn't like the ending.


Cluster #2 sentences (density score: 0.479):

Gillian Flynn is an amazing writer.
I love the way Gillian Flynn writes!
Gillian Flynn is an author I'm definitely going to look out for.
Gillian Flynn is a beautiful writer.
Gillian Flynn is an incredibly gifted writer.
I really applaud the way that Gillian Flynn could explore all the psychological aspects of the book.
Lots of Kudos to Gill

These results still look good! Let's try an even smaller sample.

In [14]:
# Get sentence clusters from a 99% review sample of our dataset
random.seed(24)
sentences, sentence_vectors = get_review_sample(gg_file_path, 0.01)
get_clusters(sentences, sentence_vectors, k=9, n=8)

This sample consists of 367 sentences from 82 reviews

Cluster #1 sentences (density score: 0.51):

Gillian Flynn is a beautiful writer.
It's the third novel I've read by Gillian Flynn and it didn't disappoint.
Very disturbing at time, but all Gillian Flynn books are.
I would reccommend it to anyone i know and give major props to Gillian Flynn.
Gillian Flynn won me over by tapping into my morbid curiosity.
This is a sharp, witty and unpredictable book by Gillian Flynn.
It also feels like Gillian Flynn asks the reader: how well do you know the book's characters?
I admire Gillian Flynn for being bold enough to use this style of writing to tell her story.


Cluster #2 sentences (density score: 0.468):

Honestly I read this book since everyone else was raving about it.
i was so excited to read this book!
I cannot possibly say enough good things about this book!!
This book had it all for me, including one of the best endings I have read in awhile.
One of the best books I have read this year

The model starts to break down a bit here, but these are still remarkably good results. From only 367 sentences our model was still able to pull out the same nine broad thematic clusters.

Thus it seems that our model is able to perform quite well even on small datasets (assuming there are indeed patterns in the datasets to detect). We should not think that we need thousands and thousands of reviews in order to deploy our algorithm.

## Filtering reviews by rating

Up until this point we have been running our model on (a random sample) of an entire dataset. But it is possible that our results would improve if we used our model to extract the most expressed opinions, not across *all* of a book's reviews, but only across its *positive* or *negative* reviews. Or we might simply be interested to know what people who liked some book are saying about it, or what people who didn't liked it are saying about it. Let us then see what happens when we filter our reviews by their rating.

Let's first get a sense of the breakdown of our dataset.

In [15]:
pd.read_csv(gg_file_path)['rating'].value_counts().sort_index()

0.0      413
1.0     1465
2.0     3457
3.0     8382
4.0    16194
5.0    11962
Name: rating, dtype: int64

As we can see here, a strong majority of the reviews are positive, at 4 or 5 stars. This class imbalance is not necessarily an issue, however, since we know now that our model performs well on large and small datasets. Nonetheless, this may be something important to keep in mind as we proceed.

Let us run our model on only the good reviews, which we will define as reviews with ratings of 4 or 5.

In [16]:
def filter_reviews(file_path, lower, upper):
    '''
    Function to filter reviews to those falling within a rating range
    '''
    
    # Read in file as dataframe
    df = pd.read_csv(file_path)
    
    # Filter dataframe to review ratings falling within specified range
    filtered_df = df[(df['rating'] >= lower) & (df['rating'] <= upper)].reset_index(drop=True)

    # Copy sentence column to new variable
    sentences = filtered_df['sentence'].copy()

    # Vectorize sentences
    sentence_vectors = embed(sentences)
    
    return sentences, sentence_vectors

In [17]:
# Get sentence clusters for 4-star and 5-star reviews
sentences, sentence_vectors = filter_reviews(gg_file_path, 4, 5)
get_clusters(sentences, sentence_vectors, k=9, n=8)

Cluster #1 sentences (density score: 0.543):

I was kind of disappointed with the ending but it was a real ending.
I wasn't a huge fan of the ending but at the same time, the ending made sense with the story.
I did feel the ending was a little anti-climatic but I love that I didn't expect it.
I liked the book, but the ending was not what I expected AT ALL!
I didn't really love the ending though.
I don't think I liked the ending, but I am still mulling it over.
I really didn't care for the ending.
I didn't like the ending at all.


Cluster #2 sentences (density score: 0.474):

Gillian Flynn is a good writer and she had me fooled.
Gillian Flynn is an awesome writer.
Gillian Flynn is a brilliant writer.
I'm definitely going to be reading another Gillian Flynn book.
Gillian Flynn is an amazing writer.
I can't wait to read another book by Gillian Flynn.
My favorite Gillian Flynn novel so far.
I can't wait to read another book by Gillian Flynn-- it was so hard to put this one down!


Cluster

These results look very similar to what we observed in our earlier trials on the unfiltered dataset. So it seems that the general trends we observed in the dataset are well represented in the good reviews. Let's see if things change when we filter to only the bad reviews.

In [18]:
# Get sentence clusters for 1-star and 2-star reviews
sentences, sentence_vectors = filter_reviews(gg_file_path, 1, 2)
get_clusters(sentences, sentence_vectors, k=9, n=8)

Cluster #1 sentences (density score: 0.504):

I really didn't like this book.
I wasn't a fan of this book.
I didn't really like this book.
I really, really wanted to like this book.
Aside from the easy read, I really didn't like this book.
I thought this book was going to be good, I really did.
I really really wanted to love this book.
I really wanted to like this book.


Cluster #2 sentences (density score: 0.496):

I thought the ending was AWFUL.
The ending was a total anti-climax - I see the intention of the ending but it irritated me.
I really did not like the ending.
I found the ending so very unsatisfying.
Brilliant, but I hated the ending!
I wanted to like it more but the ending really disappointed me.
The only good thing about the ending was that it was over.
It was an interesting book but I really hated the ending.


Cluster #3 sentences (density score: 0.425):

Interesting plot but the characters were so unlikeable - ugh!
Not one of the characters was likeable at all.
I compl

Here (for once!) our clusters look quite different. Some of the themes are the same as before, but many are quite different. To summarize:

1. Didn't like the book
1. Disliked the ending
1. Disliked the characters
1. Gillian Flynn is an excellent writer
1. Read it just to finish it
1. Main characters (Nick and Amy)
1. Plot twists & turns
1. Didn't match expectations
1. Horrible characters

Interestingly, although these clusters are different from what we see before, they don't actually add much new information to what we already knew. Rather, these clusters simplify the sentiments in our previous results: rather than seeing that people loved the story despite disliking the ending, here we see simply that people disliked the ending. The only truly new sentiments captured here are generic ones: that the book didn't match expectations, and that people simply didn't like the book.

Because of this, it does not seem to benefit our model to filter reviews by their rating. The important sentiments are still likely to be captured in the full set of reviews, and the full set is more likely to capture more complex sentiments.

Nonetheless, from a user experience perspective, it may still make sense to allow the user the option of filtering reviews by rating, in case they are particularly interested in seeing what people who loved or hated a book are saying about it. The important conclusion for now is that filtering shouldn't be the *default* option, as the best results seem to be achieved when no filtering is done.

## Testing our conclusions

So far we have been basing all of our conclusions on a single dataset. It is obviously not guaranteed that these results will generalize to other datasets. So let's see how our model actually performs on some different and smaller datasets. We'll be begin with a dataset for Margaret Atwood's *The Testaments* that was scraped from GoodReads in May 2020.

In [19]:
ts_file_path = 'data/42975172_the_testaments.csv'

sentences, sentence_vectors = load_sentences(ts_file_path)

pd.read_csv(ts_file_path)['rating'].value_counts().sort_index()

This dataset consists of 11289 sentences from 1198 reviews



1.0    1393
2.0    2288
3.0    2533
4.0    2409
5.0    2666
Name: rating, dtype: int64

As we can see here, this dataset is much smaller than the *Gone Girl* dataset, with about one fourth the number of sentences and reviews. Furthermore, this dataset is much more balanced with respect to its ratings. (This is due in part to how the data was collected.) Let's see how our model does on the full dataset with 8 clusters.

In [20]:
get_clusters(sentences, sentence_vectors, k=8, n=8)

Cluster #1 sentences (density score: 0.516):

As a fan of the book The Handmaid's Tale and an even bigger fan of the television series The Handmaid's Tale, I was very excited for this novel.
I loved the handmaids tale but didn’t like this at all.
First, the handmaids tale is one of my all time favorite books.
I’ve read The Handmaid’s Tale, more than once, and I genuinely enjoyed it.
‘The handmaid’s tale’ was one of the best books I’ve ever read.
I'm glad I read it, but I think The Handmaid's Tale is a better story.
I feel like this novel just cheapened the Handmaid’s Tale, which is such a shame.
I absolutely loved The Handmaid’s Tale.


Cluster #2 sentences (density score: 0.449):

I did like learning more about Aunt Lydia and how she became the infamous Aunt Lydia.
Also, the only interesting character in the book is Aunt Lydia.
Aunt Lydia is the most compelling but even she can't save this book.
Aunt Lydia was the sole reason I read this book until the end.
The only good I found in th

Judging from these results, our model has done a good enough job of clustering together semantically similar sentences (though not as good as it seemed to do with *Gone Girl*). However, these sentence clusters seem much less revealing of the book itself. Some clusters seem to merely express a simple sentiment of 'I didn't like it' or 'I was disappointed'; other clusters seem simply to be picking up a shared proper name, like 'Margaret Atwood' or 'Aunt Lydia' or 'The Handmaid's Tale'. On the whole, these clusters just don't tell us a whole lot about what is actually good or bad about the book.

Now this doesn't necessarily any issue with our model. The issue could very well in the data itself; perhaps reviewers of *The Testaments* just aren't saying many specific things about the book. But given the distribution of the review ratings, it may behoove us to filter the reviews and see if more meaningful clusters emerge when we look only at people who like or didn't like the book.

In [21]:
# Get sentence clusters for good reviews
sentences, sentence_vectors = filter_reviews(ts_file_path, 4, 5)
get_clusters(sentences, sentence_vectors, k=6, n=8)

Cluster #1 sentences (density score: 0.493):

The sequel to The Handmaid’s Tale is written very different from Margaret Atwood’s dystopian classic, but in no way does it suffer as a follow-up.
I loved "The Handmaids Tale" and starting this book, I wasn't so sure how to feel.
I’m the reader that loved The handmaid’s Tale and never thought about a follow up book.
The Handmaid's Tale has been one of my favourite novels for decades, and there was a lot for this sequel to live up to.
I haven’t read “The Handmaid’s Tale”, but I am a big fan of the series.
I'm not going to spend too much time on The Testaments, Margaret Atwood's follow-up to The Handmaid's Tale.
I do not see the The Testament by Margaret Atwood as a sequel for The Handmaid's Tale.
I read The Handmaid's Tale about 15 years ago and adored it.


Cluster #2 sentences (density score: 0.452):

I did like learning more about Aunt Lydia and how she became the infamous Aunt Lydia.
Ann Dowd (Hulu’s Aunt Lydia) reading the part of Aunt 

In [22]:
# Get sentence clusters for bad reviews
sentences, sentence_vectors = filter_reviews(ts_file_path, 1, 2)
get_clusters(sentences, sentence_vectors, k=6, n=8)

Cluster #1 sentences (density score: 0.502):

I loved the handmaids tale but didn’t like this at all.
I feel like this novel just cheapened the Handmaid’s Tale, which is such a shame.
As a fan of the book The Handmaid's Tale and an even bigger fan of the television series The Handmaid's Tale, I was very excited for this novel.
I had extremely high hopes for this book given the fact that Margaret Atwood’s “The Handmaid’s Tale” is a modern-day classic.
I loved The Handmaids Tale, but this just seems so bad?
buddy read with NkishaI freely admit I'm not exactly a Margaret Atwood fan and when I decided to read this sequel to The Handmaid's Tale I doubted I would much enjoy it as I found The Handmaid's Tale to be less than well executed; marvellous premise though.
‘The handmaid’s tale’ was one of the best books I’ve ever read.
I’ve read The Handmaid’s Tale, more than once, and I genuinely enjoyed it.


Cluster #2 sentences (density score: 0.415):

The only good I found in the entire book was

These results seem better. We're now getting more detail about why people liked or disliked the book. So perhaps our earlier conclusions about review filtering were mistaken. With a dataset that is more balanced between positive and negative reviews, filtering by review rating can reveal more meaningful results. This seems like further reason to include this as an optional parameter in our model, that the user can set as they wish.

Let's see how our model performs on some other books. Here's how it does with Tara Westover's *Educated*, whose dataset was also scraped from GoodReads.

In [23]:
ed_file_path = 'data/35133922_educated.csv'

sentences, sentence_vectors = load_sentences(ed_file_path)

This dataset consists of 15369 sentences from 1265 reviews



In [24]:
# Get sentence clusters
get_clusters(sentences, sentence_vectors, k=8, n=8)

Cluster #1 sentences (density score: 0.394):

Finally I think it is a book about Mormons even when she states at the start: “This story is not about Mormonism”
She is clear as she meets many other Mormons at BYU that their Mormonism is not what her family is about.
Tara's parents were Mormon, but as Tara states in the author's note, this story is not about or against Mormonism.
At the very start of the book she claims it is not a book about Mormonism, but she allows the reader to assume that many of the quirky things her parents did, they did because they were Mormon.
Her parents are members of the Church of Jesus Christ of Latter-day Saints, but they espouse views and beliefs that are very extreme; their practice of Mormonism looks nothing like the type of mainstream Mormonism I practice.
It may have been what her family believed, but its not what the mainstream LDS church teaches.
I know this book "wasn't about Mormonism" but it certainly leads credence to the cries of poor homeschoo

In [25]:
# Get sentence clusters for good reviews
sentences, sentence_vectors = filter_reviews(ed_file_path, 4, 5)
get_clusters(sentences, sentence_vectors, k=6, n=8)

Cluster #1 sentences (density score: 0.368):

More than anything else, this book feels like Tara Westover's setting the record straight for her family and small community.
Tara is nothing like her family.
I thought it was amazing that Tara made it out of that family.
Tara Westover has opened up her heart and her journals and has given us a very raw, honest, eye-opening story!In this story Tara takes us on quite a journey – starting with her childhood through her college years.
This book has been heavily reviewed, so I will make a few observations:A difference between Tara (who eventually breaks free) and her sister-in-law (who doesn’t) is that Tara goes to great lengths to hide (to others and to herself) her shame.
Some parts were difficult to read and others frustrating, but I always wanted to know how Tara would survive her family.
Tara does such an amazing job of telling her story.
I am amazed at the drive and resiliency of Tara as she pulls herself out of a horrible family situatio

In [26]:
# Get sentence clusters for bad reviews
sentences, sentence_vectors = filter_reviews(ed_file_path, 1, 2)
get_clusters(sentences, sentence_vectors, k=6, n=8)

Cluster #1 sentences (density score: 0.445):

I really wanted to enjoy this book, I just didn't.
I'm just not sure why I really didn't like this book that much.
I couldn’t stop reading this book.
I disliked this book probably more than just about any book I have ever read.
I hate giving books up, but this book was “meh”.
Just wasn't worth my time reading this book.
I really should never have read this book.
Meh, this book really didn't do alot for me.


Cluster #2 sentences (density score: 0.314):

While she certainly has crafted an interesting story of abuse and religious extremism, she never made me feel anything other than horror of the abuse in her situation and hatred of her brother and father.
Her re-telling of countless acts of abuse by her brother and mental abuse by her parents became redundant to say the least.
I truly think she would have stayed with the family if her parents hadn’t turned a blind eye towards her brother’s abuse.
What I couldn’t get over was how dysfunctiona

Here again filtering seems to produce better and more easily interpretable results.

Let's look at one last book, Ted Chiang's *Exhalation*, also scraped from GoodReads.

In [27]:
ex_file_path = 'data/41160292_exhalation_stories.csv'

sentences, sentence_vectors = load_sentences(ex_file_path)

This dataset consists of 7059 sentences from 937 reviews



In [28]:
# Get sentence clusters
get_clusters(sentences, sentence_vectors, k=8, n=8)

Cluster #1 sentences (density score: 0.557):

"Anxiety is the dizziness of freedom" is the last story, and probably the most imaginative.
I also really enjoyed "Anxiety Is the Dizziness of Freedom".
'Anxiety is the Dizziness of Freedom' was definitely my favourite.
Anxiety is the Dizziness of Freedom - This one is a pure masterpiece, also a candidate for my favorite story.
“Anxiety Is the Dizziness of Freedom” is another favorite.
"Anxiety is the Dizziness of Freedom" has a great title and a satisfying ending.
:)Anxiety Is the Dizziness of Freedom - Another novella, and fascinating as hell.
"Anxiety is the Dizziness of Freedom" is probably the second greatest in the book.


Cluster #2 sentences (density score: 0.505):

'The Merchant and the Alchemist's Gate' is a delightfully engaging time-travel tale.
I loved the first story, “The Merchant and the Alchemist's Gate”.
The first story, "The Merchant and the Alchemist's Gate," was really good.
Merchant and the Alchemist's Gate was my favo

In [29]:
# Get sentence clusters for good reviews
sentences, sentence_vectors = filter_reviews(ex_file_path, 4, 5)
get_clusters(sentences, sentence_vectors, k=6, n=8)

Cluster #1 sentences (density score: 0.563):

"Anxiety is the dizziness of freedom" is the last story, and probably the most imaginative.
'Anxiety is the Dizziness of Freedom' was definitely my favourite.
“Anxiety Is the Dizziness of Freedom” is another favorite.
"Anxiety is the Dizziness of Freedom" has a great title and a satisfying ending.
:)Anxiety Is the Dizziness of Freedom - Another novella, and fascinating as hell.
Anxiety is the Dizziness of Freedom - This one is a pure masterpiece, also a candidate for my favorite story.
"Anxiety is the Dizziness of Freedom" is probably the second greatest in the book.
Anxiety Is the Dizziness of Freedom ( Well thought out)


Cluster #2 sentences (density score: 0.438):

Ted Chiang is maybe the best living writer of the science-fiction short story.
I love whenever a new Ted Chiang short story comes out.
Ted Chiang is my favorite short story writer and having a new collection of his to read is awesome!
I cannot say if Ted Chiang is as insightf

In [30]:
# Get sentence clusters for bad reviews
sentences, sentence_vectors = filter_reviews(ex_file_path, 1, 2)
get_clusters(sentences, sentence_vectors, k=6, n=8)

Cluster #1 sentences (density score: 0.46):

I liked the first short story the most (“The Merchant and the Alchemist's Gate”) and the never ending story on digients (“The Lifecycle of Software Objects”) the least.
I loved the first story, “The Merchant and the Alchemist's Gate”.
That said, I loved the first story, The Merchant and the Alchemist's Gate.
The first story, "The Merchant and the Alchemist's Gate," was really good.
I liked the first story very much “The merchant and alchemist gate”.
The first story, "The Merchant and the Alchemist's Gate," really drew me in.
My favorite story was the Merchant and the Alchemists Gate and least favorite was the Lifecycle of Software Objects.
The only story to grab my attention was "The Merchant and The Alchemist Gate. "


Cluster #2 sentences (density score: 0.446):

I love science fiction so it was not a genre issue.
The thing that made this book so boring is that the stories were not very sci-fi but more philosophical.
Some of the stories we

In this case, the full set of reviews seems to produce more helpful results than the filtered sets of reviews. This just goes to show that there is no hard and fast rule about which settings will produce the most meaningful results: the appropriate setting will depend on the book, its reviews, and the needs of the user.

In conclusion, we have shown in this notebook that our model still performs well on relatively small datasets, of even only 100 reviews. This means that we shouldn't feel constrained in the size of datasets we can work with. We have also shown that, while filtering reviews by their rating produces better results in some cases, it produces worse results in others. Because of this, review filtering should be left as an parameter than can be set by the user.