# BetterReads: Optimizing GoodReads review data

This notebook explores how to achieve the best results with the BetterReads algorithm when using review data scraped from GoodReads. It is a short follow-up to the exploration performed in the `03_optimizing_reviews.ipynb` notebook.

We have two options when scraping review data from GoodReads: For any given book, we can either scrape 1,500 reviews, with 300 reviews for each star rating (1 to 5), or we can scrape just the top 300 reviews, of any rating. (This is due to some quirks in the way that reviews are displayed on the GoodReads website; for more information, see my [GoodReadsReviewsScraper script](https://github.com/williecostello/GoodReadsReviewsScraper).)

There are advantages and disadvantages to both options. If we scrape 1,500 reviews, we obviously have more review data to work with; however, the data is artifically class-balanced, such that, for example, we'll still see a good number of negative reviews even if the vast majority of the book's reviews are positive. If we scrape just the top 300 reviews, we will have a more representative dataset, but much less data to work with.

We saw in the `03_optimizing_reviews.ipynb` notebook that the BetterReads algorithm can achieve meaningful and representative results from a dataset with less than 100 reviews. So we should not dismiss the 300 review option simply because it involves less data. We should only dismiss it if its smaller dataset leads to worse results. So let's try these two options out on a particular book and see how the algorithm performs.

In [1]:
import numpy as np
import pandas as pd
import random

from sklearn.cluster import KMeans

import tensorflow_hub as hub

In [2]:
# Loads Universal Sentence Encoder locally, from downloaded module
embed = hub.load('../../Universal Sentence Encoder/module/')

# Loads Universal Sentence Encoder remotely, from Tensorflow Hub
# embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

## Which set of reviews should we use?

For this notebook we'll work with a new example: Sally Rooney's *Conversations with Friends*.

<img src='https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1500031338l/32187419._SY475_.jpg' width=250 align=center>

We have prepared two datasets, one of 1,500 reviews and another of 300 reviews, as described above. Both datasets were scraped from GoodReads at the same time, so there is some overlap between them. (Note that the total number of reviews in both datasets is less than advertised, since non-English and very short reviews are dropped during data cleaning.)

In [3]:
# Set path for processed file
file_path_1500 = 'data/32187419_conversations_with_friends.csv'
file_path_300 = 'data/32187419_conversations_with_friends_top_300.csv'

# Read in processed file as dataframe
df_1500 = pd.read_csv(file_path_1500)
df_300 = pd.read_csv(file_path_300)

print(f'The first dataset consists of {df_1500.shape[0]} sentences from {df_1500["review_index"].nunique()} reviews')
print(f'The second dataset consists of {df_300.shape[0]} sentences from {df_300["review_index"].nunique()} reviews')

The first dataset consists of 8604 sentences from 1190 reviews
The second dataset consists of 2874 sentences from 293 reviews


As we can see above, in comparison to the smaller dataset, the bigger dataset contains approximately three times the number of sentences from four times the number of reviews. And as we can see below, the bigger dataset contains approximately the same number of reviews for each star rating, while the smaller dataset is much more heavily skewed toward 5 star and 4 star reviews.

In [4]:
df_1500.groupby('review_index')['rating'].mean().value_counts().sort_index()

1.0    252
2.0    250
3.0    239
4.0    212
5.0    237
Name: rating, dtype: int64

In [5]:
df_300.groupby('review_index')['rating'].mean().value_counts().sort_index()

1.0     14
2.0     27
3.0     46
4.0     80
5.0    116
Name: rating, dtype: int64

On [the book's actual GoodReads page](https://www.goodreads.com/book/show/32187419-conversations-with-friends), its average review rating is listed as 3.82 stars. This is nearly the same as the average review rating of our smaller dataset. The bigger dataset's average review rating, in contrast, is just less than 3. This confirms our earlier suspicion that the smaller dataset presents a more representative sample of the book's full set of reviews.

In [6]:
df_300.groupby('review_index')['rating'].mean().mean()

3.908127208480565

In [7]:
df_1500.groupby('review_index')['rating'].mean().mean()

2.942857142857143

Let's see how these high-level differences affect the output of our algorithm.

In [11]:
def load_sentences(file_path):
    '''
    Function to load and embed a book's sentences
    '''
    
    # Read in processed file as dataframe
    df = pd.read_csv(file_path)

    # Copy sentence column to new variable
    sentences = df['sentence'].copy()

    # Vectorize sentences
    sentence_vectors = embed(sentences)
    
    return sentences, sentence_vectors

In [9]:
def get_clusters(sentences, sentence_vectors, k, n):
    '''
    Function to extract the n most representative sentences from k clusters, with density scores
    '''
    
    # Instantiate the model
    kmeans_model = KMeans(n_clusters=k, random_state=24)

    # Fit the model
    kmeans_model.fit(sentence_vectors);

    # Set the number of cluster centre points to look at when calculating density score
    centre_points = int(len(sentences) * 0.02)
    
    # Initialize list to store mean inner product value for each cluster
    cluster_density_scores = []
    
    # Initialize dataframe to store cluster centre sentences
    df = pd.DataFrame()

    # Loop through number of clusters
    for i in range(k):

        # Define cluster centre
        centre = kmeans_model.cluster_centers_[i]

        # Calculate inner product of cluster centre and sentence vectors
        ips = np.inner(centre, sentence_vectors)

        # Find the sentences with the highest inner products
        top_indices = pd.Series(ips).nlargest(n).index
        top_sentences = list(sentences[top_indices])
        
        centre_ips = pd.Series(ips).nlargest(centre_points)
        density_score = round(np.mean(centre_ips), 5)
        
        # Append the cluster density score to master list
        cluster_density_scores.append(density_score)

        # Create new row with cluster's top 10 sentences and density score
        new_row = pd.Series([top_sentences, density_score])
        
        # Append new row to master dataframe
        df = df.append(new_row, ignore_index=True)

    # Rename dataframe columns
    df.columns = ['sentences', 'density']

    # Sort dataframe by density score, from highest to lowest
    df = df.sort_values(by='density', ascending=False).reset_index(drop=True)
    
    # Loop through number of clusters selected
    for i in range(k):
        
        # Save density / similarity score & sentence list to variables
        sim_score = round(df.loc[i]["density"], 3)
        sents = df.loc[i]['sentences'].copy()
        
        print(f'Cluster #{i+1} sentences (density score: {sim_score}):\n')
        print(*sents, sep='\n')
        print('\n')
        
    model_density_score = round(np.mean(cluster_density_scores), 5)
    
    print(f'Model density score: {model_density_score}')

In [12]:
# Load and embed sentences
sentences_1500, sentence_vectors_1500 = load_sentences(file_path_1500)
sentences_300, sentence_vectors_300 = load_sentences(file_path_300)

In [14]:
# Get cluster sentences for bigger dataset
get_clusters(sentences_1500, sentence_vectors_1500, k=6, n=8)

Cluster #1 sentences (density score: 0.437):

Sally Rooney has a really interesting way of writing, which I deeply appreciate.
i just cannot get over how well Sally Rooney writes.
I think that Sally Rooney is a fantastic writer.
I'm very happy I read Rooney's Normal People first and loved it so deeply, bc I feel certain I would actively avoid Sally Rooney if this book was the first piece of writing I read by her.
Sally Rooney is a brilliant writer, and I was really looking forward to this from reading her short fiction.
I can only write that I love it even more than "Normal people" and I can't wait for more book by Sally Rooney.
I love how Sally Rooney writes - naturally and simply.
Well-written because it’s Sally Rooney and so even her debut is brilliant.


Cluster #2 sentences (density score: 0.392):

I really just couldn't get with this book.
I enjoyed this book way more than I thought I would at the beginning.
Don’t get me wrong I did enjoy this book, but I think I expected more fr

In [15]:
# Get cluster sentences for smaller dataset
get_clusters(sentences_300, sentence_vectors_300, k=6, n=8)

Cluster #1 sentences (density score: 0.44):

i just cannot get over how well Sally Rooney writes.
I finished CONVERSATIONS WITH FRIENDS by Sally Rooney this morning and once again I am in awe of Rooney's writing.
Rooney really seems to understand the lives of her chracters.
I'm looking forward to reading anything else that Sally Rooney writes.
Sally Rooney has become one of my favorite writers.
Rooney is an excellent writer; I desperately hope she is just getting started.
Sally Rooney makes me feel like I could do anything in life as long as I wrote about it well.
I can’t wait to read whatever Sally Rooney comes out with next!


Cluster #2 sentences (density score: 0.365):

There were times I was curious to see how it would play out with Frances & Nick, as well as Frances' relationship with Bobbi.
The book follows Frances and her best friend Bobbi, who become entangled with a married couple, Nick and Melissa.
Frances and Nick end up in a relationship and the conversations between them 

Let's summarize our results. The bigger dataset's sentence clusters can be summed up as follows:

1. Fantastic writing
1. Reading experience (?)
1. Unlikeable characters
1. Plot synopsis
1. Not enjoyable
1. Thematic elements: relationships & emotions

The smaller dataset's clusters can be summed up like this:

1. Fantastic writing
1. Plot synopsis
1. Loved it
1. Unlikeable characters
1. Reading experience
1. Thematic elements: Relationships & emotions

As we can see, the two sets of results are broadly similar; there are no radical differences between the two sets of clusters. The only major difference is that the bigger dataset includes a cluster of sentences expressing dislike of the book, whereas the smaller dataset includes a cluster of sentences expressing love of the book. But this was to be expected, given the relative proportions of positive and negative reviews between the two datasets.

Given these results, we feel that the smaller dataset is preferable. Its clusters seem slightly more internally coherent and to better capture the general sentiment toward the book.