# BERTerReads: Proof of concept notebook

This notebook outlines how the BERTerReads app works. The process is simple:

1. A GoodReads book URL is provided as input
1. The first page of the book's reviews are scraped live on the spot
1. The reviews are divided into their individual sentences
1. Each sentence is transformed into a 768-dimensional vector with DistilBERT
1. The set of vectors is run through a K-means clustering algorithm, dividing the sentences into 3 clusters
1. The vector closest to each cluster centre is identified
1. The sentences corresponding to these 3 vectors are displayed back to the user

### Imports

In [1]:
# Imports
import numpy as np
import pandas as pd

import requests
from bs4 import BeautifulSoup

from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

In [2]:
# Load DistilBERT model
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

### 1. Retrieve URL from user

In [3]:
url = input('Input GoodReads book URL:')

Input GoodReads book URL: https://www.goodreads.com/book/show/51791252-the-vanishing-half


### 2. Scrape reviews from URL

In [4]:
def get_reviews(url):
    '''
    Function to scrape all the reviews from the first page of a GoodReads book URL
    '''

    r = requests.get(url)
    soup = BeautifulSoup(r.content, features='html.parser')

    reviews_src = soup.find_all('div', class_='reviewText stacked')

    reviews = []

    for review in reviews_src:

        reviews.append(review.text)

    df = pd.DataFrame(reviews, columns=['review'])
    
    return df

In [5]:
reviews_df = get_reviews(url)

### 3. Divide reviews into individual sentences

In [6]:
def clean_reviews(df):
    '''
    Function to clean review text and divide into individual sentences
    '''

    # Define spoiler marker & "...more" strings, and remove from all reviews
    spoiler_str_gr = '                    This review has been hidden because it contains spoilers. To view it,\n                    click here.\n\n\n'
    more_str = '\n...more\n\n'
    df['review'] = df['review'].str.replace(spoiler_str_gr, '')
    df['review'] = df['review'].str.replace(more_str, '')

    # Scraped reviews from GoodReads typically repeat the first ~500 characters
    # The following loop removes these repeated characters

    # Loop through each row in dataframe
    for i in range(len(df)):

        # Save review and review's first ~250 characters to variables
        review = df.iloc[i]['review']
        review_start = review[2:250]

        # Loop through all of review's subsequent character strings
        for j in range(3, len(review)):

            # Check if string starts with same sequence as review start
            if review[j:].startswith(review_start):
                # If so, chop off all previous characters from review
                df.at[i, 'review'] = review[j:]

    # Replace all new line characters
    df['review'] = df['review'].str.replace('\n', ' ')

    # Append space to all sentence end characters
    df['review'] = df['review'].str.replace('.', '. ').replace('!', '! ').replace('?', '? ')

    # Initialize dataframe to store review sentences, and counter
    sentences_df = pd.DataFrame()

    # Loop through each review
    for i in range(len(df)):

        # Save row and review to variables
        row = df.iloc[i]
        review = row.loc['review']

        # Tokenize review into sentences
        sentences = sent_tokenize(review)

        # Loop through each sentence in list of tokenized sentences
        for sentence in sentences:
            # Add row for sentence to sentences dataframe
            new_row = row.copy()
            new_row.at['review'] = sentence
            sentences_df = sentences_df.append(new_row, ignore_index=True)

    sentences_df.rename(columns={'review':'sentence'}, inplace=True)

    lower_thresh = 5
    upper_thresh = 50

    # Remove whitespaces at the start and end of sentences
    sentences_df['sentence'] = sentences_df['sentence'].str.strip()

    # Create list of sentence lengths
    sentence_lengths = sentences_df['sentence'].str.split(' ').map(len)

    num_short = (sentence_lengths <= lower_thresh).sum()
    num_long = (sentence_lengths >= upper_thresh).sum()
    num_sents = num_short + num_long

    # Filter sentences
    sentences_df = sentences_df[
        (sentence_lengths > lower_thresh) & (sentence_lengths < upper_thresh)]

    sentences_df.reset_index(drop=True, inplace=True)
    
    return sentences_df['sentence']

In [7]:
sentences = clean_reviews(reviews_df)

### 4. Transform each sentence into a vector

In [8]:
sentence_vectors = model.encode(sentences)

### 5. Cluster sentences and print sentences closest to each cluster centre

In [9]:
def get_opinions(sentences, sentence_vectors, k=3, n=1):
    '''
    Function to extract the n most representative sentences from k clusters, with density scores
    '''
    
    # Instantiate the model
    kmeans_model = KMeans(n_clusters=k, random_state=24)

    # Fit the model
    kmeans_model.fit(sentence_vectors);
    
    # Set the number of cluster centre points to look at when calculating density score
    centre_points = int(len(sentences) * 0.02)
    
    # Initialize list to store mean inner product value for each cluster
    cluster_density_scores = []
    
    # Initialize dataframe to store cluster centre sentences
    df = pd.DataFrame()

    # Loop through number of clusters
    for i in range(k):

        # Define cluster centre
        centre = kmeans_model.cluster_centers_[i]

        # Calculate inner product of cluster centre and sentence vectors
        ips = np.inner(centre, sentence_vectors)

        # Find the sentences with the highest inner products
        top_index = pd.Series(ips).nlargest(n).index
        top_sentence = sentences[top_index].iloc[0]
        
        centre_ips = pd.Series(ips).nlargest(centre_points)
        density_score = round(np.mean(centre_ips), 5)
        
        # Create new row with cluster's top 10 sentences and density score
        new_row = pd.Series([top_sentence, density_score])
        
        # Append new row to master dataframe
        df = df.append(new_row, ignore_index=True)

    # Rename dataframe columns
    df.columns = ['sentence', 'density']

    # Sort dataframe by density score, from highest to lowest
    df = df.sort_values(by='density', ascending=False).reset_index(drop=True)
    
    for i in range(len(df)):
        print(f"Opinion #{i+1}: {df['sentence'][i]}\n")

In [10]:
get_opinions(sentences, sentence_vectors, 3)

Opinion #1: I found this to be a beautifully written and thought-provoking book.

Opinion #2: While racial identity is the core of the story, there are so many other layers here with characters that the author portrays in such a way that I got a sense of who they were, even if at times they questioned their own identities.

Opinion #3: Nearly broken from her sister’s choice to leave her, she never gives up hope of finding Stella until it’s nearly too late.



### Quick run

In [11]:
url = input('Input GoodReads book URL:')
reviews_df = get_reviews(url)
print('\nScraped reviews!')
sentences = clean_reviews(reviews_df)
print('Cleaned reviews!')
sentence_vectors = model.encode(sentences)
print('Embedded sentences!\n')
get_opinions(sentences, sentence_vectors, 3)

Input GoodReads book URL: https://www.goodreads.com/book/show/48570454-transcendent-kingdom



Scraped reviews!
Cleaned reviews!
Embedded sentences!

Opinion #1: This is definitely a book that I appreciate, respect, admire, more than I love.

Opinion #2: While these experiences have affected Gifty’s relationship to her faith, and she’s somewhat embarrassed when reading her old diary entries, in which she pleads for divine intervention, as an adult Gifty finds herself craving that ardor.

Opinion #3: Her brother’s addiction and her mother’s depression have irrevocably shaped Gifty, the protagonist and narrator of Transcendent Kingdom, who is now a sixth-year PhD candidate in neuroscience at Stanford.

