# BetterReads: Optimizing the Universal Sentence Encoder

This notebook is a short investigation into which version of the Universal Sentence Encoder should be used for the BetterReads algorithm.

There are two versions of the Universal Sentence Encoder currently made available through TensorFlow Hub: [`universal-sentence-encoder`](https://tfhub.dev/google/universal-sentence-encoder/4) and [`universal-sentence-encoder-large`](https://tfhub.dev/google/universal-sentence-encoder-large/5). These two encoders are also referred to as Encoders 4 and 5, respectively. Curiously, `universal-sentence-encoder-large` seems to be about half the size of `universal-sentence-coder`, and Encoder 4, not 5, seems to be TensorFlow's recommended version.

We have been using Encoder 4 thus far, as it's the default option in TensorFlow Hub. But it's possible that Encoder 5 will perform better than Encoder 4, either in terms of our model's output or in terms of our model's performance time. Let's see what, if any, differences we can observe.

In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import tensorflow_hub as hub

## Load time performance

To begin, let's compare how long it takes for each encoder to be loaded. We will time both the local and the remote load time.

In [2]:
%%time
# Loads Universal Sentence Encoder remotely, from Tensorflow Hub
use_4_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

CPU times: user 12.8 s, sys: 3.88 s, total: 16.6 s
Wall time: 44.5 s


In [3]:
%%time
# Loads Universal Sentence Encoder locally, from downloaded module
use_4_embed = hub.load('../../Universal Sentence Encoder/module/')

CPU times: user 6.12 s, sys: 1.96 s, total: 8.08 s
Wall time: 15.4 s


In [4]:
%%time
use_5_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

CPU times: user 17.7 s, sys: 3.51 s, total: 21.2 s
Wall time: 35.4 s


In [5]:
%%time
use_5_embed = hub.load('../../Universal Sentence Encoder/module-5/')

CPU times: user 13.8 s, sys: 1.76 s, total: 15.6 s
Wall time: 16.6 s


As we can see here, the local load time is longer for Encoder 5, whereas the remote load time is longer for Encoder 4. These are strange results, but need not be taken as tremendeously important, since the differences in load times are not significant.

## Encoding time performance

Now let's compare how long it takes for each encoder to embed a dataset of sentences.

In [6]:
# Set path for processed file
file_path = 'data/32187419_conversations_with_friends_top_300.csv'

# Read in processed file as dataframe
df = pd.read_csv(file_path)

# Copy sentence column to new variable
sentences = df['sentence'].copy()

In [7]:
%%time
# Vectorize sentences
sentence_vectors_4 = use_4_embed(sentences)

CPU times: user 790 ms, sys: 467 ms, total: 1.26 s
Wall time: 1.32 s


In [8]:
%%time
sentence_vectors_5 = use_5_embed(sentences)

CPU times: user 46.6 s, sys: 10.9 s, total: 57.5 s
Wall time: 26.4 s


As we can see here, the encoding time is significantly longer with Encoder 5. (Indeed, when I first tried this out with a bigger dataset, encoding the sentences with Encoder 5 crashed my kernel.) This difference in performance is perhaps sufficient reason to prefer Encoder 4 over Encoder 5. But before making any decisions let's see if the two embeddings lead to any noticeable differences in their results.

## Output performance

Lastly let's compare the output of our model using our two different sentence embeddings.

In [10]:
def get_clusters(sentences, sentence_vectors, k, n):
    '''
    Function to extract the n most representative sentences from k clusters, with density scores
    '''
    
    # Instantiate the model
    kmeans_model = KMeans(n_clusters=k, random_state=24)

    # Fit the model
    kmeans_model.fit(sentence_vectors);

    # Set the number of cluster centre points to look at when calculating density score
    centre_points = int(len(sentences) * 0.02)
    
    # Initialize list to store mean inner product value for each cluster
    cluster_density_scores = []
    
    # Initialize dataframe to store cluster centre sentences
    df = pd.DataFrame()

    # Loop through number of clusters
    for i in range(k):

        # Define cluster centre
        centre = kmeans_model.cluster_centers_[i]

        # Calculate inner product of cluster centre and sentence vectors
        ips = np.inner(centre, sentence_vectors)

        # Find the sentences with the highest inner products
        top_indices = pd.Series(ips).nlargest(n).index
        top_sentences = list(sentences[top_indices])
        
        centre_ips = pd.Series(ips).nlargest(centre_points)
        density_score = round(np.mean(centre_ips), 5)
        
        # Append the cluster density score to master list
        cluster_density_scores.append(density_score)

        # Create new row with cluster's top 10 sentences and density score
        new_row = pd.Series([top_sentences, density_score])
        
        # Append new row to master dataframe
        df = df.append(new_row, ignore_index=True)

    # Rename dataframe columns
    df.columns = ['sentences', 'density']

    # Sort dataframe by density score, from highest to lowest
    df = df.sort_values(by='density', ascending=False).reset_index(drop=True)
    
    # Loop through number of clusters selected
    for i in range(k):
        
        # Save density / similarity score & sentence list to variables
        sim_score = round(df.loc[i]["density"], 3)
        sents = df.loc[i]['sentences'].copy()
        
        print(f'Cluster #{i+1} sentences (density score: {sim_score}):\n')
        print(*sents, sep='\n')
        print('\n')
        
    model_density_score = round(np.mean(cluster_density_scores), 5)
    
    print(f'Model density score: {model_density_score}')

In [12]:
# Get cluster sentences with Encoder 4 embeddings
get_clusters(sentences, sentence_vectors_4, k=6, n=8)

Cluster #1 sentences (density score: 0.44):

i just cannot get over how well Sally Rooney writes.
I finished CONVERSATIONS WITH FRIENDS by Sally Rooney this morning and once again I am in awe of Rooney's writing.
Rooney really seems to understand the lives of her chracters.
I'm looking forward to reading anything else that Sally Rooney writes.
Sally Rooney has become one of my favorite writers.
Rooney is an excellent writer; I desperately hope she is just getting started.
Sally Rooney makes me feel like I could do anything in life as long as I wrote about it well.
I can’t wait to read whatever Sally Rooney comes out with next!


Cluster #2 sentences (density score: 0.365):

There were times I was curious to see how it would play out with Frances & Nick, as well as Frances' relationship with Bobbi.
The book follows Frances and her best friend Bobbi, who become entangled with a married couple, Nick and Melissa.
Frances and Nick end up in a relationship and the conversations between them 

In [14]:
# Get cluster sentences with Encoder 5 embeddings
get_clusters(sentences, sentence_vectors_5, k=6, n=8)

Cluster #1 sentences (density score: 0.491):

Rooney is brilliant and her writing so precise.
One last thing: I am always so impressed by the way Rooney ends her novels, especially this particular one.
I really love Rooney her writing style.
I’m so impressed by the way Rooney brings her characters interior worlds into contact with the outside world, gently manipulating her readers’ expectations and not always giving the reader what they want.
I really admire Sally Rooney’s writing.
Rooney is an excellent writer; I desperately hope she is just getting started.
I think I've fallen in love with Rooney's writing style and the way she describes feelings.
I finished CONVERSATIONS WITH FRIENDS by Sally Rooney this morning and once again I am in awe of Rooney's writing.


Cluster #2 sentences (density score: 0.449):

Frances is wonderfully specific, her self-perception as unemotional/flat/lacking a personality developed in a way that somehow feels both novel and relatable.
Frances was such an 

Here can we see that the two sentence embeddings do indeed lead to different output. It is hard to compare these two sets of results to one another, but our intuitive judgment is that Encoder 5's results are slightly to be preferred: they include a higher proportion of meaningful clusters, and the individual clusters exhibit a higher degree of similarity.

But perhaps these differences are just do to some peculiarities in this particular dataset. Let's try out the two encoders on a different dataset and see how those results compare.

In [15]:
# Set path for processed file
file_path = 'data/19161852_the_fifth_season.csv'

# Read in processed file as dataframe
df = pd.read_csv(file_path)

# Copy sentence column to new variable
sentences = df['sentence'].copy()

In [16]:
%%time
# Vectorize sentences
sentence_vectors_4 = use_4_embed(sentences)

CPU times: user 638 ms, sys: 1.38 s, total: 2.02 s
Wall time: 3.41 s


In [17]:
%%time
sentence_vectors_5 = use_5_embed(sentences)

CPU times: user 1min 8s, sys: 19.5 s, total: 1min 28s
Wall time: 35.4 s


In [18]:
# Get cluster sentences with Encoder 4 embeddings
get_clusters(sentences, sentence_vectors_4, k=6, n=8)

Cluster #1 sentences (density score: 0.378):

I liked the overall story we got from this book but I honestly think I’m just going to leave it at one book.
This book easily is now one of my favorite books of all time.
I feel like I may need to reread this book in order to properly review it, but wow I'm kind of blown away still.
I heard all of the praises for this book but for some reason I didn’t pick it up for the longest time.
I’ll certainly be reading book two, and I hope it resonates with me more than this first book.
I finally managed to read this book, giving myself a break from the huge pile of forthcoming books that I'm supposed to be reading, and it was even better than I had hoped.
Even as I read the book there were so many things about this book that were so well done.
What was different with The Fifth Season was that the instant I read the blurb on goodreads I knew that this book was something that I would enjoy immensely and I had some ridiculously stupid expectation.


Cl

In [19]:
# Get cluster sentences with Encoder 5 embeddings
get_clusters(sentences, sentence_vectors_5, k=6, n=8)

Cluster #1 sentences (density score: 0.498):

The Fifth Season is a beautifully constructed, masterfully executed work of fiction that everyone needs to read.
The Fifth Season is pure originality and the best way I've ever seen someone deal with the fantasy genre.
The Fifth Season is unlike anything else I have ever read.
The Fifth Season was unlike anything I've ever read, and I mean that in the best way possible!
The Fifth Season is unlike anything I've read in a long time.
The Fifth Season is the most unique book I've read in quite a while.
The Fifth Season is a fascinating and intelligent fantasy novel, with exquisite world-building and such interesting characters.
”One of the reasons why The Fifth Season has such compelling narratives is Jemisin’s jaw-dropping world-building.


Cluster #2 sentences (density score: 0.396):

I finally managed to read this book, giving myself a break from the huge pile of forthcoming books that I'm supposed to be reading, and it was even better than 

Here again we would give the slight advantage to Encoder 5: its clusters are, overall, slightly more meaningful and slightly more internally cohesive.

## Conclusions

The two available versions of the Universal Sentence Encoder do indeed show differences in performance: Encoder 4 has better time performance, with respect to both being loaded and embedding sentences, but Encoder 5 seems to perform slightly better in terms of results.

Nonetheless, Encoder 5's significantly greater embedding time makes it practically infeasible for the purposes of our algorithm. Thus we will continue to build our algorithm using Encoder 4.