An interactive app that rapidly extracts the most commonly expressed opinions across all of a book's reviews.
BetterReads is a Python-based app that uses natural language processing and unsupervised machine learning to distill thousands of reviews of a particular book down to their most commonly expressed opinions. It accomplishes this by dividing each review into its individual sentences (using NLTK), transforming each sentence into a 512-dimensional vector (using the Universal Sentence Encoder), finding the densest regions in the resultant vector space (using k-means clustering), and extracting the sentences that are closest to the centre of each region. To learn more...
- Visit the app: bit.ly/betterreads
- Read the blog post: bit.ly/whatisbetterreads
- Watch a video walkthrough: youtu.be/ojkM4JGT93k
Detailed explanations of how the algorithm was built and the exploratory analysis involved are presented in the included Jupyter notebooks.
01_modelling_with_use
Establishes proof-of-concept of BetterReads algorithm02_optimizing_kmeans
Explores how to optimize k-means clustering model03_optimizing_reviews
Explores how many reviews are needed to obtain meaningful results04_optimizing_goodreads
Explores how to tune model on GoodReadsReviewsScraper data05_optimizing_use
Compares performance between versions of Universal Sentence Encoder06_visualizing_results
Creates some simple visualizations of our model
After cloning the repo and creating a virtual environment with the packages listed in requirements.txt, run the setup.sh script to create the expected file directories and install the expected data. Note that not all datasets can be downloaded with this script, so be sure to follow the instructions in the script's comments.
Scripts to clean, process, and transform the data are included under src
. All scripts should be run from the root directory, to preserve file directory references.
The web app script can be run locally with the following command:
streamlit run src/05_web_app/betterreads.py
Note that the web app expects to find sentence datasets in src/05_web_app/data
and book cover images in src/05_web_app/book_covers
, and downloads sentence embeddings from a private S3 bucket. The sentence datasets are identical to the files written to data/03_processed
and the sentence vectors are identical to the files written to data/05_model_output
. The app's code can easy be rewritten to load these files locally, and the book cover images can be commented out.
Data for the BetterReads algorithm comes from two sources:
- UCSD Book Graph: full text of 15.7 million GoodReads reviews, of 2M books from 465K users, scraped in late 2017
- GoodReadsReviewsScraper: self-made web-scraping script that, from any GoodReads book URL (or list of URLs), downloads the full text of its top 1,500 reviews
Instructions for downloading the UCSD Book Graph data are included in setup.sh
. Data generated with the scraper should copied to data/01_raw/goodreads/
.
├── setup.sh <- Script to download data files
│
├── requirements.txt <- Required Python packages
│
├── data <- Data directory (created in setup.sh)
│ ├── 01_raw <- Raw data from UCSD Book Graph / GoodReadReviewsScraper
│ ├── 02_intermediate <- Cleaned UCSD book review datasets
│ ├── 03_processed <- Processed book review datasets
│ ├── 04_models <- Universal Sentence Encoder (installed in setup.sh)
│ └── 05_model_output <- Sentence embeddings
│
├── notebooks <- Jupyter notebooks detailing analysis
│
└── src <- All scripts to be run from root directory
├── 01_data <- Splits UCSD data into smaller chunks
│ └── split_ucsd_files.sh
│
├── 02_intermediate <- Cleans UCSD data
│ ├── ucsd_01_compile_book_list.py
│ ├── ucsd_02_compile_master_list.py
│ ├── ucsd_03_filter_reviews.py
│ └── ucsd_04_write_book_files.py
│
├── 03_processing <- Processes review datasets into sentence datasets
│ ├── process_reviews.py
│ └── process_utils.py
│
├── 04_modelling <- Embeds sentence datasets as vectors
│ └── embed_sentences.py
│
└── 05_web_app <- Runs Streamlit web app (bit.ly/betterreads)
└── betterreads.py
Special thanks to Mengting Wan and Julian McAuley of UCSD, for making the Book Graph dataset freely available online; see their papers "Item Recommendation on Monotonic Behavior Chains" and "Fine-Grained Spoiler Detection from Large-Scale Review Corpora".
Kushal Chauhan's blog post on "Unsupervised Text Summarization using Sentence Embeddings" was a huge help and a big inspiration early on.
The BetterReads app was made as part of my capstone for the Data Science Diploma Program at BrainStation. Many many thanks to my educators: Myles Harrison, Adam Thorstenstein, Govind Suresh, Patrick Min, and Daria Aza. Thanks also to all my fantastic classmates!