#1. Tools for text processing

What are the most frequent words in F. Scott Fitzgerald's novel, The Great Gatsby, and how often do they occur?

In this notebook, we'll scrape the novel The Great Gataby from the website Project Gutenberg (which contains a large corpus of books) using the Python package requests. Then we'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk) and Counter.

The Data Science pipeline we'll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. The natural language processing tools used here apply to much of the data that data scientists encounter as a vast proportion of the world's data is unstructured data and includes a great deal of text.

Let's start by loading in the three main Python packages we are going to use.

In [17]:
# Importing requests, BeautifulSoup, nltk, and Counter
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter

#2. Request The Great Gatsby
To analyze The Great Gatsby, we need to get the contents of The Great Gatsby from somewhere. Luckily, the text is freely available online at Project Gutenberg as an HTML file: https://www.gutenberg.org/cache/epub/64317/pg64317-images.html.

Note that HTML stands for Hypertext Markup Language and is the standard markup language for the web.

To fetch the HTML file with The Great Gatsby we're going to use the request package to make a GET request for the website, which means we're getting data from it. This is what you're doing through a browser when visiting a webpage, but now we're getting the requested page directly into Python instead.

In [18]:
# Getting The Great Gatsby HTML  
r = requests.get('https://www.gutenberg.org/cache/epub/64317/pg64317-images.html')

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 2000 characters in html
print(html[0:2000])

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"><style>
#pg-header div, #pg-footer div {
    all: initial;
    display: block;
    margin-top: 1em;
    margin-bottom: 1em;
    margin-left: 2em;
}
#pg-footer div.agate {
    font-size: 90%;
    margin-top: 0;
    margin-bottom: 0;
    text-align: center;
}
#pg-footer li {
    all: initial;
    display: block;
    margin-top: 1em;
    margin-bottom: 1em;
    text-indent: -0.6em;
}
#pg-footer div.secthead {
    font-size: 110%;
    font-weight: bold;
}
#pg-footer #project-gutenberg-license {
    font-size: 110%;
    margin-top: 0;
    margin-bottom: 0;
    text-align: center;
}
#pg-header-heading {
    all: inherit;
    text-align: center;
    font-size: 120%;
    font-weight:bold;
}
#pg-footer-heading {
    all: inherit;
    text-align: center;
    font-size: 120%;
    font-weight: normal;
    margin-top: 0;
    margin-bottom: 0;
}
#pg-header #pg-machine-header p {
    text-in

#3. Get the text from the HTML
This HTML is not quite what we want. However, it does contain what we want: the text of The Great Gatsby. What we need to do now is wrangle this HTML to extract the text of the novel. For this we'll use the package BeautifulSoup.

Firstly, a word on the name of the package: Beautiful Soup? In web development, the term "tag soup" refers to structurally or syntactically incorrect HTML code written for a web page. What Beautiful Soup does best is to make tag soup beautiful again and to extract information from it with ease! In fact, the main object created and queried when using this package is called BeautifulSoup.

In [19]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html.parser")

# Getting the text out of the soup
text = soup.get_text()

# Printing out text between characters 32000 and 34000
print(text[32000:34000])

r lovely face, as if she had asserted her membership in a rather distinguished secret society to which she and Tom belonged.



Inside, the crimson room bloomed with light. Tom and Miss Baker sat at either end of the long couch and she read aloud to him from the Saturday Evening Post—the words, murmurous and uninflected, running together in a soothing tune. The lamplight, bright on his boots and dull on the autumn-leaf yellow of her hair, glinted along the paper as she turned a page with a flutter of slender muscles in her arms.


When we came in she held us silent for a moment with a lifted hand.


“To be continued,” she said, tossing the magazine on the table, “in our very next issue.”


Her body asserted itself with a restless movement of her knee, and she stood up.


“Ten o’clock,” she remarked, apparently finding the time on the ceiling. “Time for this good girl to go to bed.”


“Jordan’s going to play in the tournament tomorrow,” explained Daisy, “over at Westchester.

#4. Extract the words

We now have the text of the novel! There is some unwanted stuff at the start and some unwanted stuff at the end. We could remove it, but this content is so much smaller in amount than the text of The Great Gatsby that, to a first approximation, it is okay to leave it in.

Now that we have the text of interest, it's time to count how many times each word appears, and for this we'll use nltk – the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.

In [20]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens 
tokens[0:8]

['pg', 'header', 'div', 'pg', 'footer', 'div', 'all', 'initial']

#5. Make the words lowercase
OK! We're nearly there. Note that in the above 'Or' has a capital 'O' and that in other places it may not, but both 'Or' and 'or' should be counted as the same word. For this reason, we should build a list of all words in The Great Gatsby in which all capital letters have been made lower case.

In [21]:
# Create a list called words containing all tokens transformed to lower-case
words = [token.lower() for token in tokens]

# Printing out the first 8 words / tokens 
words[:8]

['pg', 'header', 'div', 'pg', 'footer', 'div', 'all', 'initial']

#6. Load in stop words
It is common practice to remove words that appear a lot in the English language such as 'the', 'of' and 'a' because they're not so interesting. Such words are known as stop words. The package nltk includes a good list of stop words in English that we can use.

In [22]:
# Getting the English stop words from nltk
sw = nltk.corpus.stopwords.words('english')

# Printing out the first fifteen stop words
sw[:15]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours']

#7. Remove stop words in The Great Gatsby
We now want to create a new list with all words in The Great Gatsby, except those that are stop words (that is, those words listed in sw).

In [23]:
# Create a list words_ns containing all words that are in words but not in sw
words_ns = [word for word in words if word not in sw]

# Printing the first 15 words_ns to check that  stop words are gone
words_ns[:15]

['pg',
 'header',
 'div',
 'pg',
 'footer',
 'div',
 'initial',
 'display',
 'block',
 'margin',
 'top',
 '1em',
 'margin',
 'bottom',
 '1em']

#8. We have the answer
Our original question was:

What are the most frequent words in F. Scott Fitzgerald's novel The Great Gatsby and how often do they occur?

We are now ready to answer that! Let's answer this question using the Counter class we imported earlier.

In [24]:
# Initialize a Counter object from our processed list of words
count = Counter(words_ns)

# Store 10 most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(top_ten)

[('gatsby', 268), ('said', 235), ('tom', 191), ('daisy', 186), ('one', 154), ('like', 122), ('man', 114), ('back', 109), ('came', 108), ('little', 103)]


#9. The most common word
Nice! Using our variable top_ten, we now have an answer to our original question.

The natural language processing skills we used in this notebook are also applicable to much of the data that Data Scientists encounter as the vast proportion of the world's data is unstructured data and includes a great deal of text.

So, what word turned out to (not surprisingly) be the most common word in The Great Gatsby?

In [25]:
# What's the most common word in The Great Gatsby?
most_common_word = 'gatsby'
print(most_common_word)

gatsby
