![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

In [1]:
# Import packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter

In [3]:
# extract the HTML and create a BeautifulSoup object using an HTML parser to get the text.

url = 'https://www.gutenberg.org/files/2701/2701-h/2701-h.htm'

response = requests.get(url, 'html.parser')
response.encoding = 'utf-8'

soup = BeautifulSoup(response.text)

In [4]:
# initialize a regex tokenizer object tokenizer to keep only alphanumeric text

import nltk

tokenizer = nltk.tokenize.RegexpTokenizer(pattern=r'[a-zA-Z0-9]+')

tokens = tokenizer.tokenize(soup.text)

In [5]:
# transform the tokens into lowercase, removing English stop words
from nltk.corpus import stopwords

stops = set(stopwords.words('english'))

def words_no_stop(tokens):
    new_tokens = []
    for token in tokens:
        token = token.lower()
        if token not in stops:
            new_tokens.append(token)
    return new_tokens

tokens = words_no_stop(tokens)

In [6]:
# initialize a Counter object and find the ten most common words

counter = Counter(tokens)

dir(counter)

top_ten = counter.most_common(10)

print(top_ten)

[('whale', 1240), ('one', 923), ('like', 647), ('upon', 566), ('man', 530), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 450)]
