![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

In [39]:
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
# url to scrape
url = "https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm"

# request the Moby Dick HTML file using requests and encoding it to utf-8
r = requests.get(url)
r.encoding = "utf-8"

In [41]:
# Parse the HTML content with BeautifulSoup
html = r.text
html_soup = BeautifulSoup(html, 'html.parser')

# Print the text content of the page
print(html_soup.get_text()[:500])  # Display the first 500 characters of the text






      Moby Dick; Or the Whale, by Herman Melville
    





The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Moby Dick; or The Whale

Author: Herman Melville

Release Date: December 25, 2008 [EB


In [42]:
# Extract the text from the HTML
moby_text = html_soup.get_text()

# Initialize the regex tokenizer to keep only alphanumeric words
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(moby_text)

# Print the first 20 tokens
print(tokens[:20])

['Moby', 'Dick', 'Or', 'the', 'Whale', 'by', 'Herman', 'Melville', 'The', 'Project', 'Gutenberg', 'EBook', 'of', 'Moby', 'Dick', 'or', 'The', 'Whale', 'by', 'Herman']


In [43]:
# Convert tokens to lowercase
words = [token.lower() for token in tokens]

# Load English stopwords
stop_words = nltk.corpus.stopwords.words('english')

# Remove stop words
words_no_stop = [word for word in words if word not in stop_words]

# Print the first 20 words after stop word removal
print(words_no_stop[:20])

['moby', 'dick', 'whale', 'herman', 'melville', 'project', 'gutenberg', 'ebook', 'moby', 'dick', 'whale', 'herman', 'melville', 'ebook', 'use', 'anyone', 'anywhere', 'cost', 'almost', 'restrictions']


In [44]:
# Initialize the Counter object to count word frequencies
count = Counter(words_no_stop)

# Get the 10 most common words
top_ten = count.most_common(10)

# Print the top 10 most common words
print(top_ten)

[('whale', 1246), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]
