![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

In [18]:
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

# Start coding here... 

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
# 라이브러리 임포트
import requests
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from collections import Counter
import nltk

# NLTK 불용어 다운로드 (최초 1회)
nltk.download('stopwords')

# 1. HTML 파일 요청 (객체 이름: r)
url = "https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm"
r = requests.get(url)
r.encoding = 'utf-8'
html = r.text

# 2. BeautifulSoup 객체 생성 (객체 이름: html_soup)
html_soup = BeautifulSoup(html, "html.parser")

# 3. 텍스트 추출 (객체 이름: moby_text)
moby_text = html_soup.get_text()

# 4. 정규표현식 토크나이저 생성
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(moby_text)

# 5. 소문자로 변환 (객체 이름: words)
words = [word.lower() for word in tokens]

# 6. 불용어 제거 (객체 이름: words_no_stop)
stop_words = stopwords.words('english')
words_no_stop = [
    word for word in words
    if word not in stop_words
]

# 7. Counter 객체 생성 (객체 이름: count) ⭐ 중요
count = Counter(words_no_stop)

# 8. 가장 많이 등장한 단어 10개 추출
top_ten = count.most_common(10)

# 결과 출력
print(top_ten)

## Results Interpretation

The most frequent words in *Moby Dick* reveal the main themes and focus of the novel.

Words such as **"whale"**, **"sea"**, **"ship"**, and **"captain"** appear repeatedly, indicating that the story strongly revolves around maritime life and the obsessive pursuit of the whale.

The frequent appearance of **"ahab"** highlights the central role of Captain Ahab and emphasizes his dominance in the narrative.  
This supports the idea that the novel is not only an adventure story but also a deep psychological exploration of obsession and revenge.

## Analytical Insight

By removing stopwords and analyzing word frequencies, we can focus on meaningful terms that carry thematic significance.  
This approach helps identify key motifs without manually reading the entire text.

For example, the dominance of words related to the sea and the whale reflects the novel’s intense focus on nature, fate, and human struggle.  
Such text analysis techniques are useful in literary studies, content analysis, and large-scale document processing.

## Conclusion

This project demonstrates how Python can be used to:
- Collect data from the web
- Clean and preprocess raw text
- Perform basic natural language processing
- Extract meaningful insights from large documents

The same workflow can be applied to news articles, customer reviews, social media data, and other real-world text datasets.  
This makes it a strong foundation for further work in data analysis, NLP, and machine learning.

## One-Line Summary

This analysis shows how simple NLP techniques can uncover thematic patterns in classic literature using Python.


[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('whale', 1246), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]
