## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

In [8]:
import requests
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from collections import Counter

# Step 1: Download the text of "Peter Pan" from Project Gutenberg
url = "https://www.gutenberg.org/files/16/16-0.txt"
response = requests.get(url)
text = response.text

# Step 2: Tokenize the text into words
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)

# Step 3: Convert tokens to lowercase
words = [token.lower() for token in tokens]

# Step 4: Remove stopwords
stop_words = set(stopwords.words('english'))
meaningful_words = [word for word in words if word not in stop_words]

# Step 5: Count word frequencies
word_counts = Counter(meaningful_words)

# Step 6: Get the top 10 most common words
top_ten_words = word_counts.most_common(10)

# Step 7: Ensure a comprehensive list of possible protagonists
all_possible_protagonists = ["Peter", "Wendy", "Tinkerbell", "Hook", "Neverland", "Darling", "John"]

# Identify character names among the top words
protagonists = [word.capitalize() for word, _ in top_ten_words if word.capitalize() in all_possible_protagonists]

# Step 8: Print the results
print("Top 10 most common meaningful words:", top_ten_words)
print("Character names among the top 10 words:", protagonists)


Top 10 most common meaningful words: [('peter', 408), ('wendy', 362), ('said', 358), ('would', 217), ('one', 212), ('hook', 174), ('could', 142), ('cried', 136), ('john', 133), ('time', 126)]
Character names among the top 10 words: ['Peter', 'Wendy', 'Hook', 'John']
