# Word Frequency in Peter Pan by J.M. Barrie
## Introduction

### In this project I used the Python requests package to scrape the novel Peter Pan by J.M. Barrie from the Project Gutenberg site. I then used Beautiful Soup to extract the words from the HTML file. I used nltk to process the file and Counter to count occurrences of each word.

### This project comes from DataCamp.

## Common Imports
### For this project the following imports were used:
1. requests: Python package used for making HTTP requests
2. BeautifulSoup from bs4: Python package for parsing HTML and XML documents
3. nltk: Natural Language Tool Kit used for processing human language
4. Counter from collections: Collection used to get the count of items, for this project a count of words/occurrences

In [1]:
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter

## Import HTML file and extract the text from it.

In [2]:
#Retrieving the Peter Pan HTML
r = requests.get('https://www.gutenberg.org/cache/epub/16/pg16-images.html')

#Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

#Extracting the HTML from the request object
html = r.text


In [3]:
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html)

#Get the text out of the soup
text = soup.get_text()

## Preparing Data

### To prepare this data the following steps were taken: 
1. tokenize the text into individual words
2. turn all words to lowercase to make processing easier

In [4]:
#Create a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

#Tokenize the text
tokens = tokenizer.tokenize(text)

#Print out the first few words/tokens
tokens[0:11]

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Peter',
 'Pan',
 'by',
 'James',
 'M',
 'Barrie']

In [5]:
#Create a list called words containing all tokens transformed to lowercase
words = [token.lower() for token in tokens]

#Print out the first 11 words/tokens
words[:11]

['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'peter',
 'pan',
 'by',
 'james',
 'm',
 'barrie']

## Removing Stopwords
### Stopwords are common words used in English that should not necessarily be counted, like "the," "and," or "or." For this project, stopwords were removed.

In [6]:
#Download the stopwords
nltk.download('stopwords')

#Getting the English stop words from nltk
sw = nltk.corpus.stopwords.words('english')

#Create a list of words that are in words but not in stopwords
words_ns = [word for word in words if word not in sw]

#Print the first 5 words to confirm the stopwords are gone
words_ns[:5]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sfroe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['project', 'gutenberg', 'ebook', 'peter', 'pan']

## Get The Answer

### Getting the answer involved counting the occurence of each word and then populating the most frequently used word. 

In [7]:
#Initialize a Counter object from our processed list of words
count = Counter(words_ns)

#Store 10 most common words and their counts as top_ten
top_ten = count.most_common(10)

#Print the top ten words and their counts
print(top_ten)

[('peter', 408), ('wendy', 361), ('said', 358), ('would', 217), ('one', 212), ('hook', 174), ('could', 142), ('cried', 136), ('john', 133), ('time', 126)]


In [8]:
#What's the most common word in Peter Pan?
print("The most common word in Peter Pan is", top_ten[0][0], "and it occurs", top_ten[0][1], "times.")

The most common word in Peter Pan is peter and it occurs 408 times.
