## 1. Tools for text processing
<p><img style="float: right ; margin: 5px 20px 5px 10px; width: 45%" src="https://images3.penguinrandomhouse.com/cover/9780147514011"> </p>
<p>In this project, we aim to analyze the most frequent words in Louisa May Alcott's classic novel, "Little Women," and examine how often they occur throughout the text. To achieve this, we'll utilize various Python libraries and tools to scrape the novel from Project Gutenberg, a platform providing free access to a vast collection of ebooks, and extract relevant text data using BeautifulSoup. By employing natural language processing techniques and tools such as NLTK and Counter, we'll delve into the distribution of words within the novel, uncovering insights into its linguistic makeup.

Credit goes to Project Gutenberg for generously providing the free ebook and HTML data, enabling us to explore and analyze this literary work in a data-driven manner. Through this project, we'll demonstrate a data science pipeline applicable not only to "Little Women" but also to other novels available on Project Gutenberg, showcasing the power of natural language processing in uncovering patterns and insights within unstructured text data.</p>

Feel free to get in touch with your comments and feedback.

Email: zaidsaad99@gmail.com

Github: https://github.com/zsaad9

LinkedIn: https://www.linkedin.com/in/zaidbsaad/


In [None]:
# Importing requests, BeautifulSoup, nltk, and Counter
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter

## 2. Request Little Women
<p>To analyze Little Women, we need to get the contents of the book from <em>somewhere</em>. Luckily, the text is freely available online at Project Gutenberg as an HTML file: https://www.gutenberg.org/cache/epub/37106/pg37106-images.html .</p>
<p>To fetch the HTML file with Little Women we're going to use the <code>request</code> package to make a <code>GET</code> request for the website, which means we're <em>getting</em> data from it. </p>

In [None]:
# Getting the Little Women HTML
r = requests.get("https://www.gutenberg.org/cache/epub/37106/pg37106-images.html")

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 2000 characters in html
print(html[:2000])

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"><style>
#pg-header div, #pg-footer div {
    all: initial;
    display: block;
    margin-top: 1em;
    margin-bottom: 1em;
    margin-left: 2em;
}
#pg-footer div.agate {
    font-size: 90%;
    margin-top: 0;
    margin-bottom: 0;
    text-align: center;
}
#pg-footer li {
    all: initial;
    display: block;
    margin-top: 1em;
    margin-bottom: 1em;
    text-indent: -0.6em;
}
#pg-footer div.secthead {
    font-size: 110%;
    font-weight: bold;
}
#pg-footer #project-gutenberg-license {
    font-size: 110%;
    margin-top: 0;
    margin-bottom: 0;
    text-align: center;
}
#pg-header-heading {
    all: inherit;
    text-align: center;
    font-size: 120%;
    font-weight:bold;
}
#pg-footer-heading {
    all: inherit;
    text-align: center;
    font-size: 120%;
    font-weight: normal;
    margin-top: 0;
    margin-bottom: 0;
}
#pg-header #pg-machine-header p {
    text-indent: -4em;
    margin-left: 4em;
    margin-top:

## 3. Getting the text from the HTML
<p>This HTML is not quite what we want. However, it does <em>contain</em> what we want: the text of <em>Little Women</em>. What we need to do now is <em>wrangle</em> this HTML to extract the text of the novel. For this we'll use the package <code>BeautifulSoup</code>.</p>


In [None]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, 'html.parser')

# Getting the text out of the soup
text = text = soup.get_text()

# Printing out text between characters 32000 and 34000
print(text[32000:34000])


rong place with
a croak or a quaver that spoilt the most pensive tune. They had
always done this from the time they could lisp
"Crinkle, crinkle, 'ittle 'tar,"

and it had become a household custom, for the mother was a born
singer. The first sound in the morning was her voice, as she went
about the house singing like a lark; and the last sound at night was
the same cheery sound, for the girls never grew too old for that
familiar lullaby.





II. A Merry Christmas.




II.
A MERRY CHRISTMAS.
Jo was the first to wake in the gray dawn of Christmas morning.
No stockings hung at the fireplace, and for a moment she felt as
much disappointed as she did long ago, when her little sock fell down
because it was so crammed with goodies. Then she remembered her
mother's promise, and, slipping her hand under her pillow, drew out
a little crimson-covered book. She knew it very well, for it was that
beautiful old story of the best life ever lived, and Jo felt that it was
a true guide-book for any pi

## 4. Extracting the words
<p>We now have the text of the novel! There is some unwanted stuff at the start and some unwanted stuff at the end. We could remove it, but this content is so much smaller in amount than the text of Little Women that, to a first approximation, it is okay to leave it in.</p>
<p>Now that we have the text of interest, it's time to count how many times each word appears, and for this we'll use <code>nltk</code> – the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.</p>

In [None]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens
print(tokens[:8])

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Little', 'Women', 'by']


## 5. Making the words lowercase
<p>OK! We're nearly there. Note that in the above 'Little' has a capital 'L' and that in other places it may not, but both 'Little' and 'little' should be counted as the same word. For this reason, we should build a list of all words in <em>Little Women</em> in which all capital letters have been made lower case.</p>

In [None]:
# Create a list called words containing all tokens transformed to lower-case
words = [word.lower() for word in tokens]

# Printing out the first 8 words / tokens
print(words[:8])


['the', 'project', 'gutenberg', 'ebook', 'of', 'little', 'women', 'by']


## 6. Loading in the stop words
<p>It is common practice to remove words that appear a lot in the English language such as 'the', 'of' and 'a' because they're not so interesting. Such words are known as <em>stop words</em>. The package <code>nltk</code> includes a good list of stop words in English that we can use.</p>

In [None]:
# Getting the English stop words from nltk
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
sw = stopwords.words('english')

print(sw[:8])


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 7. Removing stop words in Little Women
<p>We now want to create a new list with all <code>words</code> in the novel, except those that are stop words (that is, those words listed in <code>sw</code>).</p>

In [None]:
# Creating a list words_ns containing all words that are in words but not in sw
words_ns = [word for word in words if word not in sw]

# Printing the first 5 words_ns to check that stop words are gone
print(words_ns[:5])

['project', 'gutenberg', 'ebook', 'little', 'women']


## 8. We have the answer
<p>Our original question was:</p>
<blockquote>
  <p>What are the most frequent words in Louisa May Alcott's classic novel, "Little Women" and how often do they occur?</p>
</blockquote>
<p>Let's answer this question using the <code>Counter</code> class we imported.</p>

In [None]:
# Initializing a Counter object from our processed list of words
count = Counter(words_ns)
# Storing 10 most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(top_ten)


[('jo', 1406), ('one', 900), ('said', 839), ('little', 773), ('meg', 704), ('amy', 669), ('laurie', 610), ('like', 604), ('beth', 494), ('good', 481)]


## 9. The most common word
<p>Using our variable <code>top_ten</code>, we now have an answer to our original question.</p>
<p>The natural language processing skills we used in this notebook are also applicable to much of the vast proportion of the world's data that is unstructured data and includes a great deal of text. </p>

In [None]:
# What's the most common word in Little Women?
print(count.most_common(1))

[('jo', 1406)]
