### ðŸ“š NLTK Gutenberg Corpus â€“ Brief Introduction

<div style="background-color:#fff9c4; padding:16px; border-radius:8px; border-left:5px solid #fbc02d;">

<h3>ðŸ“˜Executive Summary</h3>

The Gutenberg Corpus in NLTK is a collection of classic literary texts from Project Gutenberg, a digital library of public-domain books. It is commonly used for Natural Language Processing (NLP) practice, text analysis, and linguistic research.

The corpus includes well-known works such as:
<ul>
<li>Jane Austen novels (e.g., Emma, Persuasion)</li> 
<li>Moby Dick by Herman Melville </li>
<li>Shakespeare plays </li>
<li>The King James Bible </li>
</ul>

In NLTK, the Gutenberg corpus provides:
<ul>
<li> Raw text of entire books </li>
<li> Easy access to tokenization-ready data</li> 
<li> A clean environment for experimenting with word frequency, vocabulary richness, sentence length, and other NLP techniques</li>
</ul>
For the purpose of this analysis, this study utilizes <strong> <em> Austen-Emma</em> </strong> from the Gutenberg corpus.
    </div>

**Importing the required library**

In [1]:
import nltk 
from nltk.tokenize import word_tokenize, sent_tokenize 
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
from nltk import punkt
from nltk.corpus import gutenberg
from nltk.stem import PorterStemmer
import re
import numpy as np
from collections import Counter
import random
from nltk import FreqDist
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
nltk.download('movie_reviews', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('gutenberg', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)


True

**Loading the book**

In [3]:
text = gutenberg.raw('austen-emma.txt')
print("Sample Original Text:\n", text[:600])

Sample Original Text:
 [Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had b


**Tokenize (words + sentence)**

In [4]:
sentences = sent_tokenize(text)
tokens = word_tokenize(text)
print("Total sentences:", len(sentences))
print("Total tokens (includes punctuation):", len(tokens))
print("\nSample sentence:\n", sentences[0])
print("\nSample tokens:\n", tokens[:25])

Total sentences: 7493
Total tokens (includes punctuation): 191855

Sample sentence:
 [Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

Sample tokens:
 ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home']


**Cleaning the words to lower case & removing non alphanumeric characters**

In [5]:
words_only = [t.lower() for t in tokens if t.isalpha()]
print("Words only:", len(words_only))
print("Sample words:", words_only[:20])

Words only: 157114
Sample words: ['emma', 'by', 'jane', 'austen', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home', 'and', 'happy']


**World Length Analysis**

In [6]:
word_lengths = [len(w) for w in words_only]

In [7]:
avg_len = float(np.mean(word_lengths))
min_len = int(np.min(word_lengths))
max_len = int(np.max(word_lengths))

In [8]:
print("Word length stats:")
print("  Average:", round(avg_len, 2))
print("  Min:", min_len)
print("  Max:", max_len)

Word length stats:
  Average: 4.25
  Min: 1
  Max: 17


**Most common Word Lengths**

In [9]:
len_counts = Counter(word_lengths)
print("\nMost common word lengths (length -> count):")
print(len_counts.most_common(10))


Most common word lengths (length -> count):
[(3, 37162), (4, 30200), (2, 29527), (5, 16597), (6, 11179), (7, 9734), (1, 6315), (8, 5733), (9, 5314), (10, 2695)]


**Stop Words Removal + Top Words**

In [10]:
stop_words = set(stopwords.words("english"))
content_words = [w for w in words_only if w not in stop_words]

In [11]:
print("Before Stopwords Count:", len(words_only))
print("After Stopwords Count:", len(content_words))

Before Stopwords Count: 157114
After Stopwords Count: 69693


In [12]:
freq = FreqDist(content_words)
print("Top 20 most common words:\n")
for word, count in freq.most_common(20):
    print(f"{word} â†’ {count}")

Top 20 most common words:

emma â†’ 860
could â†’ 836
would â†’ 818
miss â†’ 599
must â†’ 566
harriet â†’ 500
much â†’ 484
said â†’ 483
one â†’ 447
weston â†’ 437
every â†’ 435
thing â†’ 394
think â†’ 383
elton â†’ 383
knightley â†’ 379
well â†’ 378
little â†’ 359
never â†’ 358
know â†’ 335
might â†’ 325


**Sentence Length Analysis**

In [13]:
def sentence_word_count(s: str) -> int:
    toks = word_tokenize(s)
    toks = [t for t in toks if t.isalpha()]
    return len(toks)

In [14]:
sentence_lengths = [sentence_word_count(s) for s in sentences]
sentence_lengths = [l for l in sentence_lengths if l > 0]

In [15]:
print("Sentence length stats (words per sentence):")
print("  Count:", len(sentence_lengths))
print("  Average:", round(float(np.mean(sentence_lengths)), 2))
print("  Median:", int(np.median(sentence_lengths)))
print("  Min:", int(np.min(sentence_lengths)))
print("  Max:", int(np.max(sentence_lengths)))

Sentence length stats (words per sentence):
  Count: 7459
  Average: 21.06
  Median: 15
  Min: 1
  Max: 234


**Stemming**

In [16]:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(w) for w in words_only]
print("Original vs Stemmed:")
for i in range(10):
    print(words_only[i], "â†’", stemmed_tokens[i])

Original vs Stemmed:
emma â†’ emma
by â†’ by
jane â†’ jane
austen â†’ austen
volume â†’ volum
i â†’ i
chapter â†’ chapter
i â†’ i
emma â†’ emma
woodhouse â†’ woodhous


In [17]:
lemmatizer = WordNetLemmatizer()
lemm_tokens = [lemmatizer.lemmatize(w) for w in words_only]
print("Original vs Lemmatized:")
for i in range(10):
    print(words_only[i], "â†’", lemm_tokens[i])

Original vs Lemmatized:
emma â†’ emma
by â†’ by
jane â†’ jane
austen â†’ austen
volume â†’ volume
i â†’ i
chapter â†’ chapter
i â†’ i
emma â†’ emma
woodhouse â†’ woodhouse


**Summary**

In [18]:
summary = {
    "characters": len(text),
    "sentences": len(sentences),
    "tokens_including_punct": len(tokens),
    "words_only": len(words_only),
    "avg_word_length": round(float(np.mean(word_lengths)), 2),
    #"unique_words": unique_tokens,
    "avg_sentence_length_words": round(float(np.mean(sentence_lengths)), 2),
}
summary

{'characters': 887071,
 'sentences': 7493,
 'tokens_including_punct': 191855,
 'words_only': 157114,
 'avg_word_length': 4.25,
 'avg_sentence_length_words': 21.06}

**Content Generation**

In [19]:
def generate_random_text(word_list, length=50):
    return " ".join(random.choice(word_list) for i in range(length))

generated_text = generate_random_text(words_only, 50)
print(generated_text)

felicity wife good marriage you it it of her as narration of but any who an reply by injury but how daily and weymouth has whose but wish walking the did smith her i a a never of set hence henceforward me no shocked it graciously that great vexed one


**POS TAG**

In [25]:
sentence = "nlp is good approach for analysis"
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

[('nlp', 'NN'), ('is', 'VBZ'), ('good', 'JJ'), ('approach', 'NN'), ('for', 'IN'), ('analysis', 'NN')]


In [21]:
#create 10 quesitons for data set - question & answers 

In [22]:
sent4 = sentences[:4]

**Using Stemmer**

In [30]:
corpus = []
stop_words = set(stopwords.words('english'))

ps = PorterStemmer()
for s in sent4[:4]:
    s = re.sub(r"[^\w\s]", " ", s)  
    toks = (w.lower() for w in s.split())
    cleaned = " ".join(ps.stem(w) for w in toks if w not in stop_words)
    corpus.append(cleaned)

corpus[:4]

['emma jane austen 1816 volum chapter emma woodhous handsom clever rich comfort home happi disposit seem unit best bless exist live nearli twenti one year world littl distress vex',
 'youngest two daughter affection indulg father consequ sister marriag mistress hous earli period',
 'mother die long ago indistinct remembr caress place suppli excel woman gover fallen littl short mother affect',
 'sixteen year miss taylor mr woodhous famili less gover friend fond daughter particularli emma']

**Using Lemmatization**

In [44]:
lemmatizer = WordNetLemmatizer()
corpus = []
for s in sent4[:4]:
    s = re.sub(r"[^\w\s]", " ", s)  # remove punctuation
    words = [lemmatizer.lemmatize(w.lower()) for w in s.split() if w.lower() not in stop_words]
    corpus.append(" ".join(words))
corpus

['emma jane austen 1816 volume chapter emma woodhouse handsome clever rich comfortable home happy disposition seemed unite best blessing existence lived nearly twenty one year world little distress vex',
 'youngest two daughter affectionate indulgent father consequence sister marriage mistress house early period',
 'mother died long ago indistinct remembrance caress place supplied excellent woman governess fallen little short mother affection',
 'sixteen year miss taylor mr woodhouse family less governess friend fond daughter particularly emma']

**Term Frequency**

In [45]:
all_words = " ".join(corpus).split()
tf = Counter(all_words)
tf

Counter({'emma': 3,
         'woodhouse': 2,
         'year': 2,
         'little': 2,
         'daughter': 2,
         'mother': 2,
         'governess': 2,
         'jane': 1,
         'austen': 1,
         '1816': 1,
         'volume': 1,
         'chapter': 1,
         'handsome': 1,
         'clever': 1,
         'rich': 1,
         'comfortable': 1,
         'home': 1,
         'happy': 1,
         'disposition': 1,
         'seemed': 1,
         'unite': 1,
         'best': 1,
         'blessing': 1,
         'existence': 1,
         'lived': 1,
         'nearly': 1,
         'twenty': 1,
         'one': 1,
         'world': 1,
         'distress': 1,
         'vex': 1,
         'youngest': 1,
         'two': 1,
         'affectionate': 1,
         'indulgent': 1,
         'father': 1,
         'consequence': 1,
         'sister': 1,
         'marriage': 1,
         'mistress': 1,
         'house': 1,
         'early': 1,
         'period': 1,
         'died': 1,
         'long'

In [43]:
len(tf)

65

**IDF Vectorization**

In [33]:
vectorizer = TfidfVectorizer(binary=False,min_df=1,max_df=1.0,use_idf=True,ngram_range=(1, 1),norm='l2')
X = vectorizer.fit_transform(corpus[:4])
df = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,1816,affection,affectionate,ago,austen,best,blessing,caress,chapter,clever,...,twenty,two,unite,vex,volume,woman,woodhouse,world,year,youngest
0,0.187808,0.0,0.0,0.0,0.187808,0.187808,0.187808,0.0,0.187808,0.187808,...,0.187808,0.0,0.187808,0.187808,0.187808,0.0,0.14807,0.187808,0.14807,0.0
1,0.0,0.0,0.281477,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.281477,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.281477
2,0.0,0.234126,0.0,0.234126,0.0,0.0,0.0,0.234126,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.234126,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.226578,0.0,0.226578,0.0
