<center> Title : Module 6 Assignment



---






Name : Sasank Yadav Daliboyina

Date : 12.17.2023

Professor : Mitch Harris

EAI 6000-Fundamentals of Artificial Intelligence.

Northeastern University.

NUId : 002612278


**INTRODUCTION :**  In this exploration, we delved into Natural Language Processing (NLP) using Python's NLTK library, applying it to classic literature from Project Gutenberg, specifically Jane Austen's "Emma". The process began with essential NLP tasks like tokenization, to break the text into manageable segments, and continued with the removal of stopwords, enhancing the focus on meaningful content. Stemming was applied for word simplification, and lemmatization further refined this by contextually analyzing words. The Bag of Words model vectorized the text, preparing it for machine learning applications. This foundational work in text processing and analysis exemplifies key techniques in NLP, crucial for understanding and manipulating language data.

**Step 1: Import and Download Necessary NLTK Resources**

In [13]:
import nltk
from nltk.corpus import gutenberg

nltk.download('gutenberg')


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

**Step 2: Explore the Gutenberg Corpus**

In [17]:
# List available texts in the Gutenberg Corpus
print(gutenberg.fileids())

# Choose "Emma" by Jane Austen
austen_text = gutenberg.raw('austen-emma.txt')

# Display the first 500 characters
print(austen_text[:500])



['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died t


**Step 3: Tokenization**

In [18]:
from nltk.tokenize import sent_tokenize, word_tokenize

# Sentence Tokenization
sentences = sent_tokenize(austen_text[:5000])  # Limiting to the first 5000 characters
print("First 5 sentences:", sentences[:5])

# Word Tokenization
words = word_tokenize(austen_text[:5000])
print("First 50 words:", words[:50])


First 5 sentences: ['[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.', "She was the youngest of the two daughters of a most affectionate,\nindulgent father; and had, in consequence of her sister's marriage,\nbeen mistress of his house from a very early period.", 'Her mother\nhad died too long ago for her to have more than an indistinct\nremembrance of her caresses; and her place had been supplied\nby an excellent woman as governess, who had fallen little short\nof a mother in affection.', "Sixteen years had Miss Taylor been in Mr. Woodhouse's family,\nless as a governess than a friend, very fond of both daughters,\nbut particularly of Emma.", 'Between _them_ it was more the intimacy\nof sisters.']
First 50 words: ['[', 'Emma', 'by', 'Jane

**Step 4: Removing Stopwords**

In [19]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Filter out stopwords from the first 500 words
filtered_words = [word for word in words[:500] if word.lower() not in stop_words]
print("First 50 words without stopwords:", filtered_words[:50])


First 50 words without stopwords: ['[', 'Emma', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'CHAPTER', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'rich', ',', 'comfortable', 'home', 'happy', 'disposition', ',', 'seemed', 'unite', 'best', 'blessings', 'existence', ';', 'lived', 'nearly', 'twenty-one', 'years', 'world', 'little', 'distress', 'vex', '.', 'youngest', 'two', 'daughters', 'affectionate', ',', 'indulgent', 'father', ';', ',', 'consequence', 'sister', "'s", 'marriage']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Step 5: Stemming**

In [20]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()
stemmed_words = [porter.stem(word) for word in filtered_words]
print("First 50 stemmed words:", stemmed_words[:50])


First 50 stemmed words: ['[', 'emma', 'jane', 'austen', '1816', ']', 'volum', 'chapter', 'emma', 'woodhous', ',', 'handsom', ',', 'clever', ',', 'rich', ',', 'comfort', 'home', 'happi', 'disposit', ',', 'seem', 'unit', 'best', 'bless', 'exist', ';', 'live', 'nearli', 'twenty-on', 'year', 'world', 'littl', 'distress', 'vex', '.', 'youngest', 'two', 'daughter', 'affection', ',', 'indulg', 'father', ';', ',', 'consequ', 'sister', "'s", 'marriag']


**Step 6: From Strings to Vectors (Bag of Words)**

In [21]:
from collections import Counter

# Creating a bag of words
bow = Counter(stemmed_words)
print("Bag of Words Count:", bow.most_common(10))


Bag of Words Count: [(',', 31), ('.', 15), (';', 8), ('emma', 6), ("'s", 6), ('miss', 5), ('taylor', 5), ('friend', 5), ('littl', 3), ('father', 3)]


**Step 7: Vectorization with sklearn**

In [22]:
# Vectorizing using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print("Vectorized text shape:", X.shape)


Vectorized text shape: (26, 389)


**Step 8: Lemmatization**

Lemmatization is a more sophisticated approach than stemming. It considers the context and converts the word to its meaningful base form.

In [23]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

def get_wordnet_pos(treebank_tag):
    """Convert the part-of-speech naming scheme from the Penn Treebank tag to a WordNet tag."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tag(words[:500])]
print("Lemmatized words:", lemmatized[:50])


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Lemmatized words: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seem', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessing', 'of', 'existence', ';', 'and', 'have', 'live', 'nearly', 'twenty-one', 'year', 'in', 'the', 'world', 'with']


**Step 9: Using a Stronger List of Stopwords**

Combining NLTK's stopwords with other sources for a more comprehensive list:

In [24]:
# Example using an external list of stopwords
extra_stopwords = ['example', 'stopword1', 'stopword2']  # This is an illustrative list

all_stopwords = set(stopwords.words('english')).union(set(extra_stopwords))
filtered_words = [word for word in lemmatized if word.lower() not in all_stopwords]
print("Filtered words:", filtered_words[:50])


Filtered words: ['[', 'Emma', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'CHAPTER', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'rich', ',', 'comfortable', 'home', 'happy', 'disposition', ',', 'seem', 'unite', 'best', 'blessing', 'existence', ';', 'live', 'nearly', 'twenty-one', 'year', 'world', 'little', 'distress', 'vex', '.', 'young', 'two', 'daughter', 'affectionate', ',', 'indulgent', 'father', ';', ',', 'consequence', 'sister', "'s", 'marriage']


**Conclusion:**

In the above analysis, we employed several fundamental natural language processing (NLP) techniques to process and analyze text data, using Python's NLTK library and texts from the Project Gutenberg corpus. Specifically, we:

Downloaded and Explored Text Data: We initially aimed to use "Pride and Prejudice" by Jane Austen but switched to "Emma" by the same author due to dataset availability. This change demonstrates the flexibility required in data analysis when dealing with text corpora.

Tokenization: We split the text into sentences and words. This step is crucial for breaking down large text blocks into manageable units for further analysis.

Stopwords Removal: We filtered out common stopwords (words with little semantic value) from our text data. This process helps focus on the more meaningful content in the text.

Stemming: We applied stemming to reduce words to their base or root form. This technique simplifies the dataset and aids in consolidating different forms of a word.

Vectorization (Bag of Words): We transformed the processed text into a numerical format using a bag-of-words model, making it suitable for machine learning algorithms.

Lemmatization: As an advanced step, we performed lemmatization to accurately reduce words to their dictionary form, considering the context. This process is more sophisticated than stemming and provides a deeper level of text normalization.

Advanced Stopword Handling: We explored the idea of extending the list of stopwords by combining NLTK's list with additional sources to create a more comprehensive filter.

Application to a Machine Learning Task: We implemented a Naive Bayes classifier using the Iris dataset as an example to demonstrate how preprocessed text data can be used in machine learning tasks.

Throughout this process, we demonstrated key NLP techniques and their applications in text analysis and machine learning. These methods form the foundation of many advanced NLP applications and are essential for anyone looking to delve into text analytics or develop NLP-driven models.