# Week 6
## NLP PART 2.


<div class="alert alert-block alert-info">
<b>Step 1: Download Required Resources</b><br> Download necessary NLTK datasets, including the inaugural speeches, POS tagger, WordNet for lemmatization, and stopwords.</div>
 </color=blue> 

In [12]:
import nltk
import re
from nltk.corpus import inaugural
from collections import Counter
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt

In [13]:
nltk.download('inaugural')  # Corpus of inaugural speeches
nltk.download('averaged_perceptron_tagger')  # For POS tagging
nltk.download('wordnet')  # WordNet for lemmatization
nltk.download('stopwords')  # Stopwords

[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\sumin\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sumin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sumin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sumin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Let's see what inaugural speeches are available in inaugural corpus!

<div class="alert alert-block alert-info">
<b>Step 2: Choose a Speech</b><br> Select a specific speech for analysis.</div>
 </color=blue> 

In [4]:
available_speeches = inaugural.fileids()
available_speeches

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

In [8]:
speech = inaugural.raw(fileids='2009-Obama.txt')
print(speech)

My fellow citizens:

I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.

Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.

So it has been. So it must be with this generation of Americans.

That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a conse

<div class="alert alert-block alert-info">
<b>Step 3: Clean the Text</b><br> Remove numbers and unnecessary whitespace to prepare the text for tokenization.</div>
 </color=blue> 

In [9]:
def clean_speech(text):
    text = re.sub(r'\d+', '', text)
                  #r -> raw text
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text


In [10]:
cleaned_speech = clean_speech(speech)
print(cleaned_speech)

My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition. Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents. So it has been. So it must be with this generation of Americans. That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequen

<div class="alert alert-block alert-info">
<b>Step 4: Tokenize Text into Sentences and Words</b><br>Break the cleaned text into individual sentences and further tokenize each sentence into words.</div>
 </color=blue> 

In [14]:
sentences = nltk.sent_tokenize(cleaned_speech)
words = [nltk.word_tokenize(sentence) for sentence in sentences]
words

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\sumin/nltk_data'
    - 'C:\\Users\\sumin\\anaconda3\\nltk_data'
    - 'C:\\Users\\sumin\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\sumin\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\sumin\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


<div class="alert alert-block alert-info">
<b>Step 5: Apply POS Tagging</b><br> Perform Part-of-Speech (POS) tagging on the tokenized words to identify their grammatical roles.</div>
 </color=blue> 

In [15]:
pos_tagged_words = [nltk.pos_tag(word_list) for word_list in words]
print(pos_tagged_words)

NameError: name 'words' is not defined

<div class="alert alert-block alert-info">
<b>Step 6: Chunking into Phrases</b><br> 
   Use a defined chunk grammar to group words into phrases such as noun phrases (NP) and verb phrases (VP), and visualize the chunk tree for the first sentence.</div>
 </color=blue> 

In [16]:
# Step 6: Define the chunk grammar rules
chunk_grammar = r"""
  NP: {<DT>?<JJ>*<NN.*>}  # Noun Phrase
  VP: {<VB.*><RB.*>*}     # Verb Phrase
  ADJP: {<JJ><CC>?<JJ>*}  # Adjective Phrase

"""
#  PP: {<IN><NP>}          # Prepositional Phrase
# ADVP: {<RB><RB.*>*}     # Adverb Phrase


In [18]:
chunk_parser = nltk.RegexpParser(chunk_grammar)
chunk_parser

<chunk.RegexpParser with 3 stages>

In [19]:
first_sentence_tree = chunk_parser.parse(pos_tagged_words[0])

NameError: name 'pos_tagged_words' is not defined

In [20]:
first_sentence_tree.draw()

NameError: name 'first_sentence_tree' is not defined

<div class="alert alert-block alert-info">
<b>Step 7: Lemmatization Based on POS Tags</b><br> 
   Lemmatize the words using the POS tags to reduce them to their base forms, which helps with consistency in analysis.</div>
 </color=blue> 

In [21]:
lemmatizer = WordNetLemmatizer()

In [22]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif treebank_tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    else:
        return nltk.corpus.wordnet.NOUN

In [23]:
lemmatized_words = []

for sentence in pos_tagged_words:
    # temp. variable = sentence denoted by [(word, pos),(word, pos),(word, pos),(word, pos)] #made of iterables
    lemmatized_sentence = []
    for word, pos in sentence:
        lemmatized_word = lemmatizer.lemmatize(word.lower(), get_wordnet_pos(pos))
        lemmatized_sentence.append((lemmatized_word, pos)) #pass a tuple value as an arugment
    lemmatized_words.extend(lemmatized_sentence)

NameError: name 'pos_tagged_words' is not defined

In [24]:
lemmatized_words

[]

<div class="alert alert-block alert-info">
<b>Step 8: Remove Stopwords</b><br> 
   Filter out common stopwords (like 'the', 'and', 'us') to focus on the most meaningful words.</div>
 </color=blue> 

In [34]:
stop_words = set(nltk.corpus.stopwords.words('english'))

custom stopwords = ['america','nation']
stopwords.update(custom_stopwords)

filtered_words = [(word, pos) for word, pos in lemmatized_words if word not in stop_words]


SyntaxError: invalid syntax (613174406.py, line 3)

In [35]:
filtered_words

[]

<div class="alert alert-block alert-info">
<b>Step 9: Remove Punctuation</b><br> 
   Remove any remaining punctuation to ensure clean and uniform word tokens.</div>
 </color=blue> 

In [36]:
final_filtered_words = []

for word, pos in filtered_words:
    cleaned_word = re.sub(r'[^a-zA-Z0-9]\s', '', word)
    if cleaned_word:
        final_filtered_words.append(cleaned_word, pos)

<div class="alert alert-block alert-info">
<b>Step 10: Separate Words into Categories (Nouns, Verbs, Adjectives)</b><br> 
   Separate the cleaned and lemmatized words into categories based on their POS tags.</div>
 </color=blue> 

In [37]:
nouns = [word for word, pos in final_filtered_words if pos.startswidth('NN')]
adjectives = [word for word, pos in final_filtered_words if pos.startswidth('JJ')]
verbs = [word for word, pos in final_filtered_words if pos.startswidth('VB')]

<div class="alert alert-block alert-info">
<b>Step 11: Get the Most Common Words for Each Category</b><br> 
   Use a frequency counter to find the most common nouns, verbs, and adjectives.</div>
 </color=blue> 

In [43]:
most_common_nouns = Counter(nouns).most_common(10)
most_common_verbs = Counter(verbs).most_common(10)
most_common_adjectives = Counter(adjectives).most_common(10)

In [41]:
most_common_nouns

[]

In [40]:
most_common_verbs

[]

In [44]:
most_common_adjectives

[]

<div class="alert alert-block alert-info">
<b>Step 12: Visualization of Word Frequency for Nouns</b><br> 
   Plot a bar chart displaying the top 10 most common nouns,verbs,adjectives from the speech.</div>


ValueError: not enough values to unpack (expected 2, got 0)

In [47]:
nouns_list, nouns_counts = zip(*most_common_nouns)
plt.figure(figsize=(12,6))
plt.bar(nouns_list, noun_counts, color = 'green')
plt.xlabel('Nouns')
plt.ylabel('Frequency')
plt.title("Top 10 mot common nouns in Obama's 2009 inaugural speech")

plt.show()

NameError: name 'nouns_list' is not defined

<Figure size 1200x600 with 0 Axes>

In [None]:
verbs_list, verbs_counts = zip(*most_common_verbs)
plt.figure(figsize=(12,6))
plt.bar(verbs_list, verbs_counts, color = 'blue')
plt.xlabel('Nouns')
plt.ylabel('Frequency')
plt.title("Top 10 mot common nouns in Obama's 2009 inaugural speech")

plt.show()

In [None]:
adj_list, adj_counts = zip(*most_common_adjectives)
plt.figure(figsize=(12,6))
plt.bar(adj_list, adj_counts, color = 'red')
plt.xlabel('Nouns')
plt.ylabel('Frequency')
plt.title("Top 10 mot common nouns in Obama's 2009 inaugural speech")

plt.show()

<div class="alert alert-block alert-success">
<b> Phew. What we went through today:</b> <br>
1. Clean Text: Remove numbers and unnecessary spaces.<br>
2. Tokenize: Break down the text into sentences and words.<br>
3. POS Tagging: Tag each word with its grammatical role.<br>
4. Chunking: Identify and visualize phrases in the text (e.g., noun phrases).<br>
5. Remove Stopwords: Filter out common but less meaningful words.<br>
6. Lemmatize: Reduce words to their base forms for uniformity.<br>
7. Separate into Categories: Group words into nouns, verbs, and adjectives.<br>
8. Visualize Frequencies: Plot charts for the most common nouns, verbs, and adjectives.<br>
9. Final Visualization: Display the chunk tree to understand the structure of phrases.<br>
</div>