<a href="https://colab.research.google.com/github/shahab-f/EAI6010-Applications_of_Artificial_Intelligence-Winter_2023/blob/Textual-Data-Analysis-with-NLTK-in-Python/EAI6010_ShahabaddinFeghahati_Week3_Winter_2023_rev_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EAI6010 - Module 3: NLP AI Applications

## Introduction

This report presents code examples for analyzing textual data using the Natural Language Toolkit (NLTK) library in Python. The code covers two main topics: syntax analysis for modal verbs and word frequency analysis for a particular speech.

The first section of the code demonstrates how to install and import the NLTK library, download a corpus, and use it to count the frequency and relative frequency of modal verbs. The second section focuses on analyzing the frequency of long words in a particular speech, identifying the ten most common long words, and finding synonyms for these words using the WordNet database. The report includes code examples for each step of the analysis, along with explanations of the syntax and methodology used.

## Body

#### Question 1A. Download and install the Gutenberg corpus tool to your Jupyter Notebook. You can install the NLTK package and use Gutenberg corpus.

#### Answer 1A:

In [None]:
# @title 1A & 1B Syntax Guide

# Install NLTK package
#!pip install nltk

# Import NLTK
import nltk


#### Question 1B. Download the Gutenberg corpus tool in the NLTK package.

#### Answer 1B:

In [None]:
# @title 1B Syntax Guide

# Download Gutenberg corpus
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

#### Question 1C. Use the texts in the corpus.

#### Question 1D. Create a table displaying relative frequencies with which “modals” (can, could, may, might, will, would, and should) appear in all texts provided in the corpus.

#### Answer 1C & 1D:

In [None]:
# @title 1C & 1D Syntax Guide

from nltk.corpus import gutenberg
import pandas as pd

# Get all modal verbs
modals = ['can', 'could', 'may', 'might', 'will', 'would', 'should']

# Count the frequency of each modal verb in the corpus
freqs = nltk.FreqDist([w.lower() for w in gutenberg.words() if w.lower() in modals])

# Calculate the total number of words in the corpus
num_words = len(gutenberg.words())

# Create a pandas DataFrame to store the frequency and relative frequency of each modal verb
df = pd.DataFrame({'Frequency': [freqs[modal] for modal in modals],
                   'Relative Frequency': [freqs[modal] / num_words for modal in modals]},
                  index=modals)

# Print the DataFrame
print(df)


        Frequency  Relative Frequency
can          2327            0.000888
could        3594            0.001371
may          2549            0.000972
might        1963            0.000749
will         7368            0.002810
would        4046            0.001543
should       2550            0.000973


The data provided shows the relative frequencies of modal verbs in the Gutenberg corpus. The modal verb "will" is the most frequent, with a relative frequency of 0.002810, followed by "would" with 0.001543, and "might" with 0.000749. The relative frequencies of the seven modal verbs range from 0.000749 to 0.002810, indicating that modal verbs are relatively rare compared to other words in the corpus. However, the total relative frequency of all modal verbs is 0.009866, indicating that they still occur frequently enough to be worthy of analysis.

#### Question 1E. For two modals with the largest span of relative (to the total number of modals) frequencies (most used minus least used), select a text which uses it the most and the text that uses it the least. Compare usage in both texts by examining the relative frequencies of those modals in the two texts. Try to explain why those words are used differently in the two texts.

#### Answer 1E:

In [None]:
# @title 1E Syntax Guide

from prettytable import PrettyTable

# Find the two modals with the largest span of relative frequencies
sorted_modals = sorted(modals, key=lambda x: freqs[x] / num_words, reverse=True)
modal1, modal2 = sorted_modals[:2]

# Find the text that uses each modal the most and the least
most_text1 = ''
most_text2 = ''
least_text1 = ''
least_text2 = ''
most_freq1 = 0
most_freq2 = 0
least_freq1 = float('inf')
least_freq2 = float('inf')
for fileid in gutenberg.fileids():
    words = gutenberg.words(fileid)
    freq1 = words.count(modal1) / len(words)
    freq2 = words.count(modal2) / len(words)
    if freq1 > most_freq1:
        most_text1 = fileid
        most_freq1 = freq1
    if freq2 > most_freq2:
        most_text2 = fileid
        most_freq2 = freq2
    if freq1 < least_freq1:
        least_text1 = fileid
        least_freq1 = freq1
    if freq2 < least_freq2:
        least_text2 = fileid
        least_freq2 = freq2

# Create a table to display the results
table = PrettyTable()
table.field_names = ['', modal1, modal2]
table.add_row(['Most used text', most_text1, most_text2])
table.add_row(['Relative frequency', '{:.6f}'.format(most_freq1), '{:.6f}'.format(most_freq2)])
table.add_row(['Least used text', least_text1, least_text2])
table.add_row(['Relative frequency', '{:.6f}'.format(least_freq1), '{:.6f}'.format(least_freq2)])

# Print the table
print(table)


+--------------------+------------------------+-----------------+
|                    |          will          |      would      |
+--------------------+------------------------+-----------------+
|   Most used text   | shakespeare-caesar.txt | austen-emma.txt |
| Relative frequency |        0.004994        |     0.004235    |
|  Least used text   |    blake-poems.txt     | blake-poems.txt |
| Relative frequency |        0.000359        |     0.000359    |
+--------------------+------------------------+-----------------+


The variation in the use of the words above can be explained by several factors, including the time period, genre, and context of the texts. Shakespeare, Blake, and Austen, were from different genres, wrote in distinct styles and forms that reflected their historical and literary contexts (WILLIAM BLAKE, n.d.). In general, the diverse contexts and unique styles and themes of the authors may have influenced the different patterns of using modal verbs in their respective texts.

#### Question 2A. Downloaded the inaugural corpus NLTK.

#### Answer 2A:

In [None]:
## @title 2A Syntax Guide

# Imported the inaugural corpus from NLTK
nltk.download("inaugural")
from nltk.corpus import inaugural

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


#### Question 2B. Choose Kennedy's speech.

#### Answer 2B:

In [None]:
# @title 2B Syntax Guide

# Get the text of Kennedy's speech
nltk.corpus.inaugural.words("1961-Kennedy.txt")

['Vice', 'President', 'Johnson', ',', 'Mr', '.', ...]

#### Question 2C. Identify the 10 most frequently used long words (words longer than 7 characters).

#### Answer 2C:

In [None]:
# @title 2C Syntax Guide

# Load Kennedy's speech
speech = nltk.corpus.inaugural.words('1961-Kennedy.txt')

# Create a list of long words (words longer than 7 characters)
long_words = [word.lower() for word in speech if len(word) > 7]

# Use FreqDist to count the frequency of each word
word_freq = nltk.FreqDist(long_words)

# Get the 10 most common words and print them
for word, frequency in word_freq.most_common(10):
    print(word)


citizens
president
americans
generation
forebears
revolution
committed
powerful
supporting
themselves


From the list of the 10 most commonly used long words in Kennedy's speech, we can get some insights into the themes and ideas that Kennedy emphasized in his speech. The list of the 10 most commonly used long words in Kennedy's speech suggests a focus on the power of the presidency, national identity and unity, honoring the past and looking towards the future, and the importance of collective action (John F. Kennedy’s Inaugural Address: An Analysis, n.d.).

#### Question 2D. Which one of those 10 words has the largest number of synonyms? Use WordNet as a helper:

#### Answer 2D:

In [None]:
# @title 2D Syntax Guide

nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize

# initialize variables for the word with the most synonyms and its number of synonyms
max_synonyms_word = ''
max_synonyms_count = 0

# iterate over the top 10 words
for word, count in top_long_words:
    # get the synonyms for the word
    synonyms = wn.synsets(word)
    synonyms_count = len(synonyms)
    # update the variables if the word has more synonyms than the previous max
    if synonyms_count > max_synonyms_count:
        max_synonyms_word = word
        max_synonyms_count = synonyms_count

# print the result
print("The word with the most synonyms among the top 10 most frequently used long words is '{}' with {} synonyms.".format(max_synonyms_word, max_synonyms_count))


The word with the most synonyms among the top 10 most frequently used long words is 'supporting' with 14 synonyms.


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


One possible insight that can be gleaned from the aforementioned results is that In his speech, Kennedy frequently used the word supporting to convey solidarity and cooperation with various groups and causes, and its wide range of synonyms enabled him to express different aspects of support, such as giving financial aid, providing emotional encouragement, strengthening something physically, and agreeing with or approving something. By using this word, Kennedy appealed to different audiences and contexts and emphasized his commitment to various issues and values.

#### Question 2E. List all synonyms for the 10 most frequently used words. Which one of those 10 words has the largest number of hyponyms?

#### Answer 2E:

In [None]:
# @title 2E Syntax Guide

import nltk
from collections import Counter

# Choose Kennedy's speech
#speech = inaugural.words('1961-Kennedy.txt')

# Identify the 10 most frequently used long words
long_words = [word.lower() for word in speech if len(word) > 7]
top_long_words = Counter(long_words).most_common(10)

# Find all synonyms for the top 10 long words using WordNet
synonyms = {}
for word, count in top_long_words:
    if wn.synsets(word, pos=wn.NOUN):
        pos = wn.NOUN
    elif wn.synsets(word, pos=wn.ADJ):
        pos = wn.ADJ
    elif wn.synsets(word, pos=wn.ADV):
        pos = wn.ADV
    else:
        continue
    synsets = wn.synsets(word, pos=pos)
    synonyms[word] = set()
    for synset in synsets:
        for lemma in synset.lemmas():
            synonyms[word].add(lemma.name())
    synonyms[word].discard(word)  # Remove the original word from the synonyms

# Find the long word with the most hyponyms
most_hyponyms = 0
most_hyponyms_word = None
for word, count in top_long_words:
    try:
        if wn.synsets(word, pos=wn.NOUN):
            pos = wn.NOUN
        elif wn.synsets(word, pos=wn.ADJ):
            pos = wn.ADJ
        elif wn.synsets(word, pos=wn.ADV):
            pos = wn.ADV
        else:
            continue
        hyponyms = wn.synset(word + '.' + pos + '.01').hyponyms()
        if len(hyponyms) > most_hyponyms:
            most_hyponyms = len(hyponyms)
            most_hyponyms_word = word
    except:
        # If the word doesn't have a sense in WordNet, skip it
        continue

# Print the synonyms for the top 10 long words and the word with the most hyponyms
print("Synonyms for the top 10 long words:")
for word, count in top_long_words:
    if word in synonyms:
        print(word, synonyms[word])
    else:
        print(word, set())
print("The long word with the most hyponyms is:", most_hyponyms_word)


Synonyms for the top 10 long words:
citizens {'citizen'}
president {'United_States_President', 'chair', 'prexy', 'Chief_Executive', 'chairwoman', 'chairman', 'President_of_the_United_States', 'chairperson', 'President'}
americans {'American', 'American_English', 'American_language'}
generation {'propagation', 'coevals', 'contemporaries', 'genesis', 'multiplication'}
forebears {'forebear', 'forbear'}
revolution {'rotation', 'gyration'}
committed {'attached'}
powerful {'knock-down', 'brawny', 'sinewy', 'muscular', 'potent', 'herculean', 'hefty'}
supporting {'support'}
themselves set()
The long word with the most hyponyms is: generation


To improve the code for question 2E, it is essential to understand how WordNet organizes different parts of speech into synsets. While attempting to find synonyms for 'committed', 'powerful', and 'themselves', the code used the pos=wordnet.NOUN parameter in the synsets function. However, since these words do not have a noun sense in WordNet, the code did not provide any synonyms for them (WordNet Documentation, n.d.).

To address this issue, we need to look for their synsets using the appropriate part of speech. For instance, 'committed' and 'powerful' are adjectives, and 'themselves' is a pronoun. Therefore, we can use pos=wordnet.ADJ for adjectives and pos=wordnet.ADV for adverbs to find their synonyms in WordNet (WordNet Documentation, n.d.).

It's worth noting that the word "themselves" is a reflexive pronoun, and it's unlikely to have synonyms. Hence, it's crucial to revise the code accordingly to include the appropriate part of speech parameter to retrieve synonyms from WordNet.

#### Question 2F.List all hyponyms of the 10 most frequently used words.

#### Answer 2F:

In [None]:
# @title 2F Syntax Guide

import nltk
from nltk.corpus import inaugural
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
from collections import Counter

# download the punkt tokenizer
nltk.download('punkt')

# choose Kennedy's speech
kennedy_speech = inaugural.raw('1961-Kennedy.txt')

# identify the 10 most frequently used long words
words = word_tokenize(kennedy_speech)
long_words = [word for word in words if len(word) > 7]
most_common = Counter(long_words).most_common(10)
print('10 most frequently used long words:\n')
for word in most_common:
    print(f"{word[0]}: {word[1]}")
    print('----------------------------------------')

# list all hyponyms for the 10 most frequently used words
hyponyms = {}
for word in most_common:
    syns = wn.synsets(word[0])
    if syns:
        hyponyms[word[0]] = [hypo.name() for s in syns for hypo in s.hyponyms()]

print('\nHyponyms of the 10 most frequently used long words:\n')
for word in hyponyms:
    print(f"{word}:")
    print(*hyponyms[word], sep='\n')
    print('----------------------------------------')


10 most frequently used long words:

citizens: 5
----------------------------------------
President: 4
----------------------------------------
Americans: 4
----------------------------------------
generation: 3
----------------------------------------
forebears: 2
----------------------------------------
revolution: 2
----------------------------------------
committed: 2
----------------------------------------
powerful: 2
----------------------------------------
supporting: 2
----------------------------------------
themselves: 2
----------------------------------------

Hyponyms of the 10 most frequently used long words:

citizens:
active_citizen.n.01
civilian.n.01
freeman.n.01
private_citizen.n.01
repatriate.n.01
thane.n.02
voter.n.01
----------------------------------------
President:
ex-president.n.01
kalon_tripa.n.01
vice_chairman.n.01
----------------------------------------
Americans:
african-american.n.01
alabaman.n.01
alaskan.n.01
anglo-american.n.01
appalachian.n.01
arizona

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Insights from the output suggest that the speech is likely about patriotism, directed towards young people, calling for change, and encouraging action and support. The tone is motivational and emphasizes the power of heritage and determination.

## Conclusion

In conclusion, this report demonstrates the use of the Natural Language Toolkit (NLTK) in Python to analyze textual data. We began by installing and importing the NLTK package, and then downloaded and analyzed the Gutenberg corpus to count the frequency of modal verbs. We also analyzed President Kennedy's inaugural speech by identifying the most frequently used long words and their synonyms using WordNet.

Based on the results of our analysis, we suggest that those conducting similar analyses consider exploring other corpora and texts, as well as conducting more sophisticated analyses, such as sentiment analysis or topic modeling.

## References

*John F. Kennedy’s Inaugural Address: An Analysis.* (n.d.). EDUZAURUS. Retrieved March 9, 2023, from https://eduzaurus.com/free-essay-samples/john-f-kennedys-inaugural-address-an-analysis/

*Will vs. Would: What’s The Difference? (2022, September 1).* Thesaurus. https://www.thesaurus.com/e/grammar/will-vs-would/

*WILLIAM BLAKE.* (n.d.). Litpriest. Retrieved March 9, 2023, from https://litpriest.com/authors/william-blake/

*WordNet Documentation.* (n.d.). wordnet.princeton.edu. Retrieved March 10, 2023, from https://wordnet.princeton.edu/documentation