# NLP AI Applications
@Yu-Wei Hsu

Install NLTK package

In [None]:
pip install nltk

## Gutenberg

Download Gutenberg corpus tool in NLTK package

In [None]:
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

 Check each text name in corpus

In [None]:
from nltk.corpus import gutenberg
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Count the relative frequencies of each words in modals for each text

In [None]:
import pandas as pd
modals = ['can','could','may','might','will','would','should']
df = pd.DataFrame()

# for loop for iterating each text 
for txt in gutenberg.fileids():
    # read the words in the text
    content = gutenberg.words(txt)
    freq = []
    
    # for loop for counting each modal 
    for m in modals:
        freq.append(content.count(m))

    # create relative frequncy list from frequncy list
    rel_freq = [round(c/sum(freq),3) for c in freq]
    # append in the data frame
    df[txt] = rel_freq

# swap index and column
df = df.T
df.columns = modals
df

Unnamed: 0,can,could,may,might,will,would,should
austen-emma.txt,0.08,0.245,0.063,0.096,0.166,0.242,0.109
austen-persuasion.txt,0.067,0.297,0.058,0.111,0.108,0.235,0.124
austen-sense.txt,0.092,0.253,0.075,0.096,0.158,0.226,0.101
bible-kjv.txt,0.031,0.024,0.149,0.069,0.552,0.064,0.111
blake-poems.txt,0.476,0.071,0.119,0.048,0.071,0.071,0.143
bryant-stories.txt,0.133,0.274,0.032,0.041,0.256,0.196,0.068
burgess-busterbrown.txt,0.13,0.316,0.017,0.096,0.107,0.26,0.073
carroll-alice.txt,0.197,0.252,0.038,0.097,0.083,0.241,0.093
chesterton-ball.txt,0.16,0.143,0.11,0.084,0.242,0.17,0.092
chesterton-brown.txt,0.177,0.238,0.066,0.1,0.156,0.185,0.079


Find two modals with the largest span of relative frequencies, then compare the texts which use the modals most and least.

In [None]:
for m in modals:
    print('{}: {}'.format(m,(max(df[m]) - min(df[m]))))

can: 0.44499999999999995
could: 0.292
may: 0.16199999999999998
might: 0.128
will: 0.48100000000000004
would: 0.196
should: 0.124


Two modals with the largest span of relative frequencies are 'can' and 'will'. 

In [None]:
print('The text uses the word "{}" most is {}.'.format('can',df['can'].idxmax()))
print('The text uses the word "{}" least is {}.'.format('can',df['can'].idxmin()))
print('The text uses the word "{}" most is {}.'.format('will',df['will'].idxmax()))
print('The text uses the word "{}" least is {}.'.format('will',df['will'].idxmin()))

The text uses the word "can" most is blake-poems.txt.
The text uses the word "can" least is bible-kjv.txt.
The text uses the word "will" most is bible-kjv.txt.
The text uses the word "will" least is blake-poems.txt.


In [None]:
df.loc[['blake-poems.txt','bible-kjv.txt']]

Unnamed: 0,can,could,may,might,will,would,should
blake-poems.txt,0.476,0.071,0.119,0.048,0.071,0.071,0.143
bible-kjv.txt,0.031,0.024,0.149,0.069,0.552,0.064,0.111


- Bible kjv version is the Christian scriptures. The word 'will' has largest relative frequency among others modals. Will sometimes describe the concept of the faculty by which a person decides on and initiates action, and usually indicates intensions in Bible. 
- Poems by Blake is the poem collection of Blake, which consists numbers of poem. The word 'can' has largest relative frequency in the text. We can use the word 'can' in many ways, such as describe the permission, ability or even container. Besides, we can easy to make rhyme in the poems. 

*These could be the reason why the words, can and will, have largest frequency in Bible and poem respectively.*

## Inaugural

Choose Kennedy's speech in the Inaugural corpus

In [None]:
nltk.download('inaugural')
from nltk.corpus import inaugural

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.


In [None]:
text = nltk.corpus.inaugural.words('1961-Kennedy.txt')

Identify the 10 most frequently use long words

In [None]:
from collections import Counter
# Find the words which have more than seven characters
long_words = [l for l in text if len(l)>7]
# Count the frequency of each long words
freq_dict = {}
for i in long_words:
    if (i in freq_dict):
        freq_dict[i] += 1
    else:
        freq_dict[i] = 1

# 10 most frequently used long words
top10 = []
c_dict = Counter(freq_dict)
for k, v in c_dict.most_common(10):
    print('{}: {}'.format(k,v))
    top10.append(k)

citizens: 5
President: 4
Americans: 4
generation: 3
forebears: 2
revolution: 2
committed: 2
powerful: 2
supporting: 2
themselves: 2


Number of synonyms of the 10 most frequently use long words

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from nltk.corpus import wordnet

for w in top10:
    print('Synonyms of ',w)
    syn = []
    for synset in wordnet.synsets(w):
        for lemma in synset.lemmas():
            syn.append(lemma.name())
    print(syn)
    print('Numbers of synonyms: {}\n'.format(len(syn)))


Synonyms of  citizens
['citizen']
Numbers of synonyms: 1

Synonyms of  President
['president', 'President_of_the_United_States', 'United_States_President', 'President', 'Chief_Executive', 'president', 'president', 'chairman', 'chairwoman', 'chair', 'chairperson', 'president', 'prexy', 'President_of_the_United_States', 'President', 'Chief_Executive']
Numbers of synonyms: 16

Synonyms of  Americans
['American', 'American_English', 'American_language', 'American', 'American']
Numbers of synonyms: 5

Synonyms of  generation
['coevals', 'contemporaries', 'generation', 'generation', 'generation', 'generation', 'genesis', 'generation', 'generation', 'generation', 'multiplication', 'propagation']
Numbers of synonyms: 12

Synonyms of  forebears
['forebear', 'forbear']
Numbers of synonyms: 2

Synonyms of  revolution
['revolution', 'revolution', 'rotation', 'revolution', 'gyration']
Numbers of synonyms: 5

Synonyms of  committed
['perpetrate', 'commit', 'pull', 'give', 'dedicate', 'consecrate', '

**We can see the word 'supporting' in top 10 frequently used long word has largest number of synonyms with 52 synonyms.** 

Number of hyponyms of the 10 most frequently use long words

In [None]:
for w in top10:
    print('Hyponyms of',w)
    hypo = []
    for synset in wordnet.synsets(w):
        for h in synset.hyponyms():
            hypo.append(h.lemma_names())
    print(hypo)
    print('Numbers of hyponyms: {}\n'.format(len(hypo)))

Hyponyms of citizens
[['active_citizen'], ['civilian'], ['freeman', 'freewoman'], ['private_citizen'], ['repatriate'], ['thane'], ['voter', 'elector']]
Numbers of hyponyms: 7

Hyponyms of President
[['ex-president'], ['Kalon_Tripa'], ['vice_chairman']]
Numbers of hyponyms: 3

Hyponyms of Americans
[['African-American', 'African_American', 'Afro-American', 'Black_American'], ['Alabaman', 'Alabamian'], ['Alaskan'], ['Anglo-American'], ['Appalachian'], ['Arizonan', 'Arizonian'], ['Arkansan', 'Arkansawyer'], ['Asian_American'], ['Bay_Stater'], ['Bostonian'], ['Californian'], ['Carolinian'], ['Coloradan'], ['Connecticuter'], ['Creole'], ['Delawarean', 'Delawarian'], ['Floridian'], ['Franco-American'], ['Georgian'], ['German_American'], ['Hawaiian'], ['Idahoan'], ['Illinoisan'], ['Indianan', 'Hoosier'], ['Iowan'], ['Kansan'], ['Kentuckian', 'Bluegrass_Stater'], ['Louisianan', 'Louisianian'], ['Mainer', 'Down_Easter'], ['Marylander'], ['Michigander', 'Wolverine'], ['Minnesotan', 'Gopher'], ['

**We can see the word 'Americans' in top 10 frequently used long word has largest number of hyponyms with 75 hyponyms.**

### Reference:
- WordNet Interface. https://www.nltk.org/howto/wordnet.html