# FLIP(01):  Advanced Data Science
**(Module 03: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 01 - Accessing Text Corpora

### Gutenberg Corpus
NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

In [None]:
import nltk
nltk.corpus.gutenberg.fileids()

In [None]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

In [None]:
len(emma)

In [None]:
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

In [None]:
emma.concordance("surprize")

In [None]:
from nltk.corpus import gutenberg
gutenberg.fileids()

In [None]:
emma = gutenberg.words('austen-emma.txt')

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid)) 
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
    print(int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid)

In [None]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')

In [None]:
macbeth_sentences

In [None]:
longest_len = max([len(s) for s in macbeth_sentences])

In [None]:
print(longest_len)

In [None]:
[s for s in macbeth_sentences if len(s) == longest_len]

## Web and Chat Text
Although Project Gutenberg contains thousands of books, it represents established literature. It is important to consider less formal language as well. NLTK’s small collection of web text includes content from a Firefox discussion forum, conversations
overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews:

In [None]:
from nltk.corpus import webtext

In [None]:
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...')

In [None]:
from nltk.corpus import nps_chat

In [None]:
chatroom = nps_chat.posts('10-19-20s_706posts.xml')

In [None]:
chatroom[123]

## Brown Corpus
The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on.

In [None]:
from nltk.corpus import brown

In [None]:
brown.categories()

In [None]:
brown.words(categories='news')

In [None]:
brown.words(fileids=['cg22'])

In [None]:
brown.sents(categories=['news', 'editorial', 'reviews'])

In [None]:
from nltk.corpus import brown

In [None]:
news_text = brown.words(categories='news')

In [None]:
fdist = nltk.FreqDist([w.lower() for w in news_text])

In [None]:
fdist

In [None]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [None]:
for m in modals:
    print(m + ':', fdist[m],)

In [None]:
# small test:
# Choose a different section of the Brown Corpus, and adapt the preceding example to count a selection of wh words, such as what,
# when, where, who and why.

In [None]:
model = ['what','where','why','when','who']

In [None]:
for i in model:
    print( i + ':',fdist[i])

In [None]:
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

In [None]:
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

## Reuters Corpus
The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called “training” and “test”; thus, the text with fileid 'test/14826' is a document drawn from the test set.This split is for training and testing algorithms that automatically detect the topic of a document.

In [None]:
from nltk.corpus import reuters

In [None]:
reuters.fileids()

In [None]:
reuters.categories()

In [None]:
reuters.categories('training/9865')

In [None]:
reuters.categories(['training/9865', 'training/9880'])

In [None]:
reuters.fileids('barley')

In [None]:
reuters.fileids(['barley', 'corn'])

In [None]:
reuters.words('training/9865')[:14]

In [None]:
reuters.words(['training/9865', 'training/9880'])

In [None]:
reuters.words(categories='barley')

In [None]:
reuters.words(categories=['barley', 'corn'])

## Inaugural Address Corpus
The corpus is actually a collection of 55 texts, one for each presidential address.An interesting property of this collection is its time dimension:

In [None]:
from nltk.corpus import inaugural

In [None]:
inaugural.fileids()

In [None]:
[fileid[:4] for fileid in inaugural.fileids()]

In [None]:
import nltk
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))

In [None]:
cfd.plot()

## Corpora in Other Languages
NLTK comes with corpora for many languages, though in some cases you will need to learn how to manipulate character encodings in Python before using these corpora.

In [None]:
nltk.corpus.cess_esp.words()

In [None]:
nltk.corpus.floresta.words()

In [None]:
nltk.corpus.indian.words('hindi.pos')

In [None]:
nltk.corpus.udhr.fileids()

In [None]:
nltk.corpus.udhr.words('Javanese-Latin1')[11:]

In [None]:
from nltk.corpus import udhr

In [None]:
languages = ['Chickasaw', 'English', 'German_Deutsch',
             'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

In [None]:
cfd = nltk.ConditionalFreqDist(
    (lang, len(word))
    for lang in languages
    for word in udhr.words(lang + '-Latin1'))

In [None]:
cfd.plot(cumulative = True)

In [None]:
# small test：
# Pick a language of interest in udhr.fileids(), and define a variable raw_text = udhr.raw(Language-Latin1). Now plot a frequency
# distribution of the letters of the text using nltk.FreqDist(raw_text).plot().

## Loading Your Own Corpus
If you have a your own collection of text files that you would like to access using the methods discussed earlier, you can easily load them with the help of NLTK’s Plain textCorpusReader. Check the location of your files on your file system; in the following
example, we have taken this to be the directory /Anaconda3/Lib/nltk_data. Whatever the location, set this to be the value of corpus_root.

In [None]:
from nltk.corpus import PlaintextCorpusReader
# corpus_root = '/usr/share/dict' 
corpus_root = '/Anaconda3/Lib/nltk_data'
wordlists = PlaintextCorpusReader(corpus_root, '.*') 
wordlists.fileids()

In [None]:
wordlists.words('connectives')

# Conditional Frequency Distributions
## Conditions and Events
A frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each event with a condition. So instead of processing a sequence of words , we have to process a sequence of pairs.

In [None]:
text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

## Counting Words by Genre
Whereas FreqDist() takes a simple list as input, ConditionalFreqDist() takes a list of pairs.

In [None]:
import nltk
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

In [None]:
genre_word = [(genre, word)
              for genre in ['news', 'romance']
              for word in brown.words(categories=genre)]

In [None]:
len(genre_word)

In [None]:
genre_word[:4]

In [None]:
genre_word[-4:]

In [None]:
cfd = nltk.ConditionalFreqDist(genre_word)

In [None]:
cfd

In [None]:
cfd.conditions()

In [None]:
cfd['news']

In [None]:
cfd['romance']

In [None]:
list(cfd['romance'])

In [None]:
cfd['romance']['could']

## Plotting and Tabulating Distributions
Apart from combining two or more frequency distributions, and being easy to initialize,a ConditionalFreqDist provides some useful methods for tabulation and plotting.

In [None]:
from nltk.corpus import inaugural

In [None]:
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4]) 
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen'] 
    if w.lower().startswith(target))

In [None]:
from nltk.corpus import udhr

In [None]:
languages = ['Chickasaw', 'English', 'German_Deutsch',
             'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

In [None]:
cfd = nltk.ConditionalFreqDist(
    (lang, len(word)) 
    for lang in languages
    for word in udhr.words(lang + '-Latin1'))

In [None]:
cfd.tabulate(conditions=['English', 'German_Deutsch'],
             samples=range(10), cumulative=True)

In [None]:
# small test:
# Working with the news and romance genres from the Brown Corpus, find out which days of the week are most newsworthy,
# and which are most romantic. Define a variable called days containing a list of days of the week, i.e., ['Monday', ...]. Now tabulate the counts
# for these words using cfd.tabulate(samples=days). Now try the same thing using plot in place of tabulate. You may control the output order
# of days with the help of an extra parameter: conditions=['Monday', ...]

## Generating Random Text with Bigrams
We can use a conditional frequency distribution to create a table of bigrams.The bigrams() function takes a list of words and builds
a list of consecutive word pairs:

In [None]:
sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
        'and', 'the', 'earth', '.']

In [None]:
nltk.bigrams(sent)

In [None]:
def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word,)
        word = cfdist[word].max()

In [None]:
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

In [None]:
print(cfd['living'])

In [None]:
generate_model(cfd, 'living')

# Lexical Resources
## Wordlist Corpora
NLTK includes some corpora that are nothing more than wordlists. The Words Corpus is the /usr/dict/words file from Unix, used by some spellcheckers. We can use it to find unusual or misspelled words in a text corpus.

In [None]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)

In [None]:
import nltk
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))

In [None]:
unusual_words(nltk.corpus.nps_chat.words())

In [None]:
from nltk.corpus import stopwords

In [None]:
stopwords.words('english')

In [None]:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)

In [None]:
content_fraction(nltk.corpus.reuters.words())

In [None]:
puzzle_letters = nltk.FreqDist('egivrvonl')

In [None]:
obligatory = 'r'

In [None]:
wordlist = nltk.corpus.words.words()

In [None]:
[w for w in wordlist if len(w) >= 6
 and obligatory in w 
 and nltk.FreqDist(w) <= puzzle_letters] 

In [None]:
names = nltk.corpus.names
names.fileids()

In [None]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]

In [None]:
cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid))

In [None]:
cfd.plot()

## A Pronouncing Dictionary
A slightly richer kind of lexical resource is a table (or spreadsheet), containing a word plus some properties in each row. NLTK includes the CMU Pronouncing Dictionary for U.S. English, which was designed for use by speech synthesizers.

In [None]:
entries = nltk.corpus.cmudict.entries()
len(entries)

In [None]:
for entry in entries[39943:39951]:
    print(entry)

In [None]:
for word, pron in entries:
    if len(pron) == 3:
        ph1, ph2, ph3 = pron
        if ph1 == 'P' and ph3 == 'T':
            print(word, ph2,)

In [None]:
syllable = ['N', 'IH0', 'K', 'S']

In [None]:
[word for word, pron in entries if pron[-4:] == syllable]

In [None]:
[w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']

In [None]:
sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))

In [None]:
def stress(pron):
    return [char for phone in pron for char in phone if char.isdigit()]

In [None]:
[w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']]

In [None]:
[w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']]

In [None]:
p3 = [(pron[0]+'-'+pron[2], word)
      for (word, pron) in entries
      if pron[0] == 'P' and len(pron) == 3]

In [None]:
cfd = nltk.ConditionalFreqDist(p3)

In [None]:
for template in cfd.conditions():
    if len(cfd[template]) > 10:
        words = cfd[template].keys()
        wordlist = ' '.join(words)
        print(template, wordlist[:70] + "...")

In [None]:
prondict = nltk.corpus.cmudict.dict()

In [None]:
prondict['fire']

In [None]:
prondict['blog'] = [['B', 'L', 'AA1', 'G']]

In [None]:
prondict['blog']

In [None]:
text = ['natural', 'language', 'processing']

In [None]:
[ph for w in text for ph in prondict[w][0]]

## Comparative Wordlists
Another example of a tabular lexicon is the comparative wordlist. NLTK includes so-called Swadesh wordlists, lists of about 200 common words in several languages.The languages are identified using an ISO 639 two-letter code.

In [None]:
from nltk.corpus import swadesh

In [None]:
swadesh.fileids()

In [None]:
swadesh.words('en')

In [None]:
fr2en = swadesh.entries(['fr', 'en'])

In [None]:
fr2en

In [None]:
translate = dict(fr2en)

In [None]:
translate['chien']

In [None]:
translate['jeter']

In [None]:
de2en = swadesh.entries(['de', 'en']) # German-English
es2en = swadesh.entries(['es', 'en']) # Spanish-English
translate.update(dict(de2en))
translate.update(dict(es2en))

In [None]:
translate['Hund']

In [None]:
translate['perro']

In [None]:
languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']
for i in [139, 140, 141, 142]:
    print(swadesh.entries(languages)[i])

## Shoebox and Toolbox Lexicons
A Toolbox file consists of a collection of entries, where each entry is made up of one or more fields. Most fields are optional or repeatable, which means that this kind of lexical resource cannot be treated as a table or spreadsheet.

In [None]:
from nltk.corpus import toolbox

In [None]:
toolbox.entries('rotokas.dic')

# WordNet
## Senses and Synonyms
Consider the sentence in (1a). If we replace the word motorcar in (1a) with automobile, to get (1b), the meaning of the sentence stays pretty much the same:
        
                            (1) a. Benz is credited with the invention of the motorcar.
                                b. Benz is credited with the invention of the automobile.
Since everything else in the sentence has remained unchanged, we can conclude that the words motorcar and automobile have the same meaning, i.e., they are synonyms.We can explore these words with the help of WordNet.

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
wn.synsets('motorcar')

In [None]:
wn.synset('car.n.01').lemma_names

In [None]:
wn.synset('car.n.01').definition

In [None]:
wn.synset('car.n.01').examples

In [None]:
wn.synset('car.n.01').lemmas

In [None]:
wn.lemma('car.n.01.automobile')

In [None]:
wn.lemma('car.n.01.automobile').synset

In [None]:
wn.lemma('car.n.01.automobile').name

In [None]:
wn.synsets('car')

In [None]:
for synset in wn.synsets('car'):
    print(synset.lemma_names)

In [None]:
wn.lemmas('car')

In [None]:
# small test：
# Write down all the senses of the word dish that you can think of. Now, explore this word with the help of WordNet, using the
# same operations shown earlier.

## The WordNet Hierarchy
WordNet synsets correspond to abstract concepts, and they don’t always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event; these are called unique beginners
or root synsets. Others, such as gas guzzler and hatchback, are much more specific.

In [None]:
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
types_of_motorcar[26]

In [None]:
sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas])

In [None]:
motorcar.hypernyms()

In [None]:
paths = motorcar.hypernym_paths()

In [None]:
len(paths)

In [None]:
[synset.name for synset in paths[0]]

In [None]:
[synset.name for synset in paths[1]]

In [None]:
motorcar.root_hypernyms()

In [None]:
# small test:
# Try out NLTK’s convenient graphical WordNet browser: nltk.app.wordnet(). Explore the WordNet hierarchy by following the
# hypernym and hyponym links.

## More Lexical Relations
Hypernyms and hyponyms are called lexical relations because they relate one synset to another. These two relations navigate up and down the “is-a” hierarchy. Another important way to navigate the WordNet network is from items to their components (meronyms) or to the things they are contained in (holonyms). For example, the parts of a tree are its trunk, crown, and so on; these are the part_meronyms(). The substance a tree is made of includes heartwood and sapwood, i.e., the substance_meronyms(). A collection of trees forms a forest, i.e., the member_holonyms():

In [None]:
wn.synset('tree.n.01').part_meronyms()

In [None]:
wn.synset('tree.n.01').substance_meronyms()

In [None]:
wn.synset('tree.n.01').member_holonyms()

In [None]:
for synset in wn.synsets('mint', wn.NOUN):
    print(synset.name, synset.definition)

In [None]:
wn.synset('mint.n.04').part_holonyms()

In [None]:
wn.synset('mint.n.04').substance_holonyms()

In [None]:
wn.synset('walk.v.01').entailments()

In [None]:
wn.synset('eat.v.01').entailments()

In [None]:
wn.synset('tease.v.03').entailments()

In [None]:
wn.lemma('supply.n.02.supply').antonyms()

In [None]:
wn.lemma('rush.v.01.rush').antonyms()

In [None]:
wn.lemma('horizontal.a.01.horizontal').antonyms()

In [None]:
wn.lemma('staccato.r.01.staccato').antonyms()

## Semantic Similarity
We have seen that synsets are linked by a complex network of lexical relations. Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term such as vehicle will match documents containing specific terms such as limousine.

In [None]:
right = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')
right.lowest_common_hypernyms(minke)

In [None]:
right.lowest_common_hypernyms(orca)

In [None]:
right.lowest_common_hypernyms(tortoise)

In [None]:
right.lowest_common_hypernyms(novel)

In [None]:
wn.synset('baleen_whale.n.01').min_depth()

In [None]:
wn.synset('whale.n.02').min_depth()

In [None]:
wn.synset('vertebrate.n.01').min_depth()

In [None]:
wn.synset('entity.n.01').min_depth()

In [None]:
right.path_similarity(minke)

In [None]:
right.path_similarity(orca)

In [None]:
right.path_similarity(tortoise)

In [None]:
right.path_similarity(novel)