# FLIP(01):  Advanced Data Science
**(Module 07: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/mds](https://github.com/tulip-lab/mds/issues)

Prepared by and for 
**Student Members** |
2006-2019 [TULIP Lab](http://www.tulip.org.au)

---

# Session B - Accessing Text Corpora and Lexical Resources

## Contents

1 [Accessing Text Corpora](#Corpora)
* Gutenberg Corpus
* Web and Chat Text
* Brown Corpus
* Regular Expression
* Reuters 
* Inaugural Address Corpus
* Corpora in Other Languages
* Text Corpus Structure


2 [Conditional Frequency Distributions](#Distributions)
* Conditions and Events
* Counting Words by Genre
* Plotting and Tabulating Distributions

3 [More Python: Reusing Code](#Code)
* Creating Programs with a Text Editor
* Functions
* Modules

4 [Lexical Resources](#Resources)
* Wordlist Corpora
* Comparative Wordlists
* Shoebox and Toolbox Lexicons

5 [WordNet](#WordNet)
* Senses and Synonyms
* The WordNet Hierarchy
* More Lexical Relations
* Semantic Similarity

6 [Summary](#Summary)

7 [Further Reading](#reading)

<a id = "Corpora"></a>

## <span style="color:#0b486b">1. Accessing Text Corpora</span>

### Gutenberg Corpus

- http://www.gutenberg.org

In [None]:
def plural(word):
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'

In [None]:
import nltk

In [None]:
nltk.download()

In [None]:
from nltk import corpus

In [None]:
help(corpus.gutenberg.fileids)

In [None]:
nltk.corpus.gutenberg.fileids()

In [None]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
emma

In [None]:
len(emma)

In [None]:
help(nltk.Text)

In [None]:
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
emma.concordance('surprize')

In [None]:
from nltk.corpus import gutenberg

In [None]:
gutenberg.fileids()

In [None]:
emma = gutenberg.words('austen-emma.txt')

#### Interpret the for statement

- num_chars: all characters
- num_words: all words
- num_sents: Any sentences or phrases ??
- num_vocab: the only word in all words


- All characters / All words, All words / All sentences or lists?
- all words / all unique words, fileid
- There's a description below.
- average word length
- average sentence length
- On average how much each word appears in the text (verbal diversity score)

> Note: What are you going to do with this? I have a purpose but I do not understand the purpose.

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
    print int(num_chars/num_words), int(num_words/num_sents), \
    int(num_words/num_vocab), fileid

In [None]:
help(gutenberg.raw)

In [None]:
len(gutenberg.raw('austen-emma.txt'))

In [None]:
help(gutenberg.words)

In [None]:
gutenberg.words('austen-emma.txt')

In [None]:
num_words2 = len(gutenberg.words('austen-emma.txt'))
num_words2

In [None]:
num_vocab2 = len(set([w.lower() for w in 
                      gutenberg.words('austen-emma.txt')]))
num_vocab2

In [None]:
# How many of the words are the only words?
num_words2 / num_vocab2

In [None]:
# sentence: sentence
# utterance: words
help(gutenberg.sents)

In [None]:
gutenberg.sents('austen-emma.txt')

In [None]:
len(gutenberg.sents('austen-emma.txt'))

- average word length: really 3, not 4, since the num_chars variable counts space characters

- sents(): divides the text up into its sentences, where each sentence is a list of words

In [None]:
# how many letters occur in the text
len(gutenberg.raw('blake-poems.txt'))

In [None]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sentences

In [None]:
len(macbeth_sentences)

In [None]:
macbeth_sentences[1037]

In [None]:
longest_len = max([len(s) for s in macbeth_sentences])
longest_len

In [None]:
[s for s in macbeth_sentences if len(s) == longest_len][0][:20]

In [None]:
' '.join([s for s in macbeth_sentences if len(s) == longest_len][0])

- Most NLTK corpus readers include various approaches such as `words ()`, `raw ()`, and `sents ()`.

### Web and Chat Text

- Gutenberg contains thousands of books: Represents published literature.
- This is important. There are few formal language elements.

In [None]:
from nltk.corpus import webtext

In [None]:
help(webtext)

In [None]:
for fileid in webtext.fileids():
    print fileid, webtext.raw(fileid)[:65], '...'

In [None]:
webtext.fileids()

In [None]:
from nltk.corpus import nps_chat

In [None]:
# 706 posts
# 10/19
# 20s
chatroom = nps_chat.posts('10-19-20s_706posts.xml')

In [None]:
chatroom[123]

### Brown Corpus

- by genre. news, editorial, etc.
- Where is Brown?
- Where did you make a corpus?
- I think this is the point ... If you made it on Reuters, you made it for your own news?

In [None]:
from nltk.corpus import brown

In [None]:
brown.categories()

In [None]:
brown.words(categories='news')

In [None]:
len(brown.words(categories='news'))

In [None]:
brown.sents(categories='news')

In [None]:
len(brown.sents(categories='news'))

In [None]:
' '.join(brown.sents(categories='news')[0])

In [None]:
brown.sents(categories=['news', 'editorial', 'reviews'])

In [None]:
len(brown.sents(categories=['news', 'editorial', 'reviews']))

In [None]:
from nltk.corpus import brown

In [None]:
news_text = brown.words(categories='news')
news_text

In [None]:
len(news_text)

In [None]:
fdist = nltk.FreqDist([w.lower() for w in news_text])
fdist

In [None]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [None]:
for m in modals:
    print m + ':', fdist[m],

#### Your Turn

- Choose a different section of the Brown Corpus, and adapt the preceding example to count a selection of wh words, such as what, when, where, who and why.

In [None]:
other_text = brown.words(categories=['news', 'humor', 'fiction'])
other_text

In [None]:
wh_words = ['what', 'when', 'where', 'who', 'why']

In [None]:
fdist2 = nltk.FreqDist([w.lower() for w in other_text if 'wh' in w])
fdist2

### Regular Expressions

- If you do not use regular expressions, other things stick together.

In [None]:
import re

In [None]:
fdist3 = nltk.FreqDist([word.lower() for word in other_text 
                                    for wh_word in wh_words 
                                    if re.search('^'+wh_word.lower()+'$', 
                                                 word.lower())])
fdist3

In [None]:
fdist3.N()

In [None]:
other_text

In [None]:
len(other_text)

In [None]:
lst = []
for word in other_text:
    for wh_word in wh_words:
        if wh_word.lower() in word.lower():
            lst.append(word)
        
len(lst)

In [None]:
# 'news', 'can'
# 'hobbies', 'can'
cfd = nltk.ConditionalFreqDist((genre, word)
                               for genre in brown.categories()
                               for word in brown.words(categories=genre))
cfd

In [None]:
len(cfd)

In [None]:
type(cfd)

In [None]:
cfd2 = nltk.ConditionalFreqDist()
len(cfd2)

In [None]:
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 
          'humor']

In [None]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [None]:
cfd.tabulate(conditions=genres, samples=modals)

### Reuters Corpus

- 10, 788 news documents, 1.3 million words
- 90 topics
- two groups: "training", "test"
- test/14826
- training and testing: Use the algorithm to automatically find the topic in the document.

In [None]:
from nltk.corpus import reuters

In [None]:
len(reuters.fileids())

In [None]:
reuters.fileids()[::500]

In [None]:
reuters.categories()

In [None]:
len(reuters.categories())

- Reuters CorpusCategories overlap each other.
- Because news stories often cover a variety of topics.

In [None]:
reuters.categories('training/9865')

In [None]:
reuters.categories(['training/9865', 'training/9880'])

In [None]:
reuters.fileids('barley')

In [None]:
reuters.fileids(['barley', 'corn'])[::30]

- You can specify words or sentences. We want in fiels or categories.

- The first few words in each text are titles. It is stored in upper case.

In [None]:
reuters.words('training/9865')[:14]

In [None]:
len(reuters.words('training/9865'))

In [None]:
reuters.words(['training/9865', 'training/9880'])

In [None]:
len(reuters.words(['training/9865', 'training/9880']))

In [None]:
reuters.words(categories='barley')

In [None]:
len(reuters.words(categories='barley'))

In [None]:
reuters.words(categories=['barley', 'corn'])

In [None]:
len(reuters.words(categories=['barley', 'corn']))

### Inaugural Address Corpus

- word offset

In [None]:
from nltk.corpus import inaugural

In [None]:
inaugural.fileids()

In [None]:
[fileid[:4] for fileid in inaugural.fileids()]

In [None]:
# The other way, the above method seems easier
[fileid.split('-')[0] for fileid in inaugural.fileids()]

In [None]:
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
                               for fileid in inaugural.fileids()
                               for w in inaugural.words(fileid)
                               for target in ['america', 'citizen']
                               if w.lower().startswith(target))
len(cfd)

#### Plot of a conditional frequency distribution

- All words in the Inaugural Address Corpus that begin with america or citizen are counted;
- separate counts are kept for each address;
- these are plotted so that trends in usage over time can be observed;
- counts are not normalized for document length.

In [None]:
%matplotlib inline
cfd.plot()

### Annotated Text Corpora

- Corpus annotation is a special tagging task for the body of corpus to maximize the utilization of corpus. In other words, linguistic information can be given to primitive corpus to make annotation corpus.
- Many text corpuses include linguistic annotations, partial tag expressions of speech, things called independent, syntactic structures, and semantic roles.
- NLTK offers a variety of approaches to these corpus.
- The data package contains a corpus. And corpus samples are freely downloadable for technical use and research use.
- [More information](http://www.nltk.org/data)

#### Table 2-2 Some of the corpora and corpus samples distributed with NLTK

Corpus | Compiler | Contents
--- | --- | ---
Brown Corpus | Francis, Kucra | 15 genres, 1.15M words, tagged, categorized
CESS Treebanks | CLiC-UB | 1M words, tagged and parsed(Catalan, Spanish)
⋯ | ⋯ | ⋯

### Corpora in Other Languages

- NLTK is made for many languages with corpus?
- Despite many cases, you need to learn how to manipulate character encoding in Python. And let's use them.

In [None]:
help(nltk.corpus.cess_esp)

In [None]:
# Another corpus
nltk.corpus.cess_esp.words()

In [None]:
len(nltk.corpus.cess_esp.words())

In [None]:
# Other corpus
nltk.corpus.floresta.words()

In [None]:
len(nltk.corpus.floresta.words())

In [None]:
nltk.corpus.indian.words('hindi.pos')

In [None]:
len(nltk.corpus.indian.words('hindi.pos'))

In [None]:
nltk.corpus.udhr.fileids()[::50]

In [None]:
nltk.corpus.udhr.words('Javanese-Latin1')[11:]

- udhr: Universal Declaration of Human Rights, 300 languages.
- fileids: Contains information about the character encoding used in the file for this corpus. Things like UTF8 or Latin1.
- Contains the condition frequency distribution to examine different points in the length of words in udhr corpus in selected languages.

In [None]:
from nltk.corpus import udhr

In [None]:
languages = ['Chickasaw', 'English', 'German_Deutsch',
             'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

In [None]:
cfd = nltk.ConditionalFreqDist((lang, len(word))
                               for lang in languages
                               for word in udhr.words(lang + '-Latin1'))
len(cfd)

#### Figure 2-2. Cumulative word length distribution

- Six translations of the Universal Declaration of Human Rights are processed
- this graph shows that words having five or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.

In [None]:
# Condition frequency distribution
cfd.plot(cumulative=True)

In [None]:
cfd.plot(cumulative=False)

#### Your Turn

- Pick a language of interest in udhr.fileids() and define a variable raw_text = udhr.raw(Language-Latin1)

In [None]:
raw_text = udhr.raw(['Chickasaw-Latin1', 'English-Latin1'])
nltk.FreqDist(raw_text).plot()

In [None]:
nltk.FreqDist(raw_text).plot(cumulative=True)

In [None]:
raw_text = udhr.raw(['Chickasaw-Latin1', 'English-Latin1', 
                     'German_Deutsch-Latin1'])
nltk.FreqDist(raw_text).plot()

In [None]:
# nltk.FreqDist(raw_text).plot(cumulative=True)

- Unfortunately, for many languages, actual corpus is not yet available.
- There is a lack of industrial support to develop for insufficient government or language resources.
- There are no definite authoring systems in several languages.

### Text Corpus Structure

- A summary of the basic corpora functionality is shown in Table 2-3.


In [None]:
help(nltk.corpus.reader)

#### Table 2-3. Basic corpus functionality defined in NLTK

Example | Description
--- | ---
fileids() | The files of the corpus
fileids([categories]) | The files of the corpus corresponding to these categories
categories() | The categories of the corpus
categories([fileids]) | The categories of the corpus corresponding to these files
raw() | The raw content of the corpus
raw(fileids=[f1, f2, f3]) | The raw content of the specified files
raw(categories=[c1, c2]) | The raw content of the specified categoreis
words() | The words of the whole corpus
words(fileids=[f1, f2, f3]) | The words of the specified fileids
words(categories=[c1, c2]) | The words of the specified categoreis
sents() | The sentences fo the specified categories
sents(fileids=[f1, f2, f3]) | The sentences of the specified fileids
sents(categories=[c1, c2]) | The sentences of the specified categories
abspath(fileid) | The location of the given file on disk
encoding(fileid) | The encoding of the file(if known)
open(fileid) | Open a stream for reading the given corpus file
root() | The path to the root of locally installed corpus
readme() | The contents of the README file of the corpus

In [None]:
raw = gutenberg.raw('burgess-busterbrown.txt')

In [None]:
raw[1:20]

In [None]:
len(raw)

In [None]:
words = gutenberg.words('burgess-busterbrown.txt')
words[1:20]

In [None]:
' '.join(words[1:20])

In [None]:
sents = gutenberg.sents('burgess-busterbrown.txt')
sents[1:5]

### Loading Your Own Corpus

- If you have your own collection of text files, you can access them using the methods already discussed. You can easily load them and NLTK's PlaintextCorpusReader will help you.

In [None]:
from nltk.corpus import PlaintextCorpusReader

In [None]:
corpus_root = '/usr/share/dict'

In [None]:
wordlists = PlaintextCorpusReader(corpus_root, '.*')

In [None]:
wordlists.fileids()

In [None]:
!ls /usr/share/dict

In [None]:
ws = wordlists.words('connectives')
ws

In [None]:
type(ws)

In [None]:
len(wordlists.words('connectives'))

#### Another example. copy of Penn Treebank

In [None]:
from nltk.corpus import BracketParseCorpusReader

In [None]:
corpus_root = '/Users/re4lfl0w/nltk_data/corpora/treebank/parsed/'

In [None]:
file_pattern = r'.*00[0-1][0-9]\.prd'

In [None]:
ptb = BracketParseCorpusReader(corpus_root, file_pattern)

In [None]:
ptb.fileids()

In [None]:
len(ptb.sents())

In [None]:
ptb.sents(fileids='wsj_0005.prd')[1]

<a id = "Distributions"></a>

## <span style="color:#0b486b">2. Conditional Frequency Distributions</span>

- Text corpus is divided into various categories (genre, topic, author, etc.)
- You can maintain a frequency distribution that is divided for each category.
- conditional frequency distribution: A collection of frequency distributions. Each by different conditions.


### Conditions and Events

- The frequency distribution counts the observed events.

In [None]:
text = ['The', 'Fulton', 'Country', 'Grand', 'Jury', 'said']

In [None]:
# (condition, event)
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'Country')]

- Brown Corpus by genre: 15 conditions, 1,161, 192 events(one per word)

### Counting Words by Genre

- FreqDist (): accepts a simple list as input
- ConditionalFreqDist (): received list pair

In [None]:
from nltk.corpus import brown

In [None]:
cfd = nltk.ConditionalFreqDist((genre, word)
                               for genre in brown.categories()
                               for word in brown.words(categories=genre))

In [None]:
cfd.conditions()

In [None]:
cfd.plot()

In [None]:
genre_word = [(genre, word)
              for genre in ['news', 'romance']
              for word in brown.words(categories=genre)]
len(genre_word)

In [None]:
genre_word[:4]

In [None]:
genre_word[-4:]

In [None]:
cfd = nltk.ConditionalFreqDist(genre_word)
cfd

In [None]:
cfd.conditions()

In [None]:
cfd['news']

In [None]:
cfd['romance']

In [None]:
list(cfd['romance'])

In [None]:
cfd['romance']['could']

In [None]:
cfd.plot()

### Plotting and Tabulating Distributions

- When combining two or more frequency distributions,
- It is easy to initialize, and CondtionalFreqDist provides some useful methods for table drawing and graph drawing.

In [None]:
from nltk.corpus import inaugural

In [None]:
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
                               for fileid in inaugural.fileids()
                               for w in inaugural.words(fileid)
                               for target in ['america', 'citizen']
                               if w.lower().startswith(target))

cfd.plot()

In [None]:
from nltk.corpus import udhr

In [None]:
languages = ['Chickasaw', 'English', 'German_Deutsch',
             'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

In [None]:
cfd = nltk.ConditionalFreqDist((lang, len(word))
                               for lang in languages
                               for word in udhr.words(lang + '-Latin1'))
cfd.plot()

In [None]:
cfd.plot(cumulative=True)

In [None]:
cfd.tabulate(conditions=['English', 'German_Deutsch'],
             samples=range(10), cumulative=True)

In [None]:
# conditions
cfd.tabulate(samples=range(10), cumulative=True)

### Generating Random Text with Bigrams

- You can use conditional frequency distributions to create bigrams tables.

In [None]:
sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
        'and', 'the', 'earth', '.']

In [None]:
# generator 
list(nltk.bigrams(sent))

#### Example 2-1. Generating random text

- This program obtains all bigrams from the text of the book of Genesis,
- then constructs a conditional frequency distribution to record which words are most likely to follow a given word;
- e.g., after the word living, the most likely word is creature;
- the generate_model() function uses this data, and a seed word, to generate random text.

In [None]:
def generate_model(cfdist, word, num=15):
    for i in range(num):
        print word,
        word = cfdist[word].max()

In [None]:
text = nltk.corpus.genesis.words('english-kjv.txt')
text

In [None]:
bigrams = nltk.bigrams(text)
list(bigrams)[:10]

In [None]:
cfd = nltk.ConditionalFreqDist(bigrams)

In [None]:
print cfd['living']

In [None]:
generate_model(cfd, 'living', num=5)

In [None]:
generate_model(cfd, 'living', num=2)

In [None]:
# living
cfd['living'].max()

In [None]:
cfd['creature'].max()

In [None]:
cfd['living']

In [None]:
' '.join(cfd['living'])

- Conditional frequency distributions are a useful data structure for many NLP tasks

#### Table 2-4. NLTK's conditional frequency distributions

Example | Description
--- | ---
cfdist = ConditionalFreqDist(pairs) | Create a conditional frequency distribution from a list of pairs
cdist.conditions() | Alphabetically sorted list of conditions
cdist[condition] | The frequency distribution for this condition
cdist\[condition\]\[sample\] | Frequency for the given sample for this condition
cfdist.tabulate() | Tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions) | Tabulation limited to the specified samples and conditions
cfdist.plot() | Graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions) | Graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2 | Test if samples in cfdist1 occur less frequently than in cfdist2

<a id = "Code"></a>

## <span style="color:#0b486b">3. More Python: Reusing Code</span>

  - Python functions

### Functions

- function
- parameters
- return value

In [None]:
from __future__ import division

In [None]:
def lexical_diversity(text):
    return len(text) / len(set(text))

In [None]:
s = 'hello world hello'

In [None]:
lexical_diversity(s)

In [None]:
len(s)

In [None]:
len(set(s))

In [None]:
len(s) / len(set(s))

In [None]:
def lexical_diversity(my_text_data):
    word_count = len(my_text_data)
    vocab_size = len(set(my_text_data))
    diversity_score = word_count / vocab_size
    return diversity_score

In [None]:
lexical_diversity(s)

- local variables

#### Example 2-2 A Python function

- This function tries to work out the plural form of any English noun

In [None]:
def plural(word):
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'

In [None]:
plural('fairy')

In [None]:
plural('woman')

<a id = "Resources"></a>

## <span style="color:#0b486b">4. Lexical Resources</span>

- lexicon or lexical resource: collection of words and/or phrases along with associated informatin, such as part-of-speech and sense definitions

```python
vocab = sorted(set(my_text))
word_freq = FreqDist(my_text)
```

- vocab, word_freq: Simple vocabulary resources
- lexical entries: Vocabulary entry


### Wordlist Corpora

- NLTK: Has a few corpus, but nothing better than the wordlists.
- Words Corpus /usr/dict/words

#### Example 2-3 Filtering a text

- text Of words.
- Erase all items. It occurs in the existing wordlist. Only unusual or misspelled words remain.

In [None]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
#     text_vocab - english_vocab
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)

In [None]:
len(nltk.corpus.words.words())

In [None]:
nltk.corpus.words.words()[::10000]

In [None]:
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))[::100]

In [None]:
unusual_words(nltk.corpus.nps_chat.words())[::100]

- stopwords: Words that appear very high, the, to, also
   - Sometimes I want to be out of this document.
   - And do the rest.
- Usually have few vocabulary contents. They exist in the text fails to distinguish them from other texts.

In [None]:
from nltk.corpus import stopwords

In [None]:
stopwords.words('english')

In [None]:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)

In [None]:
content_fraction(nltk.corpus.reuters.words())

- help with stopsrods, filter text from 3 words
- Combine two different corpus.
- lexical resources are used to filter the content of text corpus.

- wordlist: word puzzles, It is useful to solve.

In [None]:
puzzle_letters = nltk.FreqDist('egivrvonl')
puzzle_letters

In [None]:
obligatory = 'r'

In [None]:
wordlist = nltk.corpus.words.words()
len(wordlist)

In [None]:
[w for w in wordlist if len(w) >= 6
                        and obligatory in w
                        and nltk.FreqDist(w) <= puzzle_letters]

In [None]:
len([w for w in wordlist if len(w) >= 6
                        and obligatory in w])

In [None]:
names = nltk.corpus.names
names

In [None]:
names.fileids()

In [None]:
male_names = names.words('male.txt')
male_names[::100]

In [None]:
len(male_names)

In [None]:
female_names = names.words('female.txt')
female_names[::100]

In [None]:
len(female_names)

In [None]:
[w for w in male_names if w in female_names][::50]

#### Figure 2-7. Conditional frequency distribution

- This plot shows the number of female and male names ending with each letter of the alphabet
- most names ending with a, e, or i are female;
- names ending in h and l are equally to be male or female;
- names ending in k, o, r, s, and t are likely to be male.

In [None]:
cfd = nltk.ConditionalFreqDist((fileid, name[-1])
                                for fileid in names.fileids()
                                for name in names.words(fileid))
cfd.plot()

### A Prononcing Dictionary

- Some rich lexical resource table
- When you include words and some attributes in each row
- NLTK adds the CMU Pronouncing Dictionary.
- It was designed by speech synthesizers (synthesizers) for use.
- For each word, these vocabularies provide a list of voice codes.
- distinguish labels with each contrasting sound.
- [Arpabet - Wikipedia, the free encyclopedia](http://en.wikipedia.org/wiki/Arpabet)


In [None]:
entries = nltk.corpus.cmudict.entries()

In [None]:
len(entries)

In [None]:
for entry in entries[39943:39951]:
    print entry

- For each word, the vocabulary provides a sound code for all lists.
- Clear label For contrasting sound: phones
- fire: 2 pronunciation: F AY1 R, F AY1 ERO
- The symbols in the CMU Pronouncing Dictionary 
- http://en.wikipedia.org/wiki/Arpabet


Each entry consists of two parts.
Each person can proceed with a more complex version for the for statement.
- As we loop around, the first part is the word, the second part is pron

In [None]:
for word, pron in entries:
    if len(pron) == 3:
        ph1, ph2, ph3 = pron
        if ph1 == 'P' and ph3 == 'T':
            print word, ph2

- The show program scanned what the vocabulary consists of 3 phones.

In [None]:
syllable = ['N', 'IHO', 'K', 'S']

In [None]:
# Why? 
[word for word, pron in entries if pron[-4:] == syllable]

In [None]:
entries[10]

In [None]:
entries[10][-1:]

In [None]:
entries[10][1][-2:]

In [None]:
entries[10][1][-2]

In [None]:
entries[10][1][-2:] == 'AH0'

In [None]:
entries[10][1][-2] == 'AH0'

In [None]:
entries[10][-4:]

In [None]:
[w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']

In [None]:
sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))

#### pron, phone

- What do you mean?
- 0, 1, 2, What does this mean?
- 1: primary stress
- 2: secondary stress
- 0: no stress

In [None]:
def stress(pron):
    ''' input -> u'AO1', u'R', u'ER0', u'Z'
        Check each of A01, R, and ER0 for each number
    Output only numbers
    '''
    return [char for phone in pron for char in phone if char.isdigit()]

In [None]:
len(entries)

In [None]:
len([w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']])

In [None]:
[w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']][::50]

In [None]:
[w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']][::50]

- stress(): list comprehension. Duplicate nested for loop.
- p words consisting of three sounds
- group them according to their first and last sounds

In [None]:
p3 = [(pron[0] + '-' + pron[2], word)
      for (word, pron) in entries
      if pron[0] == 'P' and len(pron) == 3]

In [None]:
p3[::20]

In [None]:
cfd = nltk.ConditionalFreqDist(p3)

In [None]:
cfd.conditions()

In [None]:
for template in cfd.conditions():
    if len(cfd[template]) > 10:
        words = cfd[template].keys()
        wordlist = ' '.join(words)
        print template, wordlist[:66] + '...'

In [None]:
prondict = nltk.corpus.cmudict.dict()

In [None]:
len(prondict)

In [None]:
prondict['fire']

In [None]:
prondict['blog'] = [['B', 'L', 'AA1', 'G']]
prondict['blog']

In [None]:
text = ['natural', 'language', 'processing']

- You can use the linguistic resources of processing text. Filterable with a few linguistic properties (like nouns) or to all words in text
- For example, the text-to-speech function looks at the text of each word in the pronunciation dictionary.

In [None]:
[ph for w in text for ph in prondict[w][0]]

### Comparative Wordlists

- comparative wordlist: Can the table vocabulary be comparable?
- Swadesh wordlists in various languages: list of 200 common words.
- ISO 639 two-letter code.

In [None]:
from nltk.corpus import swadesh

In [None]:
swadesh.fileids()

In [None]:
swadesh.words('en')[:30]

- Use the entries () method in various languages and access to the etymology of words. Select a list of languages.
- It can be changed to a simple dictionary.

In [None]:
fr2en = swadesh.entries(['fr', 'en'])
fr2en[:30]

In [None]:
translate = dict(fr2en)

In [None]:
translate['chien']

In [None]:
translate['jeter']

- Let's add two dictionaries.

In [None]:
de2en = swadesh.entries(['de', 'en'])

In [None]:
es2en = swadesh.entries(['es', 'en'])

In [None]:
translate.update(dict(de2en))

In [None]:
translate.update(dict(es2en))

In [None]:
translate['Hund']

In [None]:
translate['perro']

- You can compare words in various Germanic and Romance languages.

In [None]:
languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']

In [None]:
for i in [139, 140, 141, 142]:
    print swadesh.entries(languages)[i]

### Shoebox and Toolbox Lexicons

- The toolbox is for data manipulated by verbally used in the most popular tool.
- The Toolbox consists of a collection of entries.

In [None]:
from nltk.corpus import toolbox

In [None]:
toolbox.entries('rotokas.dic')[:3]

- It consists of attribute, value pair.
- ('ps', 'V'): part-of-speech, verb
- ('ge', 'gag'): gloss-into-English
- The structure file of the unlocked Toolbox becomes difficult.
- XML provides a powerful way to process this kind of corpus
- Rotokas is a notable name in 12 phonemic inventions.

<a id = "WordNet"></a>

## <span style="color:#0b486b">5. WordNet</span>

- WordNet: Semantically oriented English dictionary.
- NLTK contains ENGLISH WordNet 155,287 words and 117,659 synonym sets.

### Senses and Synonyms


- a. Benz is credited with the invention of the motorcar.
- b. Benz is credited with the invention of the automobile.

- motorcar, automobile: synonyms

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
wn.synsets('motorcar')

- car.n.01: synset(synonym set). collection of synonymous words(or "lemmas") entry?

In [None]:
wn.synset('car.n.01').lemma_names()

In [None]:
wn.synset('car.n.01').definition()

In [None]:
wn.synset('car.n.01').examples()

In [None]:
# get all the lemmas for a given synset
wn.synset('car.n.01').lemmas()

In [None]:
# look up a particular lemma
wn.lemma('car.n.01.automobile')

In [None]:
# get the synset corresponding to a lemma
wn.lemma('car.n.01.automobile').synset()

In [None]:
# get the "name" of a lemma
wn.lemma('car.n.01.automobile').name()

In [None]:
wn.synsets('car')

In [None]:
for synset in wn.synsets('car'):
    print synset.lemma_names()

In [None]:
wn.lemmas('car')

### The WordNet Hierarchy

- WordNet synsets: Matches abstract concept. It does not always match words used in English.
- Their concepts are linked together hierarchically.
- unique beginners.

- Wordnet: Makes it easy to translate between concepts.
- hyponyms: segment

In [None]:
motorcar = wn.synset('car.n.01')

In [None]:
types_of_motorcar = motorcar.hyponyms()

In [None]:
types_of_motorcar[26]

In [None]:
types_of_motorcar

In [None]:
sorted([lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas()])

In [None]:
motorcar.hypernyms()

In [None]:
paths = motorcar.hypernym_paths()
paths

In [None]:
len(paths)

In [None]:
[synset.name() for synset in paths[0]]

In [None]:
[synset.name() for synset in paths[1]]

In [None]:
# Get the most general hypernyms(or root hypernyms)
motorcar.root_hypernyms()

### More Lexical Relations

- lexical relations: Hypernym, hyponym
- [meronyms](http://en.wikipedia.org/wiki/Meronymy): Semantic relations of linguistics
- [Holonymy](http://en.wikipedia.org/wiki/Holonymy): semantic relation
- substance_meronyms()

In [None]:
wn.synset('tree.n.01').part_meronyms()

In [None]:
wn.synset('tree.n.01').substance_meronyms()

In [None]:
wn.synset('tree.n.01').member_holonyms()

In [None]:
for synset in wn.synsets('mint', wn.NOUN):
#     print synset.name() + ':', synset.definition()
    print '{0}: {1}'.format(synset.name(), synset.definition())

In [None]:
wn.synset('mint.n.04').part_holonyms()

In [None]:
wn.synset('mint.n.04').substance_holonyms()

In [None]:
wn.synset('walk.v.01').entailments()

In [None]:
wn.synset('eat.v.01').entailments()

In [None]:
wn.synset('tease.v.03').entailments()

- antonymy: Anti-wisdom

In [None]:
# Supply and Demand
wn.lemma('supply.n.02.supply').antonyms()

In [None]:
wn.lemma('rush.v.01.rush').antonyms()

In [None]:
wn.lemma('horizontal.a.01.horizontal').antonyms()

In [None]:
wn.lemma('staccato.r.01.staccato').antonyms()

#### dir

- Determine which method is defined

In [None]:
dir(wn.synset('harmony.n.02'))

### Semantic Similarity

- synsets are linked by complex network vocabulary relationships.

In [None]:
right = wn.synset('right_whale.n.01')
right

In [None]:
orca = wn.synset('orca.n.01')
orca

In [None]:
minke = wn.synset('minke_whale.n.01')
minke

In [None]:
tortoise = wn.synset('tortoise.n.01')
tortoise

In [None]:
novel = wn.synset('novel.n.01')
novel

In [None]:
right.lowest_common_hypernyms(minke)

In [None]:
right.lowest_common_hypernyms(orca)

In [None]:
right.lowest_common_hypernyms(tortoise)

In [None]:
right.lowest_common_hypernyms(novel)

- baleen: very specific
- vertebrate: very common
- entity: completely general

In [None]:
wn.synset('minke_whale.n.01').min_depth()

In [None]:
wn.synset('baleen_whale.n.01').min_depth()

In [None]:
wn.synset('whale.n.02').min_depth()

In [None]:
wn.synset('vertebrate.n.01').min_depth()

In [None]:
wn.synset('entity.n.01').min_depth()

- Similarity measure
- path_similarity: Scores at 0-1.

In [None]:
right.path_similarity(minke)

In [None]:
right.path_similarity(orca)

In [None]:
right.path_similarity(tortoise)

In [None]:
right.path_similarity(novel)

#### Several other similarity measures are available

In [None]:
help(wn)

In [None]:
nltk.corpus.verbnet

<a id = "Summary"></a>

## <span style="color:#0b486b">6. Summary</span>

- A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown
- Some text corpora are categorized e.g., by genre or topic; sometimes the categories of a corpus overlap each other.
- A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre.
- Python programs more than a few lines long should be entered using a text editor, saved to a file with a.py extension, and accessed using an import statement
- Python functions permit you to associate a name with a particular block of code, and reuse that code as often as necessary.
- Some functions, known as "methods", are associated with an object, and we give the object name followed by a period followed by the method name, like this: **x.funct(y)**, e.g., **word.isalpha()**
- To find out about somre variable v, type help(v) in the Python interactive interpreter to read the help entry for this kind of object
- WordNet is a semantically oriented dictionary of English, consisting of synonymsets--or synsets--and organized into a network
- Some functions are not available by default, but must be accessed using Python's import statement.

<a id = "reading"></a>

## <span style="color:#0b486b">7. Further Reading</span>

- [Extra materials](http://www.nltk.org)
- [Corpus HOWTO](http://www.nltk.org/howto)
- Linguistic Data Consortium(LDC)
- European Language Resource Agency(ELRA)
- [OLAC Meta data](http://www.language-archives.org)
- [Ethnologue](http://www.ethnologue.com)
- [WordNet](http://globalwordnet.org)