# FLIP(01):  Advanced Data Science
**(Module 03: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 10 - Managing Linguistic Data

### Corpus Structure: A Case Study
#### The Structure of TIMIT

NLTK includes a sample from the TIMIT Corpus. You can access its documentation in the usual way, using help(nltk.corpus.timit). Print nltk.corpus.timit.fileids() to see a list of the 160 recorded utterances in the corpus sample.Each item has a phonetic transcription which can be accessed using the phones() method. We can access the corresponding word tokens in the customary way. Both access
methods permit an optional argument offset=True, which includes the start and end offsets of the corresponding span in the audio file.

In [None]:
import nltk
phonetic = nltk.corpus.timit.phones('dr1-fvmh0/sa1')

In [None]:
phonetic

In [None]:
nltk.corpus.timit.word_times('dr1-fvmh0/sa1')

In [None]:
timitdict = nltk.corpus.timit.transcription_dict()

In [None]:
timitdict['greasy'] + timitdict['wash'] + timitdict['water']

In [None]:
nltk.corpus.timit.spkrinfo('dr1-fvmh0')

# The Life Cycle of a Corpus

We can also measure the agreement between two independent segmentations of language input, e.g., for tokenization, sentence segmentation, and named entity recognition.Windowdiff is a simple algorithm for evaluating the agreement of two segmentations by running a sliding window over the data and awarding partial credit for near misses. If we preprocess our tokens into a sequence of zeros and ones, to record when a token is followed by a boundary, we can represent the segmentations as strings and apply the windowdiff scorer.

In [None]:
s1 = "00000010000000001000000"
s2 = "00000001000000010000000"
s3 = "00010000000000000001000"

In [None]:
nltk.windowdiff(s1, s1, 3)

In [None]:
nltk.windowdiff(s1, s2, 3)

In [None]:
nltk.windowdiff(s2, s3, 3)

# Acquiring Data

## Obtaining Data from Word Processor Files

Consider the following fragment of a lexical entry: “sleep [sli:p] v.i. condition of body and mind...”. We can key in such text using MSWord, then “Save as Web Page,” then inspect the resulting HTML file:

<p class=MsoNormal>sleep
<span style='mso-spacerun:yes'> </span>
[<span class=SpellE>sli:p</span>]
<span style='mso-spacerun:yes'> </span>
<b><span style='font-size:11.0pt'>v.i.</span></b>
<span style='mso-spacerun:yes'> </span>
<i>a condition of body and mind ...<o:p></o:p></i>
</p>

Observe that the entry is represented as an HTML paragraph, using the <p> element, and that the part of speech appears inside a <span style='font-size:11.0pt'> element. The following program defines the set of legal parts-of-speech, legal_pos. Then it extracts all 11-point content from the dict.htm file and stores it in the set used_pos. Observe that the search pattern contains a parenthesized sub-expression; only the material that matches this subexpression is returned by re.findall. Finally, the program constructs the set of illegal parts-of-speech as the set difference between used_pos and legal_pos:

In [None]:
legal_pos = set(['n', 'v.t.', 'v.i.', 'adj', 'det'])

In [None]:
import re
pattern = re.compile(r"'font-size:11.0pt'>([a-z.]+)<")

In [None]:
document = open("dict.htm").read()

In [None]:
used_pos = set(re.findall(pattern, document))
illegal_pos = used_pos.difference(legal_pos)
print(list(illegal_pos))

In [None]:
#  Converting HTML created by Microsoft Word into comma-separated values
def lexical_data(html_file):
    SEP = '_ENTRY'
    html = open(html_file).read()
    html = re.sub(r'<p', SEP + '<p', html)
    text = nltk.clean_html(html)
    text = ' '.join(text.split())
    for entry in text.split(SEP):
        if entry.count(' ') > 2:
            yield entry.split(' ', 3)

In [None]:
import csv
writer = csv.writer(open("dict1.csv", "wb"))
writer.writerows(lexical_data("dict.htm"))

## Obtaining Data from Spreadsheets and Databases

Spreadsheets are often used for acquiring wordlists or paradigms. For example, a comparative wordlist may be created using a spreadsheet, with a row for each cognate set and a column for each language (see nltk.corpus.swadesh and www.rosettaproject.org). Most spreadsheet software can export their data in CSV format. As we will see later, it is easy for Python programs to access these using the csv module.
Nevertheless, when our goal is simply to extract the contents from a database, it is enough to dump out the tables (or SQL query results) in CSV format and load them into our program. Our program might perform a linguistically motivated query that
cannot easily be expressed in SQL, e.g., select all words that appear in example sentences for which no dictionary entry is provided. For this task, we would need to extract enough information from a record for it to be uniquely identified, along with the headwords and example sentences. Let’s suppose this information was now available in a CSV file dict.csv:
Now we can express this query as shown here:

In [None]:
import csv
lexicon = csv.reader(open('dict.csv'))
pairs = [(lexeme, defn) for (lexeme, _, _, defn) in lexicon]
lexemes, defns = zip(*pairs)
defn_words = set(w for defn in defns for w in defn.split())
sorted(defn_words.difference(lexemes))

## Converting Data Formats

In the simplest case, the input and output formats are isomorphic. For instance, we might be converting lexical data from Toolbox format to XML, and it is straightforward to transliterate the entries one at a time. The structure of the data is reflected in the structure of the required program: a for loop whose body takes care of a single entry.
In another common case, the output is a digested form of the input, such as an inverted file index. Here it is necessary to build an index structure in memory, then write it to a file in the desired format. The following example constructs an index
that maps the words of a dictionary definition to the corresponding lexeme for each lexical entry , having tokenized the definition text , and discarded short words. Once the index has been constructed, we open a file and then iterate over the index entries, to write out the lines in the required format.

In [None]:
idx = nltk.Index((defn_word, lexeme)
                 for (lexeme, defn) in pairs
                 for defn_word in nltk.word_tokenize(defn)
                 if len(defn_word) > 3)

In [None]:
idx_file = open("dict.idx", "w")

In [None]:
for word in sorted(idx):
    idx_words = ', '.join(idx[word])
    idx_line = "%s: %s\n" % (word, idx_words)
    idx_file.write(idx_line)
idx_file.close()

# Working with XML

## The ElementTree Interface

Python’s ElementTree module provides a convenient way to access data stored in XML files. ElementTree is part of Python’s standard library (since Python 2.5), and is also provided as part of NLTK in case you are using Python 2.4.
We will illustrate the use of ElementTree using a collection of Shakespeare plays that have been formatted using XML. Let’s load the XML file and inspect the raw data, first at the top of the file , where we see some XML headers and the name of a schema called play.dtd, followed by the root element PLAY. We pick it up again at the start of Act 1.

In [None]:
merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
raw = open(merchant_file).read()

In [None]:
print(raw[0:168])

In [None]:
# from nltk.etree.ElementTree import ElementTree
# merchant = ElementTree().parse(merchant_file)
from xml.etree import ElementTree as ET
merchant = ET.parse(merchant_file)

In [None]:
merchant

In [None]:
merchant[0]

In [None]:
merchant[0].text

In [None]:
merchant.getchildren()

In [None]:
merchant[-2][0].text

In [None]:
merchant[-2][1]

In [None]:
merchant[-2][1][0].text

In [None]:
merchant[-2][1][54]

In [None]:
merchant[-2][1][54][0]

In [None]:
merchant[-2][1][54][0].text

In [None]:
merchant[-2][1][54][1]

In [None]:
merchant[-2][1][54][1].text

Although we can access the entire tree this way, it is more convenient to search for subelements with particular names. Recall that the elements at the top level have several types. We can iterate over just the types we are interested in (such as the acts), using merchant.findall('ACT'). Here’s an example of doing such tag-specific searches at every level of nesting:

In [None]:
for i, act in enumerate(merchant.findall('ACT')):
    for j, scene in enumerate(act.findall('SCENE')):
        for k, speech in enumerate(scene.findall('SPEECH')):
            for line in speech.findall('LINE'):
                if 'music' in str(line.text):
                    print("Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text))

In [None]:
speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')]
speaker_freq = nltk.FreqDist(speaker_seq)

In [None]:
top5 = speaker_freq.keys()[:5]

In [None]:
top5

In [None]:
mapping = nltk.defaultdict(lambda: 'OTH')
for s in top5:
    mapping[s] = s[:4]

In [None]:
speaker_seq2 = [mapping[s] for s in speaker_seq]
cfd = nltk.ConditionalFreqDist(nltk.ibigrams(speaker_seq2))
cfd.tabulate()

## Using ElementTree for Accessing Toolbox Data

We can use the toolbox.xml() method to access a Toolbox file and load it into an ElementTree object. This file contains a lexicon for the Rotokas language of Papua New Guinea.

In [None]:
from nltk.corpus import toolbox
lexicon = toolbox.xml('rotokas.dic')

In [None]:
lexicon[3][0]

In [None]:
lexicon[3][0].tag

In [None]:
lexicon[3][0].text

In [None]:
[lexeme.text.lower() for lexeme in lexicon.findall('record/lx')]

In [None]:
import sys
from nltk.etree.ElementTree import ElementTree
tree = ElementTree(lexicon[3])
tree.write(sys.stdout)

## Formatting Entries

In [None]:
html = "<table>\n"
for entry in lexicon[70:80]:
    lx = entry.findtext('lx')
    ps = entry.findtext('ps')
    ge = entry.findtext('ge')
    html += " <tr><td>%s</td><td>%s</td><td>%s</td></tr>\n" % (lx, ps, ge)
    html += "</table>"
    print(html)

# Working with Toolbox Data

Given the popularity of Toolbox among linguists, we will discuss some further methods for working with Toolbox data. Many of the methods discussed in previous chapters, such as counting, building frequency distributions, and tabulating co-occurrences, can be applied to the content of Toolbox entries. For example, we can trivially compute the average number of fields for each entry:

In [None]:
from nltk.corpus import toolbox
lexicon = toolbox.xml('rotokas.dic')
sum(len(entry) for entry in lexicon) / len(lexicon)

## Adding a Field to Each Entry

In [None]:
# Adding a new cv field to a lexical entry
def cv(s):
    s = s.lower()
    s = re.sub(r'[^a-z]', r'_', s)
    s = re.sub(r'[aeiou]', r'V', s)
    s = re.sub(r'[^V_]', r'C', s)
    return (s)

In [None]:
def add_cv_field(entry):
    for field in entry:
        if field.tag == 'lx':
            cv_field = SubElement(entry, 'cv')
            cv_field.text = cv(field.text)

In [None]:
lexicon = toolbox.xml('rotokas.dic')

In [None]:
add_cv_field(lexicon[53])

In [None]:
print(nltk.to_sfm_string(lexicon[53]))

## Validating a Toolbox Lexicon

Many lexicons in Toolbox format do not conform to any particular schema. Some entries may include extra fields, or may order existing fields in a new way. Manually inspecting thousands of lexical entries is not practicable. However, we can easily identify frequent versus exceptional field sequences, with the help of a FreqDist:

In [None]:
fd = nltk.FreqDist(':'.join(field.tag for field in entry) for entry in lexicon)

In [None]:
fd.items()

In [None]:
# Validating Toolbox entries using a context-free grammar.
grammar = nltk.parse_cfg('''
    S -> Head PS Glosses Comment Date Sem_Field Examples
    Head -> Lexeme Root
    Lexeme -> "lx"
    Root -> "rt" |
    PS -> "ps"
    Glosses -> Gloss Glosses |
    Gloss -> "ge" | "tkp" | "eng"
    Date -> "dt"
    Sem_Field -> "sf"
    Examples -> Example Ex_Pidgin Ex_English Examples |
    Example -> "ex"
    Ex_Pidgin -> "xp"
    Ex_English -> "xe"
    Comment -> "cmt" | "nt" |
    ''')

In [None]:
def validate_lexicon(grammar, lexicon, ignored_tags):
    rd_parser = nltk.RecursiveDescentParser(grammar)
    for entry in lexicon:
        marker_list = [field.tag for field in entry if field.tag not in ignored_tags]
        if rd_parser.nbest_parse(marker_list):
            print("+", ':'.join(marker_list))
        else:
            print("-", ':'.join(marker_list))

In [None]:
lexicon = toolbox.xml('rotokas.dic')[10:20]

In [None]:
ignored_tags = ['arg', 'dcsv', 'pt', 'vx']

In [None]:
validate_lexicon(grammar, lexicon, ignored_tags)

In [None]:
# Chunking a Toolbox lexicon: A chunk grammar describing the structure of entries for a lexicon for Iu Mien, a language of China.
from nltk_contrib import toolbox

In [None]:
grammar = r"""
    lexfunc: {<lf>(<lv><ln|le>*)*}
    example: {<rf|xv><xn|xe>*}
    sense: {<sn><ps><pn|gv|dv|gn|gp|dn|rn|ge|de|re>*<example>*<lexfunc>*}
    record: {<lx><hm><sense>+<dt>}
"""

In [None]:
from nltk.etree.ElementTree import ElementTree

In [None]:
db = toolbox.ToolboxData()
db.open(nltk.data.find('corpora/toolbox/iu_mien_samp.db'))

In [None]:
lexicon = db.parse(grammar, encoding='utf8')

In [None]:
toolbox.data.indent(lexicon)

In [None]:
tree = ElementTree(lexicon)

In [None]:
output = open("iu_mien_samp.xml", "w")

In [None]:
tree.write(output, encoding='utf8')

In [None]:
output.close()