# Linguistic annotation with `Python`

`Python` is a highly versatile programming language which offers a great number of
libraries which greatly support your work as digital lexicographer.

This notebook is supposed to illustrate the different levels of automatic linguistic
annotation used in the course.

## Libraries used in the course

We use the wonderful [Natural Language Toolkit](https://www.nltk.org/) which comes
with a great set of tools and resources. In addition, [spaCy](https://spacy.io/) is
used. It has a smaller range of functionalities but is a lot faster and uses state-
of-the-art algorithms (namely deep learning approaches).

## Setup

We assume that you have a working `Python3` installation. The following instructions
are tailored to Linux and MacOS but should -- with minor modifications -- work on
Windows as well.

### `pip`

`pip` is the package manager for `Python`. From version 3.4 on, it ships with `Python`. 

### `virtualenv`

`virtualenv` allows you to setup local (and clean) `Python` environments. It may be
installed via
```sh
[sudo] pip install virtualenv
```

Create a virtual environement in a subdirectory of your choice (e.g. `env`) using
```sh
virtualenv -p python3 env
```

and activate it.
```sh
. env/bin/activate
```

### `NLTK` and `spaCy`

3rd party Python packages (including `NLTK` and `spaCy`) may best be installed using `pip`:
```sh
(env) pip install -r requirements.txt
```

## Testing

Now, we are ready to roll. Start `Python`:
```sh
(env) python
```

### `NLTK`

`NLTK` itself provides a high-level API to numerous NLP tools. Before we can use them, they
have to be installed.

#### Tokenization

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/kmw/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

We are now ready to do some work:

In [4]:
sentence = "This shows NLTK's potentials."
tokens = nltk.word_tokenize(sentence)
print(tokens)

['This', 'shows', 'NLTK', "'s", 'potentials', '.']


#### Morphological analysis

Let's go for stemming!

In [5]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
for token in tokens:
    print(stemmer.stem(token))

this
show
nltk
's
potenti
.


Not very impressive?

In [7]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for token in tokens:
    print(lemmatizer.lemmatize(token))

[nltk_data] Downloading package wordnet to /Users/kmw/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


This
show
NLTK
's
potential
.


#### PoS tagging

In [12]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(tokens)
for pos_tag in pos_tags:
    print(pos_tag)

('This', 'DT')
('shows', 'VBZ')
('NLTK', 'NNP')
("'s", 'POS')
('potentials', 'NNS')
('.', '.')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kmw/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Now, what about the tagset?

In [14]:
import nltk
nltk.download('tagsets')
nltk.help.upenn_tagset('NNS')

NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


[nltk_data] Downloading package tagsets to /Users/kmw/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


There is no German tagger available with NLTK. Let's try to train one.
First, we have to obtain a training corpus:

In [4]:
url = 'http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/download/tigercorpus-2.2.conll09.tar.gz'
from urllib import request
request.urlretrieve(url, 'tigercorpus-2.2.conll09.tar.gz')

('tigercorpus-2.2.conll09.tar.gz', <http.client.HTTPMessage at 0x10cd40828>)

Then, uncompress it:

In [6]:
import tarfile
tar = tarfile.open('tigercorpus-2.2.conll09.tar.gz', mode='r:gz')
tar.extractall()
tar.close()

Read it with NLTK:

In [16]:
import nltk
corp = nltk.corpus.ConllCorpusReader('.', 'tiger_release_aug07.corrected.16012013.conll09',
                                     ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                     encoding='utf-8')
print(corp.tagged_sents()[0:1])

[[('``', '$('), ('Ross', 'NE'), ('Perot', 'NE'), ('wäre', 'VAFIN'), ('vielleicht', 'ADV'), ('ein', 'ART'), ('prächtiger', 'ADJA'), ('Diktator', 'NN'), ("''", '$(')]]


Split into training and test:

In [17]:
import random

tagged_sents = corp.tagged_sents()[0:1000]

# set a split size: use 90% for training, 10% for testing
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]
print(train_sents[0])

[('Allerdings', 'ADV'), ('glaubt', 'VVFIN'), ('fast', 'ADV'), ('die', 'ART'), ('Hälfte', 'NN'), ('der', 'ART'), ('Chief', 'FM'), ('Executives', 'FM'), (',', '$,'), ('daß', 'KOUS'), ('Perot', 'NE'), ('durchaus', 'ADV'), ('Chancen', 'NN'), ('habe', 'VAFIN'), (',', '$,'), ('die', 'ART'), ('Wahl', 'NN'), ('im', 'APPRART'), ('November', 'NN'), ('zu', 'PTKZU'), ('gewinnen', 'VVINF'), (',', '$,'), ('wenn', 'KOUS'), ('er', 'PPER'), ('denn', 'ADV'), ('kandidiert', 'VVFIN'), ('.', '$.')]


Train a tagger:

In [18]:
from nltk.tag import CRFTagger

ct = CRFTagger()
ct.train(train_sents,'model.crf.tagger')
ct.tag_sents([['Das','ist','schön', '.'], ['Funktioniert','es','?']])

[[('Das', 'NN'), ('ist', 'VAFIN'), ('schön', 'ADV'), ('.', '$.')],
 [('Funktioniert', 'VVFIN'), ('es', 'PPER'), ('?', '$.')]]

Evaluate it:

In [11]:
ct.evaluate(test_sents)

0.8876651982378855

#### Collocations

To start with identifying collocations, we need a corpus to work with. Luckily, `NLTK` provides a number of them. We choose the prime father of corpora, the Brown corpus.

In [19]:
import nltk
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to /Users/kmw/nltk_data...
[nltk_data]   Package brown is already up-to-date!


We start by searching words with the concordance function. For this, we create a `Text` object:

In [4]:
text = nltk.Text(brown.words(brown.fileids()[0:100]))
print(text.concordance('mother', lines=30))
print(text.concordance('mother', lines=10, width=10))

Displaying 30 of 38 matches:
a , Mont. , have a new baby . Their mother is Mrs. Camilla Alsop Wendell . Mr.
n gown , and it's fitted , says her mother . Also , invitations have been addr
also has a number of parolees to `` mother '' , watching to see that they do n
 College in Wellesley , Mass. . Her mother is the former Miss Stella Hayward .
 in the `` maskers' dances '' . The mother of young queen , Mrs. G. Henry Pier
e fair . Police said the children's mother , Mrs. Eleanor Somerville , was vis
at her Portland home by her widowed mother , 80 , her maiden aunt , also 80 an
y of six surviving children , whose mother died yesterday as the aftermath to 
ay night at the flat of her widowed mother , Mrs. Mary Pankowski , in the adjo
rch , 31978 Mound , in Warren . The mother and daughter , who will be buried s
Kowalski girls present held for her mother , because the flat lacked electrici
, and we must stay together the way Mother wanted '' , Kowalski said in tellin
born in County Down , I

Extracting collocations is easy!

1. Import:

In [20]:
from nltk.collocations import *

2. Create a measures object:

In [23]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

3. Create a finder object which does the hard work:

In [24]:
bigram_finder = BigramCollocationFinder.from_words(brown.words(brown.fileids()[0:10]))
trigram_finder = TrigramCollocationFinder.from_words(brown.words(brown.fileids()[0:10]))

4. Inspect the results:

In [26]:
print(bigram_finder.nbest(bigram_measures.pmi, 10))
print()
print(bigram_finder.nbest(bigram_measures.chi_sq, 10))
print()
print(bigram_finder.nbest(bigram_measures.likelihood_ratio, 10))
print()
print(bigram_finder.nbest(bigram_measures.student_t, 10))
print()
print(trigram_finder.nbest(trigram_measures.pmi, 10))
print()
print(trigram_finder.nbest(trigram_measures.chi_sq, 10))
print()
print(trigram_finder.nbest(trigram_measures.likelihood_ratio, 10))
print()
print(trigram_finder.nbest(trigram_measures.student_t, 10))

[('$115,000', 'annually'), ('$157,460', 'yearly'), ('100,000', 'recipients'), ('12,000', 'babies'), ('1311', 'acre'), ('1409', 'SW'), ('182', 'scholastics'), ('330', 'Woodland'), ('Abe', 'Stark'), ('Al', 'Ullman')]

[('$115,000', 'annually'), ('$157,460', 'yearly'), ('100,000', 'recipients'), ('12,000', 'babies'), ('1311', 'acre'), ('1409', 'SW'), ('182', 'scholastics'), ('330', 'Woodland'), ('Abe', 'Stark'), ('Al', 'Ullman')]

[('.', 'The'), ('of', 'the'), ('.', 'He'), ("''", '.'), ('has', 'been'), ('in', 'the'), ('United', 'States'), ('would', 'be'), ('will', 'be'), ('more', 'than')]

[('.', 'The'), ('of', 'the'), ('in', 'the'), ("''", '.'), ('.', 'He'), ('on', 'the'), ('.', '``'), ('for', 'the'), ("''", ','), ('has', 'been')]

[('12,000', 'babies', 'born'), ('1409', 'SW', 'Maplecrest'), ('Cape', 'Cod', 'writing'), ('Community', 'visiting', 'nurse'), ('Contempt', 'proceedings', 'originally'), ('Defends', 'Ike', 'Earlier'), ('Edgar', 'Hoover', 'presides'), ('Emerald', 'Empire', 'Kiwan

It work similarly for tagged words:

In [30]:
nltk.download('universal_tagset')

bigram_finder = BigramCollocationFinder.from_words(brown.tagged_words(brown.fileids()[0:10], tagset='universal'))

print(bigram_finder.nbest(bigram_measures.pmi, 10))

bigram_finder = BigramCollocationFinder.from_words(t for w, t in brown.tagged_words(brown.fileids()[0:10], tagset='universal'))

print(bigram_finder.nbest(bigram_measures.pmi, 10))

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/kmw/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


[(('$115,000', 'NOUN'), ('annually', 'ADV')), (('$157,460', 'NOUN'), ('yearly', 'ADV')), (('100,000', 'NUM'), ('recipients', 'NOUN')), (('12,000', 'NUM'), ('babies', 'NOUN')), (('1311', 'NUM'), ('acre', 'NOUN')), (('1409', 'NUM'), ('SW', 'NOUN')), (('182', 'NUM'), ('scholastics', 'NOUN')), (('330', 'NUM'), ('Woodland', 'NOUN')), (('3646', 'NUM'), ('N.', 'ADJ')), (('Abe', 'NOUN'), ('Stark', 'NOUN'))]
[('X', 'X'), ('PRON', 'VERB'), ('PRT', 'VERB'), ('.', 'PRON'), ('ADP', 'DET'), ('DET', 'ADJ'), ('ADV', 'ADV'), ('ADP', 'NUM'), ('ADJ', 'X'), ('VERB', 'ADV')]


It is possible to apply filters to the finder objects:

In [50]:
bigram_finder = BigramCollocationFinder.from_words(brown.words(brown.fileids()[0:10]))
print(bigram_finder.nbest(bigram_measures.pmi, 10))
print()
bigram_finder.apply_freq_filter(3)
print(bigram_finder.nbest(bigram_measures.pmi, 10))
print()

from nltk.corpus import stopwords
bigram_finder.apply_word_filter(lambda w: (not w.isalpha()) or w.lower() in stopwords.words('english'))
print(bigram_finder.nbest(bigram_measures.likelihood_ratio, 10))
print()
bigram_finder.apply_ngram_filter(lambda *w: 'United' not in w)
print(bigram_finder.nbest(bigram_measures.pmi, 10))

[('$115,000', 'annually'), ('$157,460', 'yearly'), ('100,000', 'recipients'), ('12,000', 'babies'), ('1311', 'acre'), ('1409', 'SW'), ('182', 'scholastics'), ('330', 'Woodland'), ('Abe', 'Stark'), ('Al', 'Ullman')]

[('Latin', 'America'), ('Feb.', '9'), ('U.', 'S.'), ('railroad', 'retirement'), ('rescue', 'trucks'), ('semester', 'hours'), ('Central', 'Falls'), ('Citizens', 'Group'), ('Eastwick', 'Corp.'), ('Pathet', 'Lao')]

[('United', 'States'), ('per', 'cent'), ('White', 'House'), ('social', 'security'), ('civil', 'defense'), ('last', 'year'), ('Rhode', 'Island'), ('New', 'Jersey'), ('home', 'rule'), ('Soviet', 'Union')]

[('United', 'Nations'), ('United', 'States')]


Practical session!
1. Load our own corpus (e.g. [Gutenberg](http://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html)).
2. Annotate it.
3. Determine some collocations. (Using lemmata and maybe even dependencies).

In [45]:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus_dir = os.path.abspath('corpus/')
corpus = nltk.corpus.PlaintextCorpusReader(corpus_dir, '.*Abraham.*\.txt')
print(corpus.sents(corpus.fileids()[0:1]))

[['LINCOLN', 'LETTERS'], ['By', 'Abraham', 'Lincoln'], ...]


In [39]:
from nltk.corpus import brown
nltk.download('stopwords')
text = nltk.Text(brown.words(brown.fileids()[0:100]))
print(text.collocations())
print(text.concordance('York', lines=30))

[nltk_data] Downloading package stopwords to /Users/kmw/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


United States; New York; per cent; Los Angeles; last year; White
House; years ago; San Francisco; United Nations; last night; World
War; President Kennedy; last week; home runs; Mr. Kennedy; St. Louis;
General Assembly; East Greenwich; Viet Nam; New Orleans
None
Displaying 30 of 108 matches:
 ADC program in Cook county by a New York City welfare consulting firm , liste
 of surplus funds of the Port of New York Authority , and making New Jersey at
tions , published in yesterday's New York Times . The Mayor said : `` It didn'
 of the American League champion New York Yankees , who come in here tomorrow 
two-game weekend series with the New York Yankees . Skinny Brown and Hoyt Wilh
helm , plans to bring the entire New York squad here from St. Petersburg , inc
rld record earlier this month in New York with a clocking of 1.09.3 , wiped ou
March 17 ( AP ) -- Two errors by New York Yankee shortstop Tony Kubek in the e
ti , Ohio ( AP ) -- The powerful New York Yankees won their 19th world seri

Install `spaCy` models via
```shell
python -m spacy download en
python -m spacy download de
```

In [18]:
import spacy
from nltk import Tree
en_nlp = spacy.load('en')
doc = en_nlp("The quick brown fox jumps over the lazy dog.")
for sent in doc.sents:
    for token in sent:
        print(token.orth_, token.pos_, token.tag_, token.dep_)
    print()

de_nlp = spacy.load('de')
doc = en_nlp("Des kleinen Mannes größte Freude ist sein Farbfernseher. Jeder Abschied ist ein kleiner Tod. Ein Mann tanzt mit einer Frau.")
for sent in doc.sents:
    for token in sent:
        print(token.orth_, token.pos_, token.tag_, token.dep_)
    print()

The DET DT det
quick ADJ JJ amod
brown ADJ JJ amod
fox NOUN NN nsubj
jumps VERB VBZ ROOT
over ADP IN prep
the DET DT det
lazy ADJ JJ amod
dog NOUN NN pobj
. PUNCT . punct

Des PROPN NNP amod
kleinen NOUN NN compound
Mannes PROPN NNP compound
größte VERB VBP compound
Freude ADJ JJ compound
ist NOUN NN ROOT
sein NOUN NN amod
Farbfernseher PROPN NNP dobj
. PUNCT . punct

Jeder PROPN NNP compound
Abschied VERB VBD compound
ist NOUN NN ROOT
ein NOUN NN compound
kleiner NOUN NN compound
Tod PROPN NNP dobj
. PUNCT . punct

Ein PROPN NNP compound
Mann PROPN NNP compound
tanzt NOUN NN nsubj
mit NOUN NN ROOT
einer NOUN NN compound
Frau PROPN NNP dobj
. PUNCT . punct

