# Linguistic annotation with `Python`

`Python` is a highly versatile programming language which offers a great number of
libraries which greatly support your work as digital lexicographer.

This notebook is supposed to illustrate the different levels of automatic linguistic
annotation used in the course.

## Libraries used in the course

We use the wonderful [Natural Language Toolkit](https://www.nltk.org/) which comes
with a great set of tools and resources. In addition, [spaCy](https://spacy.io/) is
used. It has a smaller range of functionalities but is a lot faster and uses state-
of-the-art algorithms (namely deep learning approaches).

## Setup

We assume that you have a working `Python3` installation. The following instructions
are tailored to Linux and MacOS but should -- with minor modifications -- work on
Windows as well.

### `pip`

`pip` is the package manager for `Python`. From version 3.4 on, it ships with `Python`. 

### `virtualenv`

`virtualenv` allows you to setup local (and clean) `Python` environments. It may be
installed via
```sh
[sudo] pip install virtualenv
```

Create a virtual environement in a subdirectory of your choice (e.g. `env`) using
```sh
virtualenv -p python3 env
```

and activate it.
```sh
. env/bin/activate
```

### `NLTK` and `spaCy`

3rd party Python packages (including `NLTK` and `spaCy`) may best be installed using `pip`:
```sh
(env) pip install -r requirements.txt
```

## Testing

Now, we are ready to roll. Start `Python`:
```sh
(env) python
```

### `NLTK`

`NLTK` itself provides a high-level API to numerous NLP tools. Before we can use them, they
have to be installed.

#### Tokenization

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/kmw/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

We are now ready to do some work:

In [4]:
sentence = "This shows NLTK's potentials."
tokens = nltk.word_tokenize(sentence)
print(tokens)

['This', 'shows', 'NLTK', "'s", 'potentials', '.']


#### Morphological analysis

Let's go for stemming!

In [5]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
for token in tokens:
    print(stemmer.stem(token))

this
show
nltk
's
potenti
.


Not very impressive?

In [7]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for token in tokens:
    print(lemmatizer.lemmatize(token))

[nltk_data] Downloading package wordnet to /Users/kmw/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


This
show
NLTK
's
potential
.


#### PoS tagging

In [12]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(tokens)
for pos_tag in pos_tags:
    print(pos_tag)

('This', 'DT')
('shows', 'VBZ')
('NLTK', 'NNP')
("'s", 'POS')
('potentials', 'NNS')
('.', '.')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kmw/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Now, what about the tagset?

In [13]:
nltk.download('tagsets')
nltk.help.upenn_tagset('NNS')

[nltk_data] Downloading package tagsets to /Users/kmw/nltk_data...


NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


[nltk_data]   Unzipping help/tagsets.zip.


There is no German tagger available with NLTK. Let's try to train one.
First, we have to obtain a training corpus:

In [4]:
url = 'http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/download/tigercorpus-2.2.conll09.tar.gz'
from urllib import request
request.urlretrieve(url, 'tigercorpus-2.2.conll09.tar.gz')

('tigercorpus-2.2.conll09.tar.gz', <http.client.HTTPMessage at 0x10cd40828>)

Then, uncompress it:

In [6]:
import tarfile
import shutil
tar = tarfile.open('tigercorpus-2.2.conll09.tar.gz', mode='r:gz')
tar.extractall()
tar.close()

Read it with NLTK:

In [2]:
import nltk
corp = nltk.corpus.ConllCorpusReader('.', 'tiger_release_aug07.corrected.16012013.conll09',
                                     ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                     encoding='utf-8')

Split into training and test:

In [9]:
import random

tagged_sents = corp.tagged_sents()[0:1000]

# set a split size: use 90% for training, 10% for testing
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]
print(train_sents[0])

[('Der', 'ART'), ('Politologe', 'NN'), ('Hans-Gerd', 'NE'), ('Jaschke', 'NE'), ('über', 'APPR'), ('den', 'ART'), ('hilflosen', 'ADJA'), ('Umgang', 'NN'), ('mit', 'APPR'), ('Rechtsradikalen', 'NN')]


Train a tagger:

In [10]:
from nltk.tag import CRFTagger

ct = CRFTagger()
ct.train(train_sents,'model.crf.tagger')
ct.tag_sents([['Das','ist','schön', '.'], ['Funktioniert','es','?']])

[[('Das', 'PDS'), ('ist', 'VAFIN'), ('schön', 'PTKVZ'), ('.', '$.')],
 [('Funktioniert', 'VVFIN'), ('es', 'PPER'), ('?', '$.')]]

Evaluate it:

In [11]:
ct.evaluate(test_sents)

0.8876651982378855

#### Collocations

In [31]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(brown.words(brown.fileids()[0:10]))
print(finder.nbest(bigram_measures.pmi, 10))
print(finder.nbest(bigram_measures.chi_sq, 10))
print(finder.nbest(bigram_measures.likelihood_ratio, 10))
print(finder.nbest(bigram_measures.student_t, 10))

[nltk_data] Downloading package brown to /Users/kmw/nltk_data...
[nltk_data]   Package brown is already up-to-date!


[('$115,000', 'annually'), ('$157,460', 'yearly'), ('100,000', 'recipients'), ('12,000', 'babies'), ('1311', 'acre'), ('1409', 'SW'), ('182', 'scholastics'), ('330', 'Woodland'), ('Abe', 'Stark'), ('Al', 'Ullman')]
[('$115,000', 'annually'), ('$157,460', 'yearly'), ('100,000', 'recipients'), ('12,000', 'babies'), ('1311', 'acre'), ('1409', 'SW'), ('182', 'scholastics'), ('330', 'Woodland'), ('Abe', 'Stark'), ('Al', 'Ullman')]
[('.', 'The'), ('of', 'the'), ('.', 'He'), ("''", '.'), ('has', 'been'), ('in', 'the'), ('United', 'States'), ('would', 'be'), ('will', 'be'), ('more', 'than')]
[('.', 'The'), ('of', 'the'), ('in', 'the'), ("''", '.'), ('.', 'He'), ('on', 'the'), ('.', '``'), ('for', 'the'), ("''", ','), ('has', 'been')]


In [27]:
from nltk.corpus import brown
for sent in brown.sents(brown.fileids()[0:1]):
    pos_tags = nltk.pos_tag(sent)
    print(pos_tags)

[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NNP'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'IN'), ('any', 'DT'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]
[('The', 'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'JJ'), ('presentments', 'NNS'), ('that', 'IN'), ('the', 'DT'), ('City', 'NNP'), ('Executive', 'NNP'), ('Committee', 'NNP'), (',', ','), ('which', 'WDT'), ('had', 'VBD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'DT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'DT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('City', 'NNP'), ('of', 'IN'), ('Atlanta', 'NNP'), ("''", "''

In [39]:
from nltk.corpus import brown
nltk.download('stopwords')
text = nltk.Text(brown.words(brown.fileids()[0:100]))
print(text.collocations())
print(text.concordance('York', lines=30))

[nltk_data] Downloading package stopwords to /Users/kmw/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


United States; New York; per cent; Los Angeles; last year; White
House; years ago; San Francisco; United Nations; last night; World
War; President Kennedy; last week; home runs; Mr. Kennedy; St. Louis;
General Assembly; East Greenwich; Viet Nam; New Orleans
None
Displaying 30 of 108 matches:
 ADC program in Cook county by a New York City welfare consulting firm , liste
 of surplus funds of the Port of New York Authority , and making New Jersey at
tions , published in yesterday's New York Times . The Mayor said : `` It didn'
 of the American League champion New York Yankees , who come in here tomorrow 
two-game weekend series with the New York Yankees . Skinny Brown and Hoyt Wilh
helm , plans to bring the entire New York squad here from St. Petersburg , inc
rld record earlier this month in New York with a clocking of 1.09.3 , wiped ou
March 17 ( AP ) -- Two errors by New York Yankee shortstop Tony Kubek in the e
ti , Ohio ( AP ) -- The powerful New York Yankees won their 19th world seri