# Linguistic annotation with `Python`

`Python` is a highly versatile programming language which offers a great number of
libraries which greatly support your work as digital lexicographer.

This notebook is supposed to illustrate the different levels of automatic linguistic
annotation used in the course.

## Libraries used in the course

We use the wonderful [Natural Language Toolkit](https://www.nltk.org/) which comes
with a great set of tools and resources. In addition, [spaCy](https://spacy.io/) is
used. It has a smaller range of functionalities but is a lot faster and uses state-
of-the-art algorithms (namely deep learning approaches).

## Setup

We assume that you have a working `Python3` installation. The following instructions
are tailored to Linux and MacOS but should -- with minor modifications -- work on
Windows as well.

### `pip`

`pip` is the package manager for `Python`. From version 3.4 on, it ships with `Python`. 

### `virtualenv`

`virtualenv` allows you to setup local (and clean) `Python` environments. It may be
installed via
```sh
[sudo] pip install virtualenv
```

Create a virtual environement in a subdirectory of your choice (e.g. `env`) using
```sh
virtualenv -p python3 env
```

and activate it.
```sh
. env/bin/activate
```

### `NLTK` and `spaCy`

3rd party Python packages (including `NLTK` and `spaCy`) may best be installed using `pip`:
```sh
(env) pip install -r requirements.txt
```

## Testing

Now, we are ready to roll. Start `Python`:
```sh
(env) python
```

### `NLTK`

`NLTK` itself provides a high-level API to numerous NLP tools. Before we can use them, they
have to be installed.

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/kmw/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

We are now ready to do some work:

In [5]:
sentence = "Das z.B. ist ein Testsatz."
tokens = nltk.word_tokenize(sentence)
print(tokens)

['Das', 'z.B', '.', 'ist', 'ein', 'Testsatz', '.']
