# ECL I: Exercises 3: Training a simple tnt-like tagger
In the lecture we heard (criteria for a good tagger): adjustability, reusability 
What exactly does that mean? A tagger like e.g. the TnT-Tagger can be trained for every language and tagset. The only thing necessary is a corpus, tagged with POS-Tags.

Explanation: All lines start with a ! are UNIX-Shell-commands executed via this Notebook.
You can always execute them without the ! directly in a terminal.


## (1) Go to the right working directory
Be sure to start with the process in the same directory as your corpus, the python program tagger.py as well as the evaluation script tag-eval.perl.
To navigate your system use pwd (print working directory), cd (change directory) as well as ls (to see everything that is contained in the current directory).
In a Notebook you can use the magic command %cd to change the current directory for all succeeding commands. We need to be in the ex03 directory after the following command.

In [None]:
%cd ex03

In [None]:
! ls -l

##  (2) 9/10 of the corpus will be our training data

In [None]:
!head -n 277313 ud-de-v2.tts > training.tts

## (3) Look at training data
Which tokens with the string "bestimmt" do have which UPOS tags?

In [None]:
!grep -C 3 bestimmt training.tts

 ## (4) 1/10 of our corpus will be our test data
 Copy the last 30812 lines into `test.tts`

In [None]:
! tail -n 30812 ud-de-v2.tts > test.tts

## (5) create an evaluation corpus by removing the tags from our test corpus


In [None]:
! cut -f 1 test.tts > test.txt

## (6)  Training and tagging:
Train the tagger with training.tts. Tag the evaluation file and write the results in result.tts (this can take a few minutes ... be patient). Some status information will appear on standard error.


In [None]:
! python3 tagger.py training.tts < test.txt > result.tts

## (7) Evaluation
Evaluate the result with a nice confusion matrix by comparing our tagger output `result.tts` with the test corpus `test.tts`

In [None]:
! perl tag-eval.perl -k test.tts -e result.tts 

Our simple NLTK-based POS Tagger does not guess unknown words... See yourself...

In [None]:
!grep -C 3 Unk result.tts

## Control question

What happens if you train over the whole corpus and evaluate afterwards? (Just substitute "training.tts" in the training and tagging command with the whole corpus "ud-de-v2.tts")

Supplement: How to compute the length of your training and test corpus automatically? This might just cut in the middle of a sentence...



In [None]:
%%bash
export corpuslen=$(wc -l <ud-de-v2.tts )
export testlen=$(($corpuslen / 10 ))
export trainlen=$(($corpuslen - $testlen))
echo Corpus lines $corpuslen test lines $testlen training lines $trainlen


# Using the hunpos tagger
Our tnt-like tagger is pretty slow. The good thing is that you can inspect the python [code](https://www.nltk.org/_modules/nltk/tag/tnt.html). For [our UD pipeline web demo](https://pub.cl.uzh.ch/users/siclemat/lehre/ecl1/ud-de-hunpos-maltparser/html/), we use the tnt reimplementation hunpos which is a lot faster and supports unknown word guessing. 

Training with hunpos:

In [None]:
! hunpos-train training.mod < training.tts

Tagging with hunpos:

In [None]:
! hunpos-tag training.mod < test.txt > results.tts

Evaluating the result:

In [None]:
! tag-eval.perl -e results.tts -k test.tts