# ECL I: Exercises 3: Training a simple tnt-like POS tagger
In the lecture we heard (criteria for a good tagger): adjustability, reusability 
What exactly does that mean? A statistical tagger as the TnT tagger can be trained for every language and tagset. The only thing necessary is a corpus tagged with POS tags. Here we demonstrate two different implementations of the tnt tagger:
 - the [NLTK tnt reimplementation](https://www.nltk.org/_modules/nltk/tag/tnt.html) where you can look at the Python source code if your are interested. Training and tagging is wrapped with a [small wrapper script](ex03/tagger.py) that you can look at. This tagger is pretty slow during tagging. The good thing is that you can inspect the python [code](https://www.nltk.org/_modules/nltk/tag/tnt.html).

-  the [hunpos tagger](https://github.com/mivoq/hunpos) builds a model file during training with a separate training program. This model file is then used by the tagger program and applied in a very efficient way.  For [our UD pipeline web demo](https://pub.cl.uzh.ch/users/siclemat/lehre/ecl1/ud-de-hunpos-maltparser/html/), we use the  hunpos tagger because it is super fast and supports unknown word guessing. 


Explanation: All lines start with a ! are UNIX-Shell-commands executed via this Notebook.
You can always execute them without the ! directly in a terminal.


## (0) Go to the right working directory
Be sure to start with the process in the same directory as your corpus, the python program tagger.py as well as the evaluation script tag-eval.perl.
To navigate your system use pwd (print working directory), cd (change directory) as well as ls (to see everything that is contained in the current directory).
In a Notebook you can use the magic command %cd to change the current directory for all succeeding commands. We need to be in the ex03 directory after the following command.

In [None]:
%cd ex03

In [None]:
! ls -l

##  (1) 9/10 of the corpus will be our training data

In [None]:
!head -n 277270 ud-de-v2.tts > training.tts

## (1a) Look at training data
Which tokens with the string "bestimmt" do have which UPOS tags?

In [None]:
!grep -C 3 bestimmt training.tts

 ## (2) 1/10 of our corpus will be our test data
 Copy the last 30807 lines into `test.tts`

In [None]:
! tail -n 30807 ud-de-v2.tts > test.tts

## (3) create an evaluation corpus by removing the tags from our test corpus


In [None]:
! cut -f 1 test.tts > test.txt

## (4) Train a model with hunpos training tool

In [None]:
! hunpos-train training.mod < training.tts

## (5) Tag the test set with the hunpos tagger and the trained model

In [None]:
! hunpos-tag training.mod < test.txt > result.tts

## (4b/5b)  Training and tagging with NLTK tnt tagger:
Train the tagger with training.tts. Tag the evaluation file and write the results in result.tts (this can take a few minutes ... be patient). Some status information will appear on standard error.


In [None]:
! python3 tagger.py training.tts < test.txt > result-nltk.tts

## (6) Evaluation of hunpos tagger output
Evaluate the *hunpos tagger result* with a nice confusion matrix by comparing our tagger output `result.tts` with the test corpus `test.tts`

In [None]:
! perl tag-eval.perl -k test.tts -e result.tts 

## (6b) Evaluation of NLTK tagger output
Evaluate the NLTK tnt tagger:

In [None]:
! tag-eval.perl -e results-nltk.tts -k test.tts

Our simple NLTK-based POS Tagger does not guess unknown words... See yourself...

In [None]:
!grep -C 3 Unk result-nltk.tts

## (7) Testing with your own tokenized input

In [None]:
! echo "Dies ist ein Satz !" | tr " " "\n" | hunpos-tag training.mod 

## Control question

What happens if you train over the whole corpus and evaluate afterwards? (Just substitute "training.tts" in the training and tagging command with the whole corpus "ud-de-v2.tts")

Supplement: How to compute the length of your training and test corpus automatically according to your split proportions? Note: This simple method can cut in the middle of a sentence...



In [None]:
%%bash
export corpuslen=$(wc -l <ud-de-v2.tts )
export testlen=$(($corpuslen / 10 ))
export trainlen=$(($corpuslen - $testlen))
echo Corpus lines $corpuslen test lines $testlen training lines $trainlen
