# How to Use the Hanover Tagger

## Table of Contents

* [Installation and Import](#sec-instal)
* [German](#sec-german)
* [Dutch](#sec-dutch)
* [English](#sec-english)

## Installation and Import<a class="anchor" id="sec-installation"></a>

In [2]:
!pip install HanTa



In [1]:
import HanoverTagger as ht

## German<a class="anchor" id="sec-german"></a>

Load a trained model. E.g. the model on Github trained on the TIGER-Corpus

In [2]:
tagger = ht.HanoverTagger('morphmodel_ger.pgz')

### Analyzing a word

The method analyze gives the most probable part of speech, the lemma and a morphological analysis of a word.  By using the optional parameter taglevel, we can very the amount of information shown:

In [3]:
print(tagger.analyze('Fachmärkte'))
print(tagger.analyze('Fachmärkte',taglevel=0))
print(tagger.analyze('Fachmärkte',taglevel=1))
print(tagger.analyze('Fachmärkte',taglevel=2))
print(tagger.analyze('Fachmärkte',taglevel=3))

('Fachmarkt', 'NN')
NN
('Fachmarkt', 'NN')
('fach+markt+e', 'NN')
('fachmarkt', [('fach', 'NN'), ('märkt', 'NN_VAR'), ('e', 'SUF_NN')], 'NN')


If the taglevel is set to 1 the Hanover Tagger tries to generate the correct lemma. For the levels 2 and 3 the stem of te word is given.

In [4]:
print(tagger.analyze('wirft',taglevel=1))
print(tagger.analyze('wirft',taglevel=2))
print(tagger.analyze('wirft',taglevel=3))

('werfen', 'VV(FIN)')
('werf+t', 'VV(FIN)')
('werf', [('wirf', 'VV_VAR'), ('t', 'SUF_FIN')], 'VV(FIN)')


Using the parameter pos we can force to give the most likely analysis for the given part of speech.

In [9]:
print(tagger.analyze('vertraute',taglevel=3,pos='VV(FIN)'))
print(tagger.analyze('vertraute',taglevel=3,pos='ADJ(D)'))
print(tagger.analyze('vertraute',taglevel=3,pos='NNA'))

('vertrau', [('vertrau', 'VVnp'), ('te', 'SUF_FIN')], 'VV(FIN)')
('vertraut', [('vertrau', 'VVnp'), ('t', 'SUF_PP'), ('e', 'SUF_ADJ')], 'ADJ(D)')
('vertraut', [('vertrau', 'VVnp'), ('t', 'SUF_PP'), ('e', 'SUF_ADJ')], 'NNA')


### Tagging a word

With the method tag_word we can get the most probable POS-tags for a word:

In [16]:
tagger.tag_word('verdachte')

[('VV(FIN)', -12.311673325151059),
 ('ADJ(A)', -13.777650841629542),
 ('NNA', -16.371677520215265)]

The numbers are the natural logarithm of the probability that the given POS produces the word as estimated by the underlying Hidden Markov Model. Here e.g. the probability that a finite verb is realized by the word 'verdachte' is $e^{-18.7} = 7.56 \cdot 10^{-9}$.

Using the Parameter cutoff we can get more or less results. Cutoff give the maximal difference of the logprob of the last result with the best result. The cutoff Parameter does not apply to frequent words with cached analyses! The aim of the cutoff is to exclude impossible analyses. Each cached analysis, however, has been obeserved and is possible.

In [18]:
print(tagger.tag_word('verdachte',cutoff=0))
print(tagger.tag_word('verdachte',cutoff=10))
print(tagger.tag_word('verdachte',cutoff=20))

[('NN', -18.662410642261552)]
[('NN', -18.662410642261552), ('VV(FIN)', -25.772053305619295), ('ADJ(A)', -27.799424476126276)]
[('NN', -18.662410642261552), ('VV(FIN)', -25.772053305619295), ('ADJ(A)', -27.799424476126276), ('ADV', -34.24922543179404), ('ADJ(D)', -34.40659304587046), ('NNA', -35.26082374287396), ('VV(IMP)', -35.958731005154625)]


If the optional Parameter casesensitive is set to True (the default value) uppercase is used to guess the most likely part of speech, mainly favouring noun readings ove other possibilities. 

In [20]:
tagger.tag_word('Vertraute',casesensitive=False)

[('NNA', -12.0875), ('VV(FIN)', -12.3106), ('ADJ(A)', -13.6969)]

In [21]:
tagger.tag_word('Vertraute',casesensitive=True)

[('NNA', -12.10138084948469),
 ('ADJ(A)', -16.25339063408611),
 ('VV(FIN)', -19.14813044593456)]

### Tagging a sentence

The Hanover Tagger also can tag all words in a sentence at once. First probabilities for each word and POS are computed. Then a trigramm sentence model is used to disambiguate the tags and select the contextuall most approriates POS. Finally, the words are analysed again for the best POS and the analysis for each word is given. 

Here we can again use the parameters taglevel and casesensitive.

In [24]:
import nltk
from pprint import pprint

sent = "Die Europawahl in den Niederlanden findet immer donnerstags statt."

words = nltk.word_tokenize(sent)
lemmata = tagger.tag_sent(words)
pprint(lemmata)

[('Die', 'der', 'ART'),
 ('Europawahl', 'Europawahl', 'NN'),
 ('in', 'in', 'APPR'),
 ('den', 'der', 'ART'),
 ('Niederlanden', 'Niederlanden', 'NE'),
 ('findet', 'finden', 'VV(FIN)'),
 ('immer', 'immer', 'ADV'),
 ('donnerstags', 'donnerstags', 'ADV'),
 ('statt', 'statt', 'PTKVZ'),
 ('.', '.', '$.')]


In [25]:
sent = "Die Sozialdemokraten haben ersten Prognosen zufolge die Europawahl in den Niederlanden gewonnen."

words = nltk.word_tokenize(sent)
lemmata = tagger.tag_sent(words,taglevel = 3)
pprint(lemmata)

[('Die', 'der', [('die', 'ART')], 'ART'),
 ('Sozialdemokraten',
  'sozialdemokrat',
  [('sozialdemokrat', 'NN'), ('en', 'SUF_NN')],
  'NN'),
 ('haben', 'hab', [('hab', 'VA'), ('en', 'SUF_FIN')], 'VA(FIN)'),
 ('ersten', 'erst', [('erst', 'ADJ'), ('en', 'SUF_ADJ')], 'ADJ(A)'),
 ('Prognosen', 'prognose', [('prognose', 'NN'), ('n', 'SUF_NN')], 'NN'),
 ('zufolge', 'zufolge', [('zufolge', 'APPO')], 'APPO'),
 ('die', 'der', [('die', 'ART')], 'ART'),
 ('Europawahl', 'europawahl', [('europawahl', 'NN')], 'NN'),
 ('in', 'in', [('in', 'APPR')], 'APPR'),
 ('den', 'der', [('den', 'ART')], 'ART'),
 ('Niederlanden', 'niederlanden', [('niederlanden', 'NE')], 'NE'),
 ('gewonnen',
  'gewinn',
  [('gewonn', 'VVnp_VAR_PP'), ('en', 'SUF_PP')],
  'VV(PP)'),
 ('.', '.', [('.', '$.')], '$.')]


In [26]:
sent = "Der palästinensische Schriftsteller Emil Habibi ist der einzige Autor im Nahen Osten, dessen Werk von allen Seiten größte und offizielle Anerkennung zuteil geworden ist."

words = nltk.word_tokenize(sent)
tags = tagger.tag_sent(words,taglevel= 0)
print(tags)

['ART', 'ADJ(A)', 'NN', 'NE', 'NE', 'VA(FIN)', 'ART', 'ADJ(A)', 'NN', 'APPRART', 'ADJ(A)', 'NN', '$,', 'PRELAT', 'NN', 'APPR', 'PIAT', 'NN', 'ADJ(A)', 'KON', 'ADJ(A)', 'NN', 'PTKVZ', 'VA(PP)', 'VA(FIN)', '$.']


### Some information on the underlying tagging model

The German model was trained on data derived from the Tiger Corpus. Hence the POS-tags are almost the same as in the Tiger Corpus, sc. the tags from the STuttgart Tübingen Tagset. See https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/annotation/tiger_scheme-morph.pdf (esp. pp 26/27). A general description of the tagset is available e.g. here: https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/germantagsets/#id-cfcbf0a7-0 or here: https://homepage.ruhr-uni-bochum.de/stephen.berman/Korpuslinguistik/Tagsets-STTS.html

The tags used for the morphemes are derived from the POS tags. 

HanTa can list all the POS tags and the tags for morphemes with some random examples:

In [3]:
tagger.list_postags()

$(	..., :, ,, ", /, (, -, )
$,	,
$.	?, :, ;, ., !
ADJ(A)	jungen, europäische, sogenannte, große, moderne, britische, israelische, neue, bestimmten, dresdner
ADJ(D)	alt, recht, lang, unterschiedlich, rasch, heftig, erheblich, offen, überwiegend, bekannt
ADV	zurück, bis, demnächst, knapp, zuletzt, zu, unten, gar, somit, sonst
APPO	ungeachtet, gegenüber, über, voran, hinunter, zufolge, wegen, entlang, halber, entgegen
APPR	anno, voller, via, seit, mitsamt, inmitten, a, minus, gemäß, infolge
APPRART	übers, vorm, v., im, unters, beim, z., zum, a., überm
APZR	heraus, willen, aus, herunter, her, hinaus, herum, hinein, hinweg, hin
ART	das, 'n, det, einem, d., einen, eines, die, einer, s
CARD	75, 700, 2,5, 2500, 1986, 1987, 29, 3,5, 400, 24
FM	british, friends, il, 's, &, labour, news, sir, spe, la
ITJ	mann, o, na, ach, ja
KOKOM	wie, als, denn
KON	plus, ebenso, +, mal, weder, sondern, u, wenngleich, entweder, beziehungsweise
KOUI	statt, ums, anstatt, ohne, um
KOUS	ehe, sofern, wie, zumal, obwoh

In [4]:
tagger.list_mtags()

ADJ             technisch, jüdisch, 13., allgemein, hiesig, 14., 19., zahlreich, schwierig, tief
ADJ_COMP        er
ADJ_IRR         täglich, größtes, viertgrößte, wahrscheinlich, politisch, größter, unklar, letzter, größtem, frühren
ADJ_SUP         st, est
ADJ_VAR         abgeschirmt, treuest, edl, bess, betreut, geschmiegt, geknüpft, schwach, bekanntgegeben, irreversibl
FUGE            en, es, er, e, nen, -, s, n
HYPHEN          -
NE_VAR          courmayeur, l, karlsruhe, skipi, shakespear, düsseldorf, jacques, lersnerstr., mehrdorn, saigon-fluss
NN_IRR          -bedingungen, -fenster, ``aldi''-brüder, -steuer, ``klassik''-mitarbeiter, ``rations''-abschnitten, ``kavadi''-träger, öfen, -führer, -desaster
NN_VAR          rückgäng, abläuf, anträg, verbänd, verstöß, wäld, töpf, nöt, sprüch, anschläg
PDAT_VAR        solch, jen, dieselbe, derartig, dasselbe, ebendies, di, denjenige, desselbe, demselbe
PIS_IRR         ihresgleichen
PIS_VAR         wat, ein, wen'g, viel
PPOSAT_VAR      ihr, e

## Dutch<a class="anchor" id="sec-dutch"></a>

You can load trained morphology models for some other languages in the same way as shown above for German. Here a few examples for a Dutch model.

In [5]:
tagger_nl = ht.HanoverTagger('morphmodel_dutch.pgz')

In [28]:
print(tagger_nl.analyze('huishoudhulpje'))
print(tagger_nl.analyze('huishoudhulpje',taglevel=0))
print(tagger_nl.analyze('huishoudhulpje',taglevel=1))
print(tagger_nl.analyze('huishoudhulpje',taglevel=2))
print(tagger_nl.analyze('huishoudhulpje',taglevel=3))

('huishoudhulp', 'N(soort,ev,dim,onz,stan)')
N(soort,ev,dim,onz,stan)
('huishoudhulp', 'N(soort,ev,dim,onz,stan)')
('huis+houd+hulp+je', 'N(soort,ev,dim,onz,stan)')
('huishoudhulp', [('huis', 'N(soort,onz)'), ('houd', 'WW'), ('hulp', 'N(soort,zijd)'), ('je', 'SUF_DIM')], 'N(soort,ev,dim,onz,stan)')


In [29]:
tagger_nl.tag_word('vertrouwen')

[('N(soort,ev,basis,onz,stan)', -9.735243330616047),
 ('WW(inf,vrij,zonder)', -12.609773795398848),
 ('WW(pv,tgw,mv)', -13.014584292537286)]

In [30]:
sent = "Elk jaar wisselen ruim 1 miljoen Nederlanders van zorgverzekeraar. "

words = nltk.word_tokenize(sent)
lemmata = tagger_nl.tag_sent(words,taglevel= 3)
pprint(lemmata)

[('Elk',
  'elk',
  [('elk', 'VNW(onbep,det,stan,prenom,zonder,evon)')],
  'VNW(onbep,det,stan,prenom,zonder,evon)'),
 ('jaar', 'jaar', [('jaar', 'N(soort,onz)')], 'N(soort,ev,basis,onz,stan)'),
 ('wisselen',
  'wissel',
  [('wissel', 'WW'), ('en', 'SUF_WW(mv)')],
  'WW(pv,tgw,mv)'),
 ('ruim', 'ruim', [('ruim', 'ADJ')], 'ADJ(vrij,basis,zonder)'),
 ('1', '1', [('1', 'TW(hoofd,prenom,stan)')], 'TW(hoofd,prenom,stan)'),
 ('miljoen',
  'miljoen',
  [('miljoen', 'N(soort,onz)')],
  'N(soort,ev,basis,onz,stan)'),
 ('Nederlanders',
  'nederlander',
  [('nederlander', 'N(eigen,zijd)'), ('s', 'SUF_N_S')],
  'N(eigen,mv,basis)'),
 ('van', 'van', [('van', 'VZ(init)')], 'VZ(init)'),
 ('zorgverzekeraar',
  'zorgverzekeraar',
  [('zorg', 'N(soort,zijd)'), ('verzekeraar', 'N(soort,zijd)')],
  'N(soort,ev,basis,zijd,stan)'),
 ('.', '.', [('.', 'LET()')], 'LET()')]


## English<a class="anchor" id="sec-english"></a>

In [43]:
tagger_en = ht.HanoverTagger('morphmodel_en.pgz')

In [34]:
print(tagger_en.analyze('walking',taglevel=0))
print(tagger_en.analyze('walking',taglevel=1))
print(tagger_en.analyze('walking',taglevel=2))
print(tagger_en.analyze('walking',taglevel=3))

VBG
('walk', 'VBG')
('walk+ing', 'VBG')
('walk', [('walk', 'VB'), ('ing', 'SUF_ING')], 'VBG')


In [39]:
tagger_en.tag_word('walks')

[('VBZ', -12.031221308520081), ('NNS', -12.225015720529305)]

If you analyze English sentences, make sure that the word tokenization is done properly. The model provided works with the default word tokenization from NLTK, which splits words like _cannot_ and _don't_ . If the wrong type of apostrope is used, tokenization might not gove the expected results, as is the case i the first variant of the following sentence.

In [40]:
#sent = "Tackling the entire kitchen can be an intimidating task, so here’s a manageable list of things to clean, ingredients to check, equipment to organize and more."
sent = "Tackling the entire kitchen can be an intimidating task, so here's a manageable list of things to clean, ingredients to check, equipment to organize and more."

words = nltk.word_tokenize(sent)

print(tagger_en.tag_sent(words,taglevel = 0))
print('----')
print(tagger_en.tag_sent(words,taglevel = 1))
print('----')
print(tagger_en.tag_sent(words,taglevel = 2))
print('----')
print(tagger_en.tag_sent(words,taglevel = 3))

['VBG', 'AT', 'JJ', 'NN', 'MD', 'BE', 'AT', 'JJ', 'NN', ',', 'QL', 'RB', 'BEZ', 'AT', 'JJ', 'NN', 'IN', 'NNS', 'TO', 'VB', ',', 'NNS', 'IN', 'NN', ',', 'NN', 'TO', 'VB', 'CC', 'AP', '.']
----
[('Tackling', 'tackl', 'VBG'), ('the', 'the', 'AT'), ('entire', 'entire', 'JJ'), ('kitchen', 'kitchen', 'NN'), ('can', 'can', 'MD'), ('be', 'be', 'BE'), ('an', 'a', 'AT'), ('intimidating', 'intimidating', 'JJ'), ('task', 'task', 'NN'), (',', ',', ','), ('so', 'so', 'QL'), ('here', 'here', 'RB'), ("'s", "'s", 'BEZ'), ('a', 'a', 'AT'), ('manageable', 'manageable', 'JJ'), ('list', 'list', 'NN'), ('of', 'of', 'IN'), ('things', 'thing', 'NNS'), ('to', 'to', 'TO'), ('clean', 'clean', 'VB'), (',', ',', ','), ('ingredients', 'ingredient', 'NNS'), ('to', 'to', 'IN'), ('check', 'check', 'NN'), (',', ',', ','), ('equipment', 'equipment', 'NN'), ('to', 'to', 'TO'), ('organize', 'organize', 'VB'), ('and', 'and', 'CC'), ('more', 'more', 'AP'), ('.', '.', '.')]
----
[('Tackling', 'tackl+ing', 'VBG'), ('the', 'th