# How to Use the Hanover Tagger

## Table of Contents

* [Installation and Import](#sec-instal)
* [German](#sec-german)
* [Dutch](#sec-dutch)
* [English](#sec-english)

## Installation and Import<a class="anchor" id="sec-installation"></a>

In [30]:
!pip install --upgrade HanTa 

Collecting HanTa
  Downloading HanTa-1.1.0-py3-none-any.whl (15.0 MB)
Installing collected packages: HanTa
  Attempting uninstall: HanTa
    Found existing installation: HanTa 1.0.0
    Uninstalling HanTa-1.0.0:
      Successfully uninstalled HanTa-1.0.0
Successfully installed HanTa-1.1.0


In [1]:
from HanTa import HanoverTagger as ht

## German<a class="anchor" id="sec-german"></a>

Load a trained model. E.g. the model on Github trained on the TIGER-Corpus

In [2]:
tagger = ht.HanoverTagger('morphmodel_ger.pgz')

### Analyzing a word

The method analyze gives the most probable part of speech, the lemma and a morphological analysis of a word.  By using the optional parameter taglevel, we can very the amount of information shown:

In [3]:
print(tagger.analyze('Fachmärkte'))
print(tagger.analyze('Fachmärkte',taglevel=0))
print(tagger.analyze('Fachmärkte',taglevel=1))
print(tagger.analyze('Fachmärkte',taglevel=2))
print(tagger.analyze('Fachmärkte',taglevel=3))

('Fachmarkt', 'NN')
NN
('Fachmarkt', 'NN')
('fach+markt+e', 'NN')
('fachmarkt', [('fach', 'NN'), ('märkt', 'NN_VAR'), ('e', 'SUF_NN')], 'NN')


If the taglevel is set to 1 the Hanover Tagger tries to generate the correct lemma. For the levels 2 and 3 the stem of te word is given.

In [4]:
print(tagger.analyze('wirft',taglevel=1))
print(tagger.analyze('wirft',taglevel=2))
print(tagger.analyze('wirft',taglevel=3))

('werfen', 'VV(FIN)')
('werf+t', 'VV(FIN)')
('werf', [('wirf', 'VV_VAR'), ('t', 'SUF_FIN')], 'VV(FIN)')


Using the parameter *pos* we can force to give the most likely analysis for the given part of speech.

In [5]:
print(tagger.analyze('vertraute',taglevel=3,pos='VV(FIN)'))
print(tagger.analyze('vertraute',taglevel=3,pos='ADJ(D)'))
print(tagger.analyze('vertraute',taglevel=3,pos='NNA'))

('vertrau', [('vertrau', 'VVnp'), ('te', 'SUF_FIN')], 'VV(FIN)')
('vertraut', [('vertrau', 'VVnp'), ('t', 'SUF_PP'), ('e', 'SUF_ADJ')], 'ADJ(D)')
('vertraut', [('vertrau', 'VVnp'), ('t', 'SUF_PP'), ('e', 'SUF_ADJ')], 'NNA')


### Tagging a word

With the method tag_word we can get the most probable POS-tags for a word:

In [6]:
tagger.tag_word('Angeln')

[('NN', -13.007050108561302),
 ('NNI', -13.704188662004706),
 ('NE', -19.594898957386754),
 ('VV(INF)', -23.595031081236865),
 ('VV(FIN)', -25.169976341439774)]

The numbers are the natural logarithm of the probability that the word is found with the given POS, as estimated by the underlying Hidden Markov Model. Here e.g. the probability word 'Angeln'  is found together with the tag 'NN' is $e^{-13} = 2.26 \cdot 10^{-6}$.

Using the Parameter cutoff we can get more or less results. Cutoff give the maximal difference of the logprob of the last result with the best result. The cutoff Parameter does not apply to frequent words with cached analyses! The aim of the cutoff is to exclude impossible analyses. Each cached analysis, however, has been obeserved and is possible.

In [7]:
print(tagger.tag_word('verdachte',cutoff=0))
print(tagger.tag_word('verdachte',cutoff=10))
print(tagger.tag_word('verdachte',cutoff=20))

[('NN', -18.757361480409173)]
[('NN', -18.757361480409173), ('VV(FIN)', -25.868323558371134), ('ADJ(A)', -28.178979842637556)]
[('NN', -18.757361480409173), ('VV(FIN)', -25.868323558371134), ('ADJ(A)', -28.178979842637556), ('ADV', -34.50301836881482), ('ADJ(D)', -34.836595953773134), ('NNA', -35.32902383199287), ('FM', -35.82508555170638), ('NE', -35.997194162564966), ('VV(IMP)', -36.16169684930409)]


If the optional Parameter casesensitive is set to True (the default value) uppercase is used to guess the most likely part of speech, mainly favouring proper noun readings and for German noun readings over other possibilities. 

In [8]:
tagger.tag_word('angeln',casesensitive=False)

[('NN', -13.0038),
 ('NNI', -13.6969),
 ('VV(INF)', -16.8673),
 ('VV(FIN)', -18.3327),
 ('NE', -19.5531)]

In [9]:
tagger.tag_word('angeln',casesensitive=True)

[('VV(INF)', -16.868497963605382),
 ('VV(FIN)', -18.333773598068934),
 ('NNI', -18.621977405734608),
 ('NN', -18.734491493868266),
 ('NE', -22.74881056438149)]

### Tagging a sentence

The Hanover Tagger also can tag all words in a sentence at once. First probabilities for each word and POS are computed. Then a trigramm sentence model is used to disambiguate the tags and select the contextual most approriate POS. Finally, the words are analysed again, and the analysis leading to the contextually best PoS is given. 

Here we can again use the parameters taglevel and casesensitive.

In [10]:
import nltk
from pprint import pprint

sent = "Die Europawahl in den Niederlanden findet immer donnerstags statt."

words = nltk.word_tokenize(sent)
lemmata = tagger.tag_sent(words)
pprint(lemmata)

[('Die', 'der', 'ART'),
 ('Europawahl', 'Europawahl', 'NN'),
 ('in', 'in', 'APPR'),
 ('den', 'der', 'ART'),
 ('Niederlanden', 'Niederlanden', 'NE'),
 ('findet', 'finden', 'VV(FIN)'),
 ('immer', 'immer', 'ADV'),
 ('donnerstags', 'donnerstags', 'ADV'),
 ('statt', 'statt', 'PTKVZ'),
 ('.', '.', '$.')]


In [11]:
sent = "Die Sozialdemokraten haben ersten Prognosen zufolge die Europawahl in den Niederlanden gewonnen."

words = nltk.word_tokenize(sent)
lemmata = tagger.tag_sent(words,taglevel = 3)
pprint(lemmata)

[('Die', 'der', [('die', 'ART')], 'ART'),
 ('Sozialdemokraten',
  'sozialdemokrat',
  [('sozialdemokrat', 'NN'), ('en', 'SUF_NN')],
  'NN'),
 ('haben', 'hab', [('hab', 'VA'), ('en', 'SUF_FIN')], 'VA(FIN)'),
 ('ersten', 'erst', [('erst', 'ADJ'), ('en', 'SUF_ADJ')], 'ADJ(A)'),
 ('Prognosen', 'prognose', [('prognose', 'NN'), ('n', 'SUF_NN')], 'NN'),
 ('zufolge', 'zufolge', [('zufolge', 'APPO')], 'APPO'),
 ('die', 'der', [('die', 'ART')], 'ART'),
 ('Europawahl', 'europawahl', [('europawahl', 'NN')], 'NN'),
 ('in', 'in', [('in', 'APPR')], 'APPR'),
 ('den', 'der', [('den', 'ART')], 'ART'),
 ('Niederlanden', 'niederlanden', [('niederlanden', 'NE')], 'NE'),
 ('gewonnen',
  'gewinn',
  [('gewonn', 'VVnp_VAR_PP'), ('en', 'SUF_PP')],
  'VV(PP)'),
 ('.', '.', [('.', '$.')], '$.')]


In [13]:
sent = "Der palästinensische Schriftsteller Emil Habibi ist der einzige Autor im Nahen Osten, dessen Werk von allen Seiten größte und offizielle Anerkennung zuteil geworden ist."

words = nltk.word_tokenize(sent)
tags = tagger.tag_sent(words,taglevel= 0)
print(tags)

['ART', 'ADJ(A)', 'NN', 'NE', 'NE', 'VA(FIN)', 'ART', 'ADJ(A)', 'NN', 'APPRART', 'ADJ(A)', 'NN', '$,', 'PRELAT', 'NN', 'APPR', 'PIAT', 'NN', 'ADJ(A)', 'KON', 'ADJ(A)', 'NN', 'PTKVZ', 'VA(PP)', 'VA(FIN)', '$.']


If only the part of speech is needed, it is recommended to set taglevel = 0, since this will be much faster.

### Some information on the underlying tagging model

The German model was trained on data derived from the Tiger Corpus. Hence the POS-tags are almost the same as in the Tiger Corpus, sc. the tags from the STuttgart Tübingen Tagset. See https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/annotation/tiger_scheme-morph.pdf (esp. pp 26/27). A general description of the tagset is available e.g. here: https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/germantagsets/#id-cfcbf0a7-0 or here: https://homepage.ruhr-uni-bochum.de/stephen.berman/Korpuslinguistik/Tagsets-STTS.html

The tags used for the morphemes are derived from the POS tags. 

HanTa can list all the POS tags and the tags for morphemes with some random examples:

In [14]:
tagger.list_postags()

$(	/, (, ,, ), ", ..., :, -
$,	,
$.	!, :, ?, ., ;
ADJ(A)	grünen, weiterer, allgemeinen, russische, erhebliche, halbe, westlichen, gemeinsame, britischen, großer
ADJ(D)	eindeutig, bewußt, nötig, größer, neu, eng, umstritten, verstärkt, angemessen, interessiert
ADV	dort, zuerst, ohnehin, erst, spätestens, weit, zweimal, außen, etwa, aber
APPO	nach, durch, über, gegenüber, wegen, entgegen, entlang, voran, halber, zufolge
APPR	hinsichtlich, als, inklusive, entlang, entgegen, bei, außerhalb, statt, jenseits, v.
APPRART	unters, aufs, vom, am, z., beim, ins, ums, zur, fürs
APZR	hinein, vorbei, aus, hin, herum, willen, entlang, hinaus, an, heraus
ART	den, ein, einen, 'n, einer, s, die, der, eines, eine
CARD	26, 1977, tausend, 60, 35, 1972, sieben, 1987, 54, 65
FM	austria, first, nouveau, akbar, tiger, par, puncto, parks, il, labour
ITJ	ach, ja, o, mann, na
KOKOM	denn, wie, als
KON	auch, weder, aber, denn, wie, statt, u, als, mal, sowie
KOUI	ums, statt, um, ohne, anstatt
KOUS	seit, obwohl, ehe,

In [16]:
tagger.list_mtags()

ACR_NE             kws, d., anc, bgh, egb, iter, npd, raf, dws, hbv
ACR_NN             mrd, dm, gmbh, kp, kwg, ngo, fckw, vatu, pvc, a320
ADJ                westlich, wirtschaftlich, jung, nötig, ander, gesetzlich, deutsch, nigerianisch, kommunistisch, groß
ADJ_COMP           er
ADJ_INVAR          würzburger, bregenzer, kopenhagener, seeheimer, baseler, mecklenburger, petersburger, weise, haager, bochumer
ADJ_IRR            größte, höchstverantwortlichen, letztes, klar, umgerechnet, größtem, größten, viertgrößte, politisch, erforderlich
ADJ_SUP            st, est
ADJ_VAR            veritabl, gröb, größt, größer, innern, edl, sinistr, profitabl, höch, millionenteur
FUGE               nen, es, s, n, en, e, er
HYPHEN             -
NE_VAR             shakespear, n', vereinten, lersnerstr., l', tian'anmen-platz, özgür, thüringer, aserbaidschan, großbritannien
NN_IRR             ``aldi''-brüder, -führer, ``kavadi''-träger, ``klassik''-mitarbeiter, -steuer, -desaster, -fenster, öfen, -melodie

## Dutch<a class="anchor" id="sec-dutch"></a>

You can load trained morphology models for some other languages in the same way as shown above for German. Here a few examples for a Dutch model.

In [17]:
tagger_nl = ht.HanoverTagger('morphmodel_dutch.pgz')

In [18]:
print(tagger_nl.analyze('staat',taglevel=2,pos='WW(pv,tgw,met-t)'))
print(tagger_nl.analyze('staat',taglevel=2,pos='N(soort,ev,basis,zijd,stan)'))

('staa+t', 'WW(pv,tgw,met-t)')
('staat', 'N(soort,ev,basis,zijd,stan)')


In [19]:
tagger_nl.tag_word('staat')

[('WW(pv,tgw,met-t)', -7.763586419721995),
 ('N(soort,ev,basis,zijd,stan)', -8.121654473590889)]

In [20]:
print(tagger_nl.analyze('huishoudhulpje'))
print(tagger_nl.analyze('huishoudhulpje',taglevel=0))
print(tagger_nl.analyze('huishoudhulpje',taglevel=1))
print(tagger_nl.analyze('huishoudhulpje',taglevel=2))
print(tagger_nl.analyze('huishoudhulpje',taglevel=3))

('huishoudhulp', 'N(soort,ev,dim,onz,stan)')
N(soort,ev,dim,onz,stan)
('huishoudhulp', 'N(soort,ev,dim,onz,stan)')
('huis+houd+hulp+je', 'N(soort,ev,dim,onz,stan)')
('huishoudhulp', [('huis', 'N(soort,onz)'), ('houd', 'WW'), ('hulp', 'N(soort,zijd)'), ('je', 'SUF_DIM')], 'N(soort,ev,dim,onz,stan)')


In [21]:
tagger_nl.tag_word('vertrouwen')

[('N(soort,ev,basis,onz,stan)', -9.735243330616047),
 ('WW(inf,vrij,zonder)', -12.609773795398848),
 ('WW(pv,tgw,mv)', -13.014584292537286)]

In [22]:
sent = "Elk jaar wisselen ruim 1 miljoen Nederlanders van zorgverzekeraar. "

words = nltk.word_tokenize(sent)
lemmata = tagger_nl.tag_sent(words,taglevel= 3)
pprint(lemmata)

[('Elk',
  'elk',
  [('elk', 'VNW(onbep,det,stan,prenom,zonder,evon)')],
  'VNW(onbep,det,stan,prenom,zonder,evon)'),
 ('jaar', 'jaar', [('jaar', 'N(soort,onz)')], 'N(soort,ev,basis,onz,stan)'),
 ('wisselen',
  'wissel',
  [('wissel', 'WW'), ('en', 'SUF_WW(mv)')],
  'WW(pv,tgw,mv)'),
 ('ruim', 'ruim', [('ruim', 'ADJ')], 'ADJ(vrij,basis,zonder)'),
 ('1', '1', [('1', 'TW(hoofd,prenom,stan)')], 'TW(hoofd,prenom,stan)'),
 ('miljoen',
  'miljoen',
  [('miljoen', 'N(soort,onz)')],
  'N(soort,ev,basis,onz,stan)'),
 ('Nederlanders',
  'nederlander',
  [('nederlander', 'N(eigen,zijd)'), ('s', 'SUF_N_S')],
  'N(eigen,mv,basis)'),
 ('van', 'van', [('van', 'VZ(init)')], 'VZ(init)'),
 ('zorgverzekeraar',
  'zorgverzekeraar',
  [('zorg', 'N(soort,zijd)'), ('verzekeraar', 'N(soort,zijd)')],
  'N(soort,ev,basis,zijd,stan)'),
 ('.', '.', [('.', 'LET()')], 'LET()')]


## English<a class="anchor" id="sec-english"></a>

In [23]:
tagger_en = ht.HanoverTagger('morphmodel_en.pgz')

In [24]:
print(tagger_en.analyze('walking',taglevel=0))
print(tagger_en.analyze('walking',taglevel=1))
print(tagger_en.analyze('walking',taglevel=2))
print(tagger_en.analyze('walking',taglevel=3))

VBG
('walk', 'VBG')
('walk+ing', 'VBG')
('walk', [('walk', 'VB'), ('ing', 'SUF_ING')], 'VBG')


In [25]:
tagger_en.tag_word('walks')

[('VBZ', -12.031221308520081), ('NNS', -12.225015720529305)]

If you analyze English sentences, make sure that the word tokenization is done properly. The model provided works with the default word tokenization from NLTK, which splits words like _cannot_ and _don't_ . If the wrong type of apostrope is used, tokenization might not gove the expected results, as is the case i the first variant of the following sentence.

In [26]:
#sent = "Tackling the entire kitchen can be an intimidating task, so here’s a manageable list of things to clean, ingredients to check, equipment to organize and more."
sent = "Tackling the entire kitchen can be an intimidating task, so here's a manageable list of things to clean, ingredients to check, equipment to organize and more."

words = nltk.word_tokenize(sent)

print(tagger_en.tag_sent(words,taglevel = 0))
print('----')
print(tagger_en.tag_sent(words,taglevel = 1))
print('----')
print(tagger_en.tag_sent(words,taglevel = 2))
print('----')
print(tagger_en.tag_sent(words,taglevel = 3))

['VBG', 'AT', 'JJ', 'NN', 'MD', 'BE', 'AT', 'JJ', 'NN', ',', 'QL', 'RB', 'BEZ', 'AT', 'JJ', 'NN', 'IN', 'NNS', 'TO', 'VB', ',', 'NNS', 'IN', 'NN', ',', 'NN', 'TO', 'VB', 'CC', 'AP', '.']
----
[('Tackling', 'tackl', 'VBG'), ('the', 'the', 'AT'), ('entire', 'entire', 'JJ'), ('kitchen', 'kitchen', 'NN'), ('can', 'can', 'MD'), ('be', 'be', 'BE'), ('an', 'a', 'AT'), ('intimidating', 'intimidating', 'JJ'), ('task', 'task', 'NN'), (',', ',', ','), ('so', 'so', 'QL'), ('here', 'here', 'RB'), ("'s", "'s", 'BEZ'), ('a', 'a', 'AT'), ('manageable', 'manageable', 'JJ'), ('list', 'list', 'NN'), ('of', 'of', 'IN'), ('things', 'thing', 'NNS'), ('to', 'to', 'TO'), ('clean', 'clean', 'VB'), (',', ',', ','), ('ingredients', 'ingredient', 'NNS'), ('to', 'to', 'IN'), ('check', 'check', 'NN'), (',', ',', ','), ('equipment', 'equipment', 'NN'), ('to', 'to', 'TO'), ('organize', 'organize', 'VB'), ('and', 'and', 'CC'), ('more', 'more', 'AP'), ('.', '.', '.')]
----
[('Tackling', 'tackl+ing', 'VBG'), ('the', 'th