# How to Use the Hanover Tagger

In [1]:
!pip install HanTa



In [2]:
from HanTa import HanoverTagger as ht

Load a trained model. E.g. the model on Github trained on the TIGER-Corpus

In [3]:
tagger = ht.HanoverTagger('morphmodel_ger.pgz')

## Analyzing a word

The method analyze gives the most probable part of speech, the lemma and a morphological analysis of a word.  Using the optional parameter taglevel, we can very the amount of information shown:

In [4]:
print(tagger.analyze('Fachmärkte'))
print(tagger.analyze('Fachmärkte',taglevel=0))
print(tagger.analyze('Fachmärkte',taglevel=1))
print(tagger.analyze('Fachmärkte',taglevel=2))
print(tagger.analyze('Fachmärkte',taglevel=3))

('Fachmarkt', 'NN')
NN
('Fachmarkt', 'NN')
('fach+markt', 'NN')
('fachmarkt', [('fach', 'NN'), ('märkt', 'NN_VAR'), ('e', 'SUF_NN')], 'NN')


If the taglevel is set to 1 the Hanover Tagger tries to generate the correct lemma. For the levels 2 and 3 the stem of te word is given.

In [5]:
print(tagger.analyze('wirft',taglevel=1))
print(tagger.analyze('wirft',taglevel=2))
print(tagger.analyze('wirft',taglevel=3))

('werfen', 'VVFIN')
('werf', 'VVFIN')
('werf', [('wirf', 'VV_VAR'), ('t', 'SUF_FIN')], 'VVFIN')


Using the parameter pos we can force to give the most likely analysis for the given part of speech.

In [6]:
print(tagger.analyze('vertraute',taglevel=3,pos='VVFIN'))
print(tagger.analyze('vertraute',taglevel=3,pos='VVPP'))
print(tagger.analyze('vertraute',taglevel=3,pos='ADJA'))
print(tagger.analyze('vertraute',taglevel=3,pos='NN'))

('vertrau', [('ver', 'PREF_V'), ('trau', 'VV'), ('te', 'SUF_FIN')], 'VVFIN')
('vertrau', [('ver', 'PREF_V'), ('trau', 'VV'), ('te', 'SUF_PP')], 'VVPP')
('vertraut', [('ver', 'PREF_V'), ('trau', 'VV'), ('t', 'SUF_PP'), ('e', 'SUF_ADJ')], 'ADJA')
('vertraute', [('vertraute', 'NN')], 'NN')


## Tagging a word

With the method tag_word we can get the most probable POS-tags for a word:

In [7]:
tagger.tag_word('verdachte')

[('VVFIN', -16.45041338404915),
 ('ADJA', -17.756739025841874),
 ('NN', -18.556571435108527)]

The numbers are the natural logarithm of the probability that the given POS produces the word as estimated by the underlying Hidden Markov Model. Here e.g. the probability that a finite verb is realized by the word 'verdachte' is $e^{-16.5} = 7.18 \cdot 10^{-8}$.

Using the Parameter cutoff we can get more or less results. Cutoff give the maximal difference of the logprob of the last result with the best result. The cutoff Parameter does not apply to frequent words with cached analyses! The aim of the cutoff is to exclude impossible analyses. Each cached analysis, however, has been obeserved and is possible.

In [8]:
print(tagger.tag_word('verdachte',cutoff=0))
print(tagger.tag_word('verdachte',cutoff=5))
print(tagger.tag_word('verdachte',cutoff=20))

[('VVFIN', -16.45041338404915)]
[('VVFIN', -16.45041338404915), ('ADJA', -17.756739025841874), ('NN', -18.556571435108527)]
[('VVFIN', -16.45041338404915), ('ADJA', -17.756739025841874), ('NN', -18.556571435108527), ('VVPP', -22.014090845844397), ('ADJD', -24.93130401111237), ('ADV', -27.35113055669061), ('VVIMP', -36.235118881764656)]


If the optional Parameter casesensitive is set to True (the default value) uppercase is used to guess the most likely part of speech, mainly favouring noun readings ove other possibilities. 

In [9]:
tagger.tag_word('Verdachte',casesensitive=True,cutoff=10)

[('NN', -12.901501651584534), ('ADJA', -20.232237126454365)]

In [10]:
tagger.tag_word('Verdachte',casesensitive=False)

[('NN', -12.898008032529743),
 ('VVFIN', -16.449339998261664),
 ('ADJA', -17.675969432512957)]

In [11]:
tagger.tag_word('verdachte',casesensitive=True)

[('VVFIN', -16.45041338404915),
 ('ADJA', -17.756739025841874),
 ('NN', -18.556571435108527)]

In [12]:
tagger.tag_word('verdachte',casesensitive=False)

[('NN', -12.898008032529743),
 ('VVFIN', -16.449339998261664),
 ('ADJA', -17.675969432512957)]

## Analyzing sentences

The Hanover Tagger also can analyse a whole sentence at once. First probabilities for each word and POS are computed. Then a trigramm sentence model is used to disambiguate the tags and select the contextuall most approriates POS. Finally, the words are analysed again for the best POS and the analysis for each word is given. 

Here we can again use the parameters taglevel and casesensitive.

In [13]:
import nltk
from pprint import pprint

sent = "Die Europawahl in den Niederlanden findet immer donnerstags statt."

words = nltk.word_tokenize(sent)
lemmata = tagger.tag_sent(words,taglevel= 1)
pprint(lemmata)

[('Die', 'die', 'ART'),
 ('Europawahl', 'Europawahl', 'NN'),
 ('in', 'in', 'APPR'),
 ('den', 'den', 'ART'),
 ('Niederlanden', 'Niederlanden', 'NE'),
 ('findet', 'finden', 'VVFIN'),
 ('immer', 'immer', 'ADV'),
 ('donnerstags', 'donnerstags', 'ADV'),
 ('statt', 'statt', 'PTKVZ'),
 ('.', '--', '$.')]


In [14]:
sent = "Die Sozialdemokraten haben ersten Prognosen zufolge die Europawahl in den Niederlanden gewonnen."

words = nltk.word_tokenize(sent)
lemmata = tagger.tag_sent(words,taglevel = 3)
pprint(lemmata)

[('Die', 'die', [('die', 'ART')], 'ART'),
 ('Sozialdemokraten',
  'sozialdemokrat',
  [('sozial', 'NN'), ('demokrat', 'NN'), ('en', 'SUF_NN')],
  'NN'),
 ('haben', 'hab', [('hab', 'VA'), ('en', 'SUF_FIN')], 'VAFIN'),
 ('ersten', 'erster', [('erst', 'ADJ'), ('en', 'SUF_ADJ')], 'ADJA'),
 ('Prognosen', 'prognose', [('prognose', 'NN'), ('n', 'SUF_NN')], 'NN'),
 ('zufolge', 'zufolge', [('zufolge', 'APPO')], 'APPO'),
 ('die', 'die', [('die', 'ART')], 'ART'),
 ('Europawahl', 'europawahl', [('europawahl', 'NN')], 'NN'),
 ('in', 'in', [('in', 'APPR')], 'APPR'),
 ('den', 'den', [('den', 'ART')], 'ART'),
 ('Niederlanden', 'niederlanden', [('niederlanden', 'NE')], 'NE'),
 ('gewonnen',
  'gewinn',
  [('ge', 'PREF_PP'), ('wonn', 'VV_VAR_PP'), ('en', 'SUF_PP')],
  'VVPP'),
 ('.', '.', [('.', '$.')], '$.')]


In [15]:
sent = "Der palästinensische Schriftsteller Emil Habibi ist der einzige Autor im Nahen Osten, dessen Werk von allen Seiten größte und offizielle Anerkennung zuteil geworden ist."

words = nltk.word_tokenize(sent)
tags = tagger.tag_sent(words,taglevel= 0)
print(tags)

['ART', 'ADJA', 'NN', 'NE', 'NE', 'VAFIN', 'ART', 'ADJA', 'NN', 'APPRART', 'ADJA', 'NN', '$,', 'PRELAT', 'NN', 'APPR', 'PIAT', 'NN', 'ADJA', 'KON', 'ADJA', 'NN', 'PTKVZ', 'VAPP', 'VAFIN', '$.']
