Mistakes in the Dutch stemmer #1

gboer · 2013-03-18T10:40:56Z

I first want to thank everyone on the Snowball project for creating this software. It's great that we can use the software to build more sophisticated search capabilities for our users. However, when I was testing several Dutch words, I noticed there are actually quite a lot of mistakes. I'm not quite sure how to fix the problems in the Dutch stemmer, so I thought I'd mentioned them here and hope someone picks it up.

Not sure where to start, so I'll mention a couple that are incorrect (the last word is the correct one):

gevaren gevar -> gevaar
gevaar gevar -> gevaar
gevaarlijk gevar -> gevaarlijk
gevaarlijke gevar -> gevaarlijk
gevaarlijker gevaarlijker -> gevaarlijk
gevaarten gevaart -> gevaarte
gevallen gevall -> geval
geven gev -> geef
gevist gevist -> vis
gewasbescherming gewasbescherm -> gewasbescherming
gewassen gewass -> gewas
geweer gewer -> geweer
aanbellen aanbell -> bel aan (yes, Dutch is weird)
aandeel aandel -> aandeel
aaneen aanen -> aaneen (should really be excluded from stemming if possible, since there is no way that this word occurs in any other form)
aalmoezen aalmoez -> aalmoes
gangetje gangetj -> gang
gebaartje gebaartj -> gebaar

These are just a few, but there are quite a lot more. Should you need help verifying or testing the stemmer for the Dutch words, I'm happy to help :)

The text was updated successfully, but these errors were encountered:

rboulton · 2013-03-27T12:47:22Z

Sorry not to have responded sooner; I'll try and take a look at this within the next week.

ojwb · 2014-12-09T00:35:34Z

@rboulton Did you manage to take a look at this?

As a general point, the aim of these stemmers is not to map their inputs to words in the same language, but rather to map different forms of the same word to the same string of characters (and forms of different words to different strings of characters). It just happens that in many cases the outputs are words in the same language.

So it isn't necessarily an error to be mapping gevaren and gevaar to gevar even if that isn't a word in Dutch. It is an error if other forms of gevaar get mapped to something else, or if unrelated words get mapped to gevar as well.

Sutharsan · 2015-01-12T09:11:33Z

I have the same experience with the 'Dutch' Snowball stemmer. Much better stemming is realised using the 'Kraaij-Pohlmann' stemming algoritm (language="Kp"). The simplest improvement is to use this algoritm as the default for Dutch stemming.
See https://wiki.apache.org/solr/LanguageAnalysis#dutch

Sutharsan · 2015-01-12T20:20:51Z

I agree with @ojwb that it is not a problem if the stemmer does not map to existing words, as long a it does not map to an existing word with a different meaning. Here a few examples comparing "Dutch" language with "Kp" language.

Using: <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />

'adverteer' > 'adveter' (advertise, 1st person singular) > No stem, not an existing word, but Ok.
'adverteren' > 'adveter' (advertise, 1st person plural) > No stem, not an existing word, but Ok.
'gadverteerd' > 'geadverteerd' (advertised) > Same word, Not Ok.
'artikelen' > 'artikel' (articles, plural) > Stem, singular word, Ok
'artikeltje' > 'artikeltj' (small article) > Not an existing word, Not Ok.
'openbaar' > 'open' (public) > Existing word (open), "stem" is related, not sure if Ok.
'zaken' > 'zak' (business) > Existing word (bag), not related, Not Ok.
'gelezen' > 'gelez' (read, regular verb) > Not an existing word, Not Ok.
'gebroken' > 'gebrok' (broken, irregular verb) > Not an existing word, Not Ok.

Using: <filter class="solr.SnowballPorterFilterFactory" language="Kp" />

'adverteer' > 'adveteer' (advertise, 1st person singular) > Stem, Ok.
'adverteren' > 'adveteer' (advertise, 1st person plural) > Stem, Ok.
'gadverteerd' > 'adveteer' (advertised) > Stem, Ok.
'artikelen' > 'artikel' (articles, plural) > Stem, Ok.
'artikeltje' > 'artikel' (small article) > Stem, Ok.
'openbaar' > 'open' (public) > Existing word (open), "stem" is related, not sure if Ok.
'zaken' > 'zaak' (business) > Stem, Ok.
'gelezen' > 'lees' (read, regular verb) > Stem, Ok.
'gebroken' > 'brook' (broken, irregular verb) > Not an existing word, Not Ok.

Of course I have picked examples where the stemming fails, but I have found only one category where "Kp" language fails: Irregular verbs. But all together the "Kp" Kraaij-Pohlmann algorithm is a much better stemming than the obvious choice of "Dutch" language. Instead of fixing the Dutch stemming, I recommend to replace it by the "Kp" stemming.

fishruti · 2018-06-18T14:41:46Z

@Sutharsan Thanks for your input. Could you please tell me how could I set the language to 'kp' while using the python implementation of KPSS ? In which language are you doing it here?
Thank you.

Sutharsan · 2018-06-18T14:46:29Z

I'm using the stemmer as part of Solr configuration, that is where <filter class="solr.SnowballPorterFilterFactory" language="Kp" /> originates from.

istepaniuk · 2018-12-14T09:16:12Z

I am not a good Dutch speaker but I can see from the examples that some plural nouns are not stemmed correctly in the example diff for Dutch.

For example, I understand that acties should become actie, but remains unchanged. The same occurs for all other nouns of that same form such as conclusies, condities, etc.

It would seem that this rule is missing entirely.

MPParsley · 2019-01-03T10:15:11Z

I have the same experience with the 'Dutch' Snowball stemmer. Much better stemming is realised using the 'Kraaij-Pohlmann' stemming algoritm (language="Kp"). The simplest improvement is to use this algoritm as the default for Dutch stemming.

@Sutharsan, I opened an issue in the Search API Solr queue to move to Kraaij-Pohlmann.

ojwb · 2019-10-14T04:38:56Z

I brought this matter up on the list recently:

https://lists.tartarus.org/pipermail/snowball-discuss/2019-October/001658.html

The history here is that Martin implemented dutch.sbl (and devised the algorithm for it) and also implemented kraaij_pohlmann.sbl from a paper and C implementation by Kraaij and Pohlmann (the paper only contains a partial description). When Snowball was quite new, Martin implemented several existing stemming algorithms in it (also Lovins' English stemmer and Schinke's Latin stemmer) as a way to demonstrate that the language was flexible enough to implement any algorithmic stemmer, but at least for kraaij_pohlmann.sbl he didn't worry about matching every detail of the behaviour - essentially it's more of a proof of concept implementation rather than something intended for the wider use it seems to now be getting.

Helpfully Martin managed to find a copy of the original C implementation from Kraaij and Pohlmann, which means we can look at the discrepancies between that and kraaij_pohlmann.sbl and decide which are worth addressing.

https://snowballstem.org/algorithms/kraaij_pohlmann/stemmer.html claims "in the demonstration vocabulary only 32 words out of over 45,000 stem differently" and goes on to list them (and a significant number appear to be non-Dutch words) but if I attempt to repeat that comparison I get 220 differences.

One obvious difference from looking at the sources is that the C version includes vowels with accents whereas the Snowball version only considers unaccented vowels - a quick attempt to copy that in Snowball reduced the differences from 220 to 153.

I'll see if I can usefully summarise the differences so people can easily take a look.

If anyone knows of a good quality Dutch word list, that might be useful - dutch/voc.txt seems to have a significant amount of non-Dutch, which is rather unhelpful in this instance.

MPParsley · 2019-12-21T18:49:31Z

@ojwb, have a look at https://github.com/nielsbom/Hangman for a list of Dutch nouns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistakes in the Dutch stemmer #1

Mistakes in the Dutch stemmer #1

gboer commented Mar 18, 2013

rboulton commented Mar 27, 2013

ojwb commented Dec 9, 2014

Sutharsan commented Jan 12, 2015

Sutharsan commented Jan 12, 2015

fishruti commented Jun 18, 2018

Sutharsan commented Jun 18, 2018

istepaniuk commented Dec 14, 2018

MPParsley commented Jan 3, 2019

ojwb commented Oct 14, 2019

MPParsley commented Dec 21, 2019

Mistakes in the Dutch stemmer #1

Mistakes in the Dutch stemmer #1

Comments

gboer commented Mar 18, 2013

rboulton commented Mar 27, 2013

ojwb commented Dec 9, 2014

Sutharsan commented Jan 12, 2015

Sutharsan commented Jan 12, 2015

fishruti commented Jun 18, 2018

Sutharsan commented Jun 18, 2018

istepaniuk commented Dec 14, 2018

MPParsley commented Jan 3, 2019

ojwb commented Oct 14, 2019

MPParsley commented Dec 21, 2019