Issues with numbers and alphanumeric codes #81

ojwb · 2018-06-11T05:27:10Z

In #66, it was reported that the Finnish stemmer damaged numbers (e.g. 2000 -> 200). This is now fixed, but there are similar (though more subtle) issues elsewhere.

E.g. The Danish stemmer damages alphanumeric codes where the initial alpha part meets certain criteria - e.g. space1999 -> space199, hal9000 -> hal900, 0x0e00 -> 0x0e0. These are significantly less problematic in practice as in many cases there probably isn't any unwanted conflation (there isn't a "space199" or a "hal900") and the stem is really just an opaque internal token in typical usage. It's more likely to be a problem for cases such as hex codes in error messages, inventory codes, etc, and I think it is worth addressing in a similar way to the fix for Finnish - i.e. by replacing a non-v check with a c check where c is all the letters in danish except those in v.

We also should review the other algorithms where non is used (dutch, english, french, german, hungarian, indonesian, irish, italian, kraaij_pohlmann, lithuanian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish), and add a generic test that feeds some numbers and alphanumeric into the stemmer and checks they aren't changed, to help ensure there aren't other issues of this sort now or in the future.

The text was updated successfully, but these errors were encountered:

See snowballstem/snowball#81

We now define "consonant" more tightly than just "not a vowel", which in particular means alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000, space1999) are no longer mangled. See #81.

ojwb · 2018-11-14T23:02:46Z

I've just pushed a change for Danish which is similar to that applied for Finnish.

I've reviewed the other algorithms which use non and none of those uses appear to result in a character matched by the negated grouping being deleted or replaced, which seems to be the essence of why this was particularly problematic in the Finnish and (to a lesser extent) Danish stemmers.

Some hex codes containing digits can get altered in some cases by the other stemmers, but hex codes not containing digits are indistinguishable from words and some are words which ought to be stemmed - e.g. defaced in English, so that's not really something I think we want to try to solve.

I think all that's left to do here is add a generic test that checks the stemmers don't damage numbers and any other cases that we think should be left alone, and to add notes of advice such as "don't damage numbers" and "use non with care" to the website docs.

See snowballstem/snowball#81

Since the fix to improve handling of numbers and alphanumeric codes in snowballstem/snowball#81 this no longer provides a regression test, and it causes iconv errors when running tests for danish in ISO 8859-1: iconv: illegal input sequence at position 213041 iconv: illegal input sequence at position 174917 It would be good to have a regression test for this, but we'll have to do it a different way. This reverts commit ef66b48.

This has led to issues with stemmers damaging numbers and numeric codes, see: snowballstem/snowball#81

ojwb added a commit to snowballstem/snowball-data that referenced this issue Nov 14, 2018

Update Danish test data for alphanumeric mangling fix

2d7b98a

See snowballstem/snowball#81

ojwb added a commit to snowballstem/snowball-website that referenced this issue Nov 14, 2018

Update Danish stemmer for alphanumeric change

306f4e0

See snowballstem/snowball#81

ojwb closed this as completed in 59da79d Sep 6, 2021

ojwb added a commit to snowballstem/snowball-website that referenced this issue Sep 6, 2021

Add warning about avoiding non followed by delete

a6bb5aa

This has led to issues with stemmers damaging numbers and numeric codes, see: snowballstem/snowball#81

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with numbers and alphanumeric codes #81

Issues with numbers and alphanumeric codes #81

ojwb commented Jun 11, 2018

ojwb commented Nov 14, 2018

Issues with numbers and alphanumeric codes #81

Issues with numbers and alphanumeric codes #81

Comments

ojwb commented Jun 11, 2018

ojwb commented Nov 14, 2018