New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with numbers and alphanumeric codes #81
Comments
We now define "consonant" more tightly than just "not a vowel", which in particular means alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000, space1999) are no longer mangled. See #81.
I've just pushed a change for Danish which is similar to that applied for Finnish. I've reviewed the other algorithms which use Some hex codes containing digits can get altered in some cases by the other stemmers, but hex codes not containing digits are indistinguishable from words and some are words which ought to be stemmed - e.g. I think all that's left to do here is add a generic test that checks the stemmers don't damage numbers and any other cases that we think should be left alone, and to add notes of advice such as "don't damage numbers" and "use |
Since the fix to improve handling of numbers and alphanumeric codes in snowballstem/snowball#81 this no longer provides a regression test, and it causes iconv errors when running tests for danish in ISO 8859-1: iconv: illegal input sequence at position 213041 iconv: illegal input sequence at position 174917 It would be good to have a regression test for this, but we'll have to do it a different way. This reverts commit ef66b48.
This has led to issues with stemmers damaging numbers and numeric codes, see: snowballstem/snowball#81
In #66, it was reported that the Finnish stemmer damaged numbers (e.g.
2000
->200
). This is now fixed, but there are similar (though more subtle) issues elsewhere.E.g. The Danish stemmer damages alphanumeric codes where the initial alpha part meets certain criteria - e.g.
space1999
->space199
,hal9000
->hal900
,0x0e00
->0x0e0
. These are significantly less problematic in practice as in many cases there probably isn't any unwanted conflation (there isn't a "space199" or a "hal900") and the stem is really just an opaque internal token in typical usage. It's more likely to be a problem for cases such as hex codes in error messages, inventory codes, etc, and I think it is worth addressing in a similar way to the fix for Finnish - i.e. by replacing anon-v
check with ac
check wherec
is all the letters in danish except those inv
.We also should review the other algorithms where
non
is used (dutch, english, french, german, hungarian, indonesian, irish, italian, kraaij_pohlmann, lithuanian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish), and add a generic test that feeds some numbers and alphanumeric into the stemmer and checks they aren't changed, to help ensure there aren't other issues of this sort now or in the future.The text was updated successfully, but these errors were encountered: