Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with numbers and alphanumeric codes #81

Closed
ojwb opened this issue Jun 11, 2018 · 1 comment
Closed

Issues with numbers and alphanumeric codes #81

ojwb opened this issue Jun 11, 2018 · 1 comment

Comments

@ojwb
Copy link
Member

ojwb commented Jun 11, 2018

In #66, it was reported that the Finnish stemmer damaged numbers (e.g. 2000 -> 200). This is now fixed, but there are similar (though more subtle) issues elsewhere.

E.g. The Danish stemmer damages alphanumeric codes where the initial alpha part meets certain criteria - e.g. space1999 -> space199, hal9000 -> hal900, 0x0e00 -> 0x0e0. These are significantly less problematic in practice as in many cases there probably isn't any unwanted conflation (there isn't a "space199" or a "hal900") and the stem is really just an opaque internal token in typical usage. It's more likely to be a problem for cases such as hex codes in error messages, inventory codes, etc, and I think it is worth addressing in a similar way to the fix for Finnish - i.e. by replacing a non-v check with a c check where c is all the letters in danish except those in v.

We also should review the other algorithms where non is used (dutch, english, french, german, hungarian, indonesian, irish, italian, kraaij_pohlmann, lithuanian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish), and add a generic test that feeds some numbers and alphanumeric into the stemmer and checks they aren't changed, to help ensure there aren't other issues of this sort now or in the future.

ojwb added a commit to snowballstem/snowball-data that referenced this issue Nov 14, 2018
ojwb added a commit that referenced this issue Nov 14, 2018
We now define "consonant" more tightly than just "not a vowel",
which in particular means alphanumeric codes ending in a double digit
(e.g. 0x0e00, hal9000, space1999) are no longer mangled.

See #81.
@ojwb
Copy link
Member Author

ojwb commented Nov 14, 2018

I've just pushed a change for Danish which is similar to that applied for Finnish.

I've reviewed the other algorithms which use non and none of those uses appear to result in a character matched by the negated grouping being deleted or replaced, which seems to be the essence of why this was particularly problematic in the Finnish and (to a lesser extent) Danish stemmers.

Some hex codes containing digits can get altered in some cases by the other stemmers, but hex codes not containing digits are indistinguishable from words and some are words which ought to be stemmed - e.g. defaced in English, so that's not really something I think we want to try to solve.

I think all that's left to do here is add a generic test that checks the stemmers don't damage numbers and any other cases that we think should be left alone, and to add notes of advice such as "don't damage numbers" and "use non with care" to the website docs.

ojwb added a commit to snowballstem/snowball-website that referenced this issue Nov 14, 2018
ojwb added a commit to snowballstem/snowball-data that referenced this issue Jul 2, 2019
Since the fix to improve handling of numbers and alphanumeric codes in
snowballstem/snowball#81 this no longer
provides a regression test, and it causes iconv errors when running
tests for danish in ISO 8859-1:

iconv: illegal input sequence at position 213041
iconv: illegal input sequence at position 174917

It would be good to have a regression test for this, but we'll have to
do it a different way.

This reverts commit ef66b48.
@ojwb ojwb closed this as completed in 59da79d Sep 6, 2021
ojwb added a commit to snowballstem/snowball-website that referenced this issue Sep 6, 2021
This has led to issues with stemmers damaging numbers and numeric
codes, see: snowballstem/snowball#81
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant