-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Greek stemmer #44
Conversation
algorithms/greek/stem_Unicode.sbl
Outdated
([substring] atlimit among ( | ||
'{m}' '{p}' '{a}{p}' '{a}{r}' '{i}{d}' '{k}{t}' '{s}{k}' '{s}{ch}' '{u}{ps}' '{f}{a}' '{ch}{r}' '{ch}{t}' '{a}{k}{t}' | ||
'{a}{o}{r}' '{a}{s}{ch}' '{a}{t}{a}' '{a}{ch}{n}' '{a}{ch}{t}' '{g}{e}{m}' '{g}{u}{r}' '{e}{m}{p}' '{e}{u}{p}' '{e}{ch}{th}' '{i}{f}{a}' | ||
'Ή{f}{a}' '{k}{a}{th}' '{k}{a}{k}' '{k}{u}{l}' '{l}{u}{g}' '{m}{a}{k}' '{m}{e}{g}' '{t}{a}{ch}' '{f}{y}{l}' '{ch}{oo}{r}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Ή{f}{a}'
should be removed. It would be escaped as '{I'}{f}{a}'
, but tolower
converts that to '{i}{f}{a}'
, which is already present in this list.
algorithms/greek/stem_Unicode.sbl
Outdated
do unset test1 | ||
[substring] atlimit among ( | ||
'{s}' '{ch}' | ||
(-> s <- 'A{r}{a}{k}' insert s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A
should be {a}
.
algorithms/greek/stem_Unicode.sbl
Outdated
'{a}{y}{f}{n}' '{y}{r}' '{o}{l}{o}' '{ps}{a}{l}' (-> s <- '{y}{d}' insert s) | ||
)) or | ||
([substring] among ( | ||
'{e}' '{p}{a}{y}{ch}{n}' (-> s <- 'I{z}' insert s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I
should be {y}
.
According to Saroukos 2008, the inserted suffix is “ΙΔ”, not “ΙΖ”, so should this be {y}{d}
?
@dscorbett thanks for comments! I think I've addressed them. |
algorithms/greek.sbl
Outdated
'{p}{a}{t}{e}{r}' '{p}' '{s}{k}' '{t}{o}{s}' '{t}{r}{y}{p}{o}{l}' | ||
(-> s <- '{y}{t}{s}' insert s) | ||
)) or | ||
(['{k}{o}{r}'] -> s <- '{y}{z}' insert s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original algorithm, the added suffix is “ΙΤΣ”.
algorithms/greek.sbl
Outdated
backwardmode ( | ||
define has_min_length as ( | ||
$length = len | ||
$length >= 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can test the length directly using $(len >= 3)
, so the length
variable is not necessary. (This syntax is a recent addition to Snowball.)
algorithms/greek.sbl
Outdated
|
||
strings ( s ) | ||
|
||
integers ( length ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
algorithms/greek.sbl:109: warning: integer 'length' defined but not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have checked that. Fixed now.
Thank you for this work, when merged. it will make a milestone addition for Greek language applications. |
Thanks, merged. And thanks to @dscorbett for the reviews. Sorry this took a while - I actually started to work on merging it not long after it was submitted (the original version of the testdata has been in the snowball-data repo since 2016-07-12) but then you pushed some further changes so I stop reviewing until the branch had stabilised, but failed to pick it up again once it had. Please ping if a similar situation occurs again. I'd encourage you to submit a write up of this algorithm for the website - something like the references from the first comment here plus notes of any points of departure and the reasons behind them would be fine. e.g. here's what I wrote for the Indonesian stemmer implementation: https://github.com/snowballstem/snowball-website/blob/master/algorithms/indonesian/stemmer.tt |
@@ -16,6 +16,7 @@ english UTF_8,ISO_8859_1 english,en,eng | |||
finnish UTF_8,ISO_8859_1 finnish,fi,fin | |||
french UTF_8,ISO_8859_1 french,fr,fre,fra | |||
german UTF_8,ISO_8859_1 german,de,ger,deu | |||
greek UTF_8 greek,gr,gre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gr
here should be el
- I've pushed a fix in e063dd8 and also added ell
, since we support other ISO 639-2T codes (e.g. fra
, deu
).
FYI, this stemmer uses |
Based on the algorithm described in http://sais.se/mthprize/2007/ntais2007.pdf
and http://tampub.uta.fi/bitstream/handle/10024/80480/gradu03463.pdf
Despite of some shortcoming (such as not handling diacritics) it reaches a good accuracy on the Universal Dependencies corpus.
Test data is in another PR: snowballstem/snowball-data#5