Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Greek stemmer #44

Closed
wants to merge 11 commits into from
Closed

Add Greek stemmer #44

wants to merge 11 commits into from

Conversation

master
Copy link
Contributor

@master master commented Aug 2, 2016

Based on the algorithm described in http://sais.se/mthprize/2007/ntais2007.pdf
and http://tampub.uta.fi/bitstream/handle/10024/80480/gradu03463.pdf

Despite of some shortcoming (such as not handling diacritics) it reaches a good accuracy on the Universal Dependencies corpus.

Test data is in another PR: snowballstem/snowball-data#5

([substring] atlimit among (
'{m}' '{p}' '{a}{p}' '{a}{r}' '{i}{d}' '{k}{t}' '{s}{k}' '{s}{ch}' '{u}{ps}' '{f}{a}' '{ch}{r}' '{ch}{t}' '{a}{k}{t}'
'{a}{o}{r}' '{a}{s}{ch}' '{a}{t}{a}' '{a}{ch}{n}' '{a}{ch}{t}' '{g}{e}{m}' '{g}{u}{r}' '{e}{m}{p}' '{e}{u}{p}' '{e}{ch}{th}' '{i}{f}{a}'
'Ή{f}{a}' '{k}{a}{th}' '{k}{a}{k}' '{k}{u}{l}' '{l}{u}{g}' '{m}{a}{k}' '{m}{e}{g}' '{t}{a}{ch}' '{f}{y}{l}' '{ch}{oo}{r}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Ή{f}{a}' should be removed. It would be escaped as '{I'}{f}{a}', but tolower converts that to '{i}{f}{a}', which is already present in this list.

do unset test1
[substring] atlimit among (
'{s}' '{ch}'
(-> s <- 'A{r}{a}{k}' insert s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A should be {a}.

'{a}{y}{f}{n}' '{y}{r}' '{o}{l}{o}' '{ps}{a}{l}' (-> s <- '{y}{d}' insert s)
)) or
([substring] among (
'{e}' '{p}{a}{y}{ch}{n}' (-> s <- 'I{z}' insert s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be {y}.

According to Saroukos 2008, the inserted suffix is “ΙΔ”, not “ΙΖ”, so should this be {y}{d}?

@master
Copy link
Contributor Author

master commented Jul 16, 2018

@dscorbett thanks for comments! I think I've addressed them.

'{p}{a}{t}{e}{r}' '{p}' '{s}{k}' '{t}{o}{s}' '{t}{r}{y}{p}{o}{l}'
(-> s <- '{y}{t}{s}' insert s)
)) or
(['{k}{o}{r}'] -> s <- '{y}{z}' insert s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original algorithm, the added suffix is “ΙΤΣ”.

backwardmode (
define has_min_length as (
$length = len
$length >= 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can test the length directly using $(len >= 3), so the length variable is not necessary. (This syntax is a recent addition to Snowball.)


strings ( s )

integers ( length )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

algorithms/greek.sbl:109: warning: integer 'length' defined but not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have checked that. Fixed now.

@raratiru
Copy link

Thank you for this work, when merged. it will make a milestone addition for Greek language applications.

@ojwb ojwb closed this in c64169d Oct 29, 2018
@ojwb
Copy link
Member

ojwb commented Oct 29, 2018

Thanks, merged. And thanks to @dscorbett for the reviews.

Sorry this took a while - I actually started to work on merging it not long after it was submitted (the original version of the testdata has been in the snowball-data repo since 2016-07-12) but then you pushed some further changes so I stop reviewing until the branch had stabilised, but failed to pick it up again once it had. Please ping if a similar situation occurs again.

I'd encourage you to submit a write up of this algorithm for the website - something like the references from the first comment here plus notes of any points of departure and the reasons behind them would be fine. e.g. here's what I wrote for the Indonesian stemmer implementation: https://github.com/snowballstem/snowball-website/blob/master/algorithms/indonesian/stemmer.tt

@@ -16,6 +16,7 @@ english UTF_8,ISO_8859_1 english,en,eng
finnish UTF_8,ISO_8859_1 finnish,fi,fin
french UTF_8,ISO_8859_1 french,fr,fre,fra
german UTF_8,ISO_8859_1 german,de,ger,deu
greek UTF_8 greek,gr,gre
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gr here should be el - I've pushed a fix in e063dd8 and also added ell, since we support other ISO 639-2T codes (e.g. fra, deu).

@ojwb
Copy link
Member

ojwb commented Nov 20, 2018

FYI, this stemmer uses do unset test1 a number of times - here the do is completely redundant since unset can't give signal f and can't change the cursor position. I've pushed 7c9d05b to eliminate do in such cases. It doesn't affect the generated code (for C at least, and probably not for other languages) but it reduces the size of the code a bit and may avoid puzzling people who wonder why it's there.

a16bitsysop added a commit to a16bitsysop/snowball that referenced this pull request Dec 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants