Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmatization does not appear to be working for Indonesian (GSD). #1003

Closed
xavier-taylor opened this issue Apr 18, 2022 · 5 comments
Closed
Labels

Comments

@xavier-taylor
Copy link

xavier-taylor commented Apr 18, 2022

Describe the bug
Lemmatization does not appear to be working for Indonesian.

To Reproduce
nlp = stanza.Pipeline(lang='id', processors='tokenize,pos,lemma,depparse')`

doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.')

`
Output:

word: Ia lemma: ia
word: menjadi lemma: menjadi
word: Gubernur lemma: gubernur
word: Bali lemma: bali
word: menggantikan lemma: mengantikan
word: Anak lemma: anak
word: Agung lemma: agung
word: Bagus lemma: bagus
word: Sutedja lemma: sutedja
word: . lemma: .

Expected behavior

I am not an expert on Indonesian, but I am studying it now and am confident about the following:

In Indonesian many words are built by adding affixes to lemmas.

For example

menjadi is men + jadi. Jadi is the lemma.
menggantikan is meng + ganti + kan. Ganti is the lemma.

In all example sentences I have tried, the words are not being correctly lemmatized.

So I expect, for words with lemmas like these two examples, to get output like:

word: menjadi lemma: jadi
word: menggantikan lemma: ganti

Environment:

  • OS: Ubuntu
  • Python version: 3.9
  • Stanza version: 1.3

Additional context
I noticed the same behavior in the following libraries:
https://github.com/nlp-uoregon/trankit
https://github.com/TakeLab/spacy-udpipe

I understand that these projects also use UD Indonesian GSD. I note that when you look at this data, you can see the proper lemmatization - see the third and final columns in this row for an example:
5 menggantikan ganti VERB VSA Mood=Ind|Voice=Act 2 advcl _ MorphInd=^meN+ganti<v>+kan_VSA$
https://github.com/UniversalDependencies/UD_Indonesian-GSD/blob/master/id_gsd-ud-test.conllu

Dictionary definition showing words and lemmas used in this example:
image
image

** When the bug does not appear**

I figured this out just before posting this issue - the bug does not appear when you use the (non default) package 'csui':
`
nlp = stanza.Pipeline(
lang='id', processors='tokenize,pos,lemma,depparse', package='csui')

doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.')
`

When that runs, you get the expected output:

word: Ia lemma: ia
word: menjadi lemma: jadi
word: Gubernur lemma: Gubernur
word: Bali lemma: Bali
word: menggantikan lemma: ganti
word: Anak lemma: anak
word: Agung lemma: Agung
word: Bagus lemma: Bagus
word: Sutedja lemma: Sutedja
word: . lemma: .

Of interest to me is that both the GSD and CSUI do have the correct lemma in their lemma column, column 3.
See:
https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-CSUI/master/id_csui-ud-train.conllu
and
https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-GSD/master/id_gsd-ud-train.conllu
for example, and search for the word menggantikan. In both files menggantikan appears correctly as the lemma ganti.

So why/how is it going wrong with the training of the GSD model?

Thanks for any insight,

Xavier

@xavier-taylor xavier-taylor changed the title Lemmatization does not appear to be working for Indonesian. Lemmatization does not appear to be working for Indonesian (GSD). Apr 18, 2022
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Apr 18, 2022 via email

@xavier-taylor
Copy link
Author

I will follow this thread for any updates. I appreciate your work. Once again, thanks for making this amazing toolkit publicly available!

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Apr 19, 2022 via email

@xavier-taylor
Copy link
Author

Thanks mate, I will see if I can figure out how to do that! In the mean time, I hope that you get well soon!

@xavier-taylor
Copy link
Author

xavier-taylor commented Apr 19, 2022

Note for those like myself not super familiar with python - I didn't immediately know how to install the dev branch.

This did the trick:

pip install https://github.com/stanfordnlp/stanza/archive/dev.zip

You may also need something like: pip install --upgrade torch torchvision depending on whether your torch version is different to the one assumed by dev.

I checked that GSD is indeed working for lemmatization:

`
$python3

import stanza
[print(f'word: {word.text} lemma: {word.lemma}') for sentence in stanza.Pipeline('id',package='gsd').process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.').sentences for word in sentence.words]
`

output:
word: Ia lemma: dia
> word: menjadi lemma: jadi
word: Gubernur lemma: gubernur
word: Bali lemma: bali
> word: menggantikan lemma: ganti
word: Anak lemma: anak
word: Agung lemma: agung
word: Bagus lemma: bagus
word: Sutedja lemma: sutedja
word: . lemma: .

So this is good. Thanks John.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants