Lemmatization does not appear to be working for Indonesian (GSD). #1003

xavier-taylor · 2022-04-18T11:00:43Z

Describe the bug
Lemmatization does not appear to be working for Indonesian.

To Reproduce
nlp = stanza.Pipeline(lang='id', processors='tokenize,pos,lemma,depparse')`

doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.')

`
Output:

word: Ia lemma: ia
word: menjadi lemma: menjadi
word: Gubernur lemma: gubernur
word: Bali lemma: bali
word: menggantikan lemma: mengantikan
word: Anak lemma: anak
word: Agung lemma: agung
word: Bagus lemma: bagus
word: Sutedja lemma: sutedja
word: . lemma: .

Expected behavior

I am not an expert on Indonesian, but I am studying it now and am confident about the following:

In Indonesian many words are built by adding affixes to lemmas.

For example

menjadi is men + jadi. Jadi is the lemma.
menggantikan is meng + ganti + kan. Ganti is the lemma.

In all example sentences I have tried, the words are not being correctly lemmatized.

So I expect, for words with lemmas like these two examples, to get output like:

word: menjadi lemma: jadi
word: menggantikan lemma: ganti

Environment:

OS: Ubuntu
Python version: 3.9
Stanza version: 1.3

Additional context
I noticed the same behavior in the following libraries:
https://github.com/nlp-uoregon/trankit
https://github.com/TakeLab/spacy-udpipe

I understand that these projects also use UD Indonesian GSD. I note that when you look at this data, you can see the proper lemmatization - see the third and final columns in this row for an example:
5 menggantikan ganti VERB VSA Mood=Ind|Voice=Act 2 advcl _ MorphInd=^meN+ganti<v>+kan_VSA$
https://github.com/UniversalDependencies/UD_Indonesian-GSD/blob/master/id_gsd-ud-test.conllu

Dictionary definition showing words and lemmas used in this example:

** When the bug does not appear**

I figured this out just before posting this issue - the bug does not appear when you use the (non default) package 'csui':
`
nlp = stanza.Pipeline(
lang='id', processors='tokenize,pos,lemma,depparse', package='csui')

doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.')
`

When that runs, you get the expected output:

word: Ia lemma: ia
word: menjadi lemma: jadi
word: Gubernur lemma: Gubernur
word: Bali lemma: Bali
word: menggantikan lemma: ganti
word: Anak lemma: anak
word: Agung lemma: Agung
word: Bagus lemma: Bagus
word: Sutedja lemma: Sutedja
word: . lemma: .

Of interest to me is that both the GSD and CSUI do have the correct lemma in their lemma column, column 3.
See:
https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-CSUI/master/id_csui-ud-train.conllu
and
https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-GSD/master/id_gsd-ud-train.conllu
for example, and search for the word menggantikan. In both files menggantikan appears correctly as the lemma ganti.

So why/how is it going wrong with the training of the GSD model?

Thanks for any insight,

Xavier

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2022-04-18T17:00:04Z

That's a recent update which we haven't incorporated yet. We can rebuild the models for the updated GSD dataset pretty quickly

…

On Mon, Apr 18, 2022 at 4:00 AM Xavier Taylor ***@***.***> wrote: *Describe the bug* Lemmatization does not appear to be working for Indonesian. *To Reproduce* nlp = stanza.Pipeline(lang='id', processors='tokenize,pos,lemma,depparse')` doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.') ` Output: word: Ia lemma: ia word: menjadi lemma: menjadi word: Gubernur lemma: gubernur word: Bali lemma: bali word: menggantikan lemma: mengantikan word: Anak lemma: anak word: Agung lemma: agung word: Bagus lemma: bagus word: Sutedja lemma: sutedja word: . lemma: . *Expected behavior* I am not an expert on Indonesian, but I am studying it now and am confident about the following: In Indonesian many words are built by adding affixes to lemmas. For example menjadi is men + jadi. Jadi is the lemma. menggantikan is meng + ganti + kan. Ganti is the lemma. In all example sentences I have tried, the words are not being correctly lemmatized. So I expect, for words with lemmas like these two examples, to get output like: word: menjadi lemma: jadi word: menggantikan lemma: ganti *Environment:* - OS: Ubuntu - Python version: 3.9 - Stanza version: 1.3 *Additional context* I noticed the same behavior in the following libraries: https://github.com/nlp-uoregon/trankit https://github.com/TakeLab/spacy-udpipe I understand that these projects also use UD Indonesian GSD. I note that when you look at this data, you can see the proper lemmatization - see the final column in this row for an example: 5 menggantikan ganti VERB VSA Mood=Ind|Voice=Act 2 advcl _ MorphInd=^meN+ganti+kan_VSA$ https://github.com/UniversalDependencies/UD_Indonesian-GSD/blob/master/id_gsd-ud-test.conllu Dictionary definition showing words and lemmas used in this example: [image: image] <https://user-images.githubusercontent.com/26212434/163791987-641db501-4927-4093-95a7-f0ec3d12ea21.png> [image: image] <https://user-images.githubusercontent.com/26212434/163792092-f4ba043f-635f-4584-af63-2090a3cb217c.png> ** When the bug does not appear I figured this out just before posting this issue - the bug does not appear when you use the (non default) package 'csui': ` nlp = stanza.Pipeline( lang='id', processors='tokenize,pos,lemma,depparse', package='csui') doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.') ` When that runs, you get the expected output: word: Ia lemma: ia word: menjadi lemma: jadi word: Gubernur lemma: Gubernur word: Bali lemma: Bali word: menggantikan lemma: ganti word: Anak lemma: anak word: Agung lemma: Agung word: Bagus lemma: Bagus word: Sutedja lemma: Sutedja word: . lemma: . Of interest to me is that both the GSD and CSUI do have the correct lemma in their lemma column, column 3. See: https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-CSUI/master/id_csui-ud-train.conllu and https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-GSD/master/id_gsd-ud-train.conllu for example, and search for the word menggantikan. In both files menggantikan appears correctly as the lemma ganti. So why/how is it going wrong with the training of the GSD model? Thanks for any insight, Xavier — Reply to this email directly, view it on GitHub <#1003>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWMDIVHS5L3MGMLFJGTVFU6GPANCNFSM5TVSYGMA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

xavier-taylor · 2022-04-19T06:53:39Z

I will follow this thread for any updates. I appreciate your work. Once again, thanks for making this amazing toolkit publicly available!

AngledLuffa · 2022-04-19T07:05:01Z

This is now fixed on the dev branch, in case you are willing to install that. New release will be within a week, hopefully. Depends on how long this cold is kicking my butt

…

On Mon, Apr 18, 2022 at 11:53 PM Xavier Taylor ***@***.***> wrote: I will follow this thread for any updates. I appreciate your work. Once again, thanks for making this amazing toolkit publicly available! — Reply to this email directly, view it on GitHub <#1003 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWLRULT27IZ7ZDWFDZ3VFZJ73ANCNFSM5TVSYGMA> . You are receiving this because you commented.Message ID: ***@***.***>

xavier-taylor · 2022-04-19T10:35:37Z

Thanks mate, I will see if I can figure out how to do that! In the mean time, I hope that you get well soon!

xavier-taylor · 2022-04-19T23:13:13Z

Note for those like myself not super familiar with python - I didn't immediately know how to install the dev branch.

This did the trick:

pip install https://github.com/stanfordnlp/stanza/archive/dev.zip

You may also need something like: pip install --upgrade torch torchvision depending on whether your torch version is different to the one assumed by dev.

I checked that GSD is indeed working for lemmatization:

`
$python3

import stanza
[print(f'word: {word.text} lemma: {word.lemma}') for sentence in stanza.Pipeline('id',package='gsd').process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.').sentences for word in sentence.words]
`

output:
word: Ia lemma: dia
> word: menjadi lemma: jadi
word: Gubernur lemma: gubernur
word: Bali lemma: bali
> word: menggantikan lemma: ganti
word: Anak lemma: anak
word: Agung lemma: agung
word: Bagus lemma: bagus
word: Sutedja lemma: sutedja
word: . lemma: .

So this is good. Thanks John.

xavier-taylor added the bug label Apr 18, 2022

xavier-taylor changed the title ~~Lemmatization does not appear to be working for Indonesian.~~ Lemmatization does not appear to be working for Indonesian (GSD). Apr 18, 2022

xavier-taylor mentioned this issue Apr 18, 2022

Issue with lemmatization of indonesian nlp-uoregon/trankit#45

Closed

xavier-taylor closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatization does not appear to be working for Indonesian (GSD). #1003

Lemmatization does not appear to be working for Indonesian (GSD). #1003

xavier-taylor commented Apr 18, 2022 •

edited

Loading

AngledLuffa commented Apr 18, 2022 via email

xavier-taylor commented Apr 19, 2022

AngledLuffa commented Apr 19, 2022 via email

xavier-taylor commented Apr 19, 2022

xavier-taylor commented Apr 19, 2022 •

edited

Loading

Lemmatization does not appear to be working for Indonesian (GSD). #1003

Lemmatization does not appear to be working for Indonesian (GSD). #1003

Comments

xavier-taylor commented Apr 18, 2022 • edited Loading

AngledLuffa commented Apr 18, 2022 via email

xavier-taylor commented Apr 19, 2022

AngledLuffa commented Apr 19, 2022 via email

xavier-taylor commented Apr 19, 2022

xavier-taylor commented Apr 19, 2022 • edited Loading

xavier-taylor commented Apr 18, 2022 •

edited

Loading

xavier-taylor commented Apr 19, 2022 •

edited

Loading