-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemmatization does not appear to be working for Indonesian (GSD). #1003
Comments
That's a recent update which we haven't incorporated yet. We can rebuild
the models for the updated GSD dataset pretty quickly
…On Mon, Apr 18, 2022 at 4:00 AM Xavier Taylor ***@***.***> wrote:
*Describe the bug*
Lemmatization does not appear to be working for Indonesian.
*To Reproduce*
nlp = stanza.Pipeline(lang='id', processors='tokenize,pos,lemma,depparse')`
doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus
Sutedja.')
`
Output:
word: Ia lemma: ia
word: menjadi lemma: menjadi
word: Gubernur lemma: gubernur
word: Bali lemma: bali
word: menggantikan lemma: mengantikan
word: Anak lemma: anak
word: Agung lemma: agung
word: Bagus lemma: bagus
word: Sutedja lemma: sutedja
word: . lemma: .
*Expected behavior*
I am not an expert on Indonesian, but I am studying it now and am
confident about the following:
In Indonesian many words are built by adding affixes to lemmas.
For example
menjadi is men + jadi. Jadi is the lemma.
menggantikan is meng + ganti + kan. Ganti is the lemma.
In all example sentences I have tried, the words are not being correctly
lemmatized.
So I expect, for words with lemmas like these two examples, to get output
like:
word: menjadi lemma: jadi
word: menggantikan lemma: ganti
*Environment:*
- OS: Ubuntu
- Python version: 3.9
- Stanza version: 1.3
*Additional context*
I noticed the same behavior in the following libraries:
https://github.com/nlp-uoregon/trankit
https://github.com/TakeLab/spacy-udpipe
I understand that these projects also use UD Indonesian GSD. I note that
when you look at this data, you can see the proper lemmatization - see the
final column in this row for an example:
5 menggantikan ganti VERB VSA Mood=Ind|Voice=Act 2 advcl _
MorphInd=^meN+ganti+kan_VSA$
https://github.com/UniversalDependencies/UD_Indonesian-GSD/blob/master/id_gsd-ud-test.conllu
Dictionary definition showing words and lemmas used in this example:
[image: image]
<https://user-images.githubusercontent.com/26212434/163791987-641db501-4927-4093-95a7-f0ec3d12ea21.png>
[image: image]
<https://user-images.githubusercontent.com/26212434/163792092-f4ba043f-635f-4584-af63-2090a3cb217c.png>
** When the bug does not appear
I figured this out just before posting this issue - the bug does not
appear when you use the (non default) package 'csui':
`
nlp = stanza.Pipeline(
lang='id', processors='tokenize,pos,lemma,depparse', package='csui')
doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus
Sutedja.')
`
When that runs, you get the expected output:
word: Ia lemma: ia
word: menjadi lemma: jadi
word: Gubernur lemma: Gubernur
word: Bali lemma: Bali
word: menggantikan lemma: ganti
word: Anak lemma: anak
word: Agung lemma: Agung
word: Bagus lemma: Bagus
word: Sutedja lemma: Sutedja
word: . lemma: .
Of interest to me is that both the GSD and CSUI do have the correct lemma
in their lemma column, column 3.
See:
https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-CSUI/master/id_csui-ud-train.conllu
and
https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-GSD/master/id_gsd-ud-train.conllu
for example, and search for the word menggantikan. In both files
menggantikan appears correctly as the lemma ganti.
So why/how is it going wrong with the training of the GSD model?
Thanks for any insight,
Xavier
—
Reply to this email directly, view it on GitHub
<#1003>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWMDIVHS5L3MGMLFJGTVFU6GPANCNFSM5TVSYGMA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I will follow this thread for any updates. I appreciate your work. Once again, thanks for making this amazing toolkit publicly available! |
This is now fixed on the dev branch, in case you are willing to install
that.
New release will be within a week, hopefully. Depends on how long this
cold is kicking my butt
…On Mon, Apr 18, 2022 at 11:53 PM Xavier Taylor ***@***.***> wrote:
I will follow this thread for any updates. I appreciate your work. Once
again, thanks for making this amazing toolkit publicly available!
—
Reply to this email directly, view it on GitHub
<#1003 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWLRULT27IZ7ZDWFDZ3VFZJ73ANCNFSM5TVSYGMA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks mate, I will see if I can figure out how to do that! In the mean time, I hope that you get well soon! |
Note for those like myself not super familiar with python - I didn't immediately know how to install the dev branch. This did the trick:
You may also need something like: I checked that GSD is indeed working for lemmatization: `
So this is good. Thanks John. |
Describe the bug
Lemmatization does not appear to be working for Indonesian.
To Reproduce
nlp = stanza.Pipeline(lang='id', processors='tokenize,pos,lemma,depparse')`
doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.')
`
Output:
Expected behavior
I am not an expert on Indonesian, but I am studying it now and am confident about the following:
In Indonesian many words are built by adding affixes to lemmas.
For example
menjadi is men + jadi. Jadi is the lemma.
menggantikan is meng + ganti + kan. Ganti is the lemma.
In all example sentences I have tried, the words are not being correctly lemmatized.
So I expect, for words with lemmas like these two examples, to get output like:
Environment:
Additional context
I noticed the same behavior in the following libraries:
https://github.com/nlp-uoregon/trankit
https://github.com/TakeLab/spacy-udpipe
I understand that these projects also use UD Indonesian GSD. I note that when you look at this data, you can see the proper lemmatization - see the third and final columns in this row for an example:
5 menggantikan ganti VERB VSA Mood=Ind|Voice=Act 2 advcl _ MorphInd=^meN+ganti<v>+kan_VSA$
https://github.com/UniversalDependencies/UD_Indonesian-GSD/blob/master/id_gsd-ud-test.conllu
Dictionary definition showing words and lemmas used in this example:
** When the bug does not appear**
I figured this out just before posting this issue - the bug does not appear when you use the (non default) package 'csui':
`
nlp = stanza.Pipeline(
lang='id', processors='tokenize,pos,lemma,depparse', package='csui')
doc = nlp.process('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.')
`
When that runs, you get the expected output:
Of interest to me is that both the GSD and CSUI do have the correct lemma in their lemma column, column 3.
See:
https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-CSUI/master/id_csui-ud-train.conllu
and
https://raw.githubusercontent.com/UniversalDependencies/UD_Indonesian-GSD/master/id_gsd-ud-train.conllu
for example, and search for the word menggantikan. In both files menggantikan appears correctly as the lemma ganti.
So why/how is it going wrong with the training of the GSD model?
Thanks for any insight,
Xavier
The text was updated successfully, but these errors were encountered: