Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmatization of Possessive Case markers in Hindi and Urdu #1067

Closed
raydoc opened this issue Jun 30, 2022 · 7 comments
Closed

Lemmatization of Possessive Case markers in Hindi and Urdu #1067

raydoc opened this issue Jun 30, 2022 · 7 comments
Labels

Comments

@raydoc
Copy link

raydoc commented Jun 30, 2022

Hi,
I was checking your lemmatization for Hindi and Urdu and found that possessive [genitive] case markers in Hindi and Urdu are wrongly lemmatized.
It refers to the Hindi possessive case markers:
का की के
I have noticed that your lemmatiser tends to reduce these [maybe because they are using builtin libraries] to a single form का
लड़की के मामा की बहन
Lemmatized form लड़की का मामा का बहन
I personally do not agree with this approach not only because /ka/ is not the base form [lemma] of /ki/ and /ke/ but also because it is "sexist" in nature and reduces feminine and feminine/masc plural to a masculine singular form. I do no see 'her' or 'sa' or ihre' in En Fr Ger reduced to 'him', 'son','ihr' which it should by the same logic.
The same scenario is in Urdu
Stanford's Stanza reduces Urdu genitive case markers /ka/ ke/ ki/ to /ka/
Here is an output of the Urdu sentence
مھمد کی کتاب اور ھسن کے گھر
Lemma: مھمد کا کتاب اور ھسن کا گھر
I believe the library which does this is at fault. I consulted my colleagues who are linguists in Hindi and Urdu and alss work in the area of NLP and we feel this approach is linguistically incorrect and worse still smacks of sexism. I do not think it is right to reduce a feminine form to a masculine.
my email: raymond.doctor@gmail.com
I hope a more rational approach to this will be adopted.

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Jun 30, 2022 via email

@raydoc
Copy link
Author

raydoc commented Jun 30, 2022 via email

@AngledLuffa
Copy link
Collaborator

Thank you for the detailed explanation. Now that I know what to look for, I will claim that this is an issue with the underlying data. Especially true for languages where we don't have any expertise of our own (which is unfortunately true for Hindi), we simply put the data into our model training, and whatever comes out is what comes out.

The datasets are here:

https://universaldependencies.org/
https://github.com/UniversalDependencies/UD_Hindi-HDTB
https://github.com/UniversalDependencies/UD_Urdu-UDTB

So, for example with the Hindi dataset, I can grep for की in the dataset.

(Note: the fields are tab separated, so you can grep for exactly that word by surrounding the character with tabs. You can put a tab in a bash shell with ctrl-V tab. You may already know all of that.)

The results of grepping for की look like:

3       की      का      ADP     PSP     AdpType=Post|Case=Acc|Gender=Fem|Number=Plur    2       case    _       ChunkId=NP|ChunkType=child|Translit=kī
3       की      का      ADP     PSP     AdpType=Post|Case=Nom|Gender=Fem|Number=Sing    2       case    _       ChunkId=NP2|ChunkType=child|Translit=kī

So you can see, the underlying dataset turns it into the male form.

For के, the results are less consistent. I'll leave a bit more context in case you can explain why it is doing things differently. Sometimes it keeps it the same, and sometimes it switches it to का

10      जाने     जा      VERB    VM      Case=Acc|VerbForm=Inf   16      advcl   _       Vib=ना_के_लिए|Tam=nA|ChunkId=VGNN|ChunkType=head|Translit=jāne
11      के       के       ADP     PSP     AdpType=Post    10      mark    _       ChunkId=VGNN|ChunkType=child|Translit=ke
12      लिए     लिए     ADP     PSP     AdpType=Post    10      mark    _       ChunkId=VGNN|ChunkType=child|Translit=lie
--
9       देश      देश      NOUN    NN      Case=Acc|Gender=Masc|Number=Sing|Person=3       11      nmod    _       Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=deśa
10      के       का      ADP     PSP     AdpType=Post|Case=Acc|Gender=Masc|Number=Plur   9       case    _       ChunkId=NP4|ChunkType=child|Translit=ke
11      लोगों    लोग     NOUN    NN      Case=Acc|Gender=Masc|Number=Plur|Person=3       14      obj     _       Vib=0_को|Tam=0|ChunkId=NP5|ChunkType=head|Translit=logoṁ

I suggest going through the dataset some yourself to see if there's a reasonable standard or if you think it should be changed. The fact that के isn't consistent is a little suspicious to me, if nothing else. Plus, as you point out, most other language datasets don't ignore the gender in the lemma.

Anyway, a lot of the dataset maintainers are pretty responsive to issues. You could create an issue or even a pull request against the Hindi dataset if you think the lemmas should be updated to reflect the gender of the pronoun. I would suggest starting from a more neutral attitude rather than going straight to calling the dataset sexist, though :) If you do effect some changes in those datasets, we can retrain the models at any point, not necessarily when UD 2.11 comes out. Alternatively, we can always train the models from a fork of the dataset if it seems they are not responding and you are certain your change is an improvement.

BTW, the reason I like this job is learning interesting tidbits about other languages - in English, the gender of the subject determines the pronoun, whereas the gender of the object determines the pronoun in Hindi & Urdu.

@raydoc
Copy link
Author

raydoc commented Jul 1, 2022 via email

@stale
Copy link

stale bot commented Aug 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 31, 2022
@AngledLuffa
Copy link
Collaborator

Any luck sorting out the different lemmas in the datasets? Happy to rebuild the models for those languages if we make an improvement to the data.

@stale
Copy link

stale bot commented Sep 8, 2022

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants