-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemmatization of Possessive Case markers in Hindi and Urdu #1067
Comments
The simple fact is we don't have anyone who speaks Hindi working on this
project. The models are trained from the Hindi and Urdu datasets available
from Universal Dependencies without any human intervention:
https://universaldependencies.org/
https://github.com/UniversalDependencies/UD_Hindi-HDTB
https://github.com/UniversalDependencies/UD_Urdu-UDTB
I don't even know which of three possessive case markers are male, female,
or neutral. Google translate doesn't distinguish them. Your message
assumes we'll know which ones are which. Hopefully your next message will
adopt a more rational approach to explaining what the problem is.
…On Wed, Jun 29, 2022 at 9:10 PM raydoc ***@***.***> wrote:
Hi,
I was checking your lemmatization for Hindi and Urdu and found that
possessive [genitive] case markers in Hindi and Urdu are wrongly lemmatized.
It refers to the Hindi possessive case markers:
का की के
I have noticed that your lemmatiser tends to reduce these [maybe because
they are using builtin libraries] to a single form का
लड़की के मामा की बहन
Lemmatized form लड़की का मामा का बहन
I personally do not agree with this approach not only because /ka/ is not
the base form [lemma] of /ki/ and /ke/ but also because it is "sexist" in
nature and reduces feminine and feminine/masc plural to a masculine
singular form. I do no see 'her' or 'sa' or ihre' in En Fr Ger reduced to
'him', 'son','ihr' which it should by the same logic.
The same scenario is in Urdu
Stanford's Stanza reduces Urdu genitive case markers /ka/ ke/ ki/ to /ka/
Here is an output of the Urdu sentence
مھمد کی کتاب اور ھسن کے گھر
Lemma: مھمد کا کتاب اور ھسن کا گھر
I believe the library which does this is at fault. I consulted my
colleagues who are linguists in Hindi and Urdu and alss work in the area of
NLP and we feel this approach is linguistically incorrect and worse still
smacks of sexism. I do not think it is right to reduce a feminine form to a
masculine.
my email: ***@***.***
I hope a more rational approach to this will be adopted.
—
Reply to this email directly, view it on GitHub
<#1067>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWK6S5K62NGJKKURNP3VRUMZVANCNFSM52H4NIUA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi,
I assumed someone knew Hindi/Urdu and hence did not go into details. I'll
taake Hindi and Urdu one by one, although they both exhibit the same problem
HINDI
The issue is as under:
Hindi admits 3 case markers का /ka/ की /kii/ के /ke/ . These case
markers agree in number and gender with the possessed. If the
object/person is masculine singular का is used. If Feminine singular की
and if Plural masc/feminine के is used
Examples:
राम का भाई Ram's brother: ka because Brother is masculine
राम की बहन Ram's sister: kii because Sister is feminine
राम के *मित्रों* : Ram's friends: ke because friends is plural
As you can see the genitive is marked for number and gender of the
possessed. Reducing them to the masculine singular /ka/ is not a right
approach and as I mentioned is sexist in approach. This is like lemmatising
her to his in English.
URDU
In the case of Urdu, the scenario is similar. I will take the same
examples to make comprehension easier:
Urdu admits 3 similar case markers کا /ka/ کی /kii/ کے /ke/ .
These case markers agree in number and gender with the possessed.
If the object/person is masculine singular کا is used.
If Feminine singular کی
and if Plural masc/feminine کے is used
Examples:
رام کا بھائی: Ram's brother: ka because Brother is masculine
رام کی بہن: Ram's sister: kii because Sister is feminine
رام کےمتروں: Ram's friends: ke because friends is plural
As you can see the genitive is marked for number and gender of the
possessed, unlike English.: his book, her book
Reducing them to the masculine singular /ka/ is not a right approach
I trust this explanation will help you see the problem in its right
perspective.
Thank you
Best regards,
Doc
On Thu, Jun 30, 2022 at 12:50 PM John Bauer ***@***.***>
wrote:
… The simple fact is we don't have anyone who speaks Hindi working on this
project. The models are trained from the Hindi and Urdu datasets available
from Universal Dependencies without any human intervention:
https://universaldependencies.org/
https://github.com/UniversalDependencies/UD_Hindi-HDTB
https://github.com/UniversalDependencies/UD_Urdu-UDTB
I don't even know which of three possessive case markers are male, female,
or neutral. Google translate doesn't distinguish them. Your message
assumes we'll know which ones are which. Hopefully your next message will
adopt a more rational approach to explaining what the problem is.
On Wed, Jun 29, 2022 at 9:10 PM raydoc ***@***.***> wrote:
> Hi,
> I was checking your lemmatization for Hindi and Urdu and found that
> possessive [genitive] case markers in Hindi and Urdu are wrongly
lemmatized.
> It refers to the Hindi possessive case markers:
> का की के
> I have noticed that your lemmatiser tends to reduce these [maybe because
> they are using builtin libraries] to a single form का
> लड़की के मामा की बहन
> Lemmatized form लड़की का मामा का बहन
> I personally do not agree with this approach not only because /ka/ is not
> the base form [lemma] of /ki/ and /ke/ but also because it is "sexist" in
> nature and reduces feminine and feminine/masc plural to a masculine
> singular form. I do no see 'her' or 'sa' or ihre' in En Fr Ger reduced to
> 'him', 'son','ihr' which it should by the same logic.
> The same scenario is in Urdu
> Stanford's Stanza reduces Urdu genitive case markers /ka/ ke/ ki/ to /ka/
> Here is an output of the Urdu sentence
> مھمد کی کتاب اور ھسن کے گھر
> Lemma: مھمد کا کتاب اور ھسن کا گھر
> I believe the library which does this is at fault. I consulted my
> colleagues who are linguists in Hindi and Urdu and alss work in the area
of
> NLP and we feel this approach is linguistically incorrect and worse still
> smacks of sexism. I do not think it is right to reduce a feminine form
to a
> masculine.
> my email: ***@***.***
> I hope a more rational approach to this will be adopted.
>
> —
> Reply to this email directly, view it on GitHub
> <#1067>, or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AA2AYWK6S5K62NGJKKURNP3VRUMZVANCNFSM52H4NIUA
>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#1067 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL724I7IQXOJDH4SWJPU7HDVRVDC5ANCNFSM52H4NIUA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thank you for the detailed explanation. Now that I know what to look for, I will claim that this is an issue with the underlying data. Especially true for languages where we don't have any expertise of our own (which is unfortunately true for Hindi), we simply put the data into our model training, and whatever comes out is what comes out. The datasets are here: https://universaldependencies.org/ So, for example with the Hindi dataset, I can grep for (Note: the fields are tab separated, so you can grep for exactly that word by surrounding the character with tabs. You can put a tab in a bash shell with The results of grepping for
So you can see, the underlying dataset turns it into the male form. For
I suggest going through the dataset some yourself to see if there's a reasonable standard or if you think it should be changed. The fact that Anyway, a lot of the dataset maintainers are pretty responsive to issues. You could create an issue or even a pull request against the Hindi dataset if you think the lemmas should be updated to reflect the gender of the pronoun. I would suggest starting from a more neutral attitude rather than going straight to calling the dataset sexist, though :) If you do effect some changes in those datasets, we can retrain the models at any point, not necessarily when UD 2.11 comes out. Alternatively, we can always train the models from a fork of the dataset if it seems they are not responding and you are certain your change is an improvement. BTW, the reason I like this job is learning interesting tidbits about other languages - in English, the gender of the subject determines the pronoun, whereas the gender of the object determines the pronoun in Hindi & Urdu. |
Hi,
I'll go through the datasets and get back to you.
A little clarification: In the case of
11 के के ADP PSP AdpType=Post 10 mark
_ ChunkId=VGNN|ChunkType=child|Translit=ke
12 लिए लिए ADP PSP AdpType=Post 10 mark
_ ChunkId=VGNN|ChunkType=child|Translit=lie
के लिए / ke lie/, /ke/ does not lemmatize to /ka/ because /ke lie/
constitutes one single unit , roughly translated as for which/whom
and demands s a pronoun/noun before it. In this case: देश
[country]. The whole construct means 'for the country' sake'.
Hopefully one more tidbit to add.
Best regards,
Doc
…On Fri, Jul 1, 2022 at 11:57 AM John Bauer ***@***.***> wrote:
Thank you for the detailed explanation. Now that I know what to look for,
I will claim that this is an issue with the underlying data. Especially
true for languages where we don't have any expertise of our own (which is
unfortunately true for Hindi), we simply put the data into our model
training, and whatever comes out is what comes out.
The datasets are here:
https://universaldependencies.org/
https://github.com/UniversalDependencies/UD_Hindi-HDTB
https://github.com/UniversalDependencies/UD_Urdu-UDTB
So, for example with the Hindi dataset, I can grep for की in the dataset.
(Note: the fields are tab separated, so you can grep for exactly that word
by surrounding the character with tabs. You can put a tab in a bash shell
with ctrl-V tab. You may already know all of that.)
The results of grepping for की look like:
3 की का ADP PSP AdpType=Post|Case=Acc|Gender=Fem|Number=Plur 2 case _ ChunkId=NP|ChunkType=child|Translit=kī
3 की का ADP PSP AdpType=Post|Case=Nom|Gender=Fem|Number=Sing 2 case _ ChunkId=NP2|ChunkType=child|Translit=kī
So you can see, the underlying dataset turns it into the male form.
For के, the results are less consistent. I'll leave a bit more context in
case you can explain why it is doing things differently. Sometimes it keeps
it the same, and sometimes it switches it to का
10 जाने जा VERB VM Case=Acc|VerbForm=Inf 16 advcl _ Vib=ना_के_लिए|Tam=nA|ChunkId=VGNN|ChunkType=head|Translit=jāne
11 के के ADP PSP AdpType=Post 10 mark _ ChunkId=VGNN|ChunkType=child|Translit=ke
12 लिए लिए ADP PSP AdpType=Post 10 mark _ ChunkId=VGNN|ChunkType=child|Translit=lie
--
9 देश देश NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 11 nmod _ Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=deśa
10 के का ADP PSP AdpType=Post|Case=Acc|Gender=Masc|Number=Plur 9 case _ ChunkId=NP4|ChunkType=child|Translit=ke
11 लोगों लोग NOUN NN Case=Acc|Gender=Masc|Number=Plur|Person=3 14 obj _ Vib=0_को|Tam=0|ChunkId=NP5|ChunkType=head|Translit=logoṁ
I suggest going through the dataset some yourself to see if there's a
reasonable standard or if you think it should be changed. The fact that के
isn't consistent is a little suspicious to me, if nothing else. Plus, as
you point out, most other language datasets don't ignore the gender in the
lemma.
Anyway, a lot of the dataset maintainers are pretty responsive to issues.
You could create an issue or even a pull request against the Hindi dataset
if you think the lemmas should be updated to reflect the gender of the
pronoun. I would suggest starting from a more neutral attitude rather than
going straight to calling the dataset sexist, though :) If you do effect
some changes in those datasets, we can retrain the models at any point, not
necessarily when UD 2.11 comes out. Alternatively, we can always train the
models from a fork of the dataset if it seems they are not responding and
you are certain your change is an improvement.
BTW, the reason I like this job is learning interesting tidbits about
other languages - in English, the gender of the subject determines the
pronoun, whereas the gender of the object determines the pronoun in Hindi &
Urdu.
—
Reply to this email directly, view it on GitHub
<#1067 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL724IYM3O673QC7G6JU7YTVR2FUZANCNFSM52H4NIUA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Any luck sorting out the different lemmas in the datasets? Happy to rebuild the models for those languages if we make an improvement to the data. |
This issue has been automatically closed due to inactivity. |
Hi,
I was checking your lemmatization for Hindi and Urdu and found that possessive [genitive] case markers in Hindi and Urdu are wrongly lemmatized.
It refers to the Hindi possessive case markers:
का की के
I have noticed that your lemmatiser tends to reduce these [maybe because they are using builtin libraries] to a single form का
लड़की के मामा की बहन
Lemmatized form लड़की का मामा का बहन
I personally do not agree with this approach not only because /ka/ is not the base form [lemma] of /ki/ and /ke/ but also because it is "sexist" in nature and reduces feminine and feminine/masc plural to a masculine singular form. I do no see 'her' or 'sa' or ihre' in En Fr Ger reduced to 'him', 'son','ihr' which it should by the same logic.
The same scenario is in Urdu
Stanford's Stanza reduces Urdu genitive case markers /ka/ ke/ ki/ to /ka/
Here is an output of the Urdu sentence
مھمد کی کتاب اور ھسن کے گھر
Lemma: مھمد کا کتاب اور ھسن کا گھر
I believe the library which does this is at fault. I consulted my colleagues who are linguists in Hindi and Urdu and alss work in the area of NLP and we feel this approach is linguistically incorrect and worse still smacks of sexism. I do not think it is right to reduce a feminine form to a masculine.
my email: raymond.doctor@gmail.com
I hope a more rational approach to this will be adopted.
The text was updated successfully, but these errors were encountered: