-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language Tamil - Wrong POS tag for "ஊறு" (VERB instead of ADJ) #1319
Comments
There isn't a lot of labeled data for Tamil, but we can possibly improve
the results for Tamil by including a transformer or at least a charlm. Let
me investigate that.
|
The simplest improvement to make was to add a transformer. I chose Google's Muril Large, as it scored the highest on the dev sets of the UD POS and depparse tasks.
(Edit: you can use it now, with the existing 1.7.0 release, with `package="default_accurate"` when building a pipeline)
If that's not sufficient improvement, we could also look into getting more data and including it in the model's training data.
|
Right செய் VERB -- செய் PUNCT Code:
|
கற்ற is ADJ not ADV |
Ultimately we would need more data to fix this. Maybe one of the other
Tamil POS datasets I mentioned will be suitable
…On Sun, Dec 10, 2023, 12:29 AM மனோஜ்குமார் பழனிச்சாமி < ***@***.***> wrote:
கற்ற is ADJ not ADV
—
Reply to this email directly, view it on GitHub
<#1319 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWOTSP3RD2BGW3NTZKDYISGWBAVCNFSM6AAAAABAKLF2PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGQ3TANRQHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Wait... it shouldn't be tagging things PUNCT. Weirdly I thought that would
be improved with some recent changes we made. I can investigate later on
…On Sun, Dec 10, 2023, 10:16 AM John Bauer ***@***.***> wrote:
Ultimately we would need more data to fix this. Maybe one of the other
Tamil POS datasets I mentioned will be suitable
On Sun, Dec 10, 2023, 12:29 AM மனோஜ்குமார் பழனிச்சாமி <
***@***.***> wrote:
> கற்ற is ADJ not ADV
>
> —
> Reply to this email directly, view it on GitHub
> <#1319 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AA2AYWOTSP3RD2BGW3NTZKDYISGWBAVCNFSM6AAAAABAKLF2PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGQ3TANRQHE>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Alright, if you try it again, I set the punct "dropout" for the end of
sentences to be significantly higher. I should probably experiment to see
if that can just be the default setting for all languages
|
Got error
|
Where did you set? |
That was a training parameter.
Sorry for the inconvenience with the models. That should be fixed now.
|
Now it is showing as VERB. Is there any visualization tool for how it detects? |
No visualization tool. However, I will point out that the models all expect context. A single word isn't a great query to give it if there's no surrounding text. I don't know about Tamil, but in English it wouldn't even be possible to correctly tag single words: "tag", "tool", "point", "query" being examples from this comment which would be ambiguous. |
Eg: Here, shouldn't "learned" be ADJ?
|
This one is kinda borderline, and I'll point to some examples of
so maybe |
Did you think to point examples of |
Yes, as I earlier stated, I looked for those examples, and there was not a single example of You seem to care deeply about this particular possible mistagging, so I created an issue where I asked people who know more linguistics than I do what their opinion is: UniversalDependencies/docs#1004 If they like |
As I said, I am familiar with that meaning, and it is not in use anywhere in the training data, which makes it a serious problem for a statistical model to be able to predict that meaning. Is there some clarification needed on how that works? |
I was saying that they mentioned it as an Adjective instead of a verb here. -- https://www.dictionary.com/browse/trained For -- What do you think about this? |
Describe the bug
ஊறு
To Reproduce
Steps to reproduce the behavior:
Output:
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: