Language Tamil - Wrong POS tag for "ஊறு" (VERB instead of ADJ) #1319

SmartManoj · 2023-12-07T02:48:37Z

Describe the bug
ஊறு

To Reproduce
Steps to reproduce the behavior:

import logging
import stanza

logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')

nlp = stanza.Pipeline(lang='ta')

# Sample text in Tamil
text = "ஊறு + காய் "
# Process the text
doc = nlp(text)

# Iterate over the sentences and tokens to print POS tags
print(f'{"POS":<7} | {"WORD":<10}')
for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.pos:7} | {word.text}")

Output:

POS     | WORD      
ADV     | ஊறு
NOUN    | +
PROPN   | காய்

Environment (please complete the following information):

Stanza version: 1.7.0

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2023-12-07T03:10:55Z

There isn't a lot of labeled data for Tamil, but we can possibly improve the results for Tamil by including a transformer or at least a charlm. Let me investigate that.

AngledLuffa · 2023-12-08T16:21:05Z

The simplest improvement to make was to add a transformer. I chose Google's Muril Large, as it scored the highest on the dev sets of the UD POS and depparse tasks. (Edit: you can use it now, with the existing 1.7.0 release, with `package="default_accurate"` when building a pipeline) If that's not sufficient improvement, we could also look into getting more data and including it in the model's training data.

SmartManoj · 2023-12-09T16:23:58Z

Right

செய் VERB
தவம் PROPN

--
Wrong With default_accurate

செய் PUNCT
தவம் PUNCT

Code:

import logging
import stanza
from transformers import logging
logging.set_verbosity_error()
stanza.logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')
print('Stanza model loading...')
if 1:
    nlp = stanza.Pipeline(lang='ta',package="default_accurate")
else:
    nlp = stanza.Pipeline(lang='ta')
print('Stanza model loaded.')
def do_nlp(text,verbose=False):
    doc = nlp(text)
    # Iterate over the sentences and tokens to print POS tags
    if verbose:
        print(f'{"POS":<7} | {"WORD":<10}')
    res = []
    for sentence in doc.sentences:
        for word in sentence.words:
            if verbose:
                print(f"{word.pos:7} | {word.text}")
            else:
                res.append(word.pos)
    return ' '.join(res)
    print('----------------------')
# Sample text in Tamil
if __name__ == '__main__':
    for i in ('செய்','தவம்',):
        print(i,do_nlp(i))

SmartManoj · 2023-12-09T16:29:09Z

கற்ற is ADJ not ADV

AngledLuffa · 2023-12-10T02:16:29Z

Ultimately we would need more data to fix this. Maybe one of the other Tamil POS datasets I mentioned will be suitable

…

On Sun, Dec 10, 2023, 12:29 AM மனோஜ்குமார் பழனிச்சாமி < ***@***.***> wrote: கற்ற is ADJ not ADV — Reply to this email directly, view it on GitHub <#1319 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWOTSP3RD2BGW3NTZKDYISGWBAVCNFSM6AAAAABAKLF2PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGQ3TANRQHE> . You are receiving this because you commented.Message ID: ***@***.***>

AngledLuffa · 2023-12-10T02:18:04Z

Wait... it shouldn't be tagging things PUNCT. Weirdly I thought that would be improved with some recent changes we made. I can investigate later on

…

On Sun, Dec 10, 2023, 10:16 AM John Bauer ***@***.***> wrote: Ultimately we would need more data to fix this. Maybe one of the other Tamil POS datasets I mentioned will be suitable On Sun, Dec 10, 2023, 12:29 AM மனோஜ்குமார் பழனிச்சாமி < ***@***.***> wrote: > கற்ற is ADJ not ADV > > — > Reply to this email directly, view it on GitHub > <#1319 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AA2AYWOTSP3RD2BGW3NTZKDYISGWBAVCNFSM6AAAAABAKLF2PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGQ3TANRQHE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

AngledLuffa · 2023-12-10T12:32:11Z

Alright, if you try it again, I set the punct "dropout" for the end of sentences to be significantly higher. I should probably experiment to see if that can just be the default setting for all languages

SmartManoj · 2023-12-10T13:19:57Z

Got error

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 369kB [00:00, 24.6MB/s]
2023-12-10 19:08:17 INFO: Downloading default packages for language: ta (Tamil) ...
2023-12-10 19:08:18 INFO: File exists: C:\Users\smart\stanza_resources\ta\default.zip
2023-12-10 19:08:20 INFO: Finished downloading models and saved to C:\Users\smart\stanza_resources.
Stanza model loading...
2023-12-10 19:08:20 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 369kB [00:00, 22.4MB/s]
2023-12-10 19:08:21 INFO: Loading these models for language: ta (Tamil):
=====================================
| Processor | Package               |
-------------------------------------
| tokenize  | ttb                   |
| mwt       | ttb                   |
| pos       | ttb_muril-large-cased |
| lemma     | ttb_nocharlm          |
| depparse  | ttb_muril-large-cased |
=====================================

2023-12-10 19:08:21 INFO: Using device: cpu
2023-12-10 19:08:21 INFO: Loading: tokenize
2023-12-10 19:08:22 INFO: Loading: mwt
2023-12-10 19:08:22 INFO: Loading: pos
2023-12-10 19:08:32 INFO: Loading: lemma
2023-12-10 19:08:32 INFO: Loading: depparse
Traceback (most recent call last):
  File "c:\Users\smart\Desktop\p2p\c5.py", line 11, in <module>
    nlp = stanza.Pipeline(lang='ta',package="default_accurate")
  File "C:\Python310\lib\site-packages\stanza\pipeline\core.py", line 304, in __init__
    self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
  File "C:\Python310\lib\site-packages\stanza\pipeline\depparse_processor.py", line 30, in __init__
    super().__init__(config, pipeline, device)
  File "C:\Python310\lib\site-packages\stanza\pipeline\processor.py", line 193, in __init__
    self._set_up_model(config, pipeline, device)
  File "C:\Python310\lib\site-packages\stanza\pipeline\depparse_processor.py", line 43, in _set_up_model
    self._trainer = Trainer(args=args, pretrain=self.pretrain, model_file=config['model_path'], device=device, foundation_cache=pipeline.foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\trainer.py", line 34, in __init__
    self.load(model_file, pretrain, args, foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\trainer.py", line 120, in load
    self.model = Parser(self.args, self.vocab, emb_matrix=emb_matrix, foundation_cache=foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\model.py", line 38, in __init__
    self.lemma_emb = nn.Embedding(len(vocab['lemma']), self.args['word_emb_dim'], padding_idx=0)
  File "C:\Python310\lib\site-packages\stanza\models\common\vocab.py", line 228, in __getitem__
    return self._vocabs[key]
KeyError: 'lemma'

SmartManoj · 2023-12-10T14:05:52Z

I set

Where did you set?

AngledLuffa · 2023-12-10T14:29:31Z

That was a training parameter. Sorry for the inconvenience with the models. That should be fixed now.

SmartManoj · 2023-12-10T14:32:14Z

Where

Got it
https://huggingface.co/stanfordnlp/stanza-ta/commit/1a6352282b2e28a8aa9a9da7f33f215e71405745

SmartManoj · 2023-12-10T14:40:09Z

கற்ற is ADJ not ADV

Now it is showing as VERB. Is there any visualization tool for how it detects?

AngledLuffa · 2023-12-10T17:41:23Z

No visualization tool. However, I will point out that the models all expect context. A single word isn't a great query to give it if there's no surrounding text. I don't know about Tamil, but in English it wouldn't even be possible to correctly tag single words: "tag", "tool", "point", "query" being examples from this comment which would be ambiguous.

SmartManoj · 2023-12-11T02:49:59Z

Eg:
A learned boy.
DET VERB NOUN PUNCT

Here, shouldn't "learned" be ADJ?

import logging
import stanza
from transformers import logging
logging.set_verbosity_error()
stanza.logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')
print('Stanza model loading...')
lang='ta'
lang='en'
if 1:
    nlp = stanza.Pipeline(lang=lang,package="default_accurate")
else:
    nlp = stanza.Pipeline(lang=lang)
print('Stanza model loaded.')
def do_nlp(text,verbose=False):
    doc = nlp(text)
    # Iterate over the sentences and tokens to print POS tags
    if verbose:
        print(f'{"POS":<7} | {"WORD":<10}')
    res = []
    for sentence in doc.sentences:
        for word in sentence.words:
            if verbose:
                print(f"{word.pos:7} | {word.text}")
            else:
                res.append(word.pos)
    return ' '.join(res)
    print('----------------------')
# Sample text in Tamil
if __name__ == '__main__':
    words = ('கற்ற சிறுவன்',)
    words = ('A learned boy.',)
    for i in words:
        print(i,do_nlp(i))

AngledLuffa · 2023-12-11T04:30:39Z

This one is kinda borderline, and I'll point to some examples of trained used as a verb in the EWT and GUM datasets:

# sent_id = newsgroup-groups.google.com_misc.consumers_a534e32067078b08_ENG_20060116_030800-0026
# text = They include 120,000 Iranian Revolutionary Guards trained for land and naval asymmetrical warfare.
4       Iranian Iranian ADJ     NNP     Degree=Pos      6       amod    6:amod  _
5       Revolutionary   Revolutionary   ADJ     NNP     Degree=Pos      6       amod    6:amod  _
6       Guards  Guard   PROPN   NNPS    Number=Plur     2       obj     2:obj   _
7       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     6       acl     6:acl   _
8       for     for     ADP     IN      _       13      case    13:case _
9       land    land    NOUN    NN      Number=Sing     13      compound        13:compound     _
10      and     and     CCONJ   CC      _       11      cc      11:cc   _
11      naval   naval   ADJ     JJ      Degree=Pos      9       conj    9:conj:and|13:compound  _
12      asymmetrical    asymmetrical    ADJ     JJ      Degree=Pos      13      amod    13:amod _
13      warfare warfare NOUN    NN      Number=Sing     7       obl     7:obl:for       SpaceAfter=No

# sent_id = answers-20111108105225AAAJ9ek_ans-0014
# text = If your cat is not trained to use the litter pan, you may have a problem taking her.
1       If      if      SCONJ   IN      _       6       mark    6:mark  _
2       your    your    PRON    PRP$    Case=Gen|Person=2|Poss=Yes|PronType=Prs 3       nmod:poss       3:nmod:poss     _
3       cat     cat     NOUN    NN      Number=Sing     6       nsubj:pass      6:nsubj:pass|8:nsubj:xsubj      _
4       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   6       aux:pass        6:aux:pass      _
5       not     not     PART    RB      _       6       advmod  6:advmod        _
6       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     15      advcl   15:advcl:if     _

# sent_id = answers-20111108111031AARG57j_ans-0015
# text = She is crate trained, potty trained, ...
1       She     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   4       nsubj:pass      4:nsubj:pass|7:nsubj:pass|11:nsubj      _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       aux:pass        4:aux:pass      _
3       crate   crate   NOUN    NN      Number=Sing     4       obl:npmod       4:obl:npmod     _
4       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     0       root    0:root  SpaceAfter=No
5       ,       ,       PUNCT   ,       _       7       punct   7:punct _
6       potty   potty   NOUN    NN      Number=Sing     7       obl:npmod       7:obl:npmod     _
7       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     4       conj    4:conj:and      SpaceAfter=No

# sent_id = GUM_voyage_sydfynske-27
# text = Several rental places also gives you the option of trained guide, which can both provide information about the sights you visit, and make sure you are safe.
7       the     the     DET     DT      Definite=Def|PronType=Art       8       det     8:det   Entity=(109-abstract-new-cf3-2-sgl
8       option  option  NOUN    NN      Number=Sing     5       obj     5:obj   MSeg=opt-ion
9       of      of      ADP     IN      _       11      case    11:case _
10      trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     11      amod    11:amod Entity=(110-person-new-cf6-2-sgl|MSeg=train-ed
11      guide   guide   NOUN    NN      Number=Sing     8       nmod    8:nmod:of|16:nsubj      SpaceAfter=No

so maybe learned as VERB is correct here. However, regardless, there isn't a single instance of learned as an ADJ in the datasets we use to train, so I would never expect the model to get it right.

SmartManoj · 2023-12-11T04:48:12Z

Did you think to point examples of learned instead of trained?

https://www.dictionary.com/browse/learned

AngledLuffa · 2023-12-11T05:35:41Z

Yes, as I earlier stated, I looked for those examples, and there was not a single example of learned in the training data. I mean, I do understand the meaning of learned that you're going for, but 1) as shown with the trained examples, it's not clear the annotation scheme we used would have tagged it as ADJ or as related to the use of the past participle of "he learned something" and 2) it's immaterial because there are 33 instances of learned as a VERB and 0 as an ADJ, so the statistical models we use will tag it as a VERB no matter sentence you write.

You seem to care deeply about this particular possible mistagging, so I created an issue where I asked people who know more linguistics than I do what their opinion is:

UniversalDependencies/docs#1004

If they like ADJ we can possibly add a few sentences to the English training data with the appropriate context, but that's not a "today" project, at any rate.

SmartManoj · 2023-12-11T05:39:32Z

that you're going for

https://www.dictionary.com/browse/learned#:~:text=on%20Thesaurus.com-,adjective,-having%20much%20knowledge

AngledLuffa · 2023-12-11T05:44:01Z

As I said, I am familiar with that meaning, and it is not in use anywhere in the training data, which makes it a serious problem for a statistical model to be able to predict that meaning. Is there some clarification needed on how that works?

SmartManoj · 2023-12-11T06:17:55Z

that you're going for

dictionary.com/browse/learned#:~:text=on%20Thesaurus.com-,adjective,-having%20much%20knowledge

I was saying that they mentioned it as an Adjective instead of a verb here.

--

https://www.dictionary.com/browse/trained

For trained, they redirected to train itself.

--

What do you think about this?

SmartManoj added the bug label Dec 7, 2023

AngledLuffa mentioned this issue Dec 11, 2023

Would learned, describing how much knowledge a person has acquired, be treated as an ADJ or a VERB? UniversalDependencies/docs#1004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language Tamil - Wrong POS tag for "ஊறு" (VERB instead of ADJ) #1319

Language Tamil - Wrong POS tag for "ஊறு" (VERB instead of ADJ) #1319

SmartManoj commented Dec 7, 2023

AngledLuffa commented Dec 7, 2023 via email

AngledLuffa commented Dec 8, 2023 via email •

edited

Loading

SmartManoj commented Dec 9, 2023

SmartManoj commented Dec 9, 2023

AngledLuffa commented Dec 10, 2023 via email

AngledLuffa commented Dec 10, 2023 via email

AngledLuffa commented Dec 10, 2023 via email

SmartManoj commented Dec 10, 2023 •

edited

Loading

SmartManoj commented Dec 10, 2023

AngledLuffa commented Dec 10, 2023 via email

SmartManoj commented Dec 10, 2023 •

edited

Loading

SmartManoj commented Dec 10, 2023

AngledLuffa commented Dec 10, 2023

SmartManoj commented Dec 11, 2023

AngledLuffa commented Dec 11, 2023

SmartManoj commented Dec 11, 2023 •

edited

Loading

AngledLuffa commented Dec 11, 2023

SmartManoj commented Dec 11, 2023

AngledLuffa commented Dec 11, 2023

SmartManoj commented Dec 11, 2023 •

edited

Loading

Language Tamil - Wrong POS tag for "ஊறு" (VERB instead of ADJ) #1319

Language Tamil - Wrong POS tag for "ஊறு" (VERB instead of ADJ) #1319

Comments

SmartManoj commented Dec 7, 2023

AngledLuffa commented Dec 7, 2023 via email

AngledLuffa commented Dec 8, 2023 via email • edited Loading

SmartManoj commented Dec 9, 2023

SmartManoj commented Dec 9, 2023

AngledLuffa commented Dec 10, 2023 via email

AngledLuffa commented Dec 10, 2023 via email

AngledLuffa commented Dec 10, 2023 via email

SmartManoj commented Dec 10, 2023 • edited Loading

SmartManoj commented Dec 10, 2023

AngledLuffa commented Dec 10, 2023 via email

SmartManoj commented Dec 10, 2023 • edited Loading

SmartManoj commented Dec 10, 2023

AngledLuffa commented Dec 10, 2023

SmartManoj commented Dec 11, 2023

AngledLuffa commented Dec 11, 2023

SmartManoj commented Dec 11, 2023 • edited Loading

AngledLuffa commented Dec 11, 2023

SmartManoj commented Dec 11, 2023

AngledLuffa commented Dec 11, 2023

SmartManoj commented Dec 11, 2023 • edited Loading

AngledLuffa commented Dec 8, 2023 via email •

edited

Loading

SmartManoj commented Dec 10, 2023 •

edited

Loading

SmartManoj commented Dec 10, 2023 •

edited

Loading

SmartManoj commented Dec 11, 2023 •

edited

Loading

SmartManoj commented Dec 11, 2023 •

edited

Loading