Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Tamil - Wrong POS tag for "ஊறு" (VERB instead of ADJ) #1319

Open
SmartManoj opened this issue Dec 7, 2023 · 20 comments
Open
Labels

Comments

@SmartManoj
Copy link

Describe the bug
ஊறு

To Reproduce
Steps to reproduce the behavior:

import logging
import stanza

logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')

nlp = stanza.Pipeline(lang='ta')

# Sample text in Tamil
text = "ஊறு + காய் "
# Process the text
doc = nlp(text)

# Iterate over the sentences and tokens to print POS tags
print(f'{"POS":<7} | {"WORD":<10}')
for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.pos:7} | {word.text}")


Output:

POS     | WORD      
ADV     | ஊறு
NOUN    | +
PROPN   | காய்

Environment (please complete the following information):

  • Stanza version: 1.7.0
@SmartManoj SmartManoj added the bug label Dec 7, 2023
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 7, 2023 via email

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 8, 2023 via email

@SmartManoj
Copy link
Author

Right

செய் VERB
தவம் PROPN

--
Wrong With default_accurate

செய் PUNCT
தவம் PUNCT


Code:

import logging
import stanza
from transformers import logging
logging.set_verbosity_error()
stanza.logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')
print('Stanza model loading...')
if 1:
    nlp = stanza.Pipeline(lang='ta',package="default_accurate")
else:
    nlp = stanza.Pipeline(lang='ta')
print('Stanza model loaded.')
def do_nlp(text,verbose=False):
    doc = nlp(text)
    # Iterate over the sentences and tokens to print POS tags
    if verbose:
        print(f'{"POS":<7} | {"WORD":<10}')
    res = []
    for sentence in doc.sentences:
        for word in sentence.words:
            if verbose:
                print(f"{word.pos:7} | {word.text}")
            else:
                res.append(word.pos)
    return ' '.join(res)
    print('----------------------')
# Sample text in Tamil
if __name__ == '__main__':
    for i in ('செய்','தவம்',):
        print(i,do_nlp(i))

@SmartManoj
Copy link
Author

கற்ற is ADJ not ADV

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 10, 2023 via email

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 10, 2023 via email

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 10, 2023 via email

@SmartManoj
Copy link
Author

SmartManoj commented Dec 10, 2023

Got error

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 369kB [00:00, 24.6MB/s]
2023-12-10 19:08:17 INFO: Downloading default packages for language: ta (Tamil) ...
2023-12-10 19:08:18 INFO: File exists: C:\Users\smart\stanza_resources\ta\default.zip
2023-12-10 19:08:20 INFO: Finished downloading models and saved to C:\Users\smart\stanza_resources.
Stanza model loading...
2023-12-10 19:08:20 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 369kB [00:00, 22.4MB/s]
2023-12-10 19:08:21 INFO: Loading these models for language: ta (Tamil):
=====================================
| Processor | Package               |
-------------------------------------
| tokenize  | ttb                   |
| mwt       | ttb                   |
| pos       | ttb_muril-large-cased |
| lemma     | ttb_nocharlm          |
| depparse  | ttb_muril-large-cased |
=====================================

2023-12-10 19:08:21 INFO: Using device: cpu
2023-12-10 19:08:21 INFO: Loading: tokenize
2023-12-10 19:08:22 INFO: Loading: mwt
2023-12-10 19:08:22 INFO: Loading: pos
2023-12-10 19:08:32 INFO: Loading: lemma
2023-12-10 19:08:32 INFO: Loading: depparse
Traceback (most recent call last):
  File "c:\Users\smart\Desktop\p2p\c5.py", line 11, in <module>
    nlp = stanza.Pipeline(lang='ta',package="default_accurate")
  File "C:\Python310\lib\site-packages\stanza\pipeline\core.py", line 304, in __init__
    self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
  File "C:\Python310\lib\site-packages\stanza\pipeline\depparse_processor.py", line 30, in __init__
    super().__init__(config, pipeline, device)
  File "C:\Python310\lib\site-packages\stanza\pipeline\processor.py", line 193, in __init__
    self._set_up_model(config, pipeline, device)
  File "C:\Python310\lib\site-packages\stanza\pipeline\depparse_processor.py", line 43, in _set_up_model
    self._trainer = Trainer(args=args, pretrain=self.pretrain, model_file=config['model_path'], device=device, foundation_cache=pipeline.foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\trainer.py", line 34, in __init__
    self.load(model_file, pretrain, args, foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\trainer.py", line 120, in load
    self.model = Parser(self.args, self.vocab, emb_matrix=emb_matrix, foundation_cache=foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\model.py", line 38, in __init__
    self.lemma_emb = nn.Embedding(len(vocab['lemma']), self.args['word_emb_dim'], padding_idx=0)
  File "C:\Python310\lib\site-packages\stanza\models\common\vocab.py", line 228, in __getitem__
    return self._vocabs[key]
KeyError: 'lemma'

@SmartManoj
Copy link
Author

I set

Where did you set?

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 10, 2023 via email

@SmartManoj
Copy link
Author

SmartManoj commented Dec 10, 2023

@SmartManoj
Copy link
Author

கற்ற is ADJ not ADV

Now it is showing as VERB. Is there any visualization tool for how it detects?

@AngledLuffa
Copy link
Collaborator

No visualization tool. However, I will point out that the models all expect context. A single word isn't a great query to give it if there's no surrounding text. I don't know about Tamil, but in English it wouldn't even be possible to correctly tag single words: "tag", "tool", "point", "query" being examples from this comment which would be ambiguous.

@SmartManoj
Copy link
Author

Eg:
A learned boy.
DET VERB NOUN PUNCT

Here, shouldn't "learned" be ADJ?

import logging
import stanza
from transformers import logging
logging.set_verbosity_error()
stanza.logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')
print('Stanza model loading...')
lang='ta'
lang='en'
if 1:
    nlp = stanza.Pipeline(lang=lang,package="default_accurate")
else:
    nlp = stanza.Pipeline(lang=lang)
print('Stanza model loaded.')
def do_nlp(text,verbose=False):
    doc = nlp(text)
    # Iterate over the sentences and tokens to print POS tags
    if verbose:
        print(f'{"POS":<7} | {"WORD":<10}')
    res = []
    for sentence in doc.sentences:
        for word in sentence.words:
            if verbose:
                print(f"{word.pos:7} | {word.text}")
            else:
                res.append(word.pos)
    return ' '.join(res)
    print('----------------------')
# Sample text in Tamil
if __name__ == '__main__':
    words = ('கற்ற சிறுவன்',)
    words = ('A learned boy.',)
    for i in words:
        print(i,do_nlp(i))

@AngledLuffa
Copy link
Collaborator

This one is kinda borderline, and I'll point to some examples of trained used as a verb in the EWT and GUM datasets:

# sent_id = newsgroup-groups.google.com_misc.consumers_a534e32067078b08_ENG_20060116_030800-0026
# text = They include 120,000 Iranian Revolutionary Guards trained for land and naval asymmetrical warfare.
4       Iranian Iranian ADJ     NNP     Degree=Pos      6       amod    6:amod  _
5       Revolutionary   Revolutionary   ADJ     NNP     Degree=Pos      6       amod    6:amod  _
6       Guards  Guard   PROPN   NNPS    Number=Plur     2       obj     2:obj   _
7       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     6       acl     6:acl   _
8       for     for     ADP     IN      _       13      case    13:case _
9       land    land    NOUN    NN      Number=Sing     13      compound        13:compound     _
10      and     and     CCONJ   CC      _       11      cc      11:cc   _
11      naval   naval   ADJ     JJ      Degree=Pos      9       conj    9:conj:and|13:compound  _
12      asymmetrical    asymmetrical    ADJ     JJ      Degree=Pos      13      amod    13:amod _
13      warfare warfare NOUN    NN      Number=Sing     7       obl     7:obl:for       SpaceAfter=No

# sent_id = answers-20111108105225AAAJ9ek_ans-0014
# text = If your cat is not trained to use the litter pan, you may have a problem taking her.
1       If      if      SCONJ   IN      _       6       mark    6:mark  _
2       your    your    PRON    PRP$    Case=Gen|Person=2|Poss=Yes|PronType=Prs 3       nmod:poss       3:nmod:poss     _
3       cat     cat     NOUN    NN      Number=Sing     6       nsubj:pass      6:nsubj:pass|8:nsubj:xsubj      _
4       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   6       aux:pass        6:aux:pass      _
5       not     not     PART    RB      _       6       advmod  6:advmod        _
6       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     15      advcl   15:advcl:if     _

# sent_id = answers-20111108111031AARG57j_ans-0015
# text = She is crate trained, potty trained, ...
1       She     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   4       nsubj:pass      4:nsubj:pass|7:nsubj:pass|11:nsubj      _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       aux:pass        4:aux:pass      _
3       crate   crate   NOUN    NN      Number=Sing     4       obl:npmod       4:obl:npmod     _
4       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     0       root    0:root  SpaceAfter=No
5       ,       ,       PUNCT   ,       _       7       punct   7:punct _
6       potty   potty   NOUN    NN      Number=Sing     7       obl:npmod       7:obl:npmod     _
7       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     4       conj    4:conj:and      SpaceAfter=No

# sent_id = GUM_voyage_sydfynske-27
# text = Several rental places also gives you the option of trained guide, which can both provide information about the sights you visit, and make sure you are safe.
7       the     the     DET     DT      Definite=Def|PronType=Art       8       det     8:det   Entity=(109-abstract-new-cf3-2-sgl
8       option  option  NOUN    NN      Number=Sing     5       obj     5:obj   MSeg=opt-ion
9       of      of      ADP     IN      _       11      case    11:case _
10      trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     11      amod    11:amod Entity=(110-person-new-cf6-2-sgl|MSeg=train-ed
11      guide   guide   NOUN    NN      Number=Sing     8       nmod    8:nmod:of|16:nsubj      SpaceAfter=No

so maybe learned as VERB is correct here. However, regardless, there isn't a single instance of learned as an ADJ in the datasets we use to train, so I would never expect the model to get it right.

@SmartManoj
Copy link
Author

SmartManoj commented Dec 11, 2023

Did you think to point examples of learned instead of trained?

https://www.dictionary.com/browse/learned

@AngledLuffa
Copy link
Collaborator

Yes, as I earlier stated, I looked for those examples, and there was not a single example of learned in the training data. I mean, I do understand the meaning of learned that you're going for, but 1) as shown with the trained examples, it's not clear the annotation scheme we used would have tagged it as ADJ or as related to the use of the past participle of "he learned something" and 2) it's immaterial because there are 33 instances of learned as a VERB and 0 as an ADJ, so the statistical models we use will tag it as a VERB no matter sentence you write.

You seem to care deeply about this particular possible mistagging, so I created an issue where I asked people who know more linguistics than I do what their opinion is:

UniversalDependencies/docs#1004

If they like ADJ we can possibly add a few sentences to the English training data with the appropriate context, but that's not a "today" project, at any rate.

@SmartManoj
Copy link
Author

@AngledLuffa
Copy link
Collaborator

As I said, I am familiar with that meaning, and it is not in use anywhere in the training data, which makes it a serious problem for a statistical model to be able to predict that meaning. Is there some clarification needed on how that works?

@SmartManoj
Copy link
Author

SmartManoj commented Dec 11, 2023

that you're going for

dictionary.com/browse/learned#:~:text=on%20Thesaurus.com-,adjective,-having%20much%20knowledge

I was saying that they mentioned it as an Adjective instead of a verb here.

--

https://www.dictionary.com/browse/trained

For trained, they redirected to train itself.

--

What do you think about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants