Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Langid model gives languages not in langid_lang_subset on difficult strings #1076

Closed
paulthemagno opened this issue Jul 13, 2022 · 4 comments

Comments

@paulthemagno
Copy link

Describe the bug
If you set a lang_subset to the langid processor, it gives as result a language that not always is in the subset

To Reproduce
Steps to reproduce the behavior:

import stanza
langid = stanza.Pipeline("multilingual", langid_lang_subset = ["es"])
langid("aaa").lang

The result will be:

la

Expected behavior
The language should be es 100% since I use lang_subset = ["es"]
Environment (please complete the following information):

  • Stanza version: 1.4.0

Additional context
I see that here:
https://github.com/stanfordnlp/stanza/blob/011b6c4831a614439c599fd163a9b40b7c225566/stanza/models/langid/model.py#L85is used the self.lang_subset variable that should be ["es"].
While without this subset everything seems to work properly, if self.lang_subset is set not always.

I tried to add some print

    def prediction_scores(self, x):
        prediction_probs = self(x)
        print("prediction_probs")
        print(prediction_probs)
        if self.lang_subset:
            print("lang_mask")
            print(self.lang_mask)
            prediction_batch_size = prediction_probs.size()[0]
            print("prediction_batch_size")
            print(prediction_batch_size)
            batch_mask = torch.stack([self.lang_mask for _ in range(prediction_batch_size)])
            print("batch_mask")
            print(batch_mask)
            prediction_probs = prediction_probs * batch_mask
            print("prediction_probs")
            print(prediction_probs)
        print("argmax")
        print(torch.argmax(prediction_probs, dim=1))
        print(self.idx_to_tag[torch.argmax(prediction_probs, dim=1)])
        return torch.argmax(prediction_probs, dim=1)

The result is:

prediction_probs
tensor([[-0.8162, -1.6429,  2.5455, -1.8278, -2.5117, -2.9207, -3.7888, -0.8804,
         -0.3428,  1.0926,  0.6126, -4.4035, -0.5995, -3.2369,  2.7545, -0.2619,
         -0.7622, -1.5564, -4.9973, -0.7835, -5.0576, -4.1394,  0.3663, -1.4397,
         -3.2930, -1.5079, -0.7348, -1.1671,  1.4471, -6.8030,  1.9239, -2.3856,
         -6.2493,  1.5562, -2.8086, -2.9353, -2.9437,  0.3282,  1.0697, -6.2935,
          0.5006, -0.3015, -0.4489, -0.4419,  6.0343, -2.1565,  0.9285, -2.3867,
         -2.4929, -0.8634, -3.6259, -4.6344, -0.1315, -4.4329,  0.6783,  0.3423,
         -1.2810, -3.3283, -1.4476, -6.0616, -2.5742, -4.2238, -0.5372, -6.5371,
         -2.5767, -1.5134, -3.3457,  1.6572]], device='cuda:0',
       grad_fn=<SumBackward1>)
lang_mask
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       device='cuda:0')
prediction_batch_size
1
batch_mask
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]],
       device='cuda:0')
prediction_probs
tensor([[-0.0000, -0.0000,  0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000,  0.0000,  0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000,  0.0000, -0.0000, -0.0000, -0.0000,  0.0000,  0.0000, -0.0000,
          0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,  0.0000,  0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.5372, -0.0000,
         -0.0000, -0.0000, -0.0000,  0.0000]], device='cuda:0',
       grad_fn=<MulBackward0>)
argmax
tensor([0], device='cuda:0')
la

I think in this case the string aaa was too difficult for the model (indeed it doesn't mean anything) and so the probability of es was -0.5372 (<0). All the other probability are set to -0.0000 that is an higher value. So when you compute the argmax you'll get the first label which is probably la.

I know that this case can happen rarely, since with a real word the higher probability is usually positive, but I found some similar errors with models trained by me with your training script that sometimes take the first language of the label_list since the languages in the lang_subset have negative probabilities.

@AngledLuffa
Copy link
Collaborator

I can see exactly where the problem is:

    def build_lang_mask(self, use_gpu=None):
        """
        Build language mask if a lang subset is specified (e.g. ["en", "fr"])
        """
        device = torch.device("cuda") if use_gpu else None
        lang_mask_list = [int(lang in self.lang_subset) for lang in self.idx_to_tag] if self.lang_subset else \
                         [1 for lang in self.idx_to_tag]
        self.lang_mask = torch.tensor(lang_mask_list, device=device, dtype=torch.float)

and then

    def prediction_scores(self, x):
        prediction_probs = self(x)
        if self.lang_subset:
            prediction_batch_size = prediction_probs.size()[0]
            batch_mask = torch.stack([self.lang_mask for _ in range(prediction_batch_size)])
            prediction_probs = prediction_probs * batch_mask
        return torch.argmax(prediction_probs, dim=1)

maybe negative infinity instead of 0 for illegal languages would work better

@AngledLuffa
Copy link
Collaborator

definitely not negative infinity considering that turns really unlikely languages into supa likely languages...

AngledLuffa added a commit that referenced this issue Jul 13, 2022
… languages can be chosen if all legal languages were negative

Addresses #1076
@J38
Copy link
Collaborator

J38 commented Jul 14, 2022

Thanks for pointing this out!

AngledLuffa added a commit that referenced this issue Jul 14, 2022
… languages can be chosen if all legal languages were negative

Addresses #1076
@AngledLuffa
Copy link
Collaborator

This is now fixed on 1.4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants