Langid model gives languages not in langid_lang_subset on difficult strings #1076

paulthemagno · 2022-07-13T17:22:30Z

Describe the bug
If you set a lang_subset to the langid processor, it gives as result a language that not always is in the subset

To Reproduce
Steps to reproduce the behavior:

import stanza
langid = stanza.Pipeline("multilingual", langid_lang_subset = ["es"])
langid("aaa").lang

The result will be:

la

Expected behavior
The language should be es 100% since I use lang_subset = ["es"]
Environment (please complete the following information):

Stanza version: 1.4.0

Additional context
I see that here:
https://github.com/stanfordnlp/stanza/blob/011b6c4831a614439c599fd163a9b40b7c225566/stanza/models/langid/model.py#L85is used the self.lang_subset variable that should be ["es"].
While without this subset everything seems to work properly, if self.lang_subset is set not always.

I tried to add some print

    def prediction_scores(self, x):
        prediction_probs = self(x)
        print("prediction_probs")
        print(prediction_probs)
        if self.lang_subset:
            print("lang_mask")
            print(self.lang_mask)
            prediction_batch_size = prediction_probs.size()[0]
            print("prediction_batch_size")
            print(prediction_batch_size)
            batch_mask = torch.stack([self.lang_mask for _ in range(prediction_batch_size)])
            print("batch_mask")
            print(batch_mask)
            prediction_probs = prediction_probs * batch_mask
            print("prediction_probs")
            print(prediction_probs)
        print("argmax")
        print(torch.argmax(prediction_probs, dim=1))
        print(self.idx_to_tag[torch.argmax(prediction_probs, dim=1)])
        return torch.argmax(prediction_probs, dim=1)

The result is:

prediction_probs
tensor([[-0.8162, -1.6429,  2.5455, -1.8278, -2.5117, -2.9207, -3.7888, -0.8804,
         -0.3428,  1.0926,  0.6126, -4.4035, -0.5995, -3.2369,  2.7545, -0.2619,
         -0.7622, -1.5564, -4.9973, -0.7835, -5.0576, -4.1394,  0.3663, -1.4397,
         -3.2930, -1.5079, -0.7348, -1.1671,  1.4471, -6.8030,  1.9239, -2.3856,
         -6.2493,  1.5562, -2.8086, -2.9353, -2.9437,  0.3282,  1.0697, -6.2935,
          0.5006, -0.3015, -0.4489, -0.4419,  6.0343, -2.1565,  0.9285, -2.3867,
         -2.4929, -0.8634, -3.6259, -4.6344, -0.1315, -4.4329,  0.6783,  0.3423,
         -1.2810, -3.3283, -1.4476, -6.0616, -2.5742, -4.2238, -0.5372, -6.5371,
         -2.5767, -1.5134, -3.3457,  1.6572]], device='cuda:0',
       grad_fn=<SumBackward1>)
lang_mask
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       device='cuda:0')
prediction_batch_size
1
batch_mask
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]],
       device='cuda:0')
prediction_probs
tensor([[-0.0000, -0.0000,  0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000,  0.0000,  0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000,  0.0000, -0.0000, -0.0000, -0.0000,  0.0000,  0.0000, -0.0000,
          0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,  0.0000,  0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.5372, -0.0000,
         -0.0000, -0.0000, -0.0000,  0.0000]], device='cuda:0',
       grad_fn=<MulBackward0>)
argmax
tensor([0], device='cuda:0')
la

I think in this case the string aaa was too difficult for the model (indeed it doesn't mean anything) and so the probability of es was -0.5372 (<0). All the other probability are set to -0.0000 that is an higher value. So when you compute the argmax you'll get the first label which is probably la.

I know that this case can happen rarely, since with a real word the higher probability is usually positive, but I found some similar errors with models trained by me with your training script that sometimes take the first language of the label_list since the languages in the lang_subset have negative probabilities.

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2022-07-13T21:42:11Z

I can see exactly where the problem is:

    def build_lang_mask(self, use_gpu=None):
        """
        Build language mask if a lang subset is specified (e.g. ["en", "fr"])
        """
        device = torch.device("cuda") if use_gpu else None
        lang_mask_list = [int(lang in self.lang_subset) for lang in self.idx_to_tag] if self.lang_subset else \
                         [1 for lang in self.idx_to_tag]
        self.lang_mask = torch.tensor(lang_mask_list, device=device, dtype=torch.float)

and then

    def prediction_scores(self, x):
        prediction_probs = self(x)
        if self.lang_subset:
            prediction_batch_size = prediction_probs.size()[0]
            batch_mask = torch.stack([self.lang_mask for _ in range(prediction_batch_size)])
            prediction_probs = prediction_probs * batch_mask
        return torch.argmax(prediction_probs, dim=1)

maybe negative infinity instead of 0 for illegal languages would work better

AngledLuffa · 2022-07-13T21:54:53Z

definitely not negative infinity considering that turns really unlikely languages into supa likely languages...

… languages can be chosen if all legal languages were negative Addresses #1076

J38 · 2022-07-14T05:30:01Z

Thanks for pointing this out!

… languages can be chosen if all legal languages were negative Addresses #1076

AngledLuffa · 2022-09-14T19:26:24Z

This is now fixed on 1.4.1

paulthemagno added the bug label Jul 13, 2022

paulthemagno mentioned this issue Jul 13, 2022

Inference with self trained langid model #1072

Open

AngledLuffa added a commit that referenced this issue Jul 13, 2022

Mask illegal langauges by setting them to -ninf. 0 means that illegal…

8e0f2fe

… languages can be chosen if all legal languages were negative Addresses #1076

AngledLuffa mentioned this issue Jul 14, 2022

Mask illegal langauges by setting them to -ninf. 0 means that illega… #1077

Merged

AngledLuffa added a commit that referenced this issue Jul 14, 2022

Mask illegal langauges by setting them to -ninf. 0 means that illegal…

071f6bc

… languages can be chosen if all legal languages were negative Addresses #1076

AngledLuffa added the fixed on dev label Aug 11, 2022

AngledLuffa closed this as completed Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langid model gives languages not in langid_lang_subset on difficult strings #1076

Langid model gives languages not in langid_lang_subset on difficult strings #1076

paulthemagno commented Jul 13, 2022

AngledLuffa commented Jul 13, 2022

AngledLuffa commented Jul 13, 2022

J38 commented Jul 14, 2022

AngledLuffa commented Sep 14, 2022

Langid model gives languages not in langid_lang_subset on difficult strings #1076

Langid model gives languages not in langid_lang_subset on difficult strings #1076

Comments

paulthemagno commented Jul 13, 2022

AngledLuffa commented Jul 13, 2022

AngledLuffa commented Jul 13, 2022

J38 commented Jul 14, 2022

AngledLuffa commented Sep 14, 2022