Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't add new language to pre-trained spoken language recognition model: Model forgets other languages #1516

Closed
kirillkoncha opened this issue Jul 26, 2022 · 5 comments

Comments

@kirillkoncha
Copy link

Hello,

I am trying to fine-tune existing spoken language recognition model. I chose common voice language and trying to add new language. I did things exactly as they are described in fine-tuning tutorial (and ensured unknown label in label encoder as well).

I also tried to freeze more layers, for example, I froze every modules except classifier. However, when I fine-tune the model, the performance gets worse. For example, during the first several epochs model gives different incorrect outputs. However, around 5th epochs it starts assigning every language the label I want to add.

I also tried to fine-tune model on 19 different languages (including previously unknown), however, the results still same. Is there any way to fine-tune model to predict new languages or this model is not supposed to be fine-tuned? Why model can't learn new languages and forgets old during fine-tuning?

Here is the class I used in fine-tuning

class LanguageBrain(speechbrain.core.Brain):
    
    def on_stage_start(self, stage, epoch):
        # enable grad for all modules we want to fine-tune
        if stage == speechbrain.Stage.TRAIN:
            for module in [self.modules.compute_features, self.modules.mean_var_norm, 
                           self.modules.embedding_model, self.modules.classifier]:
                for p in module.parameters():
                    p.requires_grad = True

    def compute_forward(self, batch, stage):
        """Computation pipeline based on a encoder + speaker classifier.
        Data augmentation and environmental corruption are applied to the
        input speech.
        """
        batch = batch.to(self.device)
        wavs, lens = batch.sig
        #wavs, lens = wavs.to(self.device), lens.to(self.device)
        if stage == speechbrain.Stage.TRAIN:

            # Applying the augmentation pipeline
            wavs_aug_tot = []
            wavs_aug_tot.append(wavs)

            # Apply augment
            wavs_aug = self.hparams.augment_speed(wavs, lens)
            wavs_aug = self.hparams.add_rev_noise(wavs, lens)
            # Managing speed change
            if wavs_aug.shape[1] > wavs.shape[1]:
                wavs_aug = wavs_aug[:, 0 : wavs.shape[1]]
            else:
                zero_sig = torch.zeros_like(wavs)
                zero_sig[:, 0 : wavs_aug.shape[1]] = wavs_aug
                wavs_aug = zero_sig
           
            wavs = wavs_aug
            wavs_aug_tot[0] = wavs

            wavs = torch.cat(wavs_aug_tot, dim=0)
            self.n_augment = len(wavs_aug_tot)
            lens = torch.cat([lens] * self.n_augment)
        
        feats = self.modules.compute_features(wavs)
        feats = self.modules.mean_var_norm(feats, lens)

        # Embeddings + speaker classifier
        embeddings = self.modules.embedding_model(feats, lens)
        outputs = self.modules.classifier(embeddings)
        return outputs, lens

    def compute_objectives(self, predictions, batch, stage):
        """Computes the loss using speaker-id as label.
        """
        predictions, lens = predictions
        lens = lens
        uttid = batch.id
        langid = batch.lang_id_encoded
        langid = torch.cat([langid] * self.n_augment, dim=0)
        loss = self.hparams.compute_cost(predictions, langid.unsqueeze(1), lens)
        return loss
    
    def on_stage_end(self, stage, stage_loss, epoch=None):
        """Gets called at the end of an epoch."""
        stage_stats = {"loss": stage_loss}
        self.hparams.checkpointer.save_and_keep_only(
            meta={"loss": stage_stats["loss"]},
            min_keys=["loss"])

@mravanelli
Copy link
Collaborator

Hi,
thank you for sharing your results. I'm not sure I totally understood your problem. However, if you use a pre-trained model (e.g., on English) and you fine-tune it on data from another language (e.g., French) it is normal that the performance in English gets worse. This phenomenon is called catastrophic forgetting. To mitigate it, there are different possible solutions. One consists of periodically showing to the system some sentences in the previous language (e.g., English sentences). This technique is called replay.

@kirillkoncha
Copy link
Author

Thank you for answer!
I included different languages data along with new language (I fit 19 languages into the model, e.g., French, English and Kyrgyz, which was not represented in the original pre-trained model). During training model at certain point starts to assign one certain label (for example, French) to every language I want to predict.

However, I did not shuffle my training set and the data was organised such way that I fit to model all samples of one certain language and only after that the next language was fit. Could shuffling training set help?

@mravanelli
Copy link
Collaborator

That could play an important role. I think it is important to make sure there are data from different languages in each batch.

@kirillkoncha
Copy link
Author

I ensured that every batch contains all languages I want to detect and fine-tuned model, it worked pretty fine. Thanks a lot!

@anautsch
Copy link
Collaborator

Looks like this is solved; closing this one—please feel free to reopen :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants