Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add whisper normalization on training #1768

Closed
wants to merge 2 commits into from

Conversation

Moumeneb1
Copy link
Contributor

Hi :D

In the present whisper finetuning implementation, we are training with raw text (no normalisation) and then we do testing and validating using whisper normalisation. This is a little adjustment to fine-tune the model on whisper normalisation, as encode only accomplishes tokenisation and not normalisation.

from transformers.models.whisper.tokenization_whisper import WhisperTokenizer

test = "hello i have fifty two dollars"

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base")
print(tokenizer._normalize(test))
print(tokenizer.decode(tokenizer.encode(test)))
print(tokenizer.decode(tokenizer.encode(tokenizer._normalize(test))))
#print outputs
hello i have $52
<|startoftranscript|><|notimestamps|>hello i have fifty two dollars<|endoftext|>
<|startoftranscript|><|notimestamps|>hello i have $52<|endoftext|>

@Moumeneb1
Copy link
Contributor Author

taking into account normalized_transcripts param

def text_pipeline(wrd):
  if hasattr(hparams, "normalized_transcripts"):
      wrd =tokenizer._normalize(wrd)
  yield wrd
  tokens_list = tokenizer.encode(wrd)
  # avoid bos and eos tokens.
  tokens_list = tokens_list[1:-1]
  yield tokens_list
  tokens_bos = torch.LongTensor([hparams["bos_index"]] + tokens_list)
  yield tokens_bos
  tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
  yield tokens_eos
  tokens = torch.LongTensor(tokens_list)
  yield tokens

@Adel-Moumen
Copy link
Collaborator

Hello @Moumeneb1,

Thanks for the PR! Are you sure they are doing normalisation on the training transcript in the official Whisper paper? If I'm not mistaken they are not doing it but feel free to correct me.

By the way, do you have results on normalised training vs un-normalised training?

@Adel-Moumen Adel-Moumen self-requested a review December 17, 2022 11:37
@Adel-Moumen Adel-Moumen self-assigned this Dec 17, 2022
@Moumeneb1
Copy link
Contributor Author

Hi :D !

The model produces numbers as numeric values (even without applying normalization), but on datasets like CommonVoice or librispeech for example, the numbers on the transcript are written as text. They don't really talk about this in the article during finetunning, but since the model has this behavior, I think they used it during training as well.

These are some outputs I've got with Plan Whisper outputs without normalization the output from and I've got results like

  • "So if he has $3,000 a month in income",
  • "all four million. So they'll put, they collect $270",
  • " by 2060. And it turns out"

Happy holidays btw :D

@Adel-Moumen
Copy link
Collaborator

Hello,

Do you have any results to share? e.g. WER ?

Happy new year :)

@Adel-Moumen
Copy link
Collaborator

Hello @Moumeneb1,

Any news on this PR please?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants