add whisper normalization on training #1768

Moumeneb1 · 2022-12-16T11:38:23Z

Hi :D

In the present whisper finetuning implementation, we are training with raw text (no normalisation) and then we do testing and validating using whisper normalisation. This is a little adjustment to fine-tune the model on whisper normalisation, as encode only accomplishes tokenisation and not normalisation.

from transformers.models.whisper.tokenization_whisper import WhisperTokenizer

test = "hello i have fifty two dollars"

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base")
print(tokenizer._normalize(test))
print(tokenizer.decode(tokenizer.encode(test)))
print(tokenizer.decode(tokenizer.encode(tokenizer._normalize(test))))

#print outputs
hello i have $52
<|startoftranscript|><|notimestamps|>hello i have fifty two dollars<|endoftext|>
<|startoftranscript|><|notimestamps|>hello i have $52<|endoftext|>

Moumeneb1 · 2022-12-16T11:52:05Z

taking into account normalized_transcripts param

def text_pipeline(wrd):
  if hasattr(hparams, "normalized_transcripts"):
      wrd =tokenizer._normalize(wrd)
  yield wrd
  tokens_list = tokenizer.encode(wrd)
  # avoid bos and eos tokens.
  tokens_list = tokens_list[1:-1]
  yield tokens_list
  tokens_bos = torch.LongTensor([hparams["bos_index"]] + tokens_list)
  yield tokens_bos
  tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
  yield tokens_eos
  tokens = torch.LongTensor(tokens_list)
  yield tokens

Adel-Moumen · 2022-12-17T10:21:13Z

Hello @Moumeneb1,

Thanks for the PR! Are you sure they are doing normalisation on the training transcript in the official Whisper paper? If I'm not mistaken they are not doing it but feel free to correct me.

By the way, do you have results on normalised training vs un-normalised training?

Moumeneb1 · 2022-12-26T22:58:28Z

Hi :D !

The model produces numbers as numeric values (even without applying normalization), but on datasets like CommonVoice or librispeech for example, the numbers on the transcript are written as text. They don't really talk about this in the article during finetunning, but since the model has this behavior, I think they used it during training as well.

These are some outputs I've got with Plan Whisper outputs without normalization the output from and I've got results like

"So if he has $3,000 a month in income",
"all four million. So they'll put, they collect $270",
" by 2060. And it turns out"

Happy holidays btw :D

Adel-Moumen · 2023-01-20T13:29:15Z

Hello,

Do you have any results to share? e.g. WER ?

Happy new year :)

Adel-Moumen · 2023-02-12T16:47:18Z

Hello @Moumeneb1,

Any news on this PR please?

Thanks!

Abdelmoumene BOUMADANE added 2 commits December 16, 2022 11:27

add whisper normalization on training

a43b78d

use the normalized_transcripts param

b9cd001

Adel-Moumen self-requested a review December 17, 2022 11:37

Adel-Moumen self-assigned this Dec 17, 2022

Adel-Moumen closed this Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add whisper normalization on training #1768

add whisper normalization on training #1768

Moumeneb1 commented Dec 16, 2022

Moumeneb1 commented Dec 16, 2022

Adel-Moumen commented Dec 17, 2022

Moumeneb1 commented Dec 26, 2022

Adel-Moumen commented Jan 20, 2023

Adel-Moumen commented Feb 12, 2023

add whisper normalization on training #1768

add whisper normalization on training #1768

Conversation

Moumeneb1 commented Dec 16, 2022

Moumeneb1 commented Dec 16, 2022

Adel-Moumen commented Dec 17, 2022

Moumeneb1 commented Dec 26, 2022

Adel-Moumen commented Jan 20, 2023

Adel-Moumen commented Feb 12, 2023