# using insanely-fast-whisper

ref: https://github.com/Vaibhavs10/insanely-fast-whisper/blob/main/notebooks/infer_transformers_whisper_large_v2.ipynb

In [1]:
import torch
from transformers import pipeline

In [2]:
pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-large-v2",
                torch_dtype=torch.float16,
                device="cuda:0")
pipe.model = pipe.model.to_bettertransformer()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.


In [18]:
import librosa
audio_url = 'I-have-a-dream.wav'
speech, sr = librosa.load(audio_url)

In [50]:

outputs = pipe(speech,
               chunk_length_s=30,
               batch_size=16,
               return_timestamps=False)

print(outputs["text"])

 I have a dream that my four little children will one day be in a nation where they will not be judged by the color of their skin, but by the content of their character.


# using faster_whisper

In [1]:
from faster_whisper import WhisperModel

In [13]:
model_size = "large-v3"
# Run on GPU with FP16
model = WhisperModel(model_size, 
                     # device="cuda", 
                     compute_type="float16")
# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")

In [15]:

import librosa

audio_url = 'I-have-a-dream.wav'
speech, sr = librosa.load(audio_url)
segments, info = model.transcribe(speech, 
                                  beam_size=5,
                                  without_timestamps=True,
                                  word_timestamps=False
                                  )

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Detected language 'en' with probability 0.893066
[0.00s -> 20.41s]  I have a dream that my four little children will one day be with a nation where they will not judged by the color of their skin, but by the content of their character.
