### Testing out Distil-Whisper

[GitHub Repo](https://github.com/huggingface/distil-whisper/tree/3c8c15f771139f4c98284486534667a87927ae45) of model.

In [4]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

In [5]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

In [11]:
display(device)
display(torch_dtype)

'cpu'

torch.float32

In [12]:
# Distil-Whisper model id on hugging face api
model_id = "distil-whisper/distil-large-v2"

# load the model...
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,  # this helps keep loading time as low as possible
    use_safetensors=True  # use safetensors
)

model.to(device)

# ... and the processor
processor = AutoProcessor.from_pretrained(model_id)

Downloading (…)lve/main/config.json: 100%|█| 2.29k/2.29k [00:00<00:00, 4
Downloading model.safetensors: 100%|█| 1.51G/1.51G [03:08<00:00, 8.04MB/
Downloading (…)neration_config.json: 100%|█| 3.59k/3.59k [00:00<?, ?B/s]
Downloading (…)rocessor_config.json: 100%|█| 339/339 [00:00<00:00, 92.8k
Downloading (…)okenizer_config.json: 100%|█| 283k/283k [00:00<00:00, 803
Downloading (…)olve/main/vocab.json: 100%|█| 836k/836k [00:00<00:00, 1.5
Downloading (…)/main/tokenizer.json: 100%|█| 2.48M/2.48M [00:00<00:00, 1
Downloading (…)olve/main/merges.txt: 100%|█| 494k/494k [00:00<00:00, 1.2
Downloading (…)main/normalizer.json: 100%|█| 52.7k/52.7k [00:00<00:00, 4
Downloading (…)in/added_tokens.json: 100%|█| 34.6k/34.6k [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|█| 2.08k/2.08k [00:00<00:00, 4
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [13]:
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

In [14]:
# load audio sample from LibriSpeech corpus
from datasets import load_dataset

dataset = load_dataset('hf-internal-testing/librispeech_asr_dummy', 'clean', split='validation')
sample_audio = dataset[0]['audio']

Downloading builder script: 100%|███| 5.17k/5.17k [00:00<00:00, 619kB/s]
Downloading data files:   0%|                     | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|                     | 0.00/9.08M [00:00<?, ?B/s][A
Downloading data:   0%|             | 21.5k/9.08M [00:00<00:50, 180kB/s][A
Downloading data:   1%|             | 69.6k/9.08M [00:00<00:31, 282kB/s][A
Downloading data:   2%|▏             | 157k/9.08M [00:00<00:26, 334kB/s][A
Downloading data:   3%|▍             | 244k/9.08M [00:00<00:23, 376kB/s][A
Downloading data:   3%|▍             | 313k/9.08M [00:00<00:22, 397kB/s][A
Downloading data:   6%|▊             | 557k/9.08M [00:01<00:11, 762kB/s][A
Downloading data:   8%|█             | 696k/9.08M [00:01<00:10, 834kB/s][A
Downloading data:   9%|█▏            | 783k/9.08M [00:01<00:11, 729kB/s][A
Downloading data:  10%|█▍            | 940k/9.08M [00:01<00:10, 740kB/s][A
Downloading data:  12%|█▍           | 1.04M/9.08M [00:01<00:12, 662kB/s][A
Downloading data: 

Downloading data:  81%|██████████▌  | 7.40M/9.08M [00:19<00:05, 321kB/s][A
Downloading data:  82%|██████████▋  | 7.47M/9.08M [00:20<00:04, 360kB/s][A
Downloading data:  83%|██████████▊  | 7.54M/9.08M [00:20<00:04, 379kB/s][A
Downloading data:  84%|██████████▊  | 7.59M/9.08M [00:20<00:04, 334kB/s][A
Downloading data:  84%|██████████▉  | 7.66M/9.08M [00:20<00:04, 338kB/s][A
Downloading data:  85%|███████████  | 7.73M/9.08M [00:20<00:03, 338kB/s][A
Downloading data:  86%|███████████▏ | 7.80M/9.08M [00:21<00:03, 338kB/s][A
Downloading data:  87%|███████████▎ | 7.87M/9.08M [00:21<00:03, 340kB/s][A
Downloading data:  87%|███████████▎ | 7.92M/9.08M [00:21<00:04, 273kB/s][A
Downloading data:  89%|███████████▌ | 8.04M/9.08M [00:21<00:02, 359kB/s][A
Downloading data:  89%|███████████▌ | 8.09M/9.08M [00:22<00:02, 331kB/s][A
Downloading data:  90%|███████████▋ | 8.13M/9.08M [00:22<00:03, 284kB/s][A
Downloading data:  90%|███████████▋ | 8.18M/9.08M [00:22<00:03, 276kB/s][A
Downloading 

In [15]:
result = pipe(sample_audio)
print(result['text'])

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.


In [16]:
# load local sample audio
green_eggs_and_ham_audio = r"C:\Users\Administrator\Desktop\repos\auto_caption\green_eggs_and_ham_video\green_eggs_and_ham_audio.mp4"
result2 = pipe(green_eggs_and_ham_audio)
print(result2['text'])

 You're listening to a toad-stool in Ferrydusts production of Green Eggs and Ham by Dr. Seuss. Be sure to hit the thumbs-up button and subscribe. It lets YouTube know that you're interested in more videos like this. I am Sam.
