# Inference 101 using Whisper Models from HuggingFace

Note: there are many different ways to run inference. This is just one example to demonstrate how the audio data from the datasets can be run through a model.

We are focussing on Whisper models here exclusively, but there are other models that one could use. More to this later...

## Preparation -- Imports and Load dataset

In [None]:
import datasets
from huggingface_hub import hf_hub_download
from IPython.display import Audio, display
import pandas as pd

from transformers import pipeline

In [None]:
from huggingface_hub import login, whoami
HF_TOKEN = input()
login(token=HF_TOKEN)

In [None]:
dataset_name = 'cdli/kenyan_english_nonstandard_speech_v0'
ds = datasets.load_dataset(dataset_name, split='test', streaming=False)
ds = ds.filter(lambda example: example['audio_length'] <= 30)
ds

## Load a model for Inference

In [None]:
WHISPER_MODEL_NAME = "openai/whisper-tiny"
# WHISPER_MODEL_NAME = "openai/whisper-small"
# WHISPER_MODEL_NAME = "openai/whisper-large-v3"

### Easiest way is via HF's pipeline approach

In [None]:
pipe = pipeline("automatic-speech-recognition", 
                model=WHISPER_MODEL_NAME,
                #return_timestamps=False,
)

In [None]:
generate_kwargs={
    "language": 'en', 
    "task": "transcribe",
    "max_length": 448, # Note: don't exceed 448 - otherwise you'll get index errors when max_length exceeds the models positional encoding limits
    "num_beams": 1,
    "do_sample": False
}    

In [None]:
example = ds[0]
example

In [None]:
Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])

In [None]:
prediction = pipe(example['audio']['array'])
prediction['text']

In [None]:
example['transcription']

### Separate Processor and Model Output Generation

* not the recommended approach if what you want are predictions
* but can be handy if you care to analyze the intermediate representations or want to look deeper into the predicted IDs or just want to learn how the model works internally

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch


processor = WhisperProcessor.from_pretrained(WHISPER_MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(WHISPER_MODEL_NAME)


In [None]:
inputs = processor(example['audio']['array'], 
                   sampling_rate=example['audio']['sampling_rate'], 
                   return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        # language="en",
        # task="transcribe",
        # max_length=448,  # Whisper's max length - do not exceed!
        num_beams=1,
        #temperature=0.7,
        #do_sample=True
        do_sample=False
        )

result = processor.batch_decode(generated_ids)
result

## Next step

* try different model sizes!