Whisper models:
* [OpenAI Whisper - installation requirements](https://github.com/openai/whisper)
* [Hugging Face Transformers - Whisper](https://huggingface.co/openai/whisper-large-v3/blob/main/README.md)

In [8]:
# Imports
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration, pipeline
from IPython.display import Audio

In [7]:
# Config params
pipeline_id: str = "automatic-speech-recognition"
model_id: str = "openai/whisper-large-v3"
device: str = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
audio_file: str = r"D:\data\Audio files\grip_audio_file.mp3"

#### Instantiate model and spreech processing pipeline
***

In [5]:
# Instantiate the model
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)


WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 1280)
      (layers): ModuleList(
        (0-31): 32 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bia

In [6]:
# Instantiate processors (which includes the tokenizer and feature extractor for Whisper)
processor = WhisperProcessor.from_pretrained(model_id)

# Instantiate the pipeline
pipe = pipeline(
    pipeline_id,
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

#### Run inference
***

In [9]:
# play audio
Audio(audio_file)

In [23]:
# Run transcription
generate_kwargs = {
    "language": "german",
    "task": "transcribe",
    "condition_on_prev_tokens": True,
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
}

# run transcription 
try:
    result = pipe(audio_file, generate_kwargs=generate_kwargs)

    # play audio
    display(Audio(audio_file))

    # show transcribed text
    text = result["text"]
    print(f"Transcribed text:\n\t{text}")
    
except Exception as e:
    print(f"An error occurred: {e}")

Transcribed text:
	 Auslieferungspreis hat angefangen ab 768.000 Euro, wert heute circa 1,5 Millionen Euro. Das Auto ist von 2015. Wir stehen hier vor einem Auto, was circa jetzt neun Jahre alt ist, aber es wirkt immer noch neu, frisch. Es wirkt immer noch, als ob es heute auf den Markt gekommen ist, oder? Der Porsche 918 Spyder hat nun mal eine Besonderheit. Es wurde bei Porsche nur 100 Leute, die speziell dafür ausgebildet worden sind, die eine spezielle Qualifizierung auch mitnehmen mussten, durften dieses Auto bauen. Und das macht das Auto nochmal noch... Also das ist genial. Diese Geschichten packen einen. Wie groß bist du? 1,90m, 1,92m. Stell dich mal am Fahrzeug ran, damit wir mal ein Gefühl kriegen, wie tief dieses Auto ist. Das geht nicht. Hier bist du im Gürtel, siehst du das? Und bei mir bist du...


In [25]:
# Run transcription
generate_kwargs = {
    "language": "german",
    "task": "translate",
    "condition_on_prev_tokens": True,
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
}

# run transcription 
try:
    result = pipe(audio_file, generate_kwargs=generate_kwargs)

    # play audio
    display(Audio(audio_file))

    # show transcribed text
    text = result["text"]
    print(f"Translated text:\n\t{text}")
    
except Exception as e:
    print(f"An error occurred: {e}")

Translated text:
	 The delivery price started at 768,000 euros, worth about 1.5 million euros today. The car is from 2015. We are standing here in front of a car that is about nine years old now, but it still looks new, fresh, it still looks like it has come into the market today, right? The Porsche 918 Spyder has a special feature again. Porsche only had 100 people who were specially trained for it, who also had to take a special qualification with them, were allowed to build this car. And that makes the car even more ... So that's great. These stories pack you. How tall are you? 1.90, 1.92. Get in the car so that we can get a feeling of how deep this car is. That's possible. It goes all the way to the belt, you see that? Yes, yes. And with me to my ...
