# 2. Whisper Translate Experiment

**Install required packages**

In [1]:
!pip install -q git+https://github.com/openai/whisper.git

[K     |████████████████████████████████| 5.8 MB 8.7 MB/s 
[K     |████████████████████████████████| 7.6 MB 43.7 MB/s 
[K     |████████████████████████████████| 182 kB 61.5 MB/s 
[?25h  Building wheel for whisper (setup.py) ... [?25l[?25hdone


**Import required packages**

In [2]:
from google.colab import files

import whisper

## 2.1 Define model constants

This section is used to define constants for the model such as: `id`, `device`, `low_vram`.

In [3]:
WHISPER_MODEL_ID = "medium"

WHISPER_MODEL_DEVICE = "cpu"
WHISPER_MODEL_LOW_RAM = True

TEXT_BOLD = "\033[1m"
TEXT_END = "\033[0m\x1B[0m"

## 2.2 Create model processor

Instantiate the whisper model using the pretrained weights from [huggingface.co](https://huggingface.co/openai/whisper-medium)

In [4]:
model = whisper.load_model(WHISPER_MODEL_ID, device=WHISPER_MODEL_DEVICE)

100%|█████████████████████████████████████| 1.42G/1.42G [00:19<00:00, 79.2MiB/s]


## 2.3 Create inference functions

Create the functions to run inference using whisper. The `run_translate_inference()` function can be used to transcribe audio in any language and translate it to english at the same time. It returns a dictionary containing the transcription results.

In [5]:
def run_translate_inference(
    model,
    file,
    language=None,
    no_speech_threshold=0.6,
    use_previous_text=True,
    enable_blank_supression=True,
    enable_timestamps=False,
):
    result = model.transcribe(
        file,
        task="translate",
        language=language,
        no_speech_threshold=no_speech_threshold,
        condition_on_previous_text=use_previous_text,
        suppress_blank=enable_blank_supression,
        without_timestamps=not enable_timestamps,
        fp16=WHISPER_MODEL_LOW_RAM,
    )

    return result

## 2.4 Create result functions

Create the functions to show the transcription results from whisper.

In [6]:
def show_result(result):
    print("-" * 100)
    print("")

    print(TEXT_BOLD + "Model: " + TEXT_END + WHISPER_MODEL_ID)
    print(TEXT_BOLD + "Device: " + TEXT_END + WHISPER_MODEL_DEVICE)
    print("")

    print(TEXT_BOLD + "Text: " + TEXT_END + result.get("text", "-").strip())
    print(TEXT_BOLD + "Language: " + TEXT_END + result.get("language", "-").strip())
    print("")

    print(TEXT_BOLD + "Segments:" + TEXT_END)
    for segment in result.get("segments", []):
        print("\t" + TEXT_BOLD + "Id: " + TEXT_END + f"{segment['id']}")
        print("\t" + "Time: " + f"{segment['start']:.2f}s - {segment['end']:.2f}s")
        print("\t" + "Text: " + segment["text"].strip())
        print("\t" + "No Speech Probability: " + f"{segment['no_speech_prob']:.2f}")
        print("")

    print("-" * 100)

## 2.5 Run inference

In [7]:
%%time
# Upload audio
upload = files.upload()
audio = next(key for key in upload.keys() if ".mp3" in key)

# Run inference
result = run_translate_inference(
    model,
    audio,
    language=None,
    no_speech_threshold=0.6,
    use_previous_text=True,
    enable_blank_supression=True,
    enable_timestamps=True
)

Saving sample-nl.mp3 to sample-nl.mp3




CPU times: user 51.8 s, sys: 558 ms, total: 52.3 s
Wall time: 1min 6s


## 2.6 Run result

In [8]:
show_result(result)

----------------------------------------------------------------------------------------------------

[1mModel: [0m[0mmedium
[1mDevice: [0m[0mcpu

[1mText: [0m[0mThe fixed maximum speeds in Belgium are 30, 50, 70, 90 and 120 km per hour.
[1mLanguage: [0m[0mnl

[1mSegments:[0m[0m
	[1mId: [0m[0m0
	Time: 0.00s - 8.84s
	Text: The fixed maximum speeds in Belgium are 30, 50, 70, 90 and 120 km per hour.
	No Speech Probability: 0.03

----------------------------------------------------------------------------------------------------
