Geetings to : https://github.com/PacktPublishing/Learn-OpenAI-Whisper/tree/main/Chapter06

# Learn OpenAI Whisper - Chapter 6
## Notebook 3: Video Subtitle Generation using Whisper and OpenVINO™


In this advanced tutorial, we will leverage the power of OpenAI's Whisper model in conjunction with OpenVINO toolkit to automatically generate subtitles for a sample video. The process will be broken down into the following key steps:

1. Obtaining the pre-trained Whisper model
2. Setting up the PyTorch model pipeline
3. Transforming the model into OpenVINO Intermediate Representation (IR) format using the model conversion API
4. Executing the Whisper pipeline with the converted OpenVINO models to generate the subtitles


## Setting Up the Environment


We start by importing a helper Python utility module called utils.py from our GitHub repository.

In [17]:
!wget -nv "https://raw.githubusercontent.com/redhat-aaiche/my-whisper/refs/heads/main/Chapter06/utils.py" -O utils.py

2025-01-13 11:48:23 URL:https://raw.githubusercontent.com/redhat-aaiche/my-whisper/refs/heads/main/Chapter06/utils.py [11251/11251] -> "utils.py" [1]


Next, we install critical software dependencies to enable working with AI models and speech data.

In [2]:
#aa execute the following cell TWICE

In [None]:
%pip install -q cohere openai tiktoken
%pip install -q "openvino>=2023.1.0"
%pip install -q "python-ffmpeg<=1.0.16" moviepy transformers --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "git+https://github.com/garywu007/pytube.git"
%pip install -q gradio
%pip install -q "openai-whisper==20231117" --extra-index-url https://download.pytorch.org/whl/cpu

## Initializing the Whisper Model

OpenAI's Whisper is a powerful Transformer-based encoder-decoder model, also known as a sequence-to-sequence model, designed for speech recognition tasks. It operates by mapping a sequence of audio spectrogram features to a corresponding sequence of text tokens. The process can be broken down into three main steps:

1. **Feature Extraction**: The raw audio inputs are first converted into a log-Mel spectrogram representation using a feature extractor module.

2. **Encoding**: The Transformer encoder then processes the spectrogram, generating a sequence of hidden states that capture the essential information from the audio input.

3. **Decoding**: Finally, the decoder autoregressively predicts the text tokens, conditioned on both the previously generated tokens and the encoder's hidden states.

The architecture of the Whisper model is illustrated in the diagram below:

![whisper_architecture.svg](https://user-images.githubusercontent.com/29454499/204536571-8f6d8d77-5fbd-4c6d-8e29-14e734837860.svg)

*Source: https://openai.com/research/whisper*

By leveraging this powerful architecture, Whisper achieves state-of-the-art performance on various speech recognition benchmarks, making it an ideal choice for our subtitle generation task.


The creators of Whisper have trained several models with varying sizes and capabilities to cater to different use cases and resource constraints. For the purpose of this tutorial, we will be using the `base` model, which offers a good balance between performance and efficiency. However, it's important to note that the steps and techniques demonstrated in this notebook can be easily applied to other models within the Whisper family, allowing you to experiment with different configurations and find the one that best suits your specific requirements.

In [5]:
from whisper import _MODELS
import ipywidgets as widgets

model_id = widgets.Dropdown(
    options=list(_MODELS),
    value='base',
    description='Model:',
    disabled=False,
)

model_id

Dropdown(description='Model:', index=3, options=('tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'm…

In [6]:
!ls -alh  ~/.cache/whisper/

ls: cannot access '/opt/app-root/src/.cache/whisper/': No such file or directory


In [7]:
import whisper

# model = whisper.load_model(model_id.value)
model = whisper.load_model(model_id.value, "cpu")
model.eval()
pass

100%|███████████████████████████████████████| 139M/139M [00:06<00:00, 21.6MiB/s]


In [8]:
!ls -alh  ~/.cache/whisper/

total 139M
drwxr-xr-x. 2 1000800000 root   21 Jan 13 11:40 .
drwxr-xr-x. 8 1000800000 root   94 Jan 13 11:40 ..
-rw-r--r--. 1 1000800000 root 139M Jan 13 11:41 base.pt


### Converting the Model to OpenVINO Intermediate Representation (IR) Format

To achieve optimal performance and efficiency with the OpenVINO toolkit, it is highly recommended to convert the Whisper model into the OpenVINO-specific Intermediate Representation (IR) format. This process requires two key components:

1. An initialized model object
2. Sample input data for shape inference

We will leverage the `ov.convert_model` function provided by OpenVINO to perform the model conversion. This function takes the initialized model object and sample inputs as arguments and returns an OpenVINO-compatible model that is ready to be loaded onto the target device for inference.

Once the conversion is complete, we can save the OpenVINO model to disk using the `ov.save_model` function. This allows us to reuse the converted model in future sessions without the need to repeat the conversion process, saving valuable time and resources.

By converting the Whisper model to OpenVINO IR format, we can take full advantage of the performance optimizations and hardware acceleration capabilities offered by the OpenVINO toolkit, ensuring efficient and high-quality subtitle generation.


### Converting the Whisper Encoder to OpenVINO IR

In [9]:
from pathlib import Path

WHISPER_ENCODER_OV = Path(f"whisper_{model_id.value}_encoder.xml")
WHISPER_DECODER_OV = Path(f"whisper_{model_id.value}_decoder.xml")


An example input is created using a tensor of zeros. The ov.convert_model function is then used to convert the encoder model to OpenVINO's IR format. The converted model is saved to disk for future use.

In [10]:
import torch
import openvino as ov

mel = torch.zeros((1, 80 if 'v3' not in model_id.value else 128, 3000))
audio_features = model.encoder(mel)
if not WHISPER_ENCODER_OV.exists():
    encoder_model = ov.convert_model(model.encoder, example_input=mel)
    ov.save_model(encoder_model, WHISPER_ENCODER_OV)

  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"


 ### Converting the Whisper Decoder to OpenVINO IR

The Whisper decoder employs a technique called attention caching to reduce computational complexity and improve efficiency. This involves storing the key and value projections from previous steps in the attention modules, which can then be reused in subsequent computations. However, to ensure accurate tracing and conversion of the decoder to OpenVINO IR format, we need to modify this caching mechanism.

In the following code cells, we will define custom forward functions for the decoder's attention modules and residual blocks. These modified functions will explicitly handle the caching and retrieval of key and value projections, making the caching process more transparent and traceable.

By adapting the decoder's architecture to be more compatible with the OpenVINO conversion process, we can successfully convert the Whisper decoder to OpenVINO IR format, enabling us to leverage the performance benefits of the OpenVINO toolkit while maintaining the decoder's functionality and efficiency.

In [11]:
import torch
from typing import Optional, Tuple
from functools import partial


def attention_forward(
        attention_module,
        x: torch.Tensor,
        xa: Optional[torch.Tensor] = None,
        mask: Optional[torch.Tensor] = None,
        kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
):
    """
    Override for forward method of decoder attention module with storing cache values explicitly.
    Parameters:
      attention_module: current attention module
      x: input token ids.
      xa: input audio features (Optional).
      mask: mask for applying attention (Optional).
      kv_cache: dictionary with cached key values for attention modules.
      idx: idx for search in kv_cache.
    Returns:
      attention module output tensor
      updated kv_cache
    """
    q = attention_module.query(x)

    if xa is None:
        # hooks, if installed (i.e. kv_cache is not None), will prepend the cached kv tensors;
        # otherwise, perform key/value projections for self- or cross-attention as usual.
        k = attention_module.key(x)
        v = attention_module.value(x)
        if kv_cache is not None:
            k = torch.cat((kv_cache[0], k), dim=1)
            v = torch.cat((kv_cache[1], v), dim=1)
        kv_cache_new = (k, v)
    else:
        # for cross-attention, calculate keys and values once and reuse in subsequent calls.
        k = attention_module.key(xa)
        v = attention_module.value(xa)
        kv_cache_new = (None, None)

    wv, qk = attention_module.qkv_attention(q, k, v, mask)
    return attention_module.out(wv), kv_cache_new


def block_forward(
    residual_block,
    x: torch.Tensor,
    xa: Optional[torch.Tensor] = None,
    mask: Optional[torch.Tensor] = None,
    kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
):
    """
    Override for residual block forward method for providing kv_cache to attention module.
      Parameters:
        residual_block: current residual block.
        x: input token_ids.
        xa: input audio features (Optional).
        mask: attention mask (Optional).
        kv_cache: cache for storing attention key values.
      Returns:
        x: residual block output
        kv_cache: updated kv_cache

    """
    x0, kv_cache = residual_block.attn(residual_block.attn_ln(
        x), mask=mask, kv_cache=kv_cache)
    x = x + x0
    if residual_block.cross_attn:
        x1, _ = residual_block.cross_attn(
            residual_block.cross_attn_ln(x), xa)
        x = x + x1
    x = x + residual_block.mlp(residual_block.mlp_ln(x))
    return x, kv_cache



# update forward functions
for idx, block in enumerate(model.decoder.blocks):
    block.forward = partial(block_forward, block)
    block.attn.forward = partial(attention_forward, block.attn)
    if block.cross_attn:
        block.cross_attn.forward = partial(attention_forward, block.cross_attn)


def decoder_forward(decoder, x: torch.Tensor, xa: torch.Tensor, kv_cache: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor]]] = None):
    """
    Override for decoder forward method.
    Parameters:
      x: torch.LongTensor, shape = (batch_size, <= n_ctx) the text tokens
      xa: torch.Tensor, shape = (batch_size, n_mels, n_audio_ctx)
           the encoded audio features to be attended on
      kv_cache: Dict[str, torch.Tensor], attention modules hidden states cache from previous steps
    """
    if kv_cache is not None:
        offset = kv_cache[0][0].shape[1]
    else:
        offset = 0
        kv_cache = [None for _ in range(len(decoder.blocks))]
    x = decoder.token_embedding(
        x) + decoder.positional_embedding[offset: offset + x.shape[-1]]
    x = x.to(xa.dtype)
    kv_cache_upd = []

    for block, kv_block_cache in zip(decoder.blocks, kv_cache):
        x, kv_block_cache_upd = block(x, xa, mask=decoder.mask, kv_cache=kv_block_cache)
        kv_cache_upd.append(tuple(kv_block_cache_upd))

    x = decoder.ln(x)
    logits = (
        x @ torch.transpose(decoder.token_embedding.weight.to(x.dtype), 1, 0)).float()

    return logits, tuple(kv_cache_upd)



# override decoder forward
model.decoder.forward = partial(decoder_forward, model.decoder)

In [12]:
tokens = torch.ones((5, 3), dtype=torch.int64)
logits, kv_cache = model.decoder(tokens, audio_features, kv_cache=None)

tokens = torch.ones((5, 1), dtype=torch.int64)

if not WHISPER_DECODER_OV.exists():
    decoder_model = ov.convert_model(model.decoder, example_input=(tokens, audio_features, kv_cache))
    ov.save_model(decoder_model, WHISPER_DECODER_OV)

  if a.grad is not None:


The decoder model autoregressively predicts the next token guided by encoder hidden states and previously predicted sequence. This means that the shape of inputs which depends on the previous step (inputs for tokens and attention hidden states from previous step) are dynamic. For efficient utilization of memory, you define an upper bound for dynamic input shapes.

### Preparing the Inference Pipeline

The image below illustrates the pipeline of video transcribing using the Whisper model.

![ch06_diagram01.png](https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter06/ch06_diagram01.png)

To run the PyTorch Whisper model, we just need to call the `model.transcribe(audio, **parameters)` function. We will try to reuse original model pipeline for audio transcribing after replacing the original models with OpenVINO IR versions.

In the original PyTorch implementation of Whisper, running the transcription pipeline is as simple as calling the `model.transcribe(audio, **parameters)` function, which handles all the necessary steps internally.

To leverage the benefits of the OpenVINO toolkit, we will modify this pipeline by replacing the original PyTorch models with their OpenVINO IR counterparts. By doing so, we can take advantage of the performance optimizations and hardware acceleration capabilities offered by OpenVINO while maintaining the overall structure and functionality of the transcription pipeline.

In the following sections, we will dive deeper into each step of the pipeline and demonstrate how to integrate the OpenVINO models seamlessly.

### Selecting the Inference Device

One of the key advantages of the OpenVINO toolkit is its ability to optimize and run inference on a wide range of hardware devices, including CPUs, GPUs, and specialized accelerators. To harness this flexibility, we need to specify the target device on which we want to execute the inference pipeline.

In the code cell below, you will find a dropdown menu that allows you to select the desired inference device. The available options are dynamically populated based on the devices supported by your system and the installed OpenVINO runtime.

Simply choose the appropriate device from the dropdown list, considering factors such as performance, power consumption, and availability. OpenVINO will then optimize the converted models and execute the inference pipeline on the selected device, ensuring the best possible performance and efficiency.

By default, the "AUTO" option is selected, which allows OpenVINO to automatically choose the most suitable device based on the available hardware and the model's requirements. However, you can override this behavior by explicitly selecting a specific device from the list.

Once you have selected the inference device, the subsequent steps in the pipeline will be executed on that device, taking full advantage of the OpenVINO runtime's optimizations and acceleration capabilities.


In [13]:
core = ov.Core()

In [14]:
import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

In [15]:
#aa: doesnt work - complaining about moviepy.editor

In [21]:
!pip install --upgrade moviepy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [23]:
from utils import patch_whisper_for_ov_inference, OpenVINOAudioEncoder, OpenVINOTextDecoder

patch_whisper_for_ov_inference(model)

model.encoder = OpenVINOAudioEncoder(core, WHISPER_ENCODER_OV, device=device.value)
model.decoder = OpenVINOTextDecoder(core, WHISPER_DECODER_OV, device=device.value)

## Running the Video Transcription Pipeline

With the Whisper model converted to OpenVINO IR format and the inference device selected, we are now ready to run the video transcription pipeline on our chosen video.

For the purpose of this tutorial, we will demonstrate the transcription process using a video from YouTube. In the code cell below, you can enter the URL of the YouTube video you wish to transcribe. Please keep in mind that downloading the video may take some time, depending on the video's length and your internet connection speed.

Once the video URL is provided, the code will automatically download the video and save it to the local file system. The downloaded video file will serve as the input for the transcription pipeline.



In [26]:
import ipywidgets as widgets
VIDEO_LINK = "https://youtu.be/kgL5LBM-hFI"
#VIDEO_LINK = "https://youtu.be/5bs9XoTac88"
link = widgets.Text(
    value=VIDEO_LINK,
    placeholder="Type link for video",
    description="Video:",
    disabled=False
)

link

Text(value='https://youtu.be/kgL5LBM-hFI', description='Video:', placeholder='Type link for video')

Select the task for the model:

* **transcribe** - generate audio transcription in the source language (automatically detected).
* **translate** - generate audio transcription with translation to English language.

In [28]:
from whisper import _MODELS
list(_MODELS)

['tiny.en',
 'tiny',
 'base.en',
 'base',
 'small.en',
 'small',
 'medium.en',
 'medium',
 'large-v1',
 'large-v2',
 'large-v3',
 'large']

In [29]:
from whisper import _MODELS
import ipywidgets as widgets

model_id = widgets.Dropdown(
    options=list(_MODELS),
    value='base',
    description='Model:',
    disabled=False,
)

model_id

Dropdown(description='Model:', index=3, options=('tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'm…

In [31]:
task = widgets.Select(
    options=["transcribe", "translate"],
    value="translate",
    description="Select task:",
    disabled=False
)
task

Select(description='Select task:', index=1, options=('transcribe', 'translate'), value='translate')

In [34]:
task.value

'transcribe'

In [24]:
torch.cuda.is_available()

False

In [None]:
!curl https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz -o ffmpeg.tar.xz \
     && tar -xf ffmpeg.tar.xz && rm ffmpeg.tar.xz

In [40]:
ffmdir = !find . -iname ffmpeg-*-static
path = %env PATH
path = path + ':' + ffmdir[0]
%env PATH $path
!which ffmpeg
print('Done!')

env: PATH=/opt/app-root/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/app-root/src/.local/bin/:/opt/app-root/src/bin:/opt/app-root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/mssql-tools18/bin:./ffmpeg-7.0.2-amd64-static
/opt/app-root/src/my-whisper/Chapter06/ffmpeg-7.0.2-amd64-static/ffmpeg
Done!


In [41]:
audio = "example_1.wav"

In [42]:
transcription = model.transcribe(audio, fp16=torch.cuda.is_available(), task=task.value)

"The results will be saved in the `downloaded_video.srt` file. SRT is one of the most popular formats for storing subtitles and is compatible with many modern video players. This file can be used to embed transcription into videos during playback or by injecting them directly into video files using `ffmpeg`.

In [44]:
from utils import prepare_srt

#srt_lines = prepare_srt(transcription, filter_duration=duration)
srt_lines = prepare_srt(transcription)
# save transcription
with output_file.with_suffix(".srt").open("w") as f:
    f.writelines(srt_lines)

Now let us see the results.

In [45]:
print("".join(srt_lines))

1
00:00:00,000 --> 00:00:06,360
 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.

2
00:00:06,360 --> 00:00:11,280
 Nor is Mr. Quilter's manner less interesting than his matter.

3
00:00:11,280 --> 00:00:16,840
 He tells us that at this festive season of the year, with Christmas and roast beef looming

4
00:00:16,840 --> 00:00:23,800
 before us, similarly he's drawn from eating and its results occur most readily to the mind.

5
00:00:23,800 --> 00:00:29,400
 He has graved doubts whether Sir Frederick Layton's work is really Greek after all, and

6
00:00:29,400 --> 00:00:33,600
 can discover in it but little of Rocky Ithaca.

7
00:00:33,600 --> 00:00:39,800
 Lynelle's pictures are a sort of upgards and atom paintings, and Mason's exquisite

8
00:00:39,800 --> 00:00:44,600
 ittles are as national as a jingo poem.

9
00:00:44,600 --> 00:00:50,360
 Mr. Birk at Foster's landscapes smile at one much in the same way that Mr. Carker used

10
00:00:50,360

## Interactive Demo

...
