<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/whisper_torch_compile.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Whisper：通过 Torch Compile 提升 4 倍推理速度

Transformers 中的 Whisper 模型受限于 *开销*：推理速度的瓶颈在于 CPU 指令 GPU 的速度不够快，导致 GPU 未能得到充分利用。为了解决这个问题，我们需要一次性为 GPU 提供更多操作。实现这一目标的一种方法是使用 [**torch compile**](https://pytorch.org/docs/stable/generated/torch.compile.html)，这是一个用于加速 PyTorch 模型推理速度的原生 PyTorch 函数。

Torch compile 接收一个大型的 PyTorch 图区域，并将其捕获为一个单独的编译区域。通过这样做，我们可以将 GPU 指令减少到这个单一的编译区域，从而降低 CPU 开销。此外，torch compile 为这些操作生成更快的内核，从而加快计算速度并确保它们受限于*内存*。

在这个 Colab 笔记本中，我们将看到如何仅用两行代码为 Whisper 模型启用 torch compile。我们将进行一个基准测试，该测试将突出显示 torch compile 为 Hugging Face Hub 上的任何 Whisper 模型提供的 4 倍推理速度提升。

## 设置

运行时被配置为使用通过 Google Colab Pro 提供的 L4 / A100 GPU。您可以通过点击屏幕右上角的“连接 L4 / A100”按钮连接到 L4 / A100，或通过点击“运行时” -> “更改运行时类型”来选择不同的 GPU。

完成以上步骤后，我们可以继续安装必要的 Python 包。我们将安装 [🤗 Transformers](https://huggingface.co/docs/transformers/index) 来加载和运行 Whisper 模型，[🤗 Datasets](https://huggingface.co/docs/datasets/index) 来加载我们的基准测试数据集，以及 [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) 来快速加载模型：

In [7]:
!pip install -q transformers==4.52.4 torchcodec==0.2 datasets[audio] accelerate

In [None]:
!pip list | grep -E "transformers|accelerate|datasets|torch"

accelerate                            1.9.0
datasets                              3.6.0
sentence-transformers                 4.1.0
tensorflow-datasets                   4.9.9
torch                                 2.6.0+cu124
torchao                               0.10.0
torchaudio                            2.6.0+cu124
torchdata                             0.11.0
torchsummary                          1.5.1
torchtune                             0.6.1
torchvision                           0.21.0+cu124
transformers                          4.52.4
vega-datasets                         0.9.0


## Benchmarking

首先，我们将使用熟悉的 🤗 Transformers API 加载 Whisper 模型及其配套的处理器。
在这个例子中，我们将加载预训练的 Whisper [medium.en](https://huggingface.co/openai/whisper-medium.en) 模型，但您可以随意
将其替换为 Hugging Face Hub 上 [1 万个 Whisper 检查点](https://huggingface.co/models?library=transformers&other=whisper&sort=trending)
中的任何一个。为了减少加载时间，我们将向 `.from_pretrained` 传递 [low_cpu_mem_usage](https://huggingface.co/docs/transformers/v4.43.4/en/big_models#accelerates-big-model-inference) 标志：

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-medium.en" # only en
model_id = "openai/whisper-large-v3" # Multilingual
model_id = "openai/whisper-medium" # Multilingual

# Note: L4 torch compile之后反而 升高了...
model_id = "openai/whisper-large-v3-turbo" # Multilingual



model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

为了进行基准测试，我们将加载一个小型数据集，该数据集包含来自 [LibriSpeech ASR](https://huggingface.co/datasets/librispeech_asr) 验证-干净数据集的 73 个样本。这大约有 9MB 的数据，因此它非常轻量，可以快速下载到设备上：

In [None]:
from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset

README.md:   0%|          | 0.00/520 [00:00<?, ?B/s]

clean/validation-00000-of-00001.parquet:   0%|          | 0.00/9.19M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/73 [00:00<?, ? examples/s]

Dataset({
    features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
    num_rows: 73
})

In [None]:
dataset[0]

{'file': '/Users/sanchitgandhi/.cache/huggingface/datasets/downloads/extracted/aad76e6f21870761d7a8b9b34436f6f8db846546c68cb2d9388598d7a164fa4b/dev_clean/1272/128104/1272-128104-0000.flac',
 'audio': {'path': '1272-128104-0000.flac',
  'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00042725, 0.00057983,
         0.0010376 ]),
  'sampling_rate': 16000},
 'text': 'MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL',
 'speaker_id': 1272,
 'chapter_id': 128104,
 'id': '1272-128104-0000'}

为了确保音频的采样率与我们模型的采样率匹配，我们将音频重新采样到 Whisper 所期望的采样率（16kHz）。
请注意，重新采样是在加载音频时即时应用的，如果采样率已经匹配，则不执行任何操作：

In [None]:
from datasets import Audio

sampling_rate = processor.feature_extractor.sampling_rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))

现在我们准备开始进行基准测试 📏 下面的单元格将对数据集中的样本进行逐一迭代（即批量大小为一）。对于每个样本，我们执行三个推理阶段：
1. 对原始音频输入进行预处理，生成对数梅尔谱图
2. 根据谱图输入，自回归地生成文本标记
3. 对生成的标记进行后处理，转换为文本字符串

为了进行基准测试，我们将对生成步骤进行计时，这部分是由 Whisper 模型本身执行的：

In [None]:
import time
from tqdm import tqdm
from torch.nn.attention import sdpa_kernel, SDPBackend
import torch

inference_time = 0.0
model.generation_config.max_new_tokens = 128

torch.set_float32_matmul_precision("high")


In [None]:
print(dataset[0]["text"].lower())
# 1. Pre-process the audio inputs
# input_features: torch.Size([1, 80, 3000]) torch.float32 for v2 v3 support 128
input_features = processor(dataset[0]["audio"]["array"], sampling_rate=16000, return_tensors="pt").input_features
print(input_features.shape,input_features.dtype)
input_features = input_features.to(device, dtype=torch_dtype)
# Create an attention mask
attention_mask = torch.ones(input_features.shape[:2], dtype=torch.long, device=device)
# 2. Auto-regressively generate text tokens
start = time.time()
pred_ids = model.generate(input_features, attention_mask=attention_mask)
inference_time = time.time() - start
# 3. Post-process tokens to text
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True)
print(pred_text[0].lower())
print(inference_time)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
torch.Size([1, 128, 3000]) torch.float32
 mr. quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
1.3720624446868896


In [None]:
inference_time = 0.0

for sample in tqdm(dataset):
    #print(sample["text"].lower())

    # 1. Pre-process the audio inputs
    # input_features: torch.Size([1, 80, 3000]) torch.float32 for v2 v3 support 128
    input_features = processor(sample["audio"]["array"], sampling_rate=16000, return_tensors="pt").input_features
    input_features = input_features.to(device, dtype=torch_dtype)

    # Create an attention mask
    attention_mask = torch.ones(input_features.shape[:2], dtype=torch.long, device=device)

    # 2. Auto-regressively generate text tokens
    start = time.time()
    pred_ids = model.generate(input_features, attention_mask=attention_mask)
    inference_time += time.time() - start

    # 3. Post-process tokens to text
    pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True)
    #print(pred_text[0].lower())

print(inference_time)

100%|██████████| 73/73 [00:18<00:00,  3.86it/s]

17.399162769317627





未编译的模型的基准时间

L4 GPU:
- medium.en:  36.4 秒
- large-v3: 47.5秒
- large-v3-turbo: 13 秒

A100 GPU:
- medium.en:  35.7 秒
- medium: 39.3 秒
- large-v3:  50.2 秒
- large-v3-turbo: 10.6 秒



现在让我们应用 torch compile 并重新测量性能。

## Enable torch compile

启用编译的第一步是不言自明的：我们需要将 `torch.compile` 转换应用于模型的正向传递。我们将编译模式设置为 `reduce-overhead`，它使用 [CUDA Graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) 来进一步减少 CPU 开销。我们还将设置 `fullgraph=True`，以在一个图中编译整个模型（即没有图中断）：

In [None]:
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
#model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True， options={"triton.cudagraphs": True})

第二步涉及设置键值（kv）缓存。在解码期间，Whisper 解码器为每个新的输入标记计算 kv 状态，并将其保存以供下一个解码步骤重新使用，从而形成一个 **kv 缓存**。默认的 kv 缓存实现会随着每个生成的标记而增长长度，因为我们为每个解码步骤保存了一组新的 kv 状态。

虽然动态形状与 `torch.compile` 优化的一个子集兼容，但它们限制了 CPU 开销可以减少的程度。因此，我们将 kv 缓存切换到 **静态** 实现，它将整个 kv 缓存大小预先分配到最大值，并从注意力计算中屏蔽掉未使用的部分。通过这样做，这种 kv 缓存实现与上一步的 `reduce-overhead` 设置兼容。

In [None]:
model.generation_config.cache_implementation = "static"

由于 torch compile 是一种“即时”（JIT）编译，我们需要执行一系列编译步骤来编译我们的模型。在这里，我们将执行三个热身步骤，每次都生成到我们允许的最大标记数：

In [None]:
max_new_tokens = model.generation_config.max_new_tokens

for _ in tqdm(range(3)):
    with sdpa_kernel(SDPBackend.MATH):
        model.generate(input_features, attention_mask=attention_mask, min_new_tokens=max_new_tokens, max_new_tokens=max_new_tokens)

  0%|          | 0/3 [00:00<?, ?it/s]W0801 09:29:13.028000 4526 torch/_inductor/utils.py:1137] [0/0] Not enough SMs to use max_autotune_gemm mode
100%|██████████| 3/3 [01:23<00:00, 27.72s/it]


**注意：** 此代码单元格可能需要几分钟才能运行，特别是第一次调用时。为了减少后续运行的编译时间，请升级到 `torch>2.4` 并使用 `TORCHINDUCTOR_FX_GRAPH_CACHE=1` 标志启用 [FX 图缓存](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html)。

## Benchmarking with Compile

现在我们准备好使用编译后的实现重新运行我们的基准测试。我们将进行的唯一更改是使用 [缩放点积注意力 (SDPA) 上下文管理器](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) 将注意力实现从 flash attention 切换到原生的 PyTorch C++ 实现，这通常会在编译下带来更好的性能：

In [None]:
sample = dataset[0]
print(sample["text"].lower())
# 1. Pre-process the audio inputs
# input_features: torch.Size([1, 80, 3000]) torch.float32 for v2 v3 support 128
inputs = processor(
    sample["audio"]["array"],
    sampling_rate=16000,
    return_tensors="pt",
    padding="max_length",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)
print(inputs)
print(inputs.input_features.shape)

# Create an attention mask
attention_mask = torch.ones(inputs.input_features.shape[:2], dtype=torch.long, device=device)
print(attention_mask.shape,inputs.attention_mask.shape)
# 2. Auto-regressively generate text tokens
start = time.time()
with sdpa_kernel(SDPBackend.MATH):
    pred_ids = model.generate(
        **inputs,
        #inputs.input_features,
        #attention_mask=attention_mask,
        task="transcribe",
        return_timestamps="word",
    )
    print(pred_ids)
inference_time = time.time() - start
# 3. Post-process tokens to text
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True,decode_with_timestamps=True)
print(pred_text[0].lower())
print(f"{inference_time=}")


mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
{'input_features': tensor([[[ 0.0436, -0.1703, -0.1855,  ..., -0.7607, -0.7607, -0.7607],
         [ 0.1411, -0.0728, -0.0880,  ..., -0.7607, -0.7607, -0.7607],
         [-0.0206, -0.0893, -0.0648,  ..., -0.7607, -0.7607, -0.7607],
         ...,
         [-0.7607, -0.7456, -0.7607,  ..., -0.7607, -0.7607, -0.7607],
         [-0.7607, -0.7607, -0.7607,  ..., -0.7607, -0.7607, -0.7607],
         [-0.7607, -0.7607, -0.7607,  ..., -0.7607, -0.7607, -0.7607]]],
       device='cuda:0', dtype=torch.float16), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0]], device='cuda:0', dtype=torch.int32)}
torch.Size([1, 128, 3000])
torch.Size([1, 128]) torch.Size([1, 3000])
tensor([[50364,  2221,    13,  2326,   388,   391,   307,   264, 50244,   295,
           264,  2808,  5359,    11,   293,   321,   366,  5404,   281,  2928,
           702, 14943,    13]], device='cuda:0')
 mr. quilter is the apostle of the m

In [None]:
inference_time = 0.0

for sample in tqdm(dataset):
    # 1. Pre-process the audio inputs
    input_features = processor(sample["audio"]["array"], sampling_rate=16000, return_tensors="pt").input_features
    input_features = input_features.to(device, dtype=torch_dtype)
    attention_mask = torch.ones(input_features.shape[:2], dtype=torch.long, device=device)

    # 2. Auto-regressively generate text tokens
    start = time.time()
    with sdpa_kernel(SDPBackend.MATH):
        pred_ids = model.generate(input_features, attention_mask=attention_mask)
    inference_time += time.time() - start

    # 3. Post-process tokens to text
    pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True)

print(inference_time)

100%|██████████| 73/73 [00:37<00:00,  1.92it/s]

36.409374475479126





未编译的模型的基准时间

L4 GPU:
- medium.en:  36.4 秒
- medium: 39.3 秒
- large-v3: 47.5秒
- large-v3-turbo: 13 秒

A100 GPU:
- medium.en:  35.7 秒
- medium: 39.3 秒
- large-v3:  50.2 秒
- large-v3-turbo: 10.6 秒

---


编译的模型的基准时间

L4 GPU:
- medium.en: 23 秒
- large-v3: 40 秒
- large-v3-turbo: 25.8 秒

A100 GPU:
- medium.en:  10.8 秒
- medium: 12.7 秒
- large-v3:  19.3 秒
- large-v3-turbo: 8 秒

---
A100的结论和 openai whisper 库中使用torch `F.scaled_dot_product_attention` 优化效果一致，见：https://github.com/openai/whisper/discussions/2363#discussion-7264254

---

回想一下，这种优化技术是与模型无关的：它可以应用于 Transformers 库中的任何 Whisper 模型。加速效果取决于硬件和模型大小，通常较小的模型可以获得最大的加速。然而，即使是最大的 Whisper 模型 ([large-v3](https://huggingface.co/openai/whisper-large-v3)) 也能加速，但是加速效果有限。

有个例外：[large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) L4 GPU下反而升高了； A100有加速； 这和对应的cuda arch 编译有关

## Conclusion

在这个 Colab 中，我们分解了使用 torch compile 进行 Whisper 推理的步骤，展示了仅用两行额外的代码即可实现 4 倍的加速。有关端到端的代码示例，请参阅 [Whisper 模型卡](https://huggingface.co/openai/whisper-large-v3#torch-compile)。

# pipeline

与 pipeline 一起使用 用于转录任意长度的音频的类,可以返回句子、词级别的时间戳：

In [1]:
import time

import tqdm
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"
model_id = "openai/whisper-large-v3"


model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)


In [1]:
# 未编译

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=1,  # batch size for inference - set based on your device
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
print(dataset)
sample = dataset[0]["audio"]
print(dataset[0])

start = time.time()
result = pipe(
    sample.copy(),
    chunk_length_s=30,
    batch_size=1,
    generate_kwargs={"language": "en", "task": "transcribe"},
    return_timestamps=True,
    #return_timestamps="word",

)
inference_time = time.time() - start
print(f"{inference_time=}")

print(result)


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Dataset({
    features: ['audio'],
    num_rows: 1
})


You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


{'audio': {'path': '0d38672e0bbdbdc460af55b8bb84a15b2730db2819f2af64f9c777d4d586f2de', 'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00024414, 0.00048828,
       0.0005188 ]), 'sampling_rate': 16000}}
inference_time=14.550672054290771
{'text': " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerfu

In [5]:
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm

torch.set_float32_matmul_precision("high")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)

# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
    with sdpa_kernel(SDPBackend.MATH):
        result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256},return_timestamps=True)

# fast run
with sdpa_kernel(SDPBackend.MATH):
    result = pipe(sample.copy(),return_timestamps=True)

print(result["text"])


Device set to use cuda:0
Warm-up step:   0%|          | 0/2 [00:00<?, ?it/s]Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
skipping cudagraphs due to mutated inputs (64 instances). Found from : 
   File "/usr/local/lib/python3.11/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1694, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.11/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1529, in forward
    decoder_outputs = self.decoder(
  File "/usr/local/lib/python3.11/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1188, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.1

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, One can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upgards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth, and Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath. Next man!


注意： torch.compile 目前与 Chunked 长格式算法或 Flash Attention 2 不兼容⚠️

In [6]:
# fast run
start = time.time()
with sdpa_kernel(SDPBackend.MATH):
    result = pipe(
        sample.copy(),
        chunk_length_s=30,
        batch_size=1,
        return_timestamps=True,
        #return_timestamps="word",
    )
inference_time = time.time() - start
print(f"{inference_time=}")
print(len(result["text"]),len(result["chunks"]))
print(result["chunks"])


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


inference_time=110.80034637451172
822 10
[{'timestamp': (0.0, 45.78), 'text': " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's"}, {'timestamp': (45.78, 47.62), 'text': ' landscapes smile at one'}, {'timestamp': (47.62, 49.78), 'text': ' much in the same way that Mr. Carker'}, {'timestamp': (49.78, 51.56), 'text': ' used to flash his teeth.'}, {'timestamp': (52.44, 54.0), 'text': ' And Mr. John Collier'}, {'timestamp': (54.0, 55.9), 'te

# 总结
- 使用pipeline的方式来运行，虽然支持句子和词级别的时间戳，但是通过torch.compile加速效果不理想，
- 直接使用seq2seq decoder generate的方式来加速模型的forward操作，和多模态generate的方式一样（应该或多或少受到whisper的启发）， 效果别pipeline封装的好很多，没细看pipeline中的源码，应该和whisperX的pipeline一样，使用pipeline仅用于离线环境，在线处理还是需要把processor(text tokenizer + audio encoder), decoder AR generate 以及 后续的 句子和词级别的时间对齐操作（Wav2Vec2 or whisper-tiny）进行优化