# How to use Whisper for speech recognition
> Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/chart-preview.png

# Installing Whisper

The commands below will install the Python packages needed to use Whisper models and evaluate the transcription results.

In [1]:
! pip install git+https://github.com/openai/whisper.git
! pip install jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-secg28uw
  Running command git clone -q https://github.com/openai/whisper.git /tmp/pip-req-build-secg28uw
Collecting transformers>=4.19.0
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 10.1 MB/s 
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 50.2 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 69.5 MB/s 
Building wheels for collected packages: whisper
 

# Loading the LibriSpeech dataset

The following will load the test-clean split of the LibriSpeech corpus using torchaudio.

In [2]:
import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from tqdm.notebook import tqdm


DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [3]:
class LibriSpeech(torch.utils.data.Dataset):
    """
    A simple class to wrap LibriSpeech and trim/pad the audio to 30 seconds.
    It will drop the last few seconds of a very small portion of the utterances.
    """
    def __init__(self, split="test-clean", device=DEVICE):
        self.dataset = torchaudio.datasets.LIBRISPEECH(
            root=os.path.expanduser("~/.cache"),
            url=split,
            download=True,
        )
        self.device = device

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        audio, sample_rate, text, _, _, _ = self.dataset[item]
        assert sample_rate == 16000
        audio = whisper.pad_or_trim(audio.flatten()).to(self.device)
        mel = whisper.log_mel_spectrogram(audio)
        
        return (mel, text)

In [4]:
dataset = LibriSpeech("test-clean")
loader = torch.utils.data.DataLoader(dataset, batch_size=16)

  0%|          | 0.00/331M [00:00<?, ?B/s]

# Running inference on the dataset using a base Whisper model

The following will take a few minutes to transcribe all utterances in the dataset.

In [5]:
model = whisper.load_model("base.en")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 144MiB/s]


Model is English-only and has 71,825,408 parameters.


In [6]:
# predict without timestamps for short-form transcription
options = whisper.DecodingOptions(language="en", without_timestamps=True)

In [7]:
hypotheses = []
references = []

for mels, texts in tqdm(loader):
    results = model.decode(mels, options)
    hypotheses.extend([result.text for result in results])
    references.extend(texts)

  0%|          | 0/164 [00:00<?, ?it/s]

In [8]:
data = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
data

Unnamed: 0,hypothesis,reference
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...
...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...


# Calculating the word error rate

Now, we use our English normalizer implementation to standardize the transcription and calculate the WER.

In [9]:
import jiwer
from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

In [10]:
data["hypothesis_clean"] = [normalizer(text) for text in data["hypothesis"]]
data["reference_clean"] = [normalizer(text) for text in data["reference"]]
data

Unnamed: 0,hypothesis,reference,hypothesis_clean,reference_clean
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...,he hoped there would be stew for dinner turnip...,he hoped there would be stew for dinner turnip...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM,stuffered into you his belly counseled him,stuff it into you his belly counseled him
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...,after early nightfall the yellow lamps would l...,after early nightfall the yellow lamps would l...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND,hello bertie any good in your mind,hello bertie any good in your mind
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...,number 10 fresh nelly is waiting on you good n...,number 10 fresh nelly is waiting on you good n...
...,...,...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...,0 to shoot my soul is full meaning into future...,0 to shoot my soul is full meaning into future...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...,then i long tried by natural ills received the...,then i long tried by natural ills received the...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...,i love thee freely as men strive for right i l...,i love thee freely as men strive for right i l...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...,i love thee with the passion put to use in my ...,i love thee with the passion put to use in my ...


In [11]:
wer = jiwer.wer(list(data["reference_clean"]), list(data["hypothesis_clean"]))

print(f"WER: {wer * 100:.2f} %")

WER: 4.26 %


In [12]:
import io
import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import urllib
import tarfile
import whisper
import torchaudio

from scipy.io import wavfile
from tqdm.notebook import tqdm


pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 1000
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [13]:
DEVICE

'cuda'

## Loading the Fleurs dataset
Select the language of the Fleur dataset to download. Please note that the transcription and translation performance varies widely depending on the language. Appendix D.2 in the paper contains the performance breakdown by language.

In [14]:
import ipywidgets as widgets

languages = {"af_za": "Afrikaans", "am_et": "Amharic", "ar_eg": "Arabic", "as_in": "Assamese", "az_az": "Azerbaijani", "be_by": "Belarusian", "bg_bg": "Bulgarian", "bn_in": "Bengali", "bs_ba": "Bosnian", "ca_es": "Catalan", "cmn_hans_cn": "Chinese", "cs_cz": "Czech", "cy_gb": "Welsh", "da_dk": "Danish", "de_de": "German", "el_gr": "Greek", "en_us": "English", "es_419": "Spanish", "et_ee": "Estonian", "fa_ir": "Persian", "fi_fi": "Finnish", "fil_ph": "Tagalog", "fr_fr": "French", "gl_es": "Galician", "gu_in": "Gujarati", "ha_ng": "Hausa", "he_il": "Hebrew", "hi_in": "Hindi", "hr_hr": "Croatian", "hu_hu": "Hungarian", "hy_am": "Armenian", "id_id": "Indonesian", "is_is": "Icelandic", "it_it": "Italian", "ja_jp": "Japanese", "jv_id": "Javanese", "ka_ge": "Georgian", "kk_kz": "Kazakh", "km_kh": "Khmer", "kn_in": "Kannada", "ko_kr": "Korean", "lb_lu": "Luxembourgish", "ln_cd": "Lingala", "lo_la": "Lao", "lt_lt": "Lithuanian", "lv_lv": "Latvian", "mi_nz": "Maori", "mk_mk": "Macedonian", "ml_in": "Malayalam", "mn_mn": "Mongolian", "mr_in": "Marathi", "ms_my": "Malay", "mt_mt": "Maltese", "my_mm": "Myanmar", "nb_no": "Norwegian", "ne_np": "Nepali", "nl_nl": "Dutch", "oc_fr": "Occitan", "pa_in": "Punjabi", "pl_pl": "Polish", "ps_af": "Pashto", "pt_br": "Portuguese", "ro_ro": "Romanian", "ru_ru": "Russian", "sd_in": "Sindhi", "sk_sk": "Slovak", "sl_si": "Slovenian", "sn_zw": "Shona", "so_so": "Somali", "sr_rs": "Serbian", "sv_se": "Swedish", "sw_ke": "Swahili", "ta_in": "Tamil", "te_in": "Telugu", "tg_tj": "Tajik", "th_th": "Thai", "tr_tr": "Turkish", "uk_ua": "Ukrainian", "ur_pk": "Urdu", "uz_uz": "Uzbek", "vi_vn": "Vietnamese", "yo_ng": "Yoruba"}
selection = widgets.Dropdown(
    options=[("Select language", None), ("----------", None)] + sorted([(f"{v} ({k})", k) for k, v in languages.items()]),
    value="ko_kr",
    description='Language:',
    disabled=False,
)

selection

Dropdown(description='Language:', index=39, options=(('Select language', None), ('----------', None), ('Afrika…

In [15]:
lang = selection.value
language = languages[lang]

assert lang is not None, "Please select a language"
print(f"Selected language: {language} ({lang})")

Selected language: Chinese (cmn_hans_cn)


In [16]:
class Fleurs(torch.utils.data.Dataset):
    """
    A simple class to wrap Fleurs and subsample a portion of the dataset as needed.
    """
    def __init__(self, lang, split="test", subsample_rate=1, device=DEVICE):
        url = f"https://storage.googleapis.com/xtreme_translations/FLEURS102/{lang}.tar.gz"
        tar_path = os.path.expanduser(f"~/.cache/fleurs/{lang}.tgz")
        os.makedirs(os.path.dirname(tar_path), exist_ok=True)

        if not os.path.exists(tar_path):
            with urllib.request.urlopen(url) as source, open(tar_path, "wb") as output:
                with tqdm(total=int(source.info().get("Content-Length")), ncols=80, unit='iB', unit_scale=True, unit_divisor=1024) as loop:
                    while True:
                        buffer = source.read(8192)
                        if not buffer:
                            break

                        output.write(buffer)
                        loop.update(len(buffer))

        labels = {}
        all_audio = {}
        with tarfile.open(tar_path, "r:gz") as tar:
            for member in tar.getmembers():
                name = member.name
                if name.endswith(f"{split}.tsv"):
                    labels = pd.read_table(tar.extractfile(member), names=("id", "file_name", "raw_transcription", "transcription", "_", "num_samples", "gender"))

                if f"/{split}/" in name and name.endswith(".wav"):
                    audio_bytes = tar.extractfile(member).read()
                    all_audio[os.path.basename(name)] = wavfile.read(io.BytesIO(audio_bytes))[1]
                    

        self.labels = labels.to_dict("records")[::subsample_rate]
        self.all_audio = all_audio
        self.device = device

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, item):
        record = self.labels[item]
        audio = torch.from_numpy(self.all_audio[record["file_name"]].copy())
        text = record["transcription"]
        
        return (audio, text)

In [17]:
dataset = Fleurs(lang, subsample_rate=10)  # subsample 10% of the dataset for a quick demo

  0%|                                              | 0.00/2.35G [00:00<?, ?iB/s]

In [25]:
small_dataset = Fleurs(lang, subsample_rate=1)  # subsample 1% of the dataset for a quick demo

In [27]:
dataset, small_dataset

(<__main__.Fleurs at 0x7fa62041dbd0>, <__main__.Fleurs at 0x7fa5ba4cbbd0>)

In [30]:
len(small_dataset.labels), len(dataset.labels)

(945, 95)

In [32]:
type(small_dataset.labels), small_dataset.labels[0]

(list,
 {'id': 1779,
  'file_name': '10325559490685159122.wav',
  'raw_transcription': '特朗普与土耳其总统雷杰普·塔伊普·埃尔多安（Recep Tayyip Erdoğan）通话后发表了声明。',
  'transcription': '特朗普与土耳其总统雷杰普·塔伊普·埃尔多安recep tayyip erdoğan通话后发表了声明',
  '_': '特 朗 普 与 土 耳 其 总 统 雷 杰 普 · 塔 伊 普 · 埃 尔 多 安 r e c e p | t a y y i p | e r d o ğ a n 通 话 后 发 表 了 声 明 |',
  'num_samples': 242560,
  'gender': 'MALE'})

In [33]:
type(small_dataset.all_audio)

dict

In [34]:
len(small_dataset.all_audio)

945

In [38]:
import random
random.choice(list(small_dataset.all_audio.items()))

('14665455542148263931.wav',
 array([0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 8.3088875e-05,
        5.1021576e-05, 6.8724155e-05], dtype=float32))

#randaomly choose 10 training data


In [39]:
for i in range(10):
  file, raw_data = random.choice(list(small_dataset.all_audio.items()))
  print(file, len(raw_data))
  print(raw_data)

11545158540803261075.wav 212160
[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  3.6954880e-06
 -2.6524067e-05 -4.0173531e-05]
8107720956886769719.wav 81600
[0.         0.         0.         ... 0.00137931 0.00124866 0.00143242]
2625982748996855379.wav 222720
[0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 9.5546246e-05 5.7399273e-05
 4.2498112e-05]
17251775862374398296.wav 96960
[ 4.1723251e-07 -4.7683716e-07  5.9604645e-07 ... -1.1312366e-03
 -1.0677576e-03 -1.0817647e-03]
10949290146151676233.wav 273280
[0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 1.4293194e-04 5.1200390e-05
 1.0114908e-04]
8277463471762550023.wav 251520
[0.         0.         0.         ... 0.00026637 0.00031906 0.0003075 ]
1351715574318686401.wav 283200
[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ... -2.4855137e-05
 -2.2649765e-05 -2.2768974e-05]
16122645321793286284.wav 276480
[0.         0.         0.         ... 0.00013369 0.00012755 0.00010586]
6268497158786378126.wav 205440
[ 0.0000000e+00  0.0000000e+00

In [52]:
#display(Audio(raw_data, rate=fs, autoplay=True))
for i in range(10):
  file, raw_data = random.choice(list(small_dataset.all_audio.items()))
  print(file, len(raw_data))
  display(Audio(raw_data, rate=fs, autoplay=False))

18059580157250761435.wav 344640


7669819575583671099.wav 189120


10824731430851864186.wav 196800


14192991371425576281.wav 133440


12601036378860326355.wav 222400


16298512420216472244.wav 275520


5217965021762558271.wav 170880


6006946498162292333.wav 247360


11031078104467854733.wav 130560


10777625043982141593.wav 281280


In [50]:
from IPython.display import Audio
fs = 16100
display(Audio(raw_data, rate=fs, autoplay=True))

In [None]:
type(small_dataset.labels), small_dataset.labels[0]

In [28]:
small_dataset.labels

[{'id': 1779,
  'file_name': '10325559490685159122.wav',
  'raw_transcription': '特朗普与土耳其总统雷杰普·塔伊普·埃尔多安（Recep Tayyip Erdoğan）通话后发表了声明。',
  'transcription': '特朗普与土耳其总统雷杰普·塔伊普·埃尔多安recep tayyip erdoğan通话后发表了声明',
  '_': '特 朗 普 与 土 耳 其 总 统 雷 杰 普 · 塔 伊 普 · 埃 尔 多 安 r e c e p | t a y y i p | e r d o ğ a n 通 话 后 发 表 了 声 明 |',
  'num_samples': 242560,
  'gender': 'MALE'},
 {'id': 1779,
  'file_name': '149040922025982373.wav',
  'raw_transcription': '特朗普与土耳其总统雷杰普·塔伊普·埃尔多安（Recep Tayyip Erdoğan）通话后发表了声明。',
  'transcription': '特朗普与土耳其总统雷杰普·塔伊普·埃尔多安recep tayyip erdoğan通话后发表了声明',
  '_': '特 朗 普 与 土 耳 其 总 统 雷 杰 普 · 塔 伊 普 · 埃 尔 多 安 r e c e p | t a y y i p | e r d o ğ a n 通 话 后 发 表 了 声 明 |',
  'num_samples': 127680,
  'gender': 'FEMALE'},
 {'id': 1843,
  'file_name': '14716260585206763911.wav',
  'raw_transcription': '邓迪大学（University of Dundee）的 Pamela Ferguson 教授指出：“记者如果公布嫌疑人的照片等信息的话，似乎确实存在危害性。”',
  'transcription': '邓 迪 大 学 university of dundee 的 pamela ferguson 教 授 指 出 记 者 如 果 公 布 嫌 疑 人 的 照 片 等 信 息 的 话 

## Running inference on the dataset using a medium Whisper model
The following will take a few minutes to transcribe and translate utterances in the dataset.

In [18]:
model = whisper.load_model("medium")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

100%|█████████████████████████████████████| 1.42G/1.42G [00:32<00:00, 46.5MiB/s]


Model is multilingual and has 762,321,920 parameters.


In [19]:
options = dict(language=language, beam_size=5, best_of=5)
transcribe_options = dict(task="transcribe", **options)
translate_options = dict(task="translate", **options)

In [20]:
references = []
transcriptions = []
translations = []

for audio, text in tqdm(dataset):
    transcription = model.transcribe(audio, **transcribe_options)["text"]
    translation = model.transcribe(audio, **translate_options)["text"]
    
    transcriptions.append(transcription)
    translations.append(translation)
    references.append(text)

  0%|          | 0/95 [00:00<?, ?it/s]

In [21]:
data = pd.DataFrame(dict(reference=references, transcription=transcriptions, translation=translations))
data

Unnamed: 0,reference,transcription,translation
0,特朗普与土耳其总统雷杰普·塔伊普·埃尔多安recep tayyip erdoğan通话后发表了声明,特朗普與土耳其總統雷傑普塔伊普埃爾多安通話後發表了聲明,Trump and Turkish President Recep Tayyip Erdogan made a statement after the call.
1,他 受 到 了 新 加 坡 副 总 理 黄 根 成 的 欢 迎 并 与 新 加 坡 总 理 李 显 龙 探 讨 了 贸 易 和 恐 怖 主 义 问 题,"他受到了新加坡副总理黄根成的欢迎,并与新加坡总理李显龙探讨了贸易和恐怖主义问题。","He was welcomed by the Deputy Prime Minister of Singapore, Huang Gencheng, and discussed trade and terrorism issues with Singapore Prime Minister Lee Hsien Loong."
2,虽 然 有 一 种 实 验 性 疫 苗 看 似 能 够 降 低 埃 博 拉 病 毒 的 死 亡 率 但 迄 今 为 止 还 没 明 确 证 明 任 何 药 物 适 合 治 疗 现 有 的 感 染,"虽然有一种实验性疫苗看似能够降低埃博拉病毒的死亡率,但迄今为止还没明确证明任何药物适合治疗现有的感染。","Although there is an experimental vaccine that seems to be able to reduce the death rate of the Ebola virus, so far, it has not clearly proven that any drug is suitable for the current infection."
3,场 景 在 金 字 塔 上 展 示 不 同 的 金 字 塔 被 点 亮,场景在金字塔上展示不同的金字塔被点亮,"The scene is displayed on the crystal tower, and different crystals are lit."
4,同 理 有 了 申 根 签 证 你 就 不 必 分 别 向 每 个 申 根 成 员 国 申 请 签 证 从 而 节 省 了 时 间 金 钱 和 手 续,"同理,有了申根签证,你就不必分别向每个申根成员国申请签证,从而节省了时间、金钱和手续。","In the same way, you don't have to apply for a visa to each member of the deep root member state if you have a deep root visa, which saves time, money and procedures."
5,昨 日 上 午 土 耳 其 加 齐 安 泰 普 gaziantep 的 警 察 总 部 发 生 了 一 起 汽 车 炸 弹 爆 炸 事 件 该 事 件 导 致 两 名 警 察 死 亡 20 余 人 受 伤,"昨日上午,土耳其加奇安泰普的警察总部发生了一起汽车爆炸事件,该事件导致两名警察死亡,二十余人受伤。","Yesterday morning, a car explosion occurred in the headquarter of the police in Turkey's Gaci Antep, which led to the death of two police officers and the injury of more than 20 people."
6,州 长 办 公 室 表 示 伤 者 中 有 十 九 人 是 警 察,"州长办公室表示,伤者中19人是警察",The governor's office said that 19 of the injured were police officers.
7,文 明 是 一 种 由 共 同 生 活 合 作 工 作 的 人 群 社 会 所 共 享 的 单 一 文 化,文明是一种由共同生活合作工作的人群社会所共享的单一文化,Civilization is a single culture shared by a community society that cooperates with each other in life and work.
8,随 着 希 腊 知 识 的 衰 落 西 方 脱 离 了 其 希 腊 哲 学 和 科 学 根 源,"随着希腊知识的衰落,西方脱离了其希腊哲学和科学根源。","With the fall of Greek knowledge, the West distanced itself from its Greek philosophy and scientific roots."
9,早 年 该 节 目 仅 在 运 营 已 久 的 互 联 网 广 播 网 站 toginet radio 上 播 出 toginet radio 是 一 个 专 注 于 谈 话 广 播 的 网 站,"早年,该节目仅在运营已久的互联网广播网站TGNet Radio上播出,TGNet Radio是一个专注于谈话广播的网站。","In the early years, this program was only available on the TGNet Radio, an online broadcasting website that has been in operation for a long time."
