<a href="https://colab.research.google.com/github/qqeip/ChatGPT-Next-Web/blob/main/notebooks/LibriSpeech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing Whisper

The commands below will install the Python packages needed to use Whisper models and evaluate the transcription results.

In [3]:
! pip install git+https://github.com/openai/whisper.git
! pip install jiwer

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-j0dvfgqn
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-j0dvfgqn
  Resolved https://github.com/openai/whisper.git to commit 517a43ecd132a2089d85f4ebc044728a71d49f6e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20240930)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-

# Loading the LibriSpeech dataset

The following will load the test-clean split of the LibriSpeech corpus using torchaudio.

In [2]:
import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from tqdm.notebook import tqdm


DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [3]:
class LibriSpeech(torch.utils.data.Dataset):
    """
    A simple class to wrap LibriSpeech and trim/pad the audio to 30 seconds.
    It will drop the last few seconds of a very small portion of the utterances.
    """
    def __init__(self, split="test-clean", device=DEVICE):
        self.dataset = torchaudio.datasets.LIBRISPEECH(
            root=os.path.expanduser("~/.cache"),
            url=split,
            download=True,
        )
        self.device = device

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        audio, sample_rate, text, _, _, _ = self.dataset[item]
        assert sample_rate == 16000
        audio = whisper.pad_or_trim(audio.flatten()).to(self.device)
        mel = whisper.log_mel_spectrogram(audio)

        return (mel, text)

In [4]:
dataset = LibriSpeech("test-clean")
loader = torch.utils.data.DataLoader(dataset, batch_size=16)

100%|██████████| 331M/331M [00:05<00:00, 58.7MB/s]


# Running inference on the dataset using a base Whisper model

The following will take a few minutes to transcribe all utterances in the dataset.

In [5]:
model = whisper.load_model("base.en")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

100%|███████████████████████████████████████| 139M/139M [00:20<00:00, 7.12MiB/s]


Model is English-only and has 71,825,408 parameters.


In [6]:
# predict without timestamps for short-form transcription
options = whisper.DecodingOptions(language="en", without_timestamps=True)

In [7]:
hypotheses = []
references = []

for mels, texts in tqdm(loader):
    results = model.decode(mels, options)
    hypotheses.extend([result.text for result in results])
    references.extend(texts)

  0%|          | 0/164 [00:00<?, ?it/s]

In [8]:
data = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
data

Unnamed: 0,hypothesis,reference
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...
...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...


# Calculating the word error rate

Now, we use our English normalizer implementation to standardize the transcription and calculate the WER.

In [9]:
import jiwer
from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

In [10]:
data["hypothesis_clean"] = [normalizer(text) for text in data["hypothesis"]]
data["reference_clean"] = [normalizer(text) for text in data["reference"]]
data

Unnamed: 0,hypothesis,reference,hypothesis_clean,reference_clean
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...,he hoped there would be stew for dinner turnip...,he hoped there would be stew for dinner turnip...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM,stuffered into you his belly counseled him,stuff it into you his belly counseled him
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...,after early nightfall the yellow lamps would l...,after early nightfall the yellow lamps would l...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND,hello bertie any good in your mind,hello bertie any good in your mind
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...,number 10 fresh nelly is waiting on you good n...,number 10 fresh nelly is waiting on you good n...
...,...,...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...,0 to shoot my soul is full meaning into future...,0 to shoot my soul is full meaning into future...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...,then i long tried by natural ills received the...,then i long tried by natural ills received the...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...,i love thee freely as men strive for right i l...,i love thee freely as men strive for right i l...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...,i love thee with the passion put to use in my ...,i love thee with the passion put to use in my ...


In [11]:
wer = jiwer.wer(list(data["reference_clean"]), list(data["hypothesis_clean"]))

print(f"WER: {wer * 100:.2f} %")

WER: 4.27 %


In [4]:
import whisper

model = whisper.load_model("turbo")
result = model.transcribe("./test1.mp3") #lcy.wav")
print(result["text"])

100%|█████████████████████████████████████| 1.51G/1.51G [01:01<00:00, 26.3MiB/s]


你怎么可能不逆呢现在好了 开始录了放在了这个是三匹这个是个大三匹我 nost投进去就 оно有 regul样的是 refused Cut去下哥是个大傻逼下哥是个大傻逼下哥是个大傻逼这教员 congestion抓 emoccciones恋好小忙小忙重点中端城区快飞海《 chain- construction》《 chain- construction》《 chain- construction》《 chain- construction》《 chain- construction》《 chain- construction》《 chain- construction》 מח voor eenhouden这个是个大撒逼 这个是个大撒逼这个是个大撒逼杂未来累不跟你ان我ении 10 vid请不吝点赞 订阅 转发 打赏支持明镜与点点栏目明镜与点点栏目明镜与点点栏目明镜与点点栏目


In [21]:
import whisper

model = whisper.load_model("turbo")
result2 = model.transcribe("./lcy.wav") #lcy.wav")
print(result2["text"])

游戏里的你再强大也是假的,不是真的


In [22]:
print(result2)

{'text': '游戏里的你再强大也是假的,不是真的', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 4.0, 'text': '游戏里的你再强大也是假的,不是真的', 'tokens': [50365, 9592, 116, 1486, 237, 15759, 1546, 2166, 8623, 5702, 118, 3582, 22021, 31706, 1546, 11, 7296, 8034, 50565], 'temperature': 0.0, 'avg_logprob': -0.31722869873046877, 'compression_ratio': 0.8596491228070176, 'no_speech_prob': 1.0468501066007718e-11}], 'language': 'zh'}


In [6]:
!pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [7]:
def format_time(seconds, precision=2):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = int(seconds % 60)
    milliseconds = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{seconds:02d}.{milliseconds:0{precision}d}"

for segment in result['segments']:
    start_formatted = format_time(segment['start'], precision=3)
    end_formatted = format_time(segment['end'], precision=3)
    print(f"开始时间: {start_formatted}")
    print(f"结束时间: {end_formatted}")
    print(f"文本: {segment['text']}")
    print("------------------------")

开始时间: 00:00:00.000
结束时间: 00:00:02.000
文本: 你怎么可能不逆呢
------------------------
开始时间: 00:00:02.000
结束时间: 00:00:04.000
文本: 现在好了 开始录了
------------------------
开始时间: 00:00:04.000
结束时间: 00:00:05.000
文本: 放在了
------------------------
开始时间: 00:00:30.000
结束时间: 00:00:54.000
文本: 这个是三匹
------------------------
开始时间: 00:00:54.000
结束时间: 00:00:56.000
文本: 这个是个大三匹
------------------------
开始时间: 00:00:56.000
结束时间: 00:01:07.000
文本: 我 nost投进去
------------------------
开始时间: 00:01:07.000
结束时间: 00:01:08.000
文本: 就 оно有 regul样的
------------------------
开始时间: 00:01:08.000
结束时间: 00:01:12.000
文本: 是
------------------------
开始时间: 00:01:12.000
结束时间: 00:01:21.000
文本:  refused
------------------------
开始时间: 00:01:21.000
结束时间: 00:01:22.000
文本:  Cut
------------------------
开始时间: 00:01:22.000
结束时间: 00:01:23.000
文本: 去
------------------------
开始时间: 00:01:31.000
结束时间: 00:01:33.000
文本: 下哥是个大傻逼
------------------------
开始时间: 00:01:34.000
结束时间: 00:01:36.000
文本: 下哥是个大傻逼
------------------------
开始时间: 00:01:39.000
结束时间: 00:01:40

In [17]:
from pydub import AudioSegment

def merged_segment(result):

    # 合并间隔2秒内的片段
    merged_segments = []
    current_segment = None

    for segment in result['segments']:
        if current_segment is None:
            current_segment = {
                'start': segment['start'],
                'end': segment['end'],
                'text': segment['text']
            }
        else:
            gap = segment['start'] - current_segment['end']
            if gap <= 2:
                current_segment['end'] = segment['end']
                current_segment['text'] += ' ' + segment['text']
            else:
                merged_segments.append(current_segment)
                current_segment = {
                    'start': segment['start'],
                    'end': segment['end'],
                    'text': segment['text']
                }
    if current_segment is not None:
        merged_segments.append(current_segment)

    # 提取并录制音频文件
    input_audio_path = './test1.mp3'
    output_folder = './merged_audio'

    # 加载原始音频文件
    input_audio = AudioSegment.from_file(input_audio_path)

    # 初始化一个空的AudioSegment对象，用于存储合并后的音频
    output_audio = AudioSegment.silent()

    # 计算需要保留的时间范围
    current_end = 0.0
    for segment in merged_segments:
        start = segment['start']
        end = segment['end']

        # 提取当前段之前的保留部分
        if start > current_end:
            start_time = current_end * 1000  # 转换为毫秒
            end_time = start * 1000
            retained_chunk = input_audio[start_time:end_time]
            output_audio += retained_chunk

        current_end = end

    # 处理最后一部分保留的时间段
    if current_end < input_audio.duration_seconds:
        start_time = current_end * 1000
        end_time = input_audio.duration_seconds * 1000
        retained_chunk = input_audio[start_time:end_time]
        output_audio += retained_chunk

    # 导出合并后的音频文件
    output_audio.export("retained_audio.mp3", format="mp3")

    audio = AudioSegment.from_file(input_audio_path)

    for i, segment in enumerate(merged_segments):
        start_time = segment['start']
        end_time = segment['end']
        output_path = f"{output_folder}/audio_{i+1}.mp3"
        print(output_path)
        print(f"开始时间: {segment['start']}")
        print(f"结束时间: {segment['end']}")
        print(f"文本: {segment['text']}")
        print("------------------------")

        segment_audio = audio[start_time*1000:end_time*1000]
        segment_audio.export(output_path, format='mp3')

merged_segment(result)

./merged_audio//audio_1.mp3
开始时间: 0.0
结束时间: 5.0600000000000005
文本: 你怎么可能不逆呢 现在好了 开始录了 放在了
------------------------
./merged_audio//audio_2.mp3
开始时间: 30.0
结束时间: 83.92
文本: 这个是三匹 这个是个大三匹 我 nost投进去 就 оно有 regul样的 是  refused  Cut 去
------------------------
./merged_audio//audio_3.mp3
开始时间: 91.62
结束时间: 96.42
文本: 下哥是个大傻逼 下哥是个大傻逼
------------------------
./merged_audio//audio_4.mp3
开始时间: 99.16
结束时间: 100.68
文本: 下哥是个大傻逼
------------------------
./merged_audio//audio_5.mp3
开始时间: 112.92
结束时间: 116.86
文本: 这教员 congestion
------------------------
./merged_audio//audio_6.mp3
开始时间: 118.96000000000001
结束时间: 125.16
文本: 抓 emoccciones 恋好 小忙
------------------------
./merged_audio//audio_7.mp3
开始时间: 127.3
结束时间: 140.22
文本: 小忙 重点 中端 城区 快飞 海
------------------------
./merged_audio//audio_8.mp3
开始时间: 142.92
结束时间: 209.16
文本: 《 chain- construction》 《 chain- construction》 《 chain- construction》 《 chain- construction》 《 chain- construction》 《 chain- construction》 《 chain- construction》  מח voor eenhouden 这个是个大撒逼 这个是

In [19]:
!cp ../lcy.wav lcy.wav