<a href="https://colab.research.google.com/github/stcoats/LVS_content/blob/main/LVS_2024_transcription.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook uses ffmpeg to convert an audio file, then uses OpenAI's Whisper and WhisperX (a pipeline built on Whisper) to automatically transcribe the recording.

First, select "Runtime" from the drop-down menu above and select "Change Runtime type". If a GPU is not selected, select one.

The cell below installs the programs and packages needed for the task. These are very large, so it will take a while.




In [None]:
#Install the required packages
# run before executing any code

!apt install ffmpeg
!pip3 install -U huggingface_hub
!pip3 install torch torchvision torchaudio yt-dlp Cython
!pip install git+https://github.com/openai/whisper.git
!pip install git+https://github.com/m-bain/whisperX.git


Now we can retrieve some audio to transcribe with the line below.

In [2]:
!wget https://media.talkbank.org/ca/SBCSAE/0wav/54.wav

--2024-01-23 17:55:13--  https://media.talkbank.org/ca/SBCSAE/0wav/54.wav
Resolving media.talkbank.org (media.talkbank.org)... 128.2.27.37
Connecting to media.talkbank.org (media.talkbank.org)|128.2.27.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51265158 (49M) [audio/x-wav]
Saving to: ‘54.wav’


2024-01-23 17:55:16 (23.3 MB/s) - ‘54.wav’ saved [51265158/51265158]



Your recording is an excerpt from the Santa Barbara Corpus of Spoken American English, made available via the [TalkBank](https://doi.org/10.21415/T5VG6X) resource.

Let's listen to it.

In [18]:
from IPython.display import Audio, display

display(Audio(filename="./54.wav", autoplay=True))

In [19]:
from IPython.display import Audio, display

display(Audio(filename="./54.wav", autoplay=True))

For automatic processing, we need`.wav` files. In case you have files in some other format, you can convert them to `.wav` with the code block below. If you have a file named (for example) `my_mp3.mp3`, upload it to the `/content` directory to the left, then run the code below.

In [6]:
#The file 01.mp3 will be converted to `.wav` and renamed `audio_16k.wav`. You will see that it is available in your environment (to the left)

!ffmpeg -i "my_mp3.mp3" -ac 1 -ar 16000 audio_16k.wav # Converting audio.wav to mono channel & 16K audio_16k.wav

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

Now you are ready to automatically transcribe the recording using Whisper. The line below specifies that Whisper will use its "medium" model. After transcribing and inspecting the transcript, try changing the syntax to specify the "tiny" model and transcribe it again. Are there any differences?
What about with the "large-v2" model?

In [None]:
!whisper './54.wav' --model small

Now double-click on the .json, .srt., .tsv, .txt, and .vtt files that were generated in the `/content` directory. These are commonly used transcript data formats.

In [14]:
from huggingface_hub.utils import _runtime   #https://github.com/m-bain/whisperX/issues/656#issuecomment-1877955404
_runtime._is_google_colab = False

In [15]:
from whisper import load_model
from huggingface_hub.utils import _runtime   #https://github.com/m-bain/whisperX/issues/656#issuecomment-1877955404
_runtime._is_google_colab = False

# Large models result in considerably better and more aligned (words, timestamps) mapping.
# Smaller models are less accurate.

model = load_model("tiny")

# Beam size is none by default (Greedy Decoding). You can also set the
# beam_size to some number like 5. This may increase transcription
# quality but may increase runtime.

results = model.transcribe('./54.wav')

100%|██████████████████████████████████████| 72.1M/72.1M [00:00<00:00, 130MiB/s]


In [17]:
#Whisper segements the audio into chunks

for x in results["segments"]:
  print(x["start"],x["end"],x["text"])

0.0 6.4  Any of you who find that you really enjoy storytelling, I would like to mention that the
6.4 11.96  last weekend in July in Spring, Roeville, and Oy, there is a festival.
11.96 13.44  Children under 12 are free.
13.44 16.92  I'll just mention that while you parents.
16.92 19.32  And it's called the Illinois Storytelling Festival.
19.32 22.32  Have any of you ever gone?
22.32 23.32  I know Lu hair.
23.32 24.32  All right.
24.32 25.32  Put your hand up.
25.32 29.96  When I found out I was telling here, I didn't realize that Dusty had attended your church
29.96 30.96  a number of times.
30.96 35.6  He's also a storyteller from the Illinois Storytelling Festival.
35.6 37.96  And we would love to have you all come.
37.96 43.42  There are four tents, a children's tent, an adult tent, a general tent, and then a tradition
43.42 48.32  tent where everyone can come and just share their memories of growing up or world war
48.32 52.28  whatever or the first time you ever wrote in a train.

In [20]:
# WhisperX can be used to get individual word timestamps by using wav2vec-based forced alignment.


import whisperx

device = 'cuda'
alignment_model, metadata = whisperx.load_align_model(language_code=results["language"], device=device)
result_aligned = whisperx.align(results["segments"], alignment_model, metadata, '54.wav', device)

  torchaudio.set_audio_backend("soundfile")
  torchaudio.set_audio_backend("soundfile")
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ls960.pth
100%|██████████| 360M/360M [00:06<00:00, 58.5MB/s]


In [21]:
result_aligned

{'segments': [{'start': 1.946,
   'end': 6.4,
   'text': ' Any of you who find that you really enjoy storytelling, I would like to mention that the',
   'words': [{'word': 'Any', 'start': 1.946, 'end': 2.107, 'score': 0.886},
    {'word': 'of', 'start': 2.147, 'end': 2.207, 'score': 0.762},
    {'word': 'you', 'start': 2.247, 'end': 2.387, 'score': 1.0},
    {'word': 'who', 'start': 2.468, 'end': 2.608, 'score': 0.93},
    {'word': 'find', 'start': 2.668, 'end': 2.909, 'score': 0.952},
    {'word': 'that', 'start': 2.949, 'end': 3.07, 'score': 0.972},
    {'word': 'you', 'start': 3.11, 'end': 3.25, 'score': 0.915},
    {'word': 'really', 'start': 3.29, 'end': 3.551, 'score': 0.796},
    {'word': 'enjoy', 'start': 3.611, 'end': 4.073, 'score': 0.828},
    {'word': 'storytelling,', 'start': 4.153, 'end': 4.895, 'score': 0.813},
    {'word': 'I', 'start': 5.056, 'end': 5.116, 'score': 0.928},
    {'word': 'would', 'start': 5.156, 'end': 5.337, 'score': 0.914},
    {'word': 'like', 'start'

In [22]:
!pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


## Variation between [ðə] and [ði:]

The pronunciation of the word *the* varies according to phonological context and can also reflect emphasis.

Let's listen to the *thes* from this audio.

In [26]:
from IPython.display import Audio
from pydub import AudioSegment

audio = AudioSegment.from_file('54.wav')

for x in result_aligned["word_segments"]:
  if x["word"] =="the":

    start_time = x["start"]*1000
    stop_time = x["end"]*1000

    audio_segment = audio[start_time:stop_time]

# Save the segment to a temporary file (you can adjust the file path if needed)
    segment_file_path = '/content/temp_segment.wav'
    audio_segment.export(segment_file_path, format="wav")

# Display the audio player for the specified segment
    display(Audio(filename=segment_file_path))

## Diarization

Diarization is the assignation of individual speakers to parts of the text transcript. Diarization can be important if you want to consider only the speech of particular persons.

Let's see how it can be done automatically.

In [None]:
access_token = "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"

diarize_model = whisperx.DiarizationPipeline(use_auth_token=access_token, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)

In [28]:
#Let's get audio from a conversation with multiple speakers

!yt-dlp -xv --audio-format wav  -o audio.wav -- https://www.youtube.com/watch?v=-guwyA8wxVQ

[debug] Command-line config: ['-xv', '--audio-format', 'wav', '-o', 'audio.wav', '--', 'https://www.youtube.com/watch?v=-guwyA8wxVQ']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.12.30 from yt-dlp/yt-dlp [f10589e34] (pip)
[debug] Python 3.10.12 (CPython x86_64 64bit) - Linux-6.1.58+-x86_64-with-glibc2.35 (OpenSSL 3.0.2 15 Mar 2022, glibc 2.35)
[debug] exe versions: ffmpeg 4.4.2 (setts), ffprobe 4.4.2
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2023.11.17, mutagen-1.47.0, requests-2.31.0, secretstorage-3.3.1, sqlite3-3.37.2, urllib3-2.0.7, websockets-12.0
[debug] Proxy map: {'colab_language_server': '/usr/colab/bin/language_service'}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1798 extractors
[youtube] Extracting URL: https://www.youtube.com/watch?v=-guwyA8wxVQ
[youtube] -guwyA8wxVQ: Downloading webpage
[youtube] -guwyA8wxVQ: Downloading ios player API JSO

In [34]:
import whisperx
import gc

device = "cuda"
audio_file = "audio.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
access_token = "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" #you can use this token but will need to get your own at huggingface.

# to do that, make an account at huggingface.co, then automatically generate a token at https://huggingface.co/settings/tokens.
# copy-paste that token into the string above


# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

#print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a



# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=access_token, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.0+cu121. Bad things might happen unless you revert torch to 1.x.
Detected language: en (0.98) in first 30s of audio...


config.yaml:   0%|          | 0.00/469 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.91M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/399 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/221 [00:00<?, ?B/s]

                               segment label     speaker       start  \
0    [ 00:00:07.886 -->  00:00:13.081]     A  SPEAKER_01    7.886248   
1    [ 00:00:13.641 -->  00:00:15.628]     B  SPEAKER_01   13.641766   
2    [ 00:00:16.222 -->  00:00:19.346]     C  SPEAKER_01   16.222411   
3    [ 00:00:19.906 -->  00:00:23.709]     D  SPEAKER_01   19.906621   
4    [ 00:00:24.320 -->  00:00:27.801]     E  SPEAKER_01   24.320883   
..                                 ...   ...         ...         ...   
286  [ 00:15:59.991 -->  00:16:00.534]    KA  SPEAKER_02  959.991511   
287  [ 00:16:00.534 -->  00:16:00.670]    KB  SPEAKER_01  960.534805   
288  [ 00:16:00.670 -->  00:16:00.687]    KC  SPEAKER_02  960.670628   
289  [ 00:16:00.687 -->  00:16:00.721]    KD  SPEAKER_01  960.687606   
290  [ 00:16:00.721 -->  00:16:00.738]    KE  SPEAKER_02  960.721562   

            end  intersection       union  
0     13.081494   -941.796506  947.451752  
1     15.628183   -939.249817  941.696234  
2  

In [37]:
import pandas as pd
pd.DataFrame(result["segments"])

Unnamed: 0,start,end,text,words,speaker
0,7.935,8.956,"Okay, thank you very much.","[{'word': 'Okay,', 'start': 7.935, 'end': 8.19...",SPEAKER_01
1,9.016,12.898,It's a great honor to have Nancy Pelosi with u...,"[{'word': 'It's', 'start': 9.016, 'end': 9.116...",SPEAKER_01
2,12.998,17.661,And we've actually worked very hard on a coupl...,"[{'word': 'And', 'start': 12.998, 'end': 13.07...",SPEAKER_01
3,17.761,19.322,Criminal justice reform.,"[{'word': 'Criminal', 'start': 17.761, 'end': ...",SPEAKER_01
4,20.463,27.767,"As you know, we just heard word, got word, tha...","[{'word': 'As', 'start': 20.463, 'end': 20.583...",SPEAKER_01
...,...,...,...,...,...
316,945.211,947.032,"The last time you shut it down, it didn't work.","[{'word': 'The', 'start': 945.211, 'end': 945....",SPEAKER_01
317,947.332,949.294,I will take the mantle of shutting down.,"[{'word': 'I', 'start': 947.332, 'end': 947.41...",SPEAKER_01
318,949.374,952.436,And I'm going to shut it down for border secur...,"[{'word': 'And', 'start': 949.374, 'end': 949....",SPEAKER_01
319,952.736,953.137,Okay.,"[{'word': 'Okay.', 'start': 952.736, 'end': 95...",SPEAKER_01
