<a href="https://colab.research.google.com/github/stcoats/LVS_content/blob/main/LVS_2024_transcription.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook uses ffmpeg to convert an audio file, then uses OpenAI's Whisper and WhisperX (a pipeline built on Whisper) to automatically transcribe the recording.

First, select "Runtime" from the drop-down menu above and select "Change Runtime type". If a GPU is not selected, select one.

The cell below installs the programs and packages needed for the task. These are very large, so it will take a while.




In [None]:
#Install the required packages
# run before executing any code

!apt install ffmpeg
!pip3 install -U huggingface_hub
!pip3 install torch torchvision torchaudio yt-dlp Cython
!pip install git+https://github.com/openai/whisper.git
!pip install git+https://github.com/m-bain/whisperX.git


Now we can retrieve some audio to transcribe with the line below.

In [11]:
!wget https://media.talkbank.org/ca/SBCSAE/0wav/54.wav

--2024-01-23 14:57:53--  https://media.talkbank.org/ca/SBCSAE/0wav/54.wav
Resolving media.talkbank.org (media.talkbank.org)... 128.2.27.37
Connecting to media.talkbank.org (media.talkbank.org)|128.2.27.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51265158 (49M) [audio/x-wav]
Saving to: ‘54.wav’


2024-01-23 14:57:54 (49.4 MB/s) - ‘54.wav’ saved [51265158/51265158]



Your recording is an excerpt from the Santa Barbara Corpus of Spoken American English, made available via the [TalkBank](https://doi.org/10.21415/T5VG6X) resource.

Let's listen to it.

In [12]:
from IPython.display import Audio

Audio(filename="/content/54.wav", autoplay=True)

For automatic processing, we need`.wav` files. In case you have files in some other format, you can convert them to `.wav` with the code block below. If you have a file named (for example) `my_mp3.mp3`, upload it to the `/content` directory to the left, then run the code below.

In [None]:
#The file 01.mp3 will be converted to `.wav` and renamed `audio_16k.wav`. You will see that it is available in your environment (to the left)

!ffmpeg -i "my_mp3.mp3" -ac 1 -ar 16000 audio_16k.wav # Converting audio.wav to mono channel & 16K audio_16k.wav

Now you are ready to automatically transcribe the recording using Whisper. The line below specifies that Whisper will use its "medium" model. After transcribing and inspecting the transcript, try changing the syntax to specify the "tiny" model and transcribe it again. Are there any differences?
What about with the "large-v2" model?

In [None]:
!whisper './54.wav' --model medium

Now double-click on the .json, .srt., .tsv, .txt, and .vtt files that were generated in the `/content` directory. These are commonly used transcript data formats.

In [14]:
from huggingface_hub.utils import _runtime   #https://github.com/m-bain/whisperX/issues/656#issuecomment-1877955404
_runtime._is_google_colab = False

In [18]:
from whisper import load_model
from huggingface_hub.utils import _runtime   #https://github.com/m-bain/whisperX/issues/656#issuecomment-1877955404
_runtime._is_google_colab = False

# Large models result in considerably better and more aligned (words, timestamps) mapping.
# Smaller models are less accurate.

model = load_model("tiny")

# Beam size is none by default (Greedy Decoding). You can also set the
# beam_size to some number like 5. This may increase transcription
# quality but may increase runtime.

results = model.transcribe('./54.wav')

In [None]:
#Whisper segements the audio into chunks

for x in results["segments"]:
  print(x["start"],x["end"],x["text"])

In [20]:
# WhisperX can be used to get individual word timestamps by using wav2vec-based forced alignment.


import whisperx

device = 'cuda'
alignment_model, metadata = whisperx.load_align_model(language_code=results["language"], device=device)
result_aligned = whisperx.align(results["segments"], alignment_model, metadata, '54.wav', device)

  torchaudio.set_audio_backend("soundfile")
  torchaudio.set_audio_backend("soundfile")
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ls960.pth
100%|██████████| 360M/360M [00:06<00:00, 61.5MB/s]


In [None]:
result_aligned

In [51]:
!pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [54]:
from IPython.display import Audio
from pydub import AudioSegment

audio = AudioSegment.from_file('54.wav')

for x in result_aligned["word_segments"]:
  if x["word"] =="over":

    start_time = x["start"]*1000
    stop_time = x["end"]*1000

    audio_segment = audio[start_time:stop_time]

# Save the segment to a temporary file (you can adjust the file path if needed)
    segment_file_path = '/content/temp_segment.wav'
    audio_segment.export(segment_file_path, format="wav")

# Display the audio player for the specified segment
    Audio(filename=segment_file_path, autoplay=True)

In [None]:
diarize_model = whisperx.DiarizationPipeline(use_auth_token=access_token, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)

In [55]:
Audio(filename=segment_file_path, autoplay=True)

In [None]:
#Let's get audio from a conversation with multiple speakers

!yt-dlp -xv --audio-format wav  -o audio.wav --https://www.youtube.com/watch?v=-guwyA8wxVQ

In [None]:
import whisperx
import gc

device = "cuda"
audio_file = "/scratch/project_2000451/audio1.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

access_token = "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"

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=access_token, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)

In [40]:
# Storing words <> timestamps mapping in a file.
import json

with open('./word_ts.text', 'w+') as f:
    for line in result_aligned['word_segments']:
        line_temp = line.copy()
        # WhisperX don't put a space after word but just to make sure.
        line_temp['word'] = line_temp['word'].strip()
        f.write(f'{json.dumps(line_temp)}\n')

In [None]:
!pip install --upgrade hydra-core llvmlite omegaconf --ignore-installed


In [None]:
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@main