# Transcribe audio files as fast as possible

## Install prerequisites

In [1]:
!ffmpeg -version

ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 13 (Ubuntu 13.2.0-23ubuntu3)
configuration: --prefix=/usr --extra-version=3ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-

In [2]:
!uv add transformers accelerate

[2mResolved [1m275 packages[0m [2min 0.83ms[0m[0m
[2mAudited [1m178 packages[0m [2min 2ms[0m[0m


In [3]:
from importlib.metadata import version

In [4]:
version('transformers')

'4.57.1'

In [5]:
version('accelerate')

'1.11.0'

## Convert video files to audio files

Optional step, if you want to extract the audio of a video file: replace the file names below with your own files.

In [None]:
!ffmpeg -y -i "2024-09-26 15-35-04.mp4" "data/2024-09-26 15-35-04.mp3"

## Choose a Whisper model on Huggingface

You saw in the first notebook how to use the official Whisper model to transcribe english speech.

https://huggingface.co/openai/whisper-large-v3-turbo

If you need to transcribe audio files in another language, you can find optimized models on HuggingFace. For example for french:

https://huggingface.co/eustlb/distil-large-v3-fr

In [1]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "eustlb/distil-large-v3-fr"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, dtype=torch_dtype, 
    use_safetensors=True, low_cpu_mem_usage=True, device_map=device, 
    attn_implementation="sdpa"
)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    dtype=torch_dtype
)

# warmup
dummy_input = torch.randn( (1, model.config.num_mel_bins, 3000), dtype=torch_dtype, device=device)
_ = model.generate(dummy_input)

  from .autonotebook import tqdm as notebook_tqdm
Device set to use cuda:0
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


## Choose a long form transcription algorithm

See: https://huggingface.co/openai/whisper-large-v3#chunked-long-form

Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are required:

- Sequential: uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other
- Chunked: splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries

The sequential long-form algorithm should be used in either of the following scenarios:
- Transcription accuracy is the most important factor, and speed is less of a consideration
- You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate

Conversely, the chunked algorithm should be used when:
- Transcription speed is the most important factor
- You are transcribing a single long audio file

By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the chunk_length_s parameter to the pipeline. To activate batching over long audio files, pass the argument batch_size:

## Sequential transcription with timestamps

Replace the mp3 audio file names below with your own files uploaded to the ./data directory.

In this example, the audio file is 1 hour and 8 minutes long.

In [8]:
result = pipe("./data/2024-09-26 15-35-04.mp3", return_timestamps=True)
len(result["text"]),result["text"][:200]

(61299,
 " dernière partie de notre tronc commun savoir réaliser un projet chez lui donc là jusqu'à maintenant on a vu principalement tous les éléments qui étaient nécessaires pour identifier des projets faire ")

Performance on RTX 4090 -> 1 hour 8 min transcribed in **1 min 51 sec**

In [9]:
result["chunks"][14]

{'timestamp': (51.94, 54.06), 'text': ' qui a un vrai challenge,'}

## Chuncked transcription with batch parallelization

Use batch size 16 for a 8 GB GPU, batch size 32 for a 16 GB+ GPU, batch size 128 for a datacenter GPU.

In [3]:
result = pipe("./data/2024-09-26 15-35-04.mp3", chunk_length_s=30, batch_size=32)
len(result["text"]),result["text"][:200]



(62063,
 " Dernière partie de notre tronc commun, savoir réaliser un projet chez lui. Donc là, jusqu'à maintenant, on a vu principalement tous les éléments qui étaient nécessaires pour identifier des projets, f")

Performance on RTX 4090 -> 1 hour 8 min transcribed in **29 sec**

## Code examples to transcribe a list of audio files

Batch processing

In [None]:
results = pipe(["./audio/2024-09-19 15-03-35.mp3","./audio/2024-09-19 16-32-50.mp3"], batch_size=2)
for result in results: print(result["text"])

Sequential processing

In [None]:
import os
import glob

# Specify the directory containing mp3 files
directory = '/workspace/wordslab-voice/data'

# Use glob to get all .mp3 files in the directory
mp3_files = glob.glob(os.path.join(directory, '*.mp3'))

# Loop through each mp3 file
for mp3_file in mp3_files:
    # Get the base name of the file (without directory path)
    base_name = os.path.basename(mp3_file)
    
    # Replace the .mp3 extension with .txt to create a new filename
    sequential_txt_file = base_name.replace('.mp3', '_sequential.txt')
    chunked_txt_file = base_name.replace('.mp3', '_chunked.txt')
    
    # Full path of the text file to be written
    sequential_txt_file_path = os.path.join(directory, sequential_txt_file)
    chunked_txt_file_path = os.path.join(directory, chunked_txt_file)

    # Transcribe audio with two methods
    print(f"- {base_name} (sequential) ...")
    sequential_txt = pipe(mp3_file)["text"]
    print("OK")
    
    print(f"- {base_name} (chunked) ...")
    chunked_txt = pipe(mp3_file, chunk_length_s=25, batch_size=32)["text"]
    print("OK")
    
    # Write a text file with the same name as the mp3 file
    with open(sequential_txt_file_path, 'w') as file:
        file.write(sequential_txt)
    print(f"Saved: {sequential_txt_file_path}")
    
    with open(chunked_txt_file_path, 'w') as file:
        file.write(chunked_txt)
    print(f"Saved: {chunked_txt_file_path}")

## Reformatting the transcribed audio

In [18]:
instruction = """
The text below is the result of an automatic transcription of the voice of a presenter at a conference on artificial intelligence.
This transcription is imperfect: errors, incomplete words, missing punctuation, hesitations, interruptions...
Your task is to **strictly repeat** the text provided below, but correcting its syntax and formatting:
- Rewording into equivalent sentences that are well-constructed and free of spelling errors.
- Adding line breaks and paragraphs whenever the presenter changes subject.
- Generating chapter titles and subtitles in Markdown format.

Here's the text to be formatted:


"""

In [19]:
transcribed_text = result["text"]

In [20]:
import ollama

Replace the model below with your default model depending on your GPU VRAM size:

In [26]:
formatted_text = ollama.generate(model='hf.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF:UD-Q4_K_XL', prompt=f"{instruction} {transcribed_text}")

In [28]:
from IPython.display import Markdown, display

display(Markdown(formatted_text.response[500:1500]))

sser" des projets : laisser les métiers s'approprier les idées (ex : direction commerciale exige désormais une étude IA avant tout projet).
  - Ateliers avec les métiers : animer sans imposer (ex : atelier ACM sans idées concrètes).

- **Veille technologique** :
  - **Intégrée au projet** : Pas une activité séparée, mais une partie du temps de travail quotidien.
  - **Exemples concrets** :
    - Connaître les versions récentes de langages (ex : C# 14) pour éviter de réinventer des fonctionnalités existantes.
    - Benchmarker les modèles IA disponibles (ex : qualité de rédaction pour un assistant client).
  - **Rentabilité** : 1 jour de veille peut économiser 3-4 jours de développement.

- **Compétences et formation** :
  - Planifier des lignes de veille dans les projets (ex : 7 heures sur 2-3 semaines).
  - Former l'équipe sur des outils spécifiques (ex : Camunda, prompting pour LLM).
  - Partager les retours d'expérience (ex : document type pour évaluer les besoins en compétences).

