# ASR with OpenAI/Whisper-large-v3

## Build an Automatic Speech Recognition Pipeline

- [Whisper (Large) on Hugging Face](https://huggingface.co/openai/whisper-large-v3 "official openai repository")
- [Hugging Face course on ASR with Whisper (Small)](https://huggingface.co/learn/audio-course/chapter5/asr_models#graduation-to-seq2seq)


#### Usage:
Whisper large-v3 is supported in Hugging Face 🤗 Transformers through the main branch in the Transformers repo. To run the model, first install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub:

In [None]:
# Check that pip is the latest version:
%pip install -U pip

# Install transformers via source along with other necessary requirements for our notebook
%pip install -U git+https://github.com/huggingface/transformers.git accelerate datasets[audio] torch

The model can be used with the [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audio files of arbitrary length. Transformers uses a chunked algorithm to transcribe long-form audio files, which in-practice is 9x faster than the sequential algorithm proposed by OpenAI (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). The batch size should be set based on the specifications of your device:

In [None]:
# Import necessary libraries
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Check if a GPU is available and set the device accordingly
# If a GPU is available, use it (cuda:0); otherwise, use the CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Set the data type for tensors based on the availability of a GPU
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

In [None]:
# Model ID for the pre-trained model
model_id = "openai/whisper-large-v3"

# Load the pre-trained model with specific configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,  # Use the previously set data type for tensors
    low_cpu_mem_usage=True,  # Optimize memory usage for CPU
    use_safetensors=True     # Enable SafeTensors for memory optimization
)

# Move the model to the specified device (GPU or CPU)
model.to(device)

# Load the processor for the model
processor = AutoProcessor.from_pretrained(model_id)

In [None]:
# Set up a pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,  # Use the loaded model
    tokenizer=processor.tokenizer,  # Use the tokenizer from the processor
    feature_extractor=processor.feature_extractor,  # Use the feature extractor from the processor
    max_new_tokens=128,
    chunk_length_s=30,  # Set the chunk length for processing
    batch_size=16,  # Set batch size
    return_timestamps=True,  # Return timestamps for the transcriptions
    torch_dtype=torch_dtype,  # Use the specified data type for tensors
    device=device  # Specify the device (GPU or CPU)
)

In [None]:
from datasets import load_dataset

# Load a dataset for validation
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")

In [None]:
sample = dataset[0]["audio"]  # Get the first sample from the dataset

In [None]:
# Run the pipeline on the sample and get the result
result = pipe(sample)

# Print the recognized text from the audio
print(result["text"])

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

```python
- result = pipe(sample)
+ result = pipe("audio.mp3")
```

#### Configure Target Language
Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it can be passed as an argument to the pipeline:

In [None]:
result = pipe(sample, generate_kwargs={"language": "english"})

#### Configure Task
By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to "translate":

In [None]:
result = pipe(sample, generate_kwargs={"task": "translate"})

#### Timestamps
Finally, the model can be made to predict timestamps.

For ***sentence-level timestamps***, pass the `return_timestamps` argument with `True`:

In [None]:
result = pipe(sample, return_timestamps=True)
print(result["chunks"])

***And for word-level timestamps, pass `"Word"`***:

In [None]:
result = pipe(sample, return_timestamps="word")
print(result["chunks"])

The above arguments can be used in ***isolation*** or in ***combination***. 

For example, to perform the task of speech transcription where the source audio is in French, and we want to return sentence-level timestamps, the following can be used:

In [None]:
result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])

### Additional Speed & Memory Improvements

#### Flash Attention
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):

In [None]:
%pip install flash-attn -no-build-isolation

and then all you have to do is to pass `use_flash_attention_2=True` to *`from_pretrained`*:

```python
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, use_safetensors=True)

+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True) # Use Flash Attention 2
```

# Build a demo with Gradio

We’ll define a function that takes the filepath for an audio input and passes it through the pipeline. 

Here, the pipeline automatically takes care of loading the audio file, resampling it to the correct sampling rate, and running inference with the model. We can then simply return the transcribed text as the output of the function. 

To ensure our model can handle audio inputs of arbitrary length, we’ll enable chunking:

In [None]:
def transcribe_speech(filepath):
    result = pipe(
        filepath,
        max_new_tokens=128,
        generate_kwargs={
            "task": "transcribe",
            "language": "english",
        },
        chunk_length_s=30,
        batch_size=8,
    )
    return result["text"]

We’ll use the Gradio blocks feature to launch two tabs on our demo: one for microphone transcription, and the other for file upload.

In [None]:
import gradio as gr

demo = gr.Blocks()

mic_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="microphone", type="filepath"),
    outputs=gr.outputs.Textbox(),
)

file_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="upload", type="filepath"),
    outputs=gr.outputs.Textbox(),
)

Launch the Gradio demo using our two blocks:

In [None]:
with demo:
    gr.TabbedInterface(
        [mic_transcribe, file_transcribe],
        ["Transcribe Microphone", "Transcribe Audio File"],
    )

demo.launch(debug=True)