## Real-Time Speech Recognition with Distil-Whisper

In this notebook, we'll showcase how to run streaming speech recognition using the [Distil-Whisper](https://huggingface.co/distil-whisper/distil-medium.en) model
in the 🤗 Transformers library. Streaming speech recognition works by constantly listening to the audio input, and continuously passing chunks of audio to the 
transcription model for inference.

The Distil-Whisper model is lightweight and fast enough to be run locally on CPU, so there's no need for a specific hardware accelerator like GPU. We'll show 
how the chunk length of the audio can be controlled to give a trade-off between latency and accuracy.

### Installation

Ensure you have PyTorch installed according to the [official instructions](https://pytorch.org/get-started/locally/), and the latest version of the Transformers library:

```bash
pip install --upgrate transformers
```

The microphone input relies on the `ffmpeg` library. You can verify that you have `ffmpeg` installed by running the following cell from the command line:
```bash
ffmpeg -version
```
If the `ffmpeg` library is not present, you can install it from the [`ffmpeg` homepage](https://www.ffmpeg.org/download.html).

### Streaming Inference

In [None]:
from transformers import pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live
import torch
import sys

We'll load the [`distil-medium.en`](distil-whisper/distil-medium.en) checkpoint, which is 6.8x faster than `large-v2` and performs to within 2% word error rate:

In [37]:
device = "cuda:0" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" 

transcriber = pipeline(
    "automatic-speech-recognition", model="distil-whisper/distil-medium.en", device=device
)

transcriber.model.generation_config.language = None
transcriber.model.generation_config.task = None

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now we'll define our transcription function, which listens to a total of `chunk_length_s` seconds of audio data. This audio data is broken down into `stream_chunk_s` second 
segments. After each new segment, we forward the total audio data to the model for transcription:

In [36]:
def transcribe(chunk_length_s=10.0, stream_chunk_s=1.0):
    sampling_rate = transcriber.feature_extractor.sampling_rate

    mic = ffmpeg_microphone_live(
        sampling_rate=sampling_rate,
        chunk_length_s=chunk_length_s,
        stream_chunk_s=stream_chunk_s,
    )

    print("Start speaking...")
    for item in transcriber(mic, generate_kwargs={"max_new_tokens": 128}):
        sys.stdout.write("\033[K")
        print(item["text"], end="\r")
        if not item["partial"][0]:
            break

    return item["text"]

Using a shorter `stream_chunk_s` lends itself to more real-time speech recognition, since we divide our input audio into smaller chunks and transcribe them on the fly. However, this comes at the expense of poorer accuracy, since there’s less context for the model to infer from.

Let's apply the transcription function to our model and generate some real-time transcription results:

In [43]:
transcribe()

Start speaking...
[K Hey, I'm running the Distill Whisper model in real time using the Transformers library with a streaming input and a chunk length of one second.

" Hey, I'm running the Distill Whisper model in real time using the Transformers library with a streaming input and a chunk length of one second."

Looks good! Try running the above with different values of `stream_chunk_s` to see how the latency impacts the accuracy. For simplicity, we terminated our microphone recording after the first `chunk_length_s` seconds (which is set to 10 seconds by default). However, you can experiment with using a [voice activity detection (VAD)](https://huggingface.co/models?pipeline_tag=voice-activity-detection&sort=trending) model to predict when the user has stopped speaking.