### 1. Install dependencies


In [None]:
%pip install transformers
%pip install torch
%pip install gradio

### 2. Set up the Transformers model

For this code, we'll use a pre-trained ASR model from HuggingFace. By default, the automatic speech recognition model pipeline loads Facebook's `facebook/wav2vec2-base-960h` model. You can also specify a different model by passing the model name as a parameter.

In [None]:
from transformers import pipeline
generator = pipeline("automatic-speech-recognition", model="openai/whisper-large-v2")

The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference.

In [None]:
from transformers import pipeline
generator = pipeline(
    "automatic-speech-recognition", 
    model="openai/whisper-large-v2",
    chunk_length_s=30,)

### 3. Create a full-context ASR app with Transformers

A full-context demo means that the user speaks the full audio before using the model to run inference.

In [None]:
import gradio as gr

def transcribe(audio):
    text = generator(audio)["text"]
    return text

gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="microphone", type="filepath"), 
    outputs="text",
    title="Automatic speech recognition with Transformers",
    description="This is a full-context demo of ASR with Transformers models.").launch()


### 4. Create a streaming ASR app with DeepSpeech

Mozilla, the organization behind DeepSpeech, has moved the DeepSpeech project to a new community-driven project called Coqui. The Python package for the new project is called `stt` (speech-to-text).

In [None]:
%pip install deepspeech==0.8.2

from deepspeech import Model
import numpy as np

model_file_path = "deepspeech-0.8.2-models.pbmm"
lm_file_path = "deepspeech-0.8.2-models.scorer"
beam_width = 100
lm_alpha = 0.93
lm_beta = 1.18

model = Model(model_file_path)
model.enableExternalScorer(lm_file_path)
model.setScorerAlphaBeta(lm_alpha, lm_beta)
model.setBeamWidth(beam_width)


def reformat_freq(sr, y):
    if sr not in (
        48000,
        16000,
    ):  # Deepspeech only supports 16k, (we convert 48k -> 16k)
        raise ValueError("Unsupported rate", sr)
    if sr == 48000:
        y = (
            ((y / max(np.max(y), 1)) * 32767)
            .reshape((-1, 3))
            .mean(axis=1)
            .astype("int16")
        )
        sr = 16000
    return sr, y


def transcribe(speech, stream):
    _, y = reformat_freq(*speech)
    if stream is None:
        stream = model.createStream()
    stream.feedAudioContent(y)
    text = stream.intermediateDecode()
    return text, stream

In [None]:
import gradio as gr

gr.Interface(
    fn=transcribe,
    inputs=[
        gr.Audio(source="microphone", type="numpy"),
        "state"
    ],
    outputs= [
        "text",
        "state"
    ],
    live=True).launch()

