# Speech Recognition using Hugging Face's OpenAI Whisper Model

OpenAI's Whisper model is a cutting-edge automatic speech recognition (ASR) system designed to convert spoken language into text. In this notebook, we will utilize the **Whisper** model provided by Hugging Face to transcribe both a sample audio from a dataset and optionally from a microphone recording.

## Installation

Ensure that the necessary packages are installed. If they are not installed yet, uncomment and run the following cell.

- **Note for Google Colab users**: Only install `transformers` and `datasets` since `torch` is already pre-installed, and `sounddevice` will not be used in this environment.

In [1]:
# Uncomment the following line if you need to install the required packages
# !pip install transformers datasets sounddevice torch

### 1. Import Necessary Libraries

To begin, we will import the necessary libraries. The `WhisperProcessor` and `WhisperForConditionalGeneration` classes from the `transformers` library will handle the ASR model. We'll also use `datasets` to load a sample dataset and `sounddevice` to record audio.

In [2]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

### 2. Load a Sample Dataset

We will use a dummy dataset provided by Hugging Face's `datasets` library to load an audio file and the corresponding transcription. 

In [3]:
# Load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]

# Original text from the dataset
original_text = ds[0]["text"]

# Print the original text
print("Original Text:", original_text)

Original Text: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL


### 3. Play the Sample Audio File

The audio file from the dataset can be played using the `IPython.display.Audio` method.

In [4]:
from IPython.display import Audio

# Play the audio file
display(Audio(sample["array"], rate=sample["sampling_rate"]))

### 4. Load the Whisper Model and Processor

Now, we'll load the **Whisper** model and processor from Hugging Face's model hub. To balance performance and download size efficiently, we will opt for the smaller **Whisper-small** version. If you require higher accuracy and are willing to accommodate a larger model, you can switch to the **Whisper-large-v3** model by replacing the model name with `"openai/whisper-large-v3"`, which is around **3-4 GB** in size.

In [5]:
# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

### 5. Process the Audio and Generate Transcription

The audio file needs to be preprocessed using the processor, which converts it into input features. Then, we use the model to generate token IDs, which are decoded to obtain the final transcription.

In [6]:
# Process the audio sample to get input features
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

# Generate token IDs from the input features
predicted_ids = model.generate(input_features)

# Decode the token IDs to obtain the transcription text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

# Print the original text and transcription
print("Original Text:", original_text)
print("Transcription:", transcription)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Original Text: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
Transcription: [' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.']


### 6. (Optional) Speech Recognition from Microphone

If you want to record your own voice and transcribe it, you can use the following code. This step requires a microphone connected to your machine and utilizes `sounddevice` to record audio, which is then processed similarly to the sample audio.

- **Note for Google Colab users**: Google Colab does not support microphone input, so if you'd like to try this step, please run the code on your local machine (e.g., a laptop or desktop).

In [7]:
import sounddevice as sd  # Used for recording audio from the microphone.
import torch

In [27]:
# Set the sampling rate and duration for recording
sampling_rate = 16000  # The number of samples per second 
duration = 8  # The length of the audio recording in seconds

# Record audio from the microphone
print("Recording audio...")
audio = sd.rec(int(sampling_rate * duration), samplerate=sampling_rate, channels=1)
sd.wait()  # Wait until the recording is finished
print("Recording complete.")

Recording audio...
Recording complete.


In [28]:
# Play the recorded audio
display(Audio(audio.flatten(), rate=sampling_rate))

In [29]:
# Convert the recorded audio (a NumPy array) to a PyTorch tensor and ensure it's flattened to a 1D array
audio_tensor = torch.from_numpy(audio.flatten()).float()

# Process the audio sample to get input features
input_features = processor(audio_tensor, sampling_rate=sampling_rate, return_tensors="pt").input_features

# Force Whisper to transcribe the audio in English
forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")

# Generate token IDs from the input features
#predicted_ids = model.generate(input_features)
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

# Decode the token IDs to obtain the transcription text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

In [30]:
# Print the transcription of the recorded audio
print("Transcription of recorded audio:", transcription)

Transcription of recorded audio:  Record audio with a microphone and use a model to turn it into text.


### 7. Conclusion

In this notebook, we used OpenAI's Whisper model to transcribe both a sample audio file from a dataset and, optionally, recorded audio from a microphone. This powerful ASR system provides an efficient way to convert speech to text using the latest advancements in deep learning.