# Speech to text / Automatic speech recognition

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/speech_to_text.ipynb)

This tutorial demonstrates how to use the `transcribe_audios` function to convert audio files into text transcriptions efficiently.

## Setup
First, let's import the necessary libraries and the function we'll be using.

In [None]:
%pip install 'senselab[audio]'

In [1]:
from senselab.audio.data_structures import Audio
from senselab.audio.tasks.preprocessing import downmix_audios_to_mono, resample_audios
from senselab.audio.tasks.speech_to_text import transcribe_audios
from senselab.audio.tasks.speech_to_text_evaluation import calculate_wer
from senselab.utils.data_structures import DeviceType, HFModel
from senselab.utils.tasks.plotting import plot_transcript

## Specifying the ASR model and the preferred device
Let's initialize the model we want to use (remember to specify both the ```path_or_uri``` and the ```revision``` for reproducibility purposes) and the device we prefer.

In [2]:
model = HFModel(path_or_uri="openai/whisper-tiny", revision="main")
device = DeviceType.CPU

## Loading Audio Files
Now let's load and process the audio files we want to transcribe using senselab's built-in tools.

In [None]:
!mkdir -p tutorial_audio_files
!wget -O tutorial_audio_files/audio_48khz_mono_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
!wget -O tutorial_audio_files/audio_48khz_stereo_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_stereo_16bits.wav

audio1 = Audio(filepath="tutorial_audio_files/audio_48khz_mono_16bits.wav")
audio2 = Audio(filepath="tutorial_audio_files/audio_48khz_stereo_16bits.wav")

## Preprocessing
Let's preprocess the audio data to make it suitable with the ASR model characteristics that we can find in the model card in the HuggingFace Hub.

In [4]:
# Downmix to mono
audio2 = downmix_audios_to_mono([audio2])[0]

# Resample both audios to 16kHz
audios = resample_audios([audio1, audio2], 16000)

## Transcription
Let's finally transcribe the audio clips. 

Note: If you know the language spoken in your clips, you can specify that using the ```language``` parameter. For more details, see the [**dedicated documentation**](https://sensein.group/senselab/senselab/audio/tasks/speech_to_text.html).

In [None]:
transcripts = transcribe_audios(audios=audios, model=model, device=device)

Here is the result of the analysis.

In [None]:
transcripts

## Transcript visualization
Let's visualize the transcript better.

In [None]:
plot_transcript(transcripts[0])

## Transcript evaluation
To compare the performance of a model against the ground truth using the Senselab functionalities, you can compute the word error rate (WER). The WER evaluates the accuracy of the model by considering the number of insertions, deletions, and substitutions, normalized by the total number of words in the reference string.

In [None]:
ground_truth = "This is Peter. This is Johnny. Kenny. And Joe. We just wanted to take a minute to thank you."

wer = calculate_wer(reference=ground_truth, hypothesis=transcripts[0].text)
print(f"The Word Error Rate (WER) is: {wer}")

Check the [**documentation**](https://sensein.group/senselab/senselab/audio/tasks/speech_to_text_evaluation.html) for more details.