<a href="https://colab.research.google.com/github/rahiakela/speech-recognition-case-studies/blob/main/wav2vec-speech-to-text/01_speech_to_text_with_wav2vec_2_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Speech to Text with Wav2Vec 2.0

Facebook recently introduced and open-sourced their new framework for self-supervised learning of representations from raw audio data called Wav2Vec 2.0.

<img src='https://www.kdnuggets.com/wp-content/uploads/wav2vec2-facebook-ai-header.jpg?raw=1' width='800'/>

In this notebook, we see how to convert speech into text using Facebook Wav2Vec 2.0 model.

Facebook recently introduced and [open-sourced their new framework](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio) for self-supervised learning of representations from raw audio data called Wav2Vec 2.0. Facebook researchers claim this framework can enable [automatic speech recognition models](https://analyticsindiamag.com/facebook-makes-advancements-in-automatic-speech-recognition/) with just 10 minutes of transcribed speech data.

As everyone knows, Transformers are playing a major role in Natural Language Processing. The latest version of Hugging Face transformers is version 4.30 and it comes with Wav2Vec 2.0. This is the first Automatic Speech recognition speech model included in the Transformers.

Model Architecture is beyond the scope of this blog. For detailed Wav2Vec model architecture, please check [here](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/).

Let’s see how we can convert the audio file into text using Hugging Face transformers with some simple lines of code.

## Setup

In [1]:
!pip install -q transformers

In [2]:
!wget https://github.com/sdhilip200/speech-to-text/raw/main/taken_clip.wav

--2021-06-03 10:23:20--  https://github.com/sdhilip200/speech-to-text/raw/main/taken_clip.wav
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sdhilip200/speech-to-text/main/taken_clip.wav [following]
--2021-06-03 10:23:20--  https://raw.githubusercontent.com/sdhilip200/speech-to-text/main/taken_clip.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 444800 (434K) [audio/wav]
Saving to: ‘taken_clip.wav.2’


2021-06-03 10:23:20 (11.5 MB/s) - ‘taken_clip.wav.2’ saved [444800/444800]



In [3]:
# For managing audio file
import librosa

#Importing Pytorch
import torch

#Importing Wav2Vec
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

import IPython.display as display

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2Tokenizer.

## Reading the audio file

I have used Liam Neeson famous dialogue audio clip from the movie `Taken` in this example which says `I will look for you, I will find you and I will kill you`.

In [4]:
display.Audio("taken_clip.wav", autoplay=True)

Please note the Wav2Vec model is pre-trained on 16 kHz frequency, so we make sure our raw audio file is also resampled to a 16 kHz sampling rate. I have used online [audio tool conversion](https://audio.online-convert.com/convert-to-wav) to resample the `taken` audio clip into 16kHz.

Let's load the audio file using the librosa library and mentioning my audio clip size is 16000 Hz. It converts the audio clip into an array and is stored into the `audio` variable.

In [5]:
# Loading the audio file
audio, rate = librosa.load("taken_clip.wav", sr=16000)

# printing audio 
print(audio)

# printing rate
print(rate)

[0. 0. 0. ... 0. 0. 0.]
16000


## Importing pre-trained Wav2Vec model

In [None]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The next step is taking the input values, passing the audio (array) into tokenizer and we want our tensors in PyTorch format instead of Python integers.

In [7]:
# Taking an input value
input_values = tokenizer(audio, return_tensors="pt").input_values

Getting the logit values (non-normalized values)

In [8]:
# Storing logits (non-normalized prediction values)
logits = model(input_values).logits

Passing the logit values to softmax to get the predicted values

In [9]:
# Storing predicted ids
prediction = torch.argmax(logits, dim=-1)

## Converting audio to text

The final step is to pass the prediction to the tokenizer decode to get the transcription.

In [10]:
# Passing the prediction to the tokenzer decode to get the transcription
transcription = tokenizer.batch_decode(prediction)[0]

# Printing the transcription
print(transcription)

I WILL LOOK FOR YOU I WILL FIND YOU  AND I WILL KILL YOU


It exactly matches our audio clip.

This would be very helpful for NLP projects especially handling audio transcripts data.

## Converting Hindi sound to text

In [11]:
display.Audio("dil_nahin_todna.wav", autoplay=True)

In [12]:
# Loading the audio file
audio, rate = librosa.load("dil_nahin_todna.wav", sr=16000)

# printing audio 
print(audio)

# printing rate
print(rate)

[ 0.          0.          0.         ... -0.1776886  -0.17518616
 -0.18510437]
16000


In [13]:
# Taking an input value
input_values = tokenizer(audio, return_tensors="pt").input_values

In [None]:
# Storing logits (non-normalized prediction values)
logits = model(input_values).logits

In [None]:
# Storing predicted ids
prediction = torch.argmax(logits, dim=-1)

In [None]:
# Passing the prediction to the tokenzer decode to get the transcription
transcription = tokenizer.batch_decode(prediction)[0]

# Printing the transcription
print(transcription)