# Speech model in Transformers - Demo

This notebook shows how [Facebook's Wav2Vec2](https://arxiv.org/abs/2006.11477 ) can be used in 🤗 Transformers. We'll use the "base" model fine-tuned on 960h of speech: https://huggingface.co/facebook/wav2vec2-base-960h


Let's go on the dev branch for the Transformer Speech models

In [1]:
%%capture
!pip install datasets
!pip install git+https://github.com/huggingface/transformers.git
!pip install soundfile

Load the model

In [2]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

speech_model = AutoModelForMaskedLM.from_pretrained("facebook/wav2vec2-base-960h")
tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h", do_lower_case=True)
from datasets import load_dataset
import soundfile as sf

# use "dummy" samples of validation split because `load_dataset("librispeech_asr", "clean")` requires > 50GB 
libri_speech_dummy = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# define function to read in audio file
def map_to_array(batch):
  speech, _ = sf.read(batch["file"])
  batch["speech"] = speech
  return batch

samples = libri_speech_dummy.map(map_to_array)[5:8]

OSError: Can't load config for 'facebook/wav2vec2-base-960h'. Make sure that:

- 'facebook/wav2vec2-base-960h' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'facebook/wav2vec2-base-960h' is the correct path to a directory containing a config.json file



Alright, let's see the model in action. We'll use some samples from the [librispeech corpus](https://huggingface.co/datasets/librispeech_asr), which is a **read out**.

Let's pick 3 random examples to transcribe.

In [None]:
from datasets import load_dataset
import soundfile as sf

# use "dummy" samples of validation split because `load_dataset("librispeech_asr", "clean")` requires > 50GB 
libri_speech_dummy = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# define function to read in audio file
def map_to_array(batch):
  speech, _ = sf.read(batch["file"])
  batch["speech"] = speech
  return batch

samples = libri_speech_dummy.map(map_to_array)[5:8]

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5019.0, style=ProgressStyle(description…


Downloading and preparing dataset librispeech_asr/clean (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/468ec03677f46a8714ac6b5b64dba02d246a228d92cbbad7f3dc190fa039eab1...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9078094.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset librispeech_asr downloaded and prepared to /root/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/468ec03677f46a8714ac6b5b64dba02d246a228d92cbbad7f3dc190fa039eab1. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, max=73.0), HTML(value='')))




Let's listen to our samples

In [None]:
import IPython.display as ipd

def play_sample(samples, idx):
  return ipd.Audio(samples['file'][idx])

In [None]:
play_sample(samples, 0)

In [None]:
play_sample(samples, 1)

In [None]:
play_sample(samples, 2)

Alright, let's transcripe the audio!

Let's pad the input

In [None]:
raw_speech_input = tokenizer(samples["speech"], padding="longest", return_tensors="pt").input_values

run in through the model

In [None]:
import torch 

with torch.no_grad():
  logits = speech_model(raw_speech_input).logits

predicted_ids = torch.argmax(logits, axis=-1)

and decode it

In [None]:
transcription = tokenizer.batch_decode(predicted_ids)
transcription

['it is obviously unnecessary for us to point out how luminous these criticisms are how delicate an expression',
 'on the general principles of art mister quilter writes with equal lucidity',
 'painting he tells us is of a different quality to mathematics and finish in art is adding more fact']

 An interesting aspect to notice here is that the model is very fast because it does not rely on auto-regressive generation. Using auto-regressive generation in an encoder-decoder setting where the speech input is passed to the encoder and the decoder would auto-regressively generate the transcript will definitely lead to better results because it can utilize the power of an auto-regressive LM, but would also be significantly slower.