# 40-audio-inference
> Starting to use audio for inference

In this notebook, we start to investigate using wav2vec2 and the associated classes on HuggingFace.  We'll start by trying straight transcription using the classes themselves.

In [None]:
#all_no_test
#modeling imports
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
import torch
import librosa

#python imports
import os.path

# Models
Here, we'll use the facebook wav2vec2 models.

In [None]:
# load model and tokenizer
pretrain_name1 = "facebook/wav2vec2-base-960h"
pretrain_name2 = "facebook/wav2vec2-large-960h-lv60-self"

# create processors and models
processor = Wav2Vec2Processor.from_pretrained(pretrain_name1)
model = Wav2Vec2ForCTC.from_pretrained(pretrain_name1)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Sample demo
In this example, clean audio is directly recorded and attempted for transcription.  The audio is found on Box at the location indicated below.  Given the text about wav2vec2 below on HuggingFace:
>The base model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

We make sure that the text has the correct sampling rate.

In [None]:
# Use test file on Box and read it
myd, mys = sf.read(os.path.expanduser('~/Box/DSI Documents/test_files/mytest.wav')) # my own short test audio
print("The sample rate of test audio: ", mys)
print("The length of audio numpy:", myd.shape)

The sample rate of test audio:  16000
The length of audio numpy: (60929,)


In [None]:
# tokenize data
input_values = processor(myd, return_tensors = "pt", sampling_rate = 16000).input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
transcription

['WHO ARE YOU MY NAME IS MICHAEL']

As we can see that the first word "How" is interpreted as "Who". It's pretty good when the audio is clear and simple sentence. Let's see what gonna happen when transcribing class audio.  

# Class audio
Now, we'll use the transformer model to embed audio numpy and then decode it as words.  I only chose first 2 million audio numpy to train wav2vec because using all 13 million would crash our memory no matter how big your memory is.

In [None]:
# Use test file on Box and read it
class_audio, class_sr = sf.read(os.path.expanduser('~/Box/DSI Documents/cleaned_data/resampled_audio_16khz/008-1.wav')) # my own short test audio
print("The sample rate of class audio: ", class_sr)
print("The length of audio array:", class_audio.shape)

The sample rate of class audio:  16000
The length of audio array: (13089280,)


In [None]:
# tokenize first 2 million numpy data
input_values = processor(class_audio[:2000000], return_tensors = "pt", sampling_rate = 16000).input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
transcription

['MONIDEGO ACORN AND YO IT ERY NY GO NOT TE BE ANY WAY HAPING ACON THE WOW NOT OT TE TAT TE GERY TE LAT  G A CORN AR HEL AD WAT WA GROD TO BE GOL GAT THE WA TAT GRD O GAD GOL WO A GA T  GO WAY OR REE AN THE RAT OR TE GREE  GRL O IN THE WO OA ON TE NA THE GAT TE RD  GRA O THE TA TE TA ROL W HER O GOW VERY  E  E EN TH GO A TA A GRAT MA GOWL AN E KNOW THAT YOU NO WA OR TE GRA TE GROL E GOD TE TAT TA E E TAT AAD TOT EVERYTHING A GRO GO MELY GO WA A BE IF ATE TO BE A A GO WIT TE A NAT AND O O BE A TA IN AGRAT TOD OR NOT AT YET A TAT AND A A TE AN HE TA A GOL NO OAD AN A NEE GA I HAD TA Y NOT TE  NOT YE NO BO  BE TE O GRAD NA HAN O RA  AT W AD AT WI TE GE R O RAT AN NO THE BERY OL TE AT  NEW TAT TAT AD T  A GO A A AN RAN  MAD TE E GIV ON  BAY IF TE TE MAD A GON E GR GOON  E    O GD NA TAT TE THE MAN TA GIV ON BE A TAT']

Note that the transcription is not good in any sense.  We shouldn't particularly be overly worried about this, as there are multiple very challenging steps that are required for going from audio to the analogous words.  Keep in mind that the audio representation to text is difficult enough; we require the efforts up to audio representation, and the transcription component itself is relatively superfluous.

# A brief aside on loading in data
As is clear, we're using soundfile to load in the audio, which is sampled at 16000Hz (16000 samples per second).  We can use some of the functionality of this package to specifically state which subsets of the audio we want to load in.  Alternately, we can use the loaded in file (currently `class_audio`) and get the segments that we want.  Let's check out what this looks like.

In [None]:
#constants
sampling_rate = 16000

Let's say we want to start at 10 seconds in and have a duration of 4 seconds.  Let's see what the indices of these would be in the data.

In [None]:
#do some math to get start and end indices (in the signal array)
start_index = 10*sampling_rate
end_index = start_index + 4*sampling_rate
print('Start at:', start_index, 'and end at:', end_index)

Start at: 160000 and end at: 224000


Let's see how we can use this:

In [None]:
# process subset of audio
input_values = processor(class_audio[start_index:end_index+1], return_tensors = "pt", sampling_rate = sampling_rate).input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
transcription

['ACORN AND YOU ARE WITH A VERY NIFE']

This transcription appears to be substantially better since it is at a better spot in the data.  Also, shorter sequences appear to work better.