[Transformers](https://huggingface.co/docs/transformers/en/index)

In [1]:
import torch
import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2Model



In [2]:
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
audio_path = "./NLP_sample_audio.wav"

In [4]:
audio_input, sample_rate = torchaudio.load(audio_path)

In [5]:
audio_input

tensor([[ 0.0000,  0.0000,  0.0000,  ..., -0.0002, -0.0002, -0.0002],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [6]:
sample_rate

48000

In [7]:
input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt")

ValueError: The model corresponding to this feature extractor: Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}
 was trained using a sampling rate of 16000. Please make sure that the provided `raw_speech` input was sampled with 16000 and not 48000.

In [8]:
resample_speech = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio_input)

In [9]:
input_values = processor(resample_speech.squeeze().numpy(), return_tensors="pt")

It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [10]:
with torch.no_grad():
    output = model(**input_values)

In [11]:
output.last_hidden_state.shape

torch.Size([2, 600, 768])

In [12]:
output.last_hidden_state

tensor([[[-2.8313e-02,  4.2560e-02, -4.5139e-02,  ..., -2.9016e-01,
           4.3965e-02, -4.4799e-03],
         [-2.2972e-02,  4.0503e-02, -4.3333e-02,  ..., -2.8777e-01,
           4.5689e-02, -6.3296e-03],
         [-2.3948e-02,  4.0958e-02, -4.1602e-02,  ..., -2.8878e-01,
           4.6640e-02, -6.2517e-03],
         ...,
         [-3.2487e-02,  6.1837e-02, -6.4588e-02,  ..., -2.9823e-01,
           4.8786e-02,  1.7487e-03],
         [-3.2536e-02,  7.4236e-02, -8.5617e-02,  ..., -3.1201e-01,
           4.1741e-02,  8.8672e-03],
         [-2.8567e-02,  4.6920e-02, -3.6513e-02,  ..., -2.7847e-01,
           4.8712e-02,  8.8344e-03]],

        [[-1.0796e-01,  1.3843e-02,  2.7779e-02,  ...,  8.3089e-05,
           3.8994e-03,  2.1546e-02],
         [-1.0028e-01,  1.7617e-02,  3.3556e-02,  ...,  1.6858e-04,
          -1.8821e-02,  3.1268e-03],
         [-1.1116e-01,  2.0320e-02,  2.7727e-02,  ...,  4.5864e-03,
          -1.3077e-02,  2.1474e-03],
         ...,
         [-9.0107e-02,  2