## Enabling GPU if you run on Colab
- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down



## Install Dependencies

In [1]:
!pip install soundfile transformers datasets youtube-dl \
    torch==1.9.1+cu111 \
    torchvision==0.10.1+cu111 \
    torchaudio===0.9.1 -f https://download.pytorch.org/whl/torch_stable.html 

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.9.1+cu111
  Downloading https://download.pytorch.org/whl/cu111/torch-1.9.1%2Bcu111-cp37-cp37m-linux_x86_64.whl (2041.3 MB)
[K     |█████████████                   | 834.1 MB 2.1 MB/s eta 0:09:33tcmalloc: large alloc 1147494400 bytes == 0x5636a7a00000 @  0x7fd4ac5f0615 0x56366d9854cc 0x56366da6547a 0x56366d9882ed 0x56366da79e1d 0x56366d9fbe99 0x56366d9f69ee 0x56366d989bda 0x56366d9fbd00 0x56366d9f69ee 0x56366d989bda 0x56366d9f8737 0x56366da7ac66 0x56366d9f7daf 0x56366da7ac66 0x56366d9f7daf 0x56366da7ac66 0x56366d9f7daf 0x56366d98a039 0x56366d9cd409 0x56366d988c52 0x56366d9fbc25 0x56366d9f69ee 0x56366d989bda 0x56366d9f8737 0x56366d9f69ee 0x56366d989bda 0x56366d9f7915 0x56366d989afa 0x56366d9f7c0d 0x56366d9f69ee
[K     |████████████████▌               | 1055.7 MB 1.7 MB/s eta 0:09:26tcmalloc: large alloc 1434370048 bytes == 0x5636ec056000 @  0x7fd4ac5f0615 0x56366d9854cc 0x56366da6547a 0x56366d9882e

## Download Audio

In [2]:
!rm -fR audio.wav
!ffmpeg -i $(youtube-dl -f 18 --get-url https://www.youtube.com/watch?v=LTxMRQObjfs ) \
  -ss 00:01:15 -to 00:01:30 -ar 16000 -ac 1 audio.wav
# A longer clip will require more RAM: -ss 00:01:15 -to 00:01:59

ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lib

In [3]:
import IPython
IPython.display.Audio('audio.wav')

## Inference

In [6]:
import soundfile as sf
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# load pretrained model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

# load audio
audio_input, sample_rate = sf.read('audio.wav')

# pad input values and return pt tensor
input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values

# INFERENCE
# retrieve logits & take argmax
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)

# transcribe
transcription = processor.decode(predicted_ids[0])
print("-" *20)
print("Transcription:\n", transcription.lower())
print("-" *20)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--------------------
Transcription:
 everyone on this movie they care about the story whether it's the language the symbology everything was done with care with research we all bond it because it's just such an important film you were meant for greatness
--------------------
