In this notebook we are going to see how to convert speech into text using Facebook Wav2Vec 2.0 model.Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2Tokenizer.For learning more about it click on this [link](https://huggingface.co/transformers/model_doc/wav2vec2.html)

In [None]:
!pip install --upgrade transformers

### Import Libraries

In [None]:
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import numpy as np

### Load pre-trained Wav2Vec model

In [None]:
#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

### Load Audio file

In [None]:
#load audio file 
audio, sampling_rate = librosa.load("../input/automatic-speech-recognition-in-wolof/clips/clips/0031672b4484f963c8a07babe6f713dd559539d44140e80ac19708db36d9712d81dd5b170c016f65bbd6763372c35bfc984a55448e356f3161dbf8d7c28aa047.mp3",sr=16000)

In [None]:
audio,sampling_rate

# Play the Audio

In [None]:
# audio
display.Audio("../input/automatic-speech-recognition-in-wolof/clips/clips/0031672b4484f963c8a07babe6f713dd559539d44140e80ac19708db36d9712d81dd5b170c016f65bbd6763372c35bfc984a55448e356f3161dbf8d7c28aa047.mp3", autoplay=True)

### Speech to Text

First of all tokenize the input values,take the maximum prediction from the logit and then extraxt the text

In [None]:
input_values = tokenizer(audio, return_tensors = 'pt').input_values
input_values

In [None]:
# store logits (non-normalized predictions)
logits = model(input_values).logits
logits

In [None]:
# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim =-1)

In [None]:
# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

In [None]:
transcriptions

In [None]:
import pandas as pd
test = pd.read_csv("../input/automatic-speech-recognition-in-wolof/Test.csv")

In [None]:
test_trans=[]
for x in test.ID:
    audio, sampling_rate = librosa.load("../input/automatic-speech-recognition-in-wolof/audio_wav_16000/tmp/WOLOF_ASR_dataset/audio_wav_16000/"+str(x)+".wav",sr=16000)
    input_values = tokenizer(audio, return_tensors = 'pt').input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim =-1)
    transcriptions = tokenizer.decode(predicted_ids[0])
    test_trans.append(transcriptions)

In [None]:
test["transcription"]=test_trans
test.head()

In [None]:
lower_trans=[]
for x in test["transcription"]:
    lower_trans.append(x.lower())
test["transcription"] = lower_trans
test.head()

In [None]:
test[["ID","transcription"]].to_csv("submission.csv",index=False)

In [None]:
sum(test["ID"]=="e3a74a8998f03c320f5a4923272247")

In [None]:
test.isna().sum()

In [None]:
sub = pd.read_csv("../input/automatic-speech-recognition-in-wolof/SampleSubmission.csv")
sub["transcription"]=test_trans
lower_trans_1=[]
for x in sub["transcription"]:
    lower_trans_1.append(x.lower())
sub["transcription"] = lower_trans_1

sub.head()

In [None]:
sub[["ID","transcription"]].to_csv("submission1.csv",index=False)