## Speech to Text using huggingfaces and wav2vec2 

Ref: https://towardsdatascience.com/building-nlp-web-apps-with-gradio-and-hugging-face-transformers-59ce8ab4a319


## Imports

In [1]:
!pip install gradio -q
!pip install wandb --upgrade -q
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


In [2]:
import gradio as gr
import librosa
import soundfile as sf
import torch
from transformers import Wav2Vec2ForMaskedLM, Wav2Vec2Tokenizer

In [3]:
import torch
import wandb
wandb.login()
from tqdm import tqdm
#wandb.init(project="Audio2Text", entity="raghavadurs", id="asr_5")
wandb.init(project="voice2text", entity="sjsu-cmpe-258-musketeers" ,id="asr_5")
config = wandb.config
config.learning_rate = 0.01

[34m[1mwandb[0m: Currently logged in as: [33msjsu-cmpe-258-musketeers[0m (use `wandb login --relogin` to force relogin)


In [4]:
# load wav2vec2 tokenizer and model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
# processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
model = Wav2Vec2ForMaskedLM.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

#define speech to text function
def asr_transcribe(audio_file):
  transcript = ""

  # Stream over 20seconds chunks
  stream = librosa.stream(
      audio_file.name, block_length=20, frame_length=16000, hop_length=16000
  )

  for speech in stream:
    if len(speech.shape) > 1:
      speech = speech[:, 0] + speech[:, 1]

    input_values = tokenizer(speech, return_tensors="pt").input_values
    logits = model(input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)[0]
    transcript += transcription.lower() + " "

  return transcript

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForMaskedLM were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
gradio_asr = gr.Interface(
    fn=asr_transcribe,
    title="Speech to Text with Wav2Vec Hugging Face",
    description="Upload an audio clip, and let Transcribe the word",
    inputs=gr.inputs.Audio(label="Upload Audio File", type="file"),
    outputs=gr.outputs.Textbox(label="Auto-Transcription"),
)

In [6]:
gradio_asr.launch()

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
This share link will expire in 72 hours. If you need a permanent link, visit: https://gradio.app/introducing-hosted
Running on External URL: https://27251.gradio.app


(<Flask 'gradio.networking'>,
 'http://127.0.0.1:7860/',
 'https://27251.gradio.app')