# Wav2Vec 2.0

## Important points -
- Quantization module, quantizes $Z$ (latent representation) from continuous to a discrete version ($Q$)
- This is done using codebooks, where $\argmin_i (q_i - z_i)$ is performed
- Wav2Vec 2.0 uses **Gumbel softmax trick** to solve the discretization problem due to argmin in back propagation (Since it is not differentiable)
- It converts $\argmax$ to $\text{softmax}$ (Discrete to continuous)
- **Temperature Softmax** is used

### For the Loss calculation
- Contrastive Loss ($L_m$): 
    - Similarity is calculated between the positive and predicted sample
    - Similarity is also calculated for the negative ($K$ distractors) and predicted sample
    - The similarity values are passed through a **temperature log-softmax**, with the temperature $\tau$ to be equal to $K$ (maybe!)
    - Similarity used is **Cosine Similarity** - $\text{sim}(\mathbf{a},\mathbf{b}) = \frac{\mathbf{a}^T\mathbf{b}}{\|{\mathbf{a}}\|\|{\mathbf{b}}\|}$
    - Final Contrastive Loss function -
    $$L_m = -\log\bigg(\frac{\exp(\text{sim}(\mathbf{c_t,q_t}))/\kappa}{\sum_{\tilde q \sim Q_t} \exp(\text{sim}(\mathbf{c_t,\tilde q}))/\kappa}\bigg)$$

    where -
    - $\mathbf{\tilde q \sim Q_t}$ includes $(K+1)$ samples: which includes -
        - $\mathbf{q_t}$, which is the positive sample
        - $K$ distractor samples
- Diversity Loss ($L_d$)
- L2 Penalty Loss ($L_p$)

Credit to AssemblyAI

In [1]:
import torch
import miniaudio
import io
import speech_recognition as ASR

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from pydub import AudioSegment

In [2]:
tokenizer = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
recognizer = ASR.Recognizer()

In [7]:
with ASR.Microphone(sample_rate=16000) as source:
    print('You can start speaking now...')
    while True:
        audio = recognizer.listen(source = source, timeout = 1, phrase_time_limit = 5) # pyaudio object
        data = io.BytesIO(audio.get_wav_data()) # list of bytes
        clip = AudioSegment.from_file(data) # To convert to numpy array
        x = torch.FloatTensor(clip.get_array_of_samples()) # Tensor

        inputs = tokenizer(x, sampling_rate = 16000, return_tensors = 'pt', padding = 'longest').input_values
        logits = model(inputs).logits
        tokens = torch.argmax(logits, axis = -1) # Get the distribution at every time step
        text = tokenizer.batch_decode(tokens) # Tokens to a string

        print("You said:",str(text).lower())

You can start speaking now...
You said: ['what you do in how was life']
You said: ["what's up"]
You said: ['whant tat er doing']
You said: ['oh doyou think i am']
You said: ['her']


KeyboardInterrupt: 

### Wav2Vec 2.0 Large

In [8]:
tokenizer2 = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
model2 = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
with ASR.Microphone(sample_rate=16000) as source:
    print('You can start speaking now...')
    while True:
        audio = recognizer.listen(source = source, phrase_time_limit=5) # pyaudio object
        data = io.BytesIO(audio.get_wav_data()) # list of bytes
        clip = AudioSegment.from_file(data) # To convert to numpy array
        x = torch.FloatTensor(clip.get_array_of_samples()) # Tensor

        inputs = tokenizer(x, sampling_rate = 16000, return_tensors = 'pt', padding = 'longest').input_values
        logits = model(inputs).logits
        tokens = torch.argmax(logits, axis = -1) # Get the distribution at every time step
        text = tokenizer.batch_decode(tokens) # Tokens to a string

        print("You said:",str(text).lower())

You can start speaking now...
You said: ['and lo how his life']
You said: ['nor me']
You said: ['do you know me']
You said: ['what am i saying and what are interpreting']
You said: ['very good this is what i call amaging']


KeyboardInterrupt: 