## **4. Combine an *n-gram* with Wav2Vec2**

In a final step, we want to wrap the *5-gram* into a `Wav2Vec2ProcessorWithLM` object to make the *5-gram* boosted decoding as seamless as shown in Section 1.
We start by downloading the currently "LM-less" processor of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv).

In [1]:
!sudo apt-get install ffmpeg

Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:4.2.7-0ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.


In [2]:
#run inference on test dataset first example
import soundfile as sf
import torch
from IPython.display import Audio
import numpy as np

In [3]:
!pip install pydub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
#get first 10 seconds of the audio
from pydub import AudioSegment
from pydub.utils import make_chunks

import math

# Load your MP3 file
audio = AudioSegment.from_mp3("FinalProject/wav2vec-kenlm/data/dafyomi/batra_155.mp3")

# Define the length of each chunk in milliseconds
chunk_length_ms = 10000  # 10 seconds * 1000 ms/sec
chunks = make_chunks(audio, chunk_length_ms) 
chunks = [chunk.set_frame_rate(16000).set_channels(1) for chunk in chunks]
chunks = [np.array(chunk.get_array_of_samples()) for chunk in chunks]
chunks = [chunk.astype(np.float32) / np.abs(chunk).max() for chunk in chunks]
# Calculate the number of chunks to split the file into
# num_chunks = math.ceil(len(audio) / chunk_length_ms)
# chunks = []
# Split the audio and save each chunk
# for i in range(num_chunks):
#     start_ms = i * chunk_length_ms
#     end_ms = min((i + 1) * chunk_length_ms, len(audio))
#     chunk = audio[start_ms:end_ms]
#     chunks.append(chunk)

In [11]:
from transformers import AutoProcessor, AutoModelForCTC

class ASRModel:
    def __init__(self, model_name=None, model=None, processor=None, feature_extractor=None, tokenizer=None, lm_model=False):
        self.model_name=model_name
        self.feature_extractor=feature_extractor
        self.processor=processor
        self.tokenizer=tokenizer
        self.lm_model=lm_model
        if feature_extractor and tokenizer:
            self.feature_extractor=feature_extractor
            self.tokenizer=tokenizer
            self.processor=AutoProcessor(feature_extractor=feature_extractor, processor=processor)

        elif processor:
            self.processor=processor
        else:
            self.processor = AutoProcessor.from_pretrained(model_name)
        
        print('Getting Model...')
        if lm_model:
            self.model= AutoModelForCTC.from_pretrained(model_name, vocab_size=(len(lm_model.tokenizer.vocab)))
        elif model:
           self.model=model 
        else:
            self.model = AutoModelForCTC.from_pretrained(model_name)

            
    def get_prediction(self, inputs, sampling_rate=16000, return_tensors="pt"):
        self.inputs= self.processor(inputs, sampling_rate=sampling_rate, return_tensors=return_tensors)
        with torch.no_grad():
            self.logits = self.model(**self.inputs).logits
        if self.lm_model:
            return self.lm_model.batch_decode(self.logits.numpy()).text
        else:
            predicted_ids = torch.argmax(self.logits, dim=-1)
            return self.processor.batch_decode(predicted_ids)


In [6]:
sample = chunks[10]
sf.write("bert_test.wav", sample, 16000)


In [12]:
from transformers import Wav2Vec2CTCTokenizer, SeamlessM4TFeatureExtractor, Wav2Vec2BertForCTC, Wav2Vec2ProcessorWithLM, Wav2Vec2BertProcessor
tokenizer=Wav2Vec2CTCTokenizer.from_pretrained("models/wav2vec2bertLm")
feature_extractor = SeamlessM4TFeatureExtractor.from_pretrained("models/wav2vec2Bert")
processor = Wav2Vec2BertProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
bertLM = Wav2Vec2ProcessorWithLM.from_pretrained("models/wav2vec2bertLm")
wav2vec2BertLm = ASRModel(model_name="models/wav2vec2Bert", processor=processor, lm_model=bertLM)
# sample = chunks[10]


Loading the LM will be faster if you build a binary file.
Reading /teamspace/studios/this_studio/models/wav2vec2bertLm/language_model/5gram.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************


Getting Model...


RuntimeError: Error(s) in loading state_dict for Wav2Vec2BertForCTC:
	size mismatch for lm_head.weight: copying a param with shape torch.Size([35, 1024]) from checkpoint, the shape in current model is torch.Size([32, 1024]).
	size mismatch for lm_head.bias: copying a param with shape torch.Size([35]) from checkpoint, the shape in current model is torch.Size([32]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

In [None]:
# inputs = processor(sample, sampling_rate=16000, return_tensors="pt")
wav2vec2BertLm.get_prediction(sample)

In [9]:
wav2vec_he = ASRModel("imvladikon/wav2vec2-xls-r-300m-hebrew")
wav2vec_he.get_prediction(sample)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Getting Model...


Some weights of the model checkpoint at imvladikon/wav2vec2-xls-r-300m-hebrew were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at imvladikon/wav2vec2-xls-r-300m-hebrew and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probab