## **4. Combine an *n-gram* with Wav2Vec2**

In a final step, we want to wrap the *5-gram* into a `Wav2Vec2ProcessorWithLM` object to make the *5-gram* boosted decoding as seamless as shown in Section 1.
We start by downloading the currently "LM-less" processor of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv).

In [7]:
!pip install pyctcdecode

Collecting pyctcdecode
  Downloading pyctcdecode-0.5.0-py2.py3-none-any.whl (39 kB)


Collecting pygtrie<3.0,>=2.1 (from pyctcdecode)
  Downloading pygtrie-2.5.0-py3-none-any.whl (25 kB)
Collecting hypothesis<7,>=6.14 (from pyctcdecode)
  Downloading hypothesis-6.97.1-py3-none-any.whl.metadata (6.0 kB)
Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis<7,>=6.14->pyctcdecode)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Downloading hypothesis-6.97.1-py3-none-any.whl (436 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m436.3/436.3 kB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sortedcontainers, pygtrie, hypothesis, pyctcdecode
Successfully installed hypothesis-6.97.1 pyctcdecode-0.5.0 pygtrie-2.5.0 sortedcontainers-2.4.0


In [14]:
!pip install kenlm -U

Collecting kenlm
  Downloading kenlm-0.2.0.tar.gz (427 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.4/427.4 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25ldone
[?25h  Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=597536 sha256=03ba5e26a8afaa4150a6c9134736f4980450cca2742d8dc5f19f56c7688e0b42
  Stored in directory: /home/zeus/.cache/pip/wheels/fd/80/e0/18f4148e863fb137bd87e21ee2bf423b81b3ed6989dab95135
Successfully built kenlm
Installing collected packages: kenlm
Successfully installed kenlm-0.2.0


In [1]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("imvladikon/wav2vec2-xls-r-300m-hebrew")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize


Next, we extract the vocabulary of its tokenizer as it represents the `"labels"` of `pyctcdecode`'s `BeamSearchDecoder` class.

In [2]:
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

The `"labels"` and the previously built `5gram_correct.arpa` file is all that's needed to build the decoder.

In [3]:
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="5gram_correct.arpa",
)

Loading the LM will be faster if you build a binary file.
Reading /teamspace/studios/this_studio/FinalProject/word2vec-kenlm/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Found entries of length > 1 in alphabet. This is unusual unless style is BPE, but the alphabet was not recognized as BPE type. Is this correct?
Unigrams and labels don't seem to agree.


We can safely ignore the warning and all that is left to do now is to wrap the just created `decoder`, together with the processor's `tokenizer` and `feature_extractor` into a `Wav2Vec2ProcessorWithLM` class.

In [4]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

We want to directly upload the LM-boosted processor into
the model folder of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv) to have all relevant files in one place.

Let's clone the repo, add the new decoder files and upload them afterward.
First, we need to install `git-lfs`.

In [None]:
# !sudo apt-get install git-lfs tree

Cloning and uploading of modeling files can be done conveniently with the `huggingface_hub`'s `Repository` class.

More information on how to use the `huggingface_hub` to upload any files, please take a look at the [official docs](https://huggingface.co/docs/hub/how-to-upstream).

In [None]:
# from huggingface_hub import Repository

# repo = Repository(local_dir="xls-r-300m-sv", clone_from="hf-test/xls-r-300m-sv")

Cloning https://huggingface.co/hf-test/xls-r-300m-sv into local empty directory.


Having cloned `xls-r-300m-sv`, let's save the new processor with LM into it.

In [None]:
# processor_with_lm.save_pretrained("xls-r-300m-sv")

Let's inspect the local repository. The `tree` command conveniently can also show the size of the different files.

In [None]:
# !tree -h xls-r-300m-sv/

xls-r-300m-sv/
├── [  23]  added_tokens.json
├── [ 401]  all_results.json
├── [ 253]  alphabet.json
├── [2.0K]  config.json
├── [ 304]  emissions.csv
├── [ 226]  eval_results.json
├── [4.0K]  language_model
│   ├── [4.1G]  5gram_correct.arpa
│   ├── [  78]  attrs.json
│   └── [4.9M]  unigrams.txt
├── [ 240]  preprocessor_config.json
├── [1.2G]  pytorch_model.bin
├── [3.5K]  README.md
├── [4.0K]  runs
│   └── [4.0K]  Jan09_22-00-50_brutasse
│       ├── [4.0K]  1641765760.8871996
│       │   └── [4.6K]  events.out.tfevents.1641765760.brutasse.31164.1
│       ├── [ 42K]  events.out.tfevents.1641765760.brutasse.31164.0
│       └── [ 364]  events.out.tfevents.1641794162.brutasse.31164.2
├── [1.2K]  run.sh
├── [ 30K]  run_speech_recognition_ctc.py
├── [ 502]  special_tokens_map.json
├── [ 279]  tokenizer_config.json
├── [ 29K]  trainer_state.json
├── [2.9K]  training_args.bin
├── [ 196]  train_results.json
├── [ 319]  vocab.json
└── [4.0K]  wandb
    ├── [  52]  debug-internal.log -> run-202

As can be seen the *5-gram* LM is quite large - it amounts to more than 4 GB.
To reduce the size of the *n-gram* and make loading faster, `kenLM` allows converting `.arpa` files to binary ones using the `build_binary` executable.

Let's make use of it here.

In [None]:
# !kenlm/build/bin/build_binary xls-r-300m-sv/language_model/5gram_correct.arpa xls-r-300m-sv/language_model/5gram.bin

Reading xls-r-300m-sv/language_model/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


Great, it worked! Let's remove the `.arpa` file and check the size of the binary *5-gram* LM.

In [None]:
# !rm xls-r-300m-sv/language_model/5gram_correct.arpa && tree -h xls-r-300m-sv/

xls-r-300m-sv/
├── [  23]  added_tokens.json
├── [ 401]  all_results.json
├── [ 253]  alphabet.json
├── [2.0K]  config.json
├── [ 304]  emissions.csv
├── [ 226]  eval_results.json
├── [4.0K]  language_model
│   ├── [1.8G]  5gram.bin
│   ├── [  78]  attrs.json
│   └── [4.9M]  unigrams.txt
├── [ 240]  preprocessor_config.json
├── [1.2G]  pytorch_model.bin
├── [3.5K]  README.md
├── [4.0K]  runs
│   └── [4.0K]  Jan09_22-00-50_brutasse
│       ├── [4.0K]  1641765760.8871996
│       │   └── [4.6K]  events.out.tfevents.1641765760.brutasse.31164.1
│       ├── [ 42K]  events.out.tfevents.1641765760.brutasse.31164.0
│       └── [ 364]  events.out.tfevents.1641794162.brutasse.31164.2
├── [1.2K]  run.sh
├── [ 30K]  run_speech_recognition_ctc.py
├── [ 502]  special_tokens_map.json
├── [ 279]  tokenizer_config.json
├── [ 29K]  trainer_state.json
├── [2.9K]  training_args.bin
├── [ 196]  train_results.json
├── [ 319]  vocab.json
└── [4.0K]  wandb
    ├── [  52]  debug-internal.log -> run-20220109_220

Nice, we reduced the *n-gram* by more than half to less than 2GB now. In the final step, let's upload all files.

In [None]:
# repo.push_to_hub(commit_message="Upload lm-boosted decoder")

Git LFS: (1 of 1 files) 1.85 GB / 1.85 GB
Counting objects: 9, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 1.23 MiB | 1.92 MiB/s, done.
Total 9 (delta 3), reused 0 (delta 0)
To https://huggingface.co/hf-test/xls-r-300m-sv
   27d0c57..5a191e2  main -> main


That's it. Now you should be able to use the *5gram* for LM-boosted decoding as shown in Section 1.

As can be seen on [`xls-r-300m-sv`'s model card](https://huggingface.co/hf-test/xls-r-300m-sv#inference-with-lm) our *5gram* LM-boosted decoder yields a WER of 18.85% on Common Voice's 7 test set which is a relative performance of *ca.* 30% 🔥.

In [7]:
# #run inference on test dataset first example
# import soundfile as sf
# import torch
# from IPython.display import Audio

# test_dataset = datasets.load_from_disk("/teamspace/studios/this_studio/FinalProject/datasets/kan_dataset/test")

# audio_sample = test_dataset[2]
# audio_sentence = audio_sample["sentence"]
# # print(audio_sample["text"].lower())
# print(audio_sentence)
# Audio(data=audio_sample["audio"]["array"], autoplay=True, rate=audio_sample["audio"]["sampling_rate"])
# from transformers import Wav2Vec2ForCTC

# model = Wav2Vec2ForCTC.from_pretrained("imvladikon/wav2vec2-xls-r-300m-hebrew")
# transcription = processor_with_lm(audio_sample["audio"]["array"], sampling_rate=audio_sample["audio"]["sampling_rate"], return_tensors="pt").input_values

# inputs = processor(audio_sample["audio"]["array"], sampling_rate=audio_sample["audio"]["sampling_rate"], return_tensors="pt")

# import torch

# with torch.no_grad():
#   logits = model(**inputs).logits
# # predicted_ids = torch.argmax(logits, dim=-1)
# transcription = processor_with_lm.batch_decode(logits.numpy()).text

Some weights of the model checkpoint at imvladikon/wav2vec2-xls-r-300m-hebrew were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at imvladikon/wav2vec2-xls-r-300m-hebrew and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probab

NameError: name 'audio_sample' is not defined

In [9]:
# with open("dafyomi_test.txt") as f:
#     lines = f.readlines()


'היום נחלק את השיעור לשלושה חלקים בשני החלקים הראשונים נדבר על השלב שקובע למעשר בגידולים שונים בחלק השלישי נראה מאמר מוסגר לגבי היכולת לדייק בשיעורי דרבנן \n'