<a href="https://colab.research.google.com/github/yohabay/voice-assistance-ecommerce-app/blob/main/examples/mms/asr/tutorial/MMS_ASR_Inference_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running MMS-ASR inference in Colab

In this notebook, we will give an example on how to run simple ASR inference using MMS ASR model.

Credit to epk2112 [(github)](https://github.com/epk2112/fairseq_meta_mms_Google_Colab_implementation)

## Step 1: Clone fairseq-py and install latest version

In [1]:
!mkdir "temp_dir"
!git clone https://github.com/pytorch/fairseq

# Change current working directory
!pwd
%cd "/content/fairseq"
!pip install --editable ./
!pip install tensorboardX


Cloning into 'fairseq'...
remote: Enumerating objects: 35385, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 35385 (delta 6), reused 5 (delta 5), pack-reused 35370 (from 2)[K
Receiving objects: 100% (35385/35385), 25.47 MiB | 14.91 MiB/s, done.
Resolving deltas: 100% (25539/25539), done.
/content
/content/fairseq
Obtaining file:///content/fairseq
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting hydra-core<1.1,>=1.0.7 (from fairseq==0.12.2)
  Downloading hydra_core-1.0.7-py3-none-any.whl.metadata (3.7 kB)
Collecting omegaconf<2.1 (from fairseq==0.12.2)
  Downloading omegaconf-2.0.6-py3-none-any.whl.metadata (3.0 kB)
Requested omegaconf<2.1 from https://files.pythonhosted.org/packages/d0/eb/9d63ce

## 2. Download MMS model
Un-comment to download your preferred model.
In this example, we use MMS-FL102 for demo purposes.
For better model quality and language coverage, user can use MMS-1B-ALL model instead (but it would require more RAM, so please use Colab-Pro instead of Colab-Free).


In [2]:
# MMS-1B:FL102 model - 102 Languages - FLEURS Dataset
!wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt'

# # MMS-1B:L1107 - 1107 Languages - MMS-lab Dataset
# !wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt'

# # MMS-1B-all - 1162 Languages - MMS-lab + FLEURS + CV + VP + MLS
# !wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt'

--2024-12-25 02:32:52--  https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.96, 3.163.189.108, 3.163.189.51, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4851043301 (4.5G) [binary/octet-stream]
Saving to: ‘./models_new/mms1b_fl102.pt’


2024-12-25 02:33:17 (186 MB/s) - ‘./models_new/mms1b_fl102.pt’ saved [4851043301/4851043301]



## 3. Prepare audio file
Create a folder on path '/content/audio_samples/' and upload your .wav audio files that you need to transcribe e.g. '/content/audio_samples/audio.wav'

Note: You need to make sure that the audio data you are using has a sample rate of 16kHz You can easily do this with FFMPEG like the example below that converts .mp3 file to .wav and fixing the audio sample rate

Here, we use a FLEURS english MP3 audio for the example.

In [6]:
! mkdir -p /content/audio_samples/

In [16]:
!wget -P ./audio_samples/ '/content/audio_samples/Conference.wav'
!ffmpeg -y -i ./audio_samples/Conference.wav -ar 16000 ./audio_samples/audio.wav

/content/audio_samples/Conference.wav: Scheme missing.
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enab

# 4: Run Inference and transcribe your audio(s)


In the below example, we will transcribe a sentence in English.

To transcribe other languages:
1. Go to [MMS README ASR section](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr)
2. Open Supported languages link
3. Find your target languages based on Language Name column
4. Copy the corresponding Iso Code
5. Replace `--lang "eng"` with new Iso Code

To improve the transcription quality, user can use language-model (LM) decoding by following this instruction [ASR LM decoding](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr)

In [17]:
import os

os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav"


>>> preparing tmp manifest dir ...
>>> loading model & running inference ...
Traceback (most recent call last):
  File "/content/fairseq/examples/speech_recognition/new/infer.py", line 21, in <module>
    from examples.speech_recognition.new.decoders.decoder_config import (
  File "/content/fairseq/examples/speech_recognition/__init__.py", line 1, in <module>
    from . import criterions, models, tasks  # noqa
  File "/content/fairseq/examples/speech_recognition/criterions/__init__.py", line 15, in <module>
    importlib.import_module(
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/content/fairseq/examples/speech_recognition/criterions/cross_entropy_acc.py", line 13, in <module>
    from fairseq import utils
  File "/content/fairseq/fairseq/__init__.py", line 20, in <module>
    from fairseq.distributed import utils as distributed_utils
  File "/content/fairseq/fairseq/distributed/_

# 5: Beam search decoding using a Language Model and transcribe audio file(s)


Since MMS is a CTC model, we can further improve the accuracy by running beam search decoding using a language model.

While we have not open sourced the language models used in MMS (yet!), we have provided the details of the data and commands to used to train the LMs in the Appendix section of our paper.


For this tutorial, we will use a alternate English language model based on Common Crawl data which has been made publicly available through the efforts of [Likhomanenko, Tatiana, et al. "Rethinking evaluation in asr: Are our models robust enough?."](https://arxiv.org/abs/2010.11745). The language model can be accessed from the GitHub repository [here](https://github.com/flashlight/wav2letter/tree/main/recipes/rasr).

In [18]:
! mkdir -p /content/lmdecode

!wget -P /content/lmdecode  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin # smaller LM
!wget -P /content/lmdecode  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt

--2024-12-25 02:49:53--  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.14, 3.163.189.96, 3.163.189.108, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2627163608 (2.4G) [application/octet-stream]
Saving to: ‘/content/lmdecode/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin’


2024-12-25 02:50:56 (40.4 MB/s) - ‘/content/lmdecode/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin’ saved [2627163608/2627163608]

--2024-12-25 02:50:56--  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.96, 3.163.189.14, 3.163.189.108, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.96|:443... connected.
HTTP request sent, awaiting response... 200 O


Install decoder bindings from [flashlight](https://github.com/flashlight/flashlight)


In [19]:
# Taken from https://github.com/flashlight/flashlight/blob/main/scripts/colab/colab_install_deps.sh
# Install dependencies from apt
! sudo apt-get install -y libfftw3-dev libsndfile1-dev libgoogle-glog-dev libopenmpi-dev libboost-all-dev
# Install Kenlm
! cd /tmp && git clone https://github.com/kpu/kenlm && cd kenlm && mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install -j$(nproc)

# Install Intel MKL 2020
! cd /tmp && wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB && \
    apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
! sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' && \
    apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends intel-mkl-64bit-2020.0-088
# Remove existing MKL libs to avoid double linkeage
! rm -rf /usr/local/lib/libmkl*


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libboost-all-dev is already the newest version (1.74.0.3ubuntu7).
libopenmpi-dev is already the newest version (4.1.2-2ubuntu1).
libsndfile1-dev is already the newest version (1.0.31-2ubuntu0.1).
The following additional packages will be installed:
  libfftw3-bin libfftw3-double3 libfftw3-long3 libfftw3-quad3 libfftw3-single3
  libgflags-dev libgflags2.2 libgoogle-glog0v5 libunwind-dev
Suggested packages:
  libfftw3-doc
The following NEW packages will be installed:
  libfftw3-bin libfftw3-dev libfftw3-double3 libfftw3-long3 libfftw3-quad3
  libfftw3-single3 libgflags-dev libgflags2.2 libgoogle-glog-dev
  libgoogle-glog0v5 libunwind-dev
0 upgraded, 11 newly installed, 0 to remove and 49 not upgraded.
Need to get 6,861 kB of archives.
After this operation, 32.4 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libfftw3-double3 amd64 3.3.8-2ubunt

In [20]:
! rm -rf flashlight
! git clone --recursive https://github.com/flashlight/flashlight.git
%cd flashlight
! git checkout 035ead6efefb82b47c8c2e643603e87d38850076
%cd bindings/python
! python3 setup.py install

%cd /content/fairseq

Cloning into 'flashlight'...
remote: Enumerating objects: 26016, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 26016 (delta 21), reused 17 (delta 14), pack-reused 25982 (from 3)[K
Receiving objects: 100% (26016/26016), 15.92 MiB | 25.87 MiB/s, done.
Resolving deltas: 100% (18590/18590), done.
/content/fairseq/flashlight
Note: switching to '035ead6efefb82b47c8c2e643603e87d38850076'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 035e

Next, we download an audio file from [People's speech](https://huggingface.co/datasets/MLCommons/peoples_speech) data. We will the audio sample from their 'dirty' subset which will be more challenging for the ASR model.

In [27]:
!pip install huggingface_hub



In [28]:
from huggingface_hub import login
login("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")

In [26]:
!wget -O ./audio_samples/tmp.wav 'https://datasets-server.huggingface.co/assets/MLCommons/peoples_speech/--/dirty/train/0/audio/audio.wav'

--2024-12-25 02:57:57--  https://datasets-server.huggingface.co/assets/MLCommons/peoples_speech/--/dirty/train/0/audio/audio.wav
Resolving datasets-server.huggingface.co (datasets-server.huggingface.co)... 13.224.14.100, 13.224.14.92, 13.224.14.103, ...
Connecting to datasets-server.huggingface.co (datasets-server.huggingface.co)|13.224.14.100|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2024-12-25 02:57:57 ERROR 403: Forbidden.



In [32]:
!wget -O ./audio_samples/tmp.wav 'https://datasets-server.huggingface.co/assets/MLCommons/peoples_speech/--/dirty/train/0/audio/audio.wav'
!ffmpeg -y -i ./audio_samples/app_src_main_assets_10001-90210-01803.wav -ar 16000 ./audio_samples/audio_noisy.wav


--2024-12-25 03:04:28--  https://datasets-server.huggingface.co/assets/MLCommons/peoples_speech/--/dirty/train/0/audio/audio.wav
Resolving datasets-server.huggingface.co (datasets-server.huggingface.co)... 13.224.14.100, 13.224.14.109, 13.224.14.92, ...
Connecting to datasets-server.huggingface.co (datasets-server.huggingface.co)|13.224.14.100|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2024-12-25 03:04:28 ERROR 403: Forbidden.

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfo

Let's listen to the audio file


In [35]:
!pip install omegaconf

Collecting omegaconf
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting antlr4-python3-runtime==4.9.* (from omegaconf)
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.0/117.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: antlr4-python3-runtime
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone
  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144555 sha256=25cd206c9a4f6cd0160806a76b6857ce2f2a6e8c639a92e79c8a56c861cb0ca7
  Stored in directory: /root/.cache/pip/wheels/12/93/dd/1f6a127edc45659556564c5730f6d4e300888f4bca2d4c5a88
Successful

In [37]:
import IPython
import os

# Install omegaconf
!pip install omegaconf

# Set environment variables
os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

# Run MMS ASR inference to get the transcript
!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio_noisy.wav" > transcript.txt

# Read the transcript from the output file
with open("transcript.txt", "r") as f:
    transcript = f.read().strip()

# Play the audio
IPython.display.display(IPython.display.Audio("./audio_samples/audio_noisy.wav"))

# Print the transcript
print("Transcript:", transcript)

>>> preparing tmp manifest dir ...
>>> loading model & running inference ...
2024-12-25 03:08:41.630153: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-25 03:08:41.656201: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-25 03:08:41.666085: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-25 03:08:41.686286: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appr

Transcript: 


In [33]:
import IPython
IPython.display.display(IPython.display.Audio("./audio_samples/audio_noisy.wav"))
print("Trancript: limiting emotions that we experience mainly in our childhood which stop us from living our life just open freedom i mean trust and")

Trancript: limiting emotions that we experience mainly in our childhood which stop us from living our life just open freedom i mean trust and


Run inference with both greedy decoding and LM decoding

In [None]:
import os

os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

print("======= WITHOUT LM DECODING=======")

!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav" "/content/fairseq/audio_samples/audio_noisy.wav"

print("\n\n\n======= WITH LM DECODING=======")

# Note that the lmweight, wordscore needs to tuned for each LM
# Using the same values may not be optimal
decoding_cmds = """
decoding.type=kenlm
decoding.beam=500
decoding.beamsizetoken=50
decoding.lmweight=2.69
decoding.wordscore=2.8
decoding.lmpath=/content/lmdecode/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin
decoding.lexicon=/content/lmdecode/lexicon.txt
""".replace("\n", " ")
!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav" "/content/fairseq/audio_samples/audio_noisy.wav" \
    --extra-infer-args '{decoding_cmds}'


>>> preparing tmp manifest dir ...
>>> loading model & running inference ...
2023-05-26 01:01:58.415006: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Input: /content/fairseq/audio_samples/audio.wav
Output: a tornado is a spinning colum of very low-pressure air which sucks it surrounding air inward and upward
Input: /content/fairseq/audio_samples/audio_noisy.wav
Output: limiting emotions that weexperienced mainly in our childhood which stop us from living our lives in just open freedom and interust and
>>> preparing tmp manifest dir ...
>>> loading model & running inference ...
2023-05-26 01:03:50.066828: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-cri