<a href="https://colab.research.google.com/github/shreyasr-upenn/asr-error-correction-cis522/blob/main/Baseline_ASR_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CIS 522 - Final Project
### Team: Transcriptionists
Members:
Manni Arora - manni@seas.upenn.edu
Pooja Dattatri - poojadat@seas.upenn.edu
Shreyas Ramesh - shreyasr@seas.upenn.edu

# Pretrained ASR model: Wav2Letter/ Flashlight

We use a pre-trained speech recognition model with CTC loss that is trained on 
many open-sourced datasets. Details can be found in [Rethinking Evaluation in ASR: Are Our Models Robust Enough?](https://arxiv.org/abs/2010.11745)

### Installing dependencies




In [None]:
# First, choose backend to build with
backend = 'CUDA' #@param ["CPU", "CUDA"]
# Clone Flashlight
!git clone https://github.com/flashlight/flashlight.git
# install all dependencies for colab notebook
!source flashlight/scripts/colab/colab_install_deps.sh


Cloning into 'flashlight'...
remote: Enumerating objects: 20076, done.[K
remote: Counting objects: 100% (3192/3192), done.[K
remote: Compressing objects: 100% (560/560), done.[K
remote: Total 20076 (delta 2821), reused 2632 (delta 2632), pack-reused 16884[K
Receiving objects: 100% (20076/20076), 14.12 MiB | 23.32 MiB/s, done.
Resolving deltas: 100% (14356/14356), done.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libboost-all-dev is already the newest version (1.65.1.0ubuntu1).
libopenmpi-dev is already the newest version (2.1.1-8).
libsndfile1-dev is already the newest version (1.0.28-4ubuntu0.18.04.2).
The following additional packages will be installed:
  libfftw3-bin libfftw3-long3 libfftw3-quad3 libfftw3-single3 libgflags-dev
  libgflags2.2 libgoogle-glog0v5
Suggested packages:
  libfftw3-doc
The following NEW packages will be installed:
  libfftw3-bin libfftw3-dev libfftw3-long3 libfftw3-quad3 libfftw3-single3
  libgflags-dev

Build CPU/CUDA Backend of `Flashlight`:
- Build from current master. 
- Builds the ASR app. 
- Resulting binaries in `/content/flashlight/build/bin/asr`.

If using a GPU Colab runtime, build the CUDA backend; else build the CPU backend.

In [None]:
# export necessary env variables
%env MKLROOT=/opt/intel/mkl
%env ArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake
%env DNNL_DIR=/opt/dnnl/dnnl_lnx_2.0.0_cpu_iomp/lib/cmake/dnnl

if backend == "CUDA":
  # Total time: ~13 minutes
  !cd flashlight && git checkout d2e1924cb2a2b32b48cc326bb7e332ca3ea54f67 && mkdir -p build && cd build && \
  cmake .. -DCMAKE_BUILD_TYPE=Release \
           -DFL_BUILD_TESTS=OFF \
           -DFL_BUILD_EXAMPLES=OFF \
           -DFL_BUILD_APP_ASR=ON && \
  make -j$(nproc)
elif backend == "CPU":
  # Total time: ~14 minutes
  !cd flashlight && git checkout d2e1924cb2a2b32b48cc326bb7e332ca3ea54f67 &&  mkdir -p build && cd build && \
  cmake .. -DFL_BACKEND=CPU \
           -DCMAKE_BUILD_TYPE=Release \
           -DFL_BUILD_TESTS=OFF \
           -DFL_BUILD_EXAMPLES=OFF \
           -DFL_BUILD_APP_ASR=ON && \
  make -j$(nproc)
else:
  raise ValueError(f"Unknown backend {backend}")

env: MKLROOT=/opt/intel/mkl
env: ArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake
env: DNNL_DIR=/opt/dnnl/dnnl_lnx_2.0.0_cpu_iomp/lib/cmake/dnnl
Note: checking out 'd2e1924cb2a2b32b48cc326bb7e332ca3ea54f67'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at d2e1924c Tensor any and all (#685)
-- The CXX compiler identification is GNU 7.5.0
-- The C compiler identification is GNU 7.5.0
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile f

In [None]:
# Binaries are located in
# /content/flashlight/build/bin
!ls /content/flashlight/build/bin/asr

fl_asr_align	       fl_asr_sfx_apply  fl_asr_tutorial_finetune_ctc
fl_asr_arch_benchmark  fl_asr_test	 fl_asr_tutorial_inference_ctc
fl_asr_decode	       fl_asr_train	 fl_asr_voice_activity_detection_ctc


#### Downloading the model files

>Architecture | # Params | Criterion | Model Name | Arch Name 
>---|---|:---|:---:|:---:
> Transformer|70Mil|CTC|am_transformer_ctc_stride3_letters_70Mparams.bin |am_transformer_ctc_stride3_letters_70Mparams.arch

We have used the above model.

In [None]:
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.bin -O model.bin # acoustic model
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.arch -O arch.txt # model architecture file

Along with the above model, we have also used the corresponding tokens file, lexicon file 

In [None]:
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/tokens.txt -O tokens.txt # tokens (defines predicted tokens)
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt -O lexicon.txt #  lexicon files (defines mapping between words)

#### Downloading the dataset

[AMI Corpus](http://groups.inf.ed.ac.uk/ami/corpus/): consists of 10m, 1hr and 10hr subsets organized as follows. 

```
dev.lst           # development set 
test.lst          # test set 
train_10min_0.lst # first 10 min fold
train_10min_1.lst
train_10min_2.lst
train_10min_3.lst
train_10min_4.lst
train_10min_5.lst
train_9hr.lst     # remaining data of the 10h split (10h=1h+9h)
```
The 10h split is created by combining the data from the 9h split and the 1h split. The 1h split is itself made of 6 folds of 10 min splits. We have evaluated only on the 9hr split.

In [None]:
!rm /content/ami_limited_supervision.tar.gz
!wget -nv --continue -o /dev/null https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/ami_limited_supervision.tar.gz -O /content/ami_limited_supervision.tar.gz
!tar -xf /content/ami_limited_supervision.tar.gz 
!ls /content/ami_limited_supervision

rm: cannot remove '/content/ami_limited_supervision.tar.gz': No such file or directory
audio	  train_10min_0.lst  train_10min_3.lst	train_9hr.lst
dev.lst   train_10min_1.lst  train_10min_4.lst
test.lst  train_10min_2.lst  train_10min_5.lst


### Get baseline WER

In [None]:
! ./flashlight/build/bin/asr/fl_asr_test --am model.bin --datadir '' --emission_dir '' --uselexicon false \
            --test ami_limited_supervision/train_9hr.lst --tokens tokens.txt --lexicon lexicon.txt --show >> output_9hr.txt

[0;32mI0422 00:39:44.398045 23136 CachingMemoryManager.cpp:114 [0mCachingMemoryManager recyclingSizeLimit_=18446744073709551615 (16777216.00 TiB) splitSizeLimit_=18446744073709551615 (16777216.00 TiB)
I0422 00:39:45.018168  9576 Test.cpp:111] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.94999999999999996; --adambeta2=0.98999999999999999; --am=model.bin; --am_decoder_tr_dropout=0.20000000000000001; --am_decoder_tr_layerdrop=0.20000000000000001; --am_decoder_tr_layers=6; --arch=EG_GLU1x2048_S3_TR36x384_1536_DO0.05_LD0.05_CTC; --attention=keyvalue; --attentionthreshold=2147483647; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batching_max_duration=0; --batching_strategy=none; --batchsize=4; --beamsize=2500; --beamsizetoken=2500

We can see that the WER is 26.6%.

