Skip to content

Latest commit

 

History

History
102 lines (89 loc) · 6.62 KB

README.md

File metadata and controls

102 lines (89 loc) · 6.62 KB

Automatic Sound Recognition using Wav2Vec2

This repository uses wav2vec2 model from hugging face transformers to create an ASR system which takes input speech signal as input and outputs transcriptions asynchronously.

I have also written a post explaining wave2vec2 in some detail with some further learning directions.

Installation

Installing via pip

  • Download and Install python
  • Create a virtual environment using python -m venv env_name
  • enable created environment env_path\Scripts\activate
  • Install PyTorch pip install torch==1.8.0+cu102 torchaudio===0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
  • Install required dependencies pip install -r requirements.txt

Installing via conda

  • Download and install miniconda
  • Create a new virutal environment using conda create --name env_name python==3.8
  • enable create environment conda activate env_name
  • Install PyTorch conda install pytorch torchaudio cudatoolkit=11.1 -c pytorch
  • Install required dependencies pip install -r requirements.txt

Inferencing

transcribing an audio file

  • run python asr_inference_offline.py with parameters:
    • --model or -m: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)
    • --pipeline or -t : path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)
    • --output or -out : path to output file to save transcriptions. (not required)
    • --device or -d : device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)
    • --lm or l : path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)
    • --beam_width or -bw : beam width to use for beam search decoder during inferencing (Defaults to 1). If beam_width <= 1 then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
  • example
    • python asr_inference_offline.py --recording data/samples/rec.wav -out output/transcription.txt
    • python asr_inference_offline.py --recording data/samples/rec.wav --device cuda

transcribing a streaming audio

  • run python asr_inference_recording.py with parameters:
    • --recording or -rec : path to audio recording
    • --model or -m: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)
    • --pipeline or -t : path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)
    • --blocksize or -bs : size of each audio block to be passed to model (Defaults to 16000)
    • --overlap or -ov : overlapping between each loaded block (Defaults to 0)
    • --output or -out : path to output file to save transcriptions. (not required)
    • --device or -d : device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)
    • --lm or l : path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)
    • --beam_width or -bw : beam width to use for beam search decoder during inferencing (Defaults to 1). If beam_width <= 1 then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
  • example
    • python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -out output/transcription.txt
    • python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -ov 1600 -out output/transcription.txt
    • python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -ov 1600 -out output/transcription.txt --device gpu

live recording and transcribing

  • run python asr_inference_live.py with parameters:
    • --model or -m: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)
    • --pipeline or -t : path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)
    • --blocksize or -bs : size of each audio block to be passed to model (Defaults to 16000)
    • --output or -out : path to output file to save transcriptions. (not required)
    • --device or -d : device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)
    • --lm or l : path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)
    • --beam_width or -bw : beam width to use for beam search decoder during inferencing (Defaults to 1). If beam_width <= 1 then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
  • example
    • python asr_inference_live.py -bs 16000 -out output/transcription.txt
    • python asr_inference_live.py
    • python asr_inference_live.py --device cuda

Training Language Model

  • run python asr_inference_live.py with parameters:
    • --corpus or -c : path to corpus text file.
    • --save or -s : folder path to save model files.

Notebooks

All notebooks resides in notebook folder these are handy when using google colab or similar platforms. All these notebooks are tested in google colab.

  • wav2vec2_asr_pretrained_inference : Basic inference notebook
  • wav2vec2_experiment_language_model : kenlm language model with beam search
  • wav2vec2large_experiment_language_model : kenlm language model with beam search for larger model
  • wav2vec2_finetuning_version_1 : finetuning notebook without augmentation
  • wav2vec2_finetuning_version_2_with_data_augmentations : finetuning notebook with augmentation
  • Training_Simple_Lanugage_Model : training language model notebook version with wikipedia data

Comparisions

GPU inference vs CPU inference

For 4min 10sec recorder audio total time taken

  1. GPU (Nvidia GeForce 940MX) : 18.29sec
  2. CPU : 116.85sec

To do list

  • Environment Setup ✔
  • Inferencing with CPU ✔
  • Inferencing with GPU ✔
  • Asyncio Compatible ✔
  • Training and Finetuning Notebooks ✔
  • Training and Finetuning Scripts
  • Converting model to TensorFlow with ONNX for inference using TensorFlow

Tested Platforms

  • native windows 10 ✔
  • windows-10 wsl2 cpu ✔
  • windows-10 wsl2 gpu ✔
  • Linux ✔

References