English speech recognition demo using [tugstugi/mongolian-speech-recognition](https://github.com/tugstugi/mongolian-speech-recognition) with an OCR network aka [CRNN](https://arxiv.org/abs/1507.05717) :)

An OCR model predicts from an image a sequence of characters. If you treat a spectrogram as image, a speech recognition model will also predict from an image (spectogram) a sequence of characters. So it means, you will be able to use an OCR network for the speech recognition task. Also Deepspeech can be used for the optical character recognition.

For other deep-learning Colab notebooks, visit [tugstugi/dl-colab-notebooks](https://github.com/tugstugi/dl-colab-notebooks).

## Install

Clone the project and install the dependencies:

In [1]:
import os
import time
from os.path import exists, join, basename, splitext

git_repo_url = 'https://github.com/tugstugi/mongolian-speech-recognition.git'
project_name = splitext(basename(git_repo_url))[0]
if not exists(project_name):
  !git clone -q {git_repo_url}
  !cd {project_name} && git checkout a79b916
  !cd {project_name} && pip install -q -r requirements.txt
  !pip install -q wget
  !pip install -q https://github.com/tugstugi/dl-colab-notebooks/archive/colab_utils.zip

import sys  
sys.path.append(project_name)
  
from IPython.display import Audio, display, clear_output
import ipywidgets as widgets
import numpy as np
from scipy.io import wavfile
from dl_colab_notebooks.audio import record_audio, upload_audio

Note: checking out 'a79b916'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at a79b916 fix crnn
[K     |████████████████████████████████| 204kB 4.7MB/s 
[K     |████████████████████████████████| 51kB 9.2MB/s 
[?25h  Building wheel for python-speech-features (setup.py) ... [?25l[?25hdone
  Building wheel for python-levenshtein (setup.py) ... [?25l[?25hdone
[K     - 51kB 53.4MB/s
[?25h  Building wheel for Deep-Learning-Colab-Notebook-Utils (setup.py) ... [?25l[?25hdone


For the language model support, we need also the `ctcdecode` lib:

In [0]:
if not exists('ctcdecode'):
  !git clone -q --recursive https://github.com/parlance/ctcdecode.git
  !cd ctcdecode && pip install .

## Download Model

Downlad the pre-trained model (16 epochs / 15 hours / 14.6% WER on LibriSpeech dev-clean) and initialize it:

In [11]:
checkpoint_file = 'checkpoint.pth'
if not exists(checkpoint_file):
  !wget -q -O {checkpoint_file} 'https://docs.google.com/uc?export=download&id=1Bt1TQD2a_RIefPW3iosa-yXtqkjehwt-'

import torch
from models.crnn import Speech2TextCRNN
from datasets.libri_speech import vocab
from datasets import *
from utils import load_checkpoint
from decoder import *
model = Speech2TextCRNN(vocab)
load_checkpoint(checkpoint_file, model, optimizer=None, use_gpu=True)
model = model.float().cuda().eval()

loaded checkpoint epoch=16 step=118816


Download a n-gram binary language model:

In [8]:
lm_model = 'lm.binary'
if not exists(lm_model):
  !wget -q -O {lm_model} http://www.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz
  !gunzip {lm_model}

gzip: lm.binary: unknown suffix -- ignored


## Record or Upload Speech

In [0]:
#@title { run: "auto" }

SAMPLE_RATE = 16000
record_or_upload = "Record" #@param ["Record", "Upload (.mp3 or .wav)"]
record_seconds =   10#@param {type:"number", min:1, max:10, step:1}

def _recognize(audio):
  display(Audio(audio, rate=SAMPLE_RATE, autoplay=True))
  wavfile.write('test.wav', SAMPLE_RATE, (32767*audio).astype(np.int16))

  audio = Compose([LoadAudio(), ComputeMelSpectrogram(), ResizeMelSpectrogram()])({'fname': 'test.wav', 'text':''})['input']
  torch.set_grad_enabled(False)
  inputs = torch.from_numpy(audio).unsqueeze(0)
  inputs = inputs.permute(0, 2, 1).cuda()
  outputs = model(inputs)
  outputs = outputs.softmax(2).permute(1, 0, 2)
  greedy_decoder = GreedyDecoder(labels=vocab)
  decoded_output, _ = greedy_decoder.decode(outputs)
  print("\nwithout LM:")
  print(decoded_output[0][0])
  print("\n")

  ALPHA = 0.3  # How much do you trust for LM? 0 means don't use LM, bigger values more trust in LM
  BETA = 1.85  # not so important, using DeepSpeech default one
  beam_ctc_decoder = BeamCTCDecoder(labels='$' + vocab[1:].upper(), num_processes=4,
                                            lm_path=lm_model,
                                            alpha=ALPHA, beta=BETA,
                                            cutoff_top_n=40, cutoff_prob=1.0, beam_width=1000)
  decoded_output, _ = beam_ctc_decoder.decode(outputs)
  print("with LM:")
  print(decoded_output[0][0].lower())


def _record_audio(b):
  clear_output()
  audio = record_audio(record_seconds, sample_rate=SAMPLE_RATE)
  _recognize(audio)
def _upload_audio(b):
  clear_output()
  audio = upload_audio(sample_rate=SAMPLE_RATE)
  _recognize(audio)

if record_or_upload == "Record":
  button = widgets.Button(description="Record Speech")
  button.on_click(_record_audio)
  display(button)
else:
  try:
    _upload_audio("")
  except TypeError:
    print("uploading failed")