# Voice transcription (inference) using DeepSpeech

Just a fun additional experiment where I convert audio to text using the pre-trained DeepSpeech model by Mozilla. Some quick tests yielded an impressive word error rate of about 3%. 

There is also a wrapper function at the end that allows you to record your own voice from this notebook and have the model perform inference on it in real time. 

In [1]:
!pip install pydrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [2]:
!sudo apt-get install python-dev
!apt -qq install -y sox
!pip install librosa
!pip install soundfile
!pip3 install deepspeech
!pip install jiwer
!pip install wave
!pip install ffmpeg-python
!pip install pydub
!pip install sounddevice
!sudo apt-get install libcdio-dev libcdio-paranoia-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-dev is already the newest version (2.7.15~rc1-1).
0 upgraded, 0 newly installed, 0 to remove and 21 not upgraded.
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3 sox
0 upgraded, 8 newly installed, 0 to remove and 21 not upgraded.
Need to get 760 kB of archives.
After this operation, 6,717 kB of additional disk space will be used.
Selecting previously unselected package libopencore-amrnb0:amd64.
(Reading database ... 144611 files and directories currently installed.)
Preparing to unpack .../0-libopencore-amrnb0_0.1.3-2.1_amd64.deb ...
Unpacking libopencore-amrnb0:amd64 (0.1.3-2.1) ...
Sel

# DeepSpeech

DeepSpeech is a state-of-the-art speech recognition system originally proposed in Baidu's Deepspeech paper. It is notable for being one of the first end-to-end trained speech recognition models without the use of handcrafted expert features and beating the word error rate (WER) of previously published results on several test sets. 

It is comprised of two components: a RNN model that predicts letters successively from an audio clip, and a subsequent language model that is trained to correct the output of the RNN to more plausible sequences, due to the nosiy and dynamic distribution of audio. 

To overcome the problem of there being no fixed mapping from input (audio sequence) to output (text, variable length), the paper uses the Connectionist Temporal Classifcation (CTC) loss to train the RNN. 

Essentially, the audio sequence is split into infitestimally brief timesteps, and the model predicts an output for every short window of samples it is given, which can include null characters. The model discovers the possibilities of all valid distributions of the output sequence using a dynamic programming approach. Finally, the model marginalizes over similar distributions with beam search in order to produce the final possibilities of the output sequence. 



In [3]:
%%bash 
# Install DeepSpeech
pip3 install deepspeech

# Download pre-trained English model files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.scorer

# Download example audio files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/audio-0.8.1.tar.gz
tar xvf audio-0.8.1.tar.gz

# Transcribe an audio file
deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav 

audio/
audio/2830-3980-0043.wav
audio/Attribution.txt
audio/4507-16021-0012.wav
audio/8455-210777-0068.wav
audio/License.txt
experience proves this


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   652  100   652    0     0   2990      0 --:--:-- --:--:-- --:--:--  2990
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  4  180M    4 7458k    0     0  6009k      0  0:00:30  0:00:01  0:00:29 7384k 28  180M   28 50.9M    0     0  22.6M      0  0:00:07  0:00:02  0:00:05 25.2M 52  180M   52 93.7M    0     0  28.8M      0  0:00:06  0:00:03  0:00:03 31.0M 64  180M   64  115M    0     0  27.2M      0  0:00:06  0:00:04  0:00:02 28.8M 87  180M   87  157M    0     0  30.0M      0  0:00:05  0:00:05 --:--:-- 31.4M100  180M  100  180M    0     0  31.6M      0  0:00:05  0:00:05 --:--:-- 38.8M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   

In [4]:
import pandas as pd
import pydub
clip_info = pd.read_csv('/content/drive/My Drive/clip_info_frontier.csv')

In [5]:
%%time
!cp "/content/drive/My Drive/speaker_audio_wavs.zip" "speaker_audio_wavs.zip" 
!unzip -q "speaker_audio_wavs.zip"

CPU times: user 1.09 s, sys: 183 ms, total: 1.27 s
Wall time: 8min 16s


In [6]:
!pip install deepspeech
import deepspeech
from scipy.io import wavfile
from jiwer import wer



In [7]:

model = deepspeech.Model("deepspeech-0.8.1-models.pbmm")
model.enableExternalScorer("deepspeech-0.8.1-models.scorer")

In [11]:
import soundfile as sf

data, samplerate = sf.read("/content/content/speaker_audio_wavs/3664/3664-11714-0019.wav", dtype='int16')
pred = model.stt(data)
print(pred)


he replied with a groan which proved the monk's memory to be only too true then at last when he had finished the renzo asked in a doubtful tone then do you believe my father that god will forgive me everything both my sins and my crimes


In [13]:
## Testing out total word error with our LibriSpeech corpus 
## An approximate experiment with 100 samples had an impressive error rate of 3%. 

total_wer = 0.0
for i in range(len(clip_info)):
  row = clip_info.iloc[i]
  data, samplerate =  sf.read(row.wav_path,  dtype='int16')
  pred = model.stt(data)
  total_wer += wer(pred, row.normalized_content)
  if i == 100:
    break

print(total_wer/(i+1))

0.02990148597035311


# Try it out yourself! 

A wrapper function to perform inference on your own pre-recorded audio file. 

In [49]:
"""
To write this piece of code I took inspiration/code from a lot of places.
It was late night, so I'm not sure how much I created or just copied o.O
Here are some of the possible references:
https://blog.addpipe.com/recording-audio-in-the-browser-using-pure-html5-and-minimal-javascript/
https://stackoverflow.com/a/18650249
https://hacks.mozilla.org/2014/06/easy-audio-capture-with-the-mediarecorder-api/
https://air.ghost.io/recording-to-an-audio-file-using-html5-and-js/
https://stackoverflow.com/a/49019356
"""
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import scipy
import ffmpeg
import random
import string



AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };            
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {            
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data); 
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);


function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording... pls wait!"
  }else if (recorder && recorder.state != "recording"){
    recorder.start()
    gumStream.getAudioTracks()[0].start()
    recordButton.innerText = "Press button to start recording"
  }
}

// https://stackoverflow.com/a/951057
function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  recordButton.innerText = "Saved!"
  resolve(base64data.toString())

});

}
});
      
</script>
"""



def get_audio():
  display(HTML(AUDIO_HTML))
  data = eval_js("data")
  binary = b64decode(data.split(',')[1])
  
  process = (ffmpeg
    .input('pipe:0')
    .output('pipe:1', format='wav')
    .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
  )
  output, err = process.communicate(input=binary)
  
  riff_chunk_size = len(output) - 8
  # Break up the chunk size into four bytes, held in b.
  q = riff_chunk_size
  b = []
  for i in range(4):
      q, r = divmod(q, 256)
      b.append(r)

  # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
  riff = output[:4] + bytes(b) + output[8:]

  sr, audio = wav_read(io.BytesIO(riff))

  fn = ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(12))
  newfn = fn + '.wav'
  
  scipy.io.wavfile.write(newfn, sr, audio)

  mysound = AudioSegment.from_wav(newfn)
  mysound = mysound.set_channels(1)
  mysound.export(newfn, format="wav")
  
  
  print("File saved as " + fn + ".wav")

  
  return audio, sr, newfn

def return_predictions():
  audio, sr, newfn = get_audio()
  data, samplerate =  sf.read(newfn,  dtype='int16')
  print(data.shape)
  pred = model.stt(data)
  print(pred)


In [50]:
## Run this cell to record your voice from the notebook.
## Your browser will ask you permission for 
return_predictions()

File saved as 9CEGOKCCY5HJ.wav
(130488,)
oh hush oh


In [None]:
%%time
'''
#root = "/content/drive/My Drive/speaker_audio_corpus/LibriSpeech/train-clean-100/"

target_root = "/content/speaker_audio_wavs"
for i in range(len(clip_info)):
  row = clip_info.iloc[i]
  if pd.isnull(row.wav_path):
    class_name = str(row.speaker_id)
    id = row.id

    if not os.path.exists(os.path.join(target_root, class_name)):
      os.mkdir(os.path.join(target_root, class_name))


    wav_path = os.path.join(os.path.join(target_root, class_name), id + ".wav")
    
    flac_path = row.clip_path

    song = AudioSegment.from_file(flac_path, "flac")
    song.export(wav_path,format = "wav") 
    clip_info.loc[i, "wav_path"] = wav_path
    if i % 10000 == 0:
      print(i)

!zip  -r "speaker_audio_wavs.zip" "/content/speaker_audio_wavs"

import os
i = 0
for root, _, fns in os.walk("/content/content/speaker_audio_wavs"):
  for fn in fns:
    id = fn.split('/')[-1][:-4]
    path = os.path.join(root, fn)
    clip_info.loc[clip_info.id == id, 'wav_path'] = path

import string

for i in range(len(clip_info)):

  con = clip_info.iloc[i].content
  clip_info.loc[i,'normalized_content'] = con.translate(str.maketrans('', '', string.punctuation)).lower()    

clip_info.to_csv('clip_info_frontier.csv', index=False)
'''

20000
CPU times: user 1min 26s, sys: 2min 24s, total: 3min 51s
Wall time: 2h 4min 5s
