### Necessary files

This is the audio package Riley mentioned and it's pretty easy so I thought it's worth a shot. You'll need vosk (pip install vosk). This is full on audio recognition but it seemed easier to start here than to distinguish signals...

Additionally, you'll need the neural network ideally the small english model which can be found at this [link](https://alphacephei.com/vosk/models)

In [50]:
# relevant packages
from vosk import Model, KaldiRecognizer, SetLogLevel
import sys
import os
import wave
import subprocess
import json
import numpy as np

SetLogLevel(0)

if not os.path.exists("model"):
    print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    exit (1)

### Convert input to string

The following is the Vosk code to convert our wav files to a list of strings. This assumes the input is one channel with a sampling rate of 16000 and converted out of Vox ADPCM (just to wac or 16 bit PCM). This does fairly well, make sure to add a low pass filter onto the audio. Additionally, I added my code for getting the edit distance because the translation will certainly have errors and it's a good way of comparing.

In [67]:
# convert audio from a wavefile e.g. "example.wav" to 
def wav2str(filename):
    sample_rate=16000
    # this should be the name
    model = Model("modelsmall")
    rec = KaldiRecognizer(model, sample_rate)

    wf = wave.open(filename, "rb")
    if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
        print ("Audio file must be WAV format mono PCM.")
        exit (1)


    results = []
    subs = []
    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        if rec.AcceptWaveform(data):
            results.append(rec.Result())
    results.append(rec.FinalResult())

    Strings = []
    for i, res in enumerate(results):
        jres = json.loads(res)
        if not 'result' in jres:
            continue
        words = jres['result']
        for j in range(len(words)):
            Strings.append(words[j]['word'])
        
    return Strings

# to compare two lists
def edit_dist(A, B):
    if len(A) <= len(B):                 # convenient notation to organize
        shorter,longer = A,B
    else:
        shorter,longer = B,A

    a = np.zeros((2,len(shorter) + 1), dtype=int) # matrix of values
    
    # get the first row
    for i in range(len(shorter)+1):
        a[0][i] = i                      # 0th row
    
    # get the rest of the rows
    for j in range(1,len(longer)+1):
        a[1][0] = j                          # first column
        for i in range(1,len(shorter)+1):
            a[1][i] = min([a[0][i-1] + (longer[j-1] != shorter[i-1]),
                           a[0][i] + 1,
                           a[1][i-1] + 1])
        a[0] = a[1]                          # push row back
    
    return(a[0][len(shorter)])           # return last value

In [68]:
one,two,three = wav2str('Recordings/tsent1.wav'),wav2str('Recordings/tsent2.wav'),wav2str('Recordings/tsent3.wav')
# one and two are close, three is perfect:
print(one,two,three)
X = [one, two, three ]
for i in range(len(X)):
    for j in range(len(X)):
        print(edit_dist(X[i],X[j]))

['sentence', 'number', 'one'] ['a', 'sentence', 'number', 'two'] ['test', 'sentence', 'number', 'three']
0
2
2
2
0
2
2
2
0


In [76]:
# me saying 'test command 1' through TBS2

wav2str('Recordings/tbs2test1-1.wav'),wav2str('Recordings/tbs2test1-2.wav')

(['one'], ['command', 'one'])