In this notebook we begin learning and exploring how to continuously listen for signals, that we might then pass through our trained neural net in order to recognize specific noises and act on them.

# Learning to use sounddevice: minimal working example

A couple instructive examples that use [pyaudio](https://people.csail.mit.edu/hubert/pyaudio/) can be found here:
* https://github.com/swharden/Python-GUI-examples/blob/master/2016-07-37_qt_audio_monitor/SWHear.py
* https://github.com/chaosparrot/parrot.py/blob/master/lib/listen.py

The latter is a more complex example, but is part of a project similar to this one, and so may be particularly insightful. We will postpone trying to parse it until we need to, however. In particular, first let's understand the basics, and construct a minimal working example.

An alternative library, which seems as powerful but much better documented, is [sounddevice](https://python-sounddevice.readthedocs.io/en/0.3.14/). This also has the benefit of outputting numpy arrays by default, which will save us some processing. We'll try to use sounddevice, but remember pyaudio as a fallback option.

So let's construct a minimal working example. Here we are adapting and stripping down code from https://python-sounddevice.readthedocs.io/en/0.3.14/examples.html, a command-line script which shows a text-mode spectrogram using live microphone data. This listens for five seconds, printing out the maximum amplitude of each 50ms interval.

In [1]:
import sounddevice as sd
from IPython.display import clear_output
import time

class args:
    block_duration = 50 # ms
    device = 2 # select the microphone. Use sd.query_devices() to see options

samplerate = sd.query_devices(args.device, 'input')['default_samplerate']

def callback(indata, frames, time, status):
    if status:
        print('STATUS: ', str(status))
    if any(indata):
        # dynamically print the max and min values
        clear_output(wait=True) # this sometimes takes too long, causing input overflows
        print(indata.max())
        print(indata.min())
    else:
        print('no input')

start = time.time()
with sd.InputStream(device=args.device, channels=1, callback=callback,
                    blocksize=int(samplerate * args.block_duration / 1000),
                    samplerate=samplerate):
    while True:
        # listen for five seconds
        if time.time() - start > 5:
            break

0.010467529
-0.009124756


Let's make another one to measure, how fast can I make sounds? Each sound is going to take at least 0.06s to listen and process, so let's hope I can't go faster than that. ... Well it looks like I easily can, at least with some sounds. We'll just hope the users go slow enough for now, and figure out how to handle over-rapid noisemaking later.

In [14]:
THRESHOLD_ABSOLUTE = 0.005 # ignore any spikes that don't rise above this

last_sound = time.time()

def callback(indata, frames, time_pa, status):
    global last_sound
    if status:
        print('STATUS: ', str(status))
    if any(indata):
        if indata.max() > THRESHOLD_ABSOLUTE:
            new_sound = time.time()
            print(new_sound - last_sound)
            last_sound = new_sound
    else:
        print('no input')

start = time.time()
with sd.InputStream(device=args.device, channels=1, callback=callback,
                    blocksize=int(samplerate * args.block_duration / 1000),
                    samplerate=samplerate):
    while True:
        # listen for five seconds
        if time.time() - start > 1:
            break

0.11463308334350586
0.09275507926940918
0.005228996276855469
0.0876607894897461
0.005400180816650391
0.08743906021118164
0.015585660934448242
0.07728719711303711
0.01627182960510254
0.07467126846313477
0.04583382606506348
0.05915117263793945
0.035568952560424805
0.04714703559875488
0.09284710884094238
0.02297663688659668
0.07494711875915527
0.040576934814453125


# Loading and testing the model

Before we proceed to real-time processing with noise recognition, let's see that we can load the model we trained in the Exploration 2 notebook, and successfully apply it to a 28x14 tensor.

We need to copy paste the class here, and then load the parameters:

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self, image_size, N_noises):
        super(Net, self).__init__()
        
        # image_size is a 2-tuple, the expected dimensions of each spectrogram
        channels, h, w = image_size
        
        # number of output nodes, (square) kernel size, and pool size per convolution layer,
        # assuming the stride for pooling is the same as the pool size
        kernels = [3, 3]
        pool = 2
        
        # compute the number of input nodes for the first dense layer
        h_out, w_out = h, w
        for k in kernels:
            # the convolution.
            h_out += -k + 1
            w_out += -k + 1
            
            # the pool. (from help(torch.nn.MaxPool2d))
            h_out = int( (h_out - pool) / pool + 1 )
            w_out = int( (w_out - pool) / pool + 1 )
            
        self.image_out = h_out * w_out
        
        # define the layers. The numbers of nodes chosen do not have deep thought behind them.
        self.conv0 = nn.Conv2d(1, 32, kernels[0])
        self.pool = nn.MaxPool2d(2)
        self.conv1 = nn.Conv2d(32, 10, kernels[1])
        self.fc0 = nn.Linear(10 * self.image_out, 50)
        self.fc1 = nn.Linear(50, 10)
        # number of output nodes for final dense layer: the number of noise types        
        self.fc2 = nn.Linear(10, N_noises)
        
    def forward(self, x):
        x = self.pool(F.relu(self.conv0(x)))
        x = self.pool(F.relu(self.conv1(x)))
        x = x.view(-1, 10 * self.image_out)
        x = F.relu(self.fc0(x))
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x


# Initialize the model
model = Net(torch.Size([1, 28, 14]), 15)

# Now load the parameters
PATH2 = './trained_models/14_noises_60ms_model_params.pth'
model.load_state_dict(torch.load(PATH2))

<All keys matched successfully>

In [5]:
# And get the dictionary of noise labels
noise_int_to_str = {
    0: 't',
    1: 'p',
    2: 'k',
    3: 'ch',
    4: 'ts',
    5: 'ps',
    6: 'ks',
    7: 'chsh',
    8: 'tf',
    9: 'pf',
    10: 'kf',
    11: 'chf',
    12: 'forward-tsk',
    13: 'side-cluck',
    14: 'lip-open-pop'
}

In [6]:
%%time

# Test model on dummy data.
# Remember the models except many batches, so the first 
# dimension is the batch size (the second is the number of channels)

foo = torch.rand(1,1,28,14) 
output = model(foo)
energy, label = [ x.item() for x in torch.max(output.data, 1) ]
print(energy)
print(label)
print(noise_int_to_str[label])

1.2102808952331543
13
side-cluck
CPU times: user 2.61 ms, sys: 1.26 ms, total: 3.87 ms
Wall time: 2.72 ms


As an interesting observation here, it seems that random spectrograms pretty much always produce 'side-cluck'.

# Using the neural net for real-time recognition

Now let's adopt this basic template to listen for spikes in volume, listen for a sufficient period, generate a spectrogram, classify the sound with the neural net we trained, and print the result.

At the moment, this is fairly fragile, in the sense that we have to carefully process this audio in the same way that we processed audio for the neural network, in "Exploration 2 - many noises - cleaning and training.ipynb" (we saved the resulting network as ./trained_models/14_noises_net1.pth). To this end, we have carefully copied over here some key parameters. This is sufficient for now, but ultimately we will want to make this more robust, especially to allow for varying the neural network approach without having to manually synchronize this code to match.

There are three key features that we should preserve, since these were used in training the net:
* Each audio sample should produce a Mel Spectrogram that is 28x14 in resolution. The 28 is the number of mel filterbanks, which is easily specified. The 14 is set by the duration of the sample analyzed.
* The samples should be identified with the percussive sound starting near the beginning of the sample.
* The samples should be normalized by dividing out the mean amplitude.

The first feature is the most important, because otherwise the net will throw an error. We think the '14' dimension comes about as follows: There were AFTER * frame_rate = 3 * 0.02 * 44100 = 2646 frames, grouped into windows of width 400, with each window a hop of 200 frames over from the one before, hence ceil(2646 / 200) = 14. These are the default values for the window width and hop, and can be found here: https://pytorch.org/audio/transforms.html#melspectrogram. We will just try to gather the same number of frames per sample to be analyzed.

In [37]:
import sounddevice as sd
from IPython.display import clear_output
import time
import torch
import torchaudio
import numpy as np

########### parameters ###########

device = 0 # select the microphone. Use sd.query_devices() to see options

# These are key variables and quantities we used in training the network.
BATCH_DURATION = 0.02 # look at BATCH_DURATION (seconds) at a time
THRESHOLD_MULTIPLIER = 5 # detect a spike when the next batch is at least THRESHOLD_MULTIPLIER times bigger
# AFTER  = 3 * BATCH_DURATION # the time (sec) to look after the spike location
n_mels = 28 # the number of mel filterbanks in the spectrogram
# --------------------

THRESHOLD_ABSOLUTE = 0.005 # ignore any spikes that don't rise above this

samplerate = sd.query_devices(device, 'input')['default_samplerate']

block_duration = BATCH_DURATION # sec
blocksize = int(samplerate * block_duration) # convert to frames

#################################

# Lots of debugging time checks in here right now
def get_prediction(recording):
    """ Build the spectrogram and use our model to recognize the noise """
    
    obs_data = torch.from_numpy(recording) / recording.mean()
    
    now = time.time()
    mel = torchaudio.transforms.MelSpectrogram(
        sample_rate=samplerate, n_mels=n_mels)(obs_data).log2()
    print('MelSpectrogram', time.time() - now)

    # change from torch.Size([28, 14]) to torch.Size([1, 1, 28, 14])
    mel = mel[None, None, :, :]
    output = model(mel)
    energy, label = torch.max(output.data, 1)
    return noise_int_to_str[label.item()]

def act_on_noise(noise_heard):
    print(noise_heard)

# bundling these is easier than declaring them 'global' in callback
class listen:
    prev_max = 1
    sound_detected = False
    batches_to_collect = 0
    recording = None
    start = 0
    end = 0
    
def callback(indata, frames, time_pa, status):
    if status:
        print('STATUS: ', str(status))
    if any(indata):
        new_max = np.absolute(indata).max()
                
        if listen.sound_detected:
            if listen.batches_to_collect > 0:
                listen.recording = np.append( listen.recording, indata )
                listen.batches_to_collect -= 1
            else:
                noise_heard = get_prediction(listen.recording)
                act_on_noise(noise_heard)
                listen.sound_detected = False
                listen.end = time.time()
                print('Processing took', listen.end - listen.start, 'sec\n')
                
        elif ( new_max > THRESHOLD_ABSOLUTE and
               new_max > THRESHOLD_MULTIPLIER * listen.prev_max ):
#             print('noise', new_max, '<', listen.prev_max)
            listen.start = time.time()
            listen.sound_detected = True
            listen.batches_to_collect = 2 # get two more batches, because AFTER = 3 batches
            listen.recording = indata[:]
            
        listen.prev_max = new_max
        
    else:
        print('no input')
        
def listen_and_respond(duration):

    start = time.time()
    with sd.InputStream(device=device, channels=1, callback=callback,
                        blocksize=int(samplerate * block_duration),
                        samplerate=samplerate):
        print('Listening...')
        while True:
            # listen for a few seconds
            if time.time() - start > duration:
                break
        print('Done.')

listen_and_respond(3)

Listening...
MelSpectrogram 0.1575329303741455
pf
Processing took 0.38613414764404297 sec

STATUS:  input overflow
MelSpectrogram 0.13739395141601562
side-cluck
Processing took 0.36205601692199707 sec

STATUS:  input overflow
Done.


Right now, this doesn't perform very well. The processing is quite slow, and the recognition doesn't seem very good, despite the excellent (95%) testing performance the model displayed when we first built it. The spectrogram calculation seems to be the biggest single time sink (at least in the get_prediction function), but only one third of the total time. We end up with a lot of input overflow errors also, because the callback function takes so long that it can't keep up with the rate of audio input.

However, at least a big part of the problem seems to be something going on with the performance of the audio stream and callback mechanism, not necessarily the content of the callback function. For instance, reprocessing the audio from the last recognized sound, we see things go orders of magnitude faster:

In [150]:
start = time.time()

foo = torch.from_numpy(listen.recording) / listen.recording.mean()

now = time.time()
mel = torchaudio.transforms.MelSpectrogram(
        sample_rate=samplerate, n_mels=n_mels)(foo).log2()
print('MelSpectrogram', time.time() - now)

mel = mel[None, None, :, :]
output = model(mel)
energy, label = torch.max(output.data, 1)
print(noise_int_to_str[label.item()])

print('Total time elapsed:', time.time() - start)

MelSpectrogram 0.0020189285278320312
ts
Total time elapsed: 0.00464177131652832


In any case, now that we have the essential features in place, we can work on tuning up performance. We'll focus on making it go fast first, and then worry about accuracy. If the recordings are somehow getting distorted by a callback function that can't keep up, it won't matter how well-trained our model is.

Studying some more of the real-time examples in the sounddevice documentation is likely a good place to start: https://python-sounddevice.readthedocs.io/en/0.3.14/examples.html#plot-microphone-signal-s-in-real-time

Let's try moving the processing out of "callback", which seems to run quite slowly. Instead we'll put the processing in the main loop (in the "with" block for the stream), and leave callback to just queue data for processing.

In [11]:
import sounddevice as sd
from IPython.display import clear_output
import time
import torch
import torchaudio
import numpy as np
import queue

########### parameters ###########

device = 2 # select the microphone. Use sd.query_devices() to see options

# These are key variables and quantities we used in training the network.
BATCH_DURATION = 0.02 # look at BATCH_DURATION (seconds) at a time
THRESHOLD_MULTIPLIER = 5 # detect a spike when the next batch is at least THRESHOLD_MULTIPLIER times bigger
# AFTER  = 3 * BATCH_DURATION # the time (sec) to look after the spike location
n_mels = 28 # the number of mel filterbanks in the spectrogram
# --------------------

THRESHOLD_ABSOLUTE = 0.005 # ignore any spikes that don't rise above this

samplerate = sd.query_devices(device, 'input')['default_samplerate']

block_duration = BATCH_DURATION # sec
blocksize = int(samplerate * block_duration) # convert to frames

#################################

def get_prediction(recording):
    """ Build the spectrogram and use our model to recognize the noise """
    
    # normalize like we did in training the model
    obs_data = torch.from_numpy(recording) / recording.mean()
    
    now = time.time()
    mel = torchaudio.transforms.MelSpectrogram(
        sample_rate=samplerate, n_mels=n_mels)(obs_data).log2()
    print('MelSpectrogram', time.time() - now)

    # change from torch.Size([28, 14]) to torch.Size([1, 1, 28, 14])
    mel = mel[None, None, :, :]
    
    # run through the model and get prediction
    output = model(mel)
    energy, label = torch.max(output.data, 1)
    return noise_int_to_str[label.item()]

def act_on_noise(noise_heard):
    """ Do something in response to a noise that has been heard. """
    print(noise_heard)

# bundling these is easier than declaring them 'global' in callback
class listen:
    prev_max = 1
    batches_to_collect = 0
    batches_collected = 0
    recording = None
    
    start = 0 # for timing recording
    end = 0
    
q_recordings = queue.Queue()
    
def callback(indata, frames, time_pa, status):
    """ Detect if a noise has been made, and add audio to the queue. """
    if status:
        print('STATUS: ', str(status))
    if any(indata):
        new_max = np.absolute(indata).max()
        
        # Gather audio data if more is required
        if listen.batches_to_collect > 0:
            q_recordings.put_nowait(indata[:])
            listen.batches_collected  += 1
            listen.batches_to_collect -= 1
            return
                
        # See if a new noise has been detected
        elif ( new_max > THRESHOLD_ABSOLUTE and
             new_max > THRESHOLD_MULTIPLIER * listen.prev_max ):
            
            listen.start = time.time()
            
            q_recordings.put_nowait(indata[:])
            listen.batches_collected += 1
            listen.batches_to_collect = 2 # get two more batches, because AFTER = 3 batches
               
        listen.prev_max = new_max
        
    else:
        print('no input')
        
def listen_and_respond(duration):
    """ Listen continuously for noises, then recognize and act on them """

    start = time.time()
    with sd.InputStream(device=device, channels=1, callback=callback,
                        blocksize=int(samplerate * block_duration),
                        samplerate=samplerate):
        print('Listening...')
        while True:
            
            # data collects if it meets the threshold. Process if enough data is in queue
            if listen.batches_collected >= 3:
                data1 = q_recordings.get_nowait()
                data2 = q_recordings.get_nowait()
                data3 = q_recordings.get_nowait()
                listen.batches_collected -= 3
                listen.recording = np.concatenate( (data1, data2, data3), axis=None )
                
                noise_heard = get_prediction(listen.recording)
                act_on_noise(noise_heard)
                
                listen.end = time.time()
                print('Processing took', listen.end - listen.start, 'sec\n')
            
            # listen for a few seconds total
            if time.time() - start > duration:
                break
        print('Done.')

listen_and_respond(3)

Listening...
MelSpectrogram 0.0009238719940185547
k
Processing took 0.002618074417114258 sec

MelSpectrogram 0.0012590885162353516
pf
Processing took 0.0026819705963134766 sec

MelSpectrogram 0.0011420249938964844
k
Processing took 0.0022580623626708984 sec

MelSpectrogram 0.0007388591766357422
p
Processing took 0.004266023635864258 sec

MelSpectrogram 0.001348257064819336
side-cluck
Processing took 0.0964360237121582 sec

MelSpectrogram 0.0010323524475097656
t
Processing took 0.09032487869262695 sec

Done.


This queuing approach seems to have fixed our slow processing issue, but a couple issues still remain:
* The accuracy is still pretty terrible. Were we misled on the accuracy of the model from the testing data? Or is there something else going on?
* At present, several noises are often being detected for each one that we make. Perhaps something about our threshold detection strategy is failing.