# Prototype Recognizer
In this notebook we'll be putting together something that listens to the audio and tries to tell who is talking.

In [1]:
import audiostream
import audiolib
import application
import config
import models
import numpy as np

application.init()

Using TensorFlow backend.


In [2]:
model = application.load_model()

Resuming at batch 29731
Loading model from checkpoints\voice-embeddings.h5
Preloaded model from checkpoints\voice-embeddings.h5


In [3]:
clip1 = audiostream.record(4)

Recording
4 seconds remaining
3 seconds remaining
2 seconds remaining
1 seconds remaining
Recording finished


In [4]:
clip2 = audiostream.record(4)

Recording
4 seconds remaining
3 seconds remaining
2 seconds remaining
1 seconds remaining
Recording finished


How long are these sound samples?

In [13]:
len(clip1) / 44100

4.0

In [3]:
def calc_embedding(model, sound):
    """
    sound: a numpy array of 16-bit signed sound samples.
    """
    print('extract soundlen=',len(sound))
    features = audiolib.extract_features(sound, sample_rate=44100, num_filters=config.NUM_FILTERS)
    emb = models.get_embedding(model, features)
    return emb 

In [19]:
399/160

2.49375

The key to what's going wrong here is ```minibatch._clipped_audio``` and ```config.NUM_FRAMES```.  I don't quite understand the details yet (actually I figured it out; read on).  I thought that 160 frames would be 4 seconds, but I suspect that somehow it is not.

Under the covers we're using [python_speech_features.fbank()](https://github.com/jameslyons/python_speech_features) with the default winstep parameter of 0.010 = 10ms.  At 160 frames, my theory is that we're actually processing not 4 seconds, but 1.6 seconds.  In particular, a randomly selected 1.6 seconds from the sample.

This theory is consistent with the fact that I was getting exactly 2.5x the amount of time I expected.  This makes sense, because winlen=0.025 and winstep=0.010.  That's exactly a ratio of 2.5:1.

I can determine what I want to do about this long term, but for now I think if I set up my sound buffer to have 1.6 seconds of sound this will work well enough to see if it generally works.

In [4]:
def sound_chunks(sound, chunk_seconds=1.6, step_seconds=0.5, sample_rate=44100):
    """Return a sequence of sound chunks from a sound clip.
    Each chunk will be 1.6 seconds of the sound, and each
    successive chunk will be advanced by the specified number of seconds.
    sound: a numpy array of 16-bit signed integers representing a sound sample.
    """
    chunk_len = int(chunk_seconds * sample_rate)
    chunk_step = int(step_seconds * sample_rate)
    chunk_count = int(len(sound) / chunk_step)
    for i in range(chunk_count):
        start = i * chunk_step
        end = start + chunk_len
        yield sound[start:end]

In [89]:
chunks = list(sound_chunks(clip1, chunk_seconds=1.61))

In [86]:
len(chunks)

8

In [90]:
len(chunks[0]) / 44100

1.61

In [17]:
def embeddings_from_sound(model, sound, sample_rate=44100):
    """Return a sequence of embeddings from the different time slices
    in the sound clip.
    sound: a numpy array of 16-bit signed integers representing a sound sample.
    """
    # The 1.601 is a hack to make sure we end up with a shape of 160 instead of 159.
    # What we actually want is 1.6.
    #*TODO: Figure out a better way to fix the 159->160 off by one error than adding .001.
    chunk_seconds=1.61
    for chunk in sound_chunks(sound, chunk_seconds=chunk_seconds, sample_rate=sample_rate):
        # The last portion of the sound may be less than our desired length.
        # We can safely skip it because we'll process it later as it shifts down the time window.
        lc = len(chunk)
        print('lc=%d sec=%f delta=%f' % (lc, lc/sample_rate, lc/sample_rate - chunk_seconds))
        if len(chunk)/sample_rate - chunk_seconds < -0.1:
            continue
        print('calculating embedding for chunk len=%d' % len(chunk))
        yield calc_embedding(model, chunk)

In [184]:
embs1 = list(embeddings_from_sound(model, clip1))



extract soundlen= 71001
extract soundlen= 71001




extract soundlen= 71001
extract soundlen= 71001
extract soundlen= 71001


In [185]:
71001/44100

1.61

In [150]:
len(embs1)

5

## Frame Length and FFT size
I've been having this problem while processing sound samples into filter banks:
> WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

This problem is discussed in this [github issue on Python Speech Features](https://github.com/jameslyons/python_speech_features).

The recommendation there is to read this [Practical Cryptography tutorial on Mel Frequency Cepstral Coefficients mfccs](http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/)

The recommendation is:
> Please read this [tutorial](http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/) to better understand the relationship between FFT size and frame length. The FFT size should be longer than the frame length. either increase your fft size, or decrease your frame length, or you can ignore the warning and see how it performs.

Investigating this further, I found this [pull request](https://github.com/jameslyons/python_speech_features/pull/76/commits/9ab32879b1fb31a38c1a70392fd21370b8fdc30f) that makes NFFT a power of two and simply increases it until it fits.  The comments in the PR say:
> Having an FFT less than the window length loses precision by dropping
    many of the samples; a longer FFT than the window allows zero-padding
    of the FFT buffer which is neutral in terms of frequency domain conversion.

So it sounds like I could increase the numfft parameter on ```fbank()``` from its default of 512 to 2048.  But why am I getting this warning in the first place?  Am I giving it the wrong sample rate?  I didn't have this problem when running the training data through this code.  So I need to figure out how to run my real-time captured sound data through the same code without it breaking.  I want to use the same processing code for training and inference, so I'm not liking the idea of changing the ```numfft``` parameter in ```fbank()```.

Let's compare some training data to my realtime captured audio and try to zero in on what's going wrong.

## Comparing Realtime Audio to Training Data

In [181]:
len(clip1)

176400

In [182]:
len(clip1)/44100

4.0

In [187]:
db = application.make_speaker_db()
triplet = db.random_triplet()

In [188]:
triplet

('d:\\datasets\\voxceleb1\\vox1\\wav\\id10716\\yKO2BD79hQ0\\00012.wav',
 'd:\\datasets\\voxceleb1\\vox1\\wav\\id10716\\Pvm_Dv1P3-M\\00001.wav',
 'd:\\datasets\\voxceleb1\\vox1\\wav\\id10161\\6KdOSVcTQNc\\00001.wav',
 'id10716',
 'id10161')

In [189]:
rate, sample = audiolib.load_wav(triplet[0])

In [190]:
rate

16000

That's an awfully low sample rate, isn't it? Is that actually correct?  I would expect 44100 not 16000.  Let's see if these numbers check out.

In [197]:
len(sample)/rate # Expected number of seconds

9.6400625

I opened this sound clip up in Audacity and verified that it is in fact 9.64 seconds long.That could explain why I'm getting such a big difference in working with my own sound clips.  My sound clips are sampled at a significantly higher rate.

In [194]:
features = audiolib.extract_features(sample, sample_rate=rate)

Notice there was no warning.  I've created a resampled version of this same clip, resampled from a rate of 16000 to 44100.  Let's see how that fares.

In [201]:
rate2, sample2 = audiolib.load_wav(r'd:\tmp\00012-441.wav')
rate2

44100

In [202]:
len(sample2)/rate2

9.640068027210884

In [203]:
features2 = audiolib.extract_features(sample2, sample_rate=rate2)



Aha! So the same cound clip, upsampled from 16000 to 44100 gives me this warning.

I either need to pre-downsample my sound clips or up the number of FFT's in the feature extraction.

## Downsampling
Here are a couple of options for downsampling that I found in this [thread](https://stackoverflow.com/questions/30619740/downsampling-wav-audio-file):
 * [audioop](https://docs.python.org/2/library/audioop.html) ```ratecv``` In the python standard library.
 * [scipi signal resample](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample)
 
 I'm going to try the standard library approach.  To verify that I've got it downsampling properly I want to save out a wav file.  I won't need to save wav files for the actual application, but I don't want to blindly use the sound data I've downsampled; I want to verify it's really doing what I think it is.

From [audioop Docs](https://docs.python.org/2/library/audioop.html):
```audioop.ratecv(fragment, width, nchannels, inrate, outrate, state[, weightA[, weightB]])```
 > state is a tuple containing the state of the converter. The converter returns a tuple (newfragment, newstate), and newstate should be passed to the next call of ratecv(). The initial call should pass None as the state.

In [204]:
len(clip1)

176400

In [234]:
import audioop
def downsample(samples, from_rate=44100, to_rate=16000):
    width = 2
    nchannels = 1
    state = None
    fragment, new_state = audioop.ratecv(samples, width, nchannels, from_rate, to_rate, state)
    return fragment

In [235]:
clip1_d = downsample(clip1)

In [238]:
len(clip1), len(clip1_d), len(clip1)/44100, len(clip1_d)/16000

(176400, 128000, 4.0, 8.0)

In [237]:
128000 / 16000

8.0

The problem here is that each entry in my samples is 16 bits, and this thing is expecting a byte-like object.  I've through of 3 options:
 1. Pass ```ratecv``` width=1
 2. Do the conversion inside of AudioStream
 3. Tell AudioStream that I want samples at 16000 instead of 44100.
 
Let's try option 3 first, because if that works, it will be cleaner and eliminate needless conversions.

In [271]:
import time
as2 = audiostream.AudioStream(seconds=4, rate=16000)
as2.start()
seconds=6
print('Recording')
for i in range(seconds):
    print('%d seconds remaining' % (seconds - i))
    time.sleep(1)
as2.stop()
clip3 = as2.sound_array()

Recording
6 seconds remaining
5 seconds remaining
4 seconds remaining
3 seconds remaining
2 seconds remaining
1 seconds remaining


In [245]:
len(clip3) / 16000

4.0

In [255]:
features3 = audiolib.extract_features(clip3, sample_rate=16000)

Ok that worked with no warnings.  Let's try saving that to a wav file to make sure it's actually producing sound the way we expect and not just crazy data.

We'll use the Python standard library [Wave module](https://docs.python.org/2/library/wave.html).

In [256]:
import wave
def save_wav(filename, samples, rate=16000, width=2, channels=1):
    wav = wave.open(filename, 'wb')
    wav.setnchannels(channels)
    wav.setsampwidth(width)
    wav.setframerate(rate)
    wav.writeframes(samples)
    wav.close()

In [272]:
save_wav(r'd:\tmp\clip3.wav', clip3, rate=16000)

This does produce a wav file as I expect, so this seems to be working from the realtime audio streaming all the way to producing sound at the desired sampling rate.

## Testing with some clips
Now that we've got the issues with the sampling rate all sorted out (the issue is I need to use a rate of 16000 not 44100), let's try this with some live recorded clips.  We'll do that in the next notebook, "6-Prototype Speaker Recognizer".

In [6]:
clip1 = audiostream.record(4, rate=16000)

Recording
4 seconds remaining
3 seconds remaining
2 seconds remaining
1 seconds remaining
Recording finished


In [7]:
clip2 = audiostream.record(4, rate=16000)

Recording
4 seconds remaining
3 seconds remaining
2 seconds remaining
1 seconds remaining
Recording finished


In [13]:
audiolib.save_wav(r'd:\tmp\clip1.wav', clip1, rate=16000)
audiolib.save_wav(r'd:\tmp\clip2.wav', clip2, rate=16000)

In [18]:
embs1 = list(embeddings_from_sound(model, clip1, sample_rate=16000))



lc=25760 sec=1.610000 delta=0.000000
calculating embedding for chunk len=25760
extract soundlen= 25760


ValueError: Error when checking input: expected input_1 to have shape (160, 64, 1) but got array with shape (57, 64, 1)

In [10]:
len(embs1)

TypeError: object of type 'generator' has no len()