# Training

I've got the following components working now:
 * **Speaker database**. Looking at our dataset of audio speaker samples and returning anchor, positive, negative triplets
 * **Audio preprocessing**. Running FFT's on audio samples and breaking them into Mel filter bank energies (log frequency bands)
 * **Batch building**. Build a batch of 32 triplets (anchor, positive, negative), slice each sample into 4 second segments. Preprocess the audio and build a batch of correctly shaped tensors ready to feed into the model.
 * **Model**. A deep neural net model that takes preprocessed audio and produces **speaker embeddings**
 * **Loss function**.  A loss function used to optimize the model using [Triplet Loss](https://en.wikipedia.org/wiki/Triplet_loss)

## Next Steps

### Processing speed
I'm training about one batch per 24 seconds.  This breaks down into two major components:
* **Batch building**. *9 sec*. Reading the audio files and running FFT transforms.  This could be done on another thread. Estimated speedup: about 1/3 faster.
* **Set up GPU**. *15 sec*. I'm currently training on a CPU because I haven't got TensorFlow working on my GPU yet. Estimated speedup: at least 10x.

Based on results I've seen from others doing similar things, I estimate I'll need about 100,000 batches to get good loss.  Just for laughs, let's look at what that'd take with the current approach:

Here is an image from [Wallecplise](https://github.com/Walleclipse/Deep_Speaker-speaker_recognition_system) from their research doing speaker embedding training:
![Walleclipse training losses over time](images/walleclipse-loss.png "Walleclipse training losses")


There's some funny/interesting things in there:
 * How did they have so many losses close to zero? Why did they still have so many losses at 4+ after 85k iterations? As of 329 iterations, I'm not seeing anything close to zero.
 * Why did they see so much variance? I'm not seeing much variance.  Right now I'm at 2.7-3.8 and not much outside of that.  I did see one batch at 1.25 loss at batch 276, but haven't seen it that low since.

In [3]:
batch_build_sec = 9
training_sec = 15
batch_train_sec = 22
need_batches = 100000
seconds_needed = batch_train_sec * need_batches
hours_needed = seconds_needed / 3600
days_needed = hours_needed / 24
days_needed

25.462962962962962

The training situation may not be as bad as I thought.  After 350 epochs I was getting training losses from 2.1-4.2.  It started with losses around 5-6.5.  This was after 2-3 hours of CPU-only training.

## Backlog
Here's a prioritized backlog of things I need to do:
 * **Add timing** Show time needed to train a batch.  Break that time down into preprocessing and training.  You'll need this for determining when it's time to stop optimizing and just let the thing learn.
 * **Inference**. Make something that does inference.  Given two samples, decide if they're the same person.  Report on what the alpha is between anchor and positive, and anchor and negative.
 * **Configure GPU**
     * After configuring GPU, estimate training time needed for 100,000 batches
 * **Chart loss** Show a continuously updated graph of the training process so I can verify that it's making progress.
 * **Optimize audio preprocessing**.  Either do this on another thread, or preprocess the entire dataset
 * **Document model architecture**
 * **Document training processing pipeline**
 
 ## Backlog: Done
 * **Checkpoint / reload models** Save the model as it's training periodically, and create a facility to resume training.  Determine how big the model file is, and determine if you want to save N number of models, all models, or just the last x_days worth of models, how often to save, etc.
   * Models are 20Mb.
 * **Log losses**. Append to a log that shows training loss per iteration.  You'll need this for doing charting that persists between incremental training sessions.


# References

 * [FaceNet paper](https://arxiv.org/pdf/1503.03832.pdf) by Florian Schroff, Dmitry Kalenichenko, James Philbin (Google)
 * [Deep Speaker paper](https://arxiv.org/pdf/1705.02304.pdf) Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li (Baidu)
 * [Walleclipse/Deep_Speaker](https://github.com/Walleclipse/Deep_Speaker-speaker_recognition_system)
 * [philipperemy/deep-speaker](https://github.com/philipperemy/deep-speaker)

In [7]:
import os.path
os.path.exists(r'C:\Users\Richard Lack\Documents\notebooks\voice-embeddings\checkpoints\x.h5')

False