In [None]:
from IPython.display import Audio
def video(fname, mimetype, width="100%"):
    from IPython.display import HTML
    video_encoded = open(fname, "rb").read().encode("base64")
    video_tag = '<video controls alt="test" src="data:video/{0};base64,{1}" width="{2}">'.format(
        mimetype, video_encoded, width)
    return HTML(data=video_tag)

# Biologically inspired methods in speech recognition and synthesis: closing the loop

## PhD defense presentation

## February 4, 2016

## Trevor Bekolay

# Motivation

In [None]:
video('spaun.mp4', 'mp4', "80%")

Spaun: very cool, but eyes and arms are lame.

Speech is more natural interaction. Give it ears and a mouth.

Taking ideas from:

- Theoretical neuroscience (want to keep the good parts of Spaun)
- Speech (can't just add ears and a mouth)
- Machine learning (lots of existing work on speech recognition and synthesis)

# Goal: speech input and output

<img src="speech.svg">

Our goal is to be able to build models using speech
for input and output.
First, we have to answer some basic questions,
like are our ears doing,
and how does that translate into internal representations
used in the brain?
How does an internal representation of
a speech intention translate to the 
precise manipulations of the vocal tract
that produce speech?
These are the basic question we have to answer
before we can do something like Spaun with speech.

# What makes speech unique?

<img src="spectrotemporal.svg">

The techniques that are used in, for example,
Spaun can't be easily applied to speech.
These and other reasons have led
some to posit that speech is a unique phenomenon
that requires unique techniques.

While that's not a debate I want to have,
one important characteristic of speech
is that speech has high temporal resolution
and low spatial resolution.
What I mean by that is...

If we think about sensorimotor problems
that humans face, ... (explain figure)

# Closed-loop modeling

<img src="perception-action.svg">

Spaun sees and writes in one unified system.
Most speech systems either do recognition or synthesis.
If they do both, they use text as an intermediary;
humans obviously do not do this.
Natural interactions necessitate closed loop models.

All animals live in a constant feedback loop with their environment.
We perceive the environment,
integrate that into our internal model of the world,
and take actions that affect the environment.
This process is constantly happening;
we don't take turns with the environment.

# Closed-loop modeling

In [None]:
video('shadowing.mp4', 'mp4', "80%")

This is partly why current
speech recognition and synthesis systems
are unnatural to us -- it feels turn-based,
which is not how people talk.
People talk over each other,
speak in unison.
There has been a push in second language learning lately
to do this explicitly;
it's a method called shadowing and it sounds like this.

Very long term goal would be to have
a computer shadow a human speaking.

# Conceptual model: Sermo

<img src="sermo.svg">

What might be required to enable
that kind of speech-based interaction?
Sermo is my attempt at identifying the computations necessary.
(Go through it...)

# Conceptual model: Sermo

1. Subsystems must operate in a continuous, online fashion.
2. Subsystems must be implementable in
   biologically plausible spiking neurons.

In addition to implementing these computations,
the way in which they are implemented matters.
For a closed loop model with natural interactions,
the subsystems in Sermo must operate continuously, and online.
Additionally, they must be implementable in biologically plausible
spiking neurons.
(More, why, etc)

# Conceptual model: Sermo

<img src="sermo-implemented.svg">

Introduce
- Speech recognition (Neural Cepstral Coefficients)
- Speech production (Trajectory generation)
- Sensorimotor integraton (Trajectory classification)

# Neural cepstral coefficients

<img src="ncc.svg">

# Neural cepstral coefficients

<img src="ncc-network.svg">

# Periphery models

<img src="gammatone.svg">

# Discrete Cosine Transform

\begin{equation}
  y_k = \frac{x_0}{\sqrt{N}} + \sqrt{\frac{2}{N}} \sum_{n=1}^{N-1}
  x_n \cos \left( \frac{\pi}{N} n \left( k + \frac{1}{2} \right) \right)
  \text{ for } 0 \le k < N,
\end{equation}

\begin{align}
  \mathbf{k} &= \left[ 0, 1, \ldots, N-1 \right] & 1 \times N \text{ vector} \nonumber \\
  \mathbf{s} &= \left[ \sqrt{2}, 1, 1, \ldots, 1 \right] & 1 \times N \text{ vector} \nonumber \\
  \mathbf{T} &= \sqrt{2}{N} \, \mathbf{s} \circ \cos \left( \frac{\pi}{N} \left(
    \mathbf{k} + \frac{1}{2} \right) \otimes \mathbf{k} \right)
    & N \times N \text{ matrix} \nonumber \\
  \mathbf{y} &= \mathbf{T}\mathbf{x} & N \times 1 \text{ vector}
\end{align}

# Example NCC

<img src="mfcc-ncc.svg">

# Evaluation

<image src="ncc-eval-train.svg">

# Evaluation

<image src="ncc-eval-test.svg">

What we're measuring is "classification correctness,"
which is the proportion of predicted phoneme labels
that are correct
(that's it, since samples are pre-segmented).

# NCCs outperform MFCCs

<img src="ncc-phones.svg">

# NCCs are computationally expensive

<img src="ncc-phones-time">

# Enables comparing periphery models

<img src="ncc-periphmodel-racc-b.svg">

# Syllable production

<img src="sermo-implemented.svg">

# VocalTractLab

<img src="vtl.svg">

# Gesture scores

<img src="gs.svg">

<img src="gs-traj.svg">

# Syllable production network

<img src="prod-network.svg">

# Example syllable sequence

<img src="prod-good.svg">

# Audio sample

In [None]:
Audio("original.wav")

In [None]:
Audio("model.wav")

# Enables speech of varying speeds

<img src="prod-freq.svg">

# Syllable classification

<img src="sermo-implemented.svg">

# Inverse DMPs

<img src="idmp.svg">

# Example syllable classification

<img src="recog-good.svg">

# Operates online with no resets

<img src="recog-sequence_len.svg">

# Limitations & future work

<img src="sermo-implemented.svg">

# NCC

- Slow! Experiment with silicon cochleas.
- Implement in continuous speech recognition system.

# Syllable production

- Does not handle long sequences well.
- Does not handle large syllabaries well.

# Syllable classification

- Does not handle varying speech speeds well.

# Contributions

<ol>
<li>Speech: Sermo, spiking neuron models</li>
<li class="fragment">Machine learning: NCC, gesture score data set</li>
<li class="fragment">Neural modeling: iDMP, linking temporal inputs/outputs to Spaun, ears and vocal tract</li>
</ol>

# Thank you

List some thanks or screenshot of acknowledgements?