# Adjacency constraints in Hidden Layers

## Constraining for activation regions

Constraints performed by restricting the 1024 layer to a 32 by 32 grid, and then penalising the difference between a cell and the activations of the surrounding cells.

<img src="images/cellinfluence.png"/>

These constraints wraparound at the edges.



## The Constraint Term

We define a term $c$,

$$c = \sqrt{\sum_{(i,j) \in H^2,\text{adjacent}(i,j)} (h_i - h_j)^2}$$

Where $H$ are the indices of the cells in the chosen hidden layer. This is then added to the final cost:

$$\hat{L} = L + \alpha c$$

In the following experiments, a value of $\alpha = 0.0125$ is used.

## Experiment

+ Dataset: TIMIT, using FBank features + energies + deltas (order=2) context of 11 frames
+ Kaldi toolkit for feature processing, Theano for neural network training
+ Experiments
    + Constrained all layers
    + Constrained on each layer individually, selected best model by dev set

## Results

| System                       | PER  |
|------------------------------|------|
| Unconstrained                | 21.1 |
| Constrained all layers       | 20.7 |
| Constrained on 1 layer (2nd) | 20.7 |

Improvements from constrained models, constraining all layers and just 1 give similar results.

Selected layer was the second hidden layer.

## Plots of the hidden layers

Averaged plot of frames with the `aa` phoneme.

<table>
<tr>
<td>Unconstrained</td>
<td>Constrained</td>
</tr>
<tr>
<td><img src="images/constraint2.png" width="75%"/> </td>
<td><img src="images/constraint1.png" width="75%"/></td>
</tr>
</table>

## Similarities between plots

<table>
<tr>
    <td>aa<img src="images/layer-0-phn-2.png" width="55%"/></td>
    <td>ae<img src="images/layer-0-phn-3.png" width="55%"/></td>
</tr>
<tr>
    <td>s<img src="images/layer-0-phn-37.png" width="55%"/></td>
    <td>z<img src="images/layer-0-phn-47.png" width="55%"/></td>
</tr>
</table>

## Video for an utterance

<video width="320" height="240" controls>
<source src="images/animation--1.mp4" type="video/mp4"/>
</video>

# What can this be useful for?

Some ideas for exploiting the behaviour caused by the constraints.

## Noise robustness?

Intuition: The constraint introduces some form of redundancy to each layer. Redundancy may be helpful in being robust to noise.

## Results on noisy test set

1. Added 5dB of noise from NoiseX `babble.wav` to test set. 
2. Ran models trained on clean data on noisy data.

| System                       | Clean PER | Noisy PER (5dB) |
|------------------------------|-----------|-----------------|
| Unconstrained                | 21.1      | 73.9            |
| Constrained on 1 layer (1st) | 20.9      | 62.5            |

#### Training on noisy data and testing on noisy data

The unconstrained model performs better than the constrained models.

## Speaker adaptation?

Intuition: Exploit the "blobs" formed by the constraint.

Suppose each speaker forms the similar types of blobs, just in different locations. Can we modify the weights of the feature extractors in the next layer to suit the speaker?

<img src="images/layer-0-phn-47.png" width="55%"/>

## Gaussian filters 

<table style="border:0px">
<tr style="border:0px">
<td style="border:0px;width:40%;"><img src="images/gaussian_filters.png"></td>
<td style="border:0px">
We want to learn feature extractors for the input in terms of Gaussian filters with varying width height and rotations.

Let the Gaussian function be,

    $$ g(\mathbf{x};\mathbf{B},\boldsymbol{\mu}) = \exp\left(-{\left\| \mathbf{B} (\mathbf{x} - \boldsymbol{\mu}) \right\|}^2\right) $$
    
Then we define a weight matrix $\mathbf{W}$ between layers such that,

$$ \underbrace{\mathbf{W}}_{(n,h)} = \underbrace{\mathbf{G}}_{(n,k)} \underbrace{\mathbf{M}}_{(k,h)}$$

where $\mathbf{M}$ is a standard transformation matrix freely tuned by gradient descent, and $\mathbf{G}$ is a matrix with each column representing a Gaussian filter:

$$\mathbf{G}_{i,j} = g(\left[\text{row}(i),\text{col}(i)\right]^\top;\mathbf{B}_j,\boldsymbol{\mu}_j)$$

</td>
</tr>
</table>

## Updating the Gaussian Filters

1. Training the canonical model, update all parameters (with adjacency constraint)
2. During adaptation, update all the $\mathbf{G}$ and $\boldsymbol{\mu}$ parameters.


Initial experiments with this approach did not work well. Improvements over canonical model are within 0.1-0.2% in PER over the canonical model.