# tweetynet inference graph
This notebook uses the original implementation of TweetyNet in the low-level Tensorflow 1.0 API, to walk through the dimensions -- to explain the architecture, and to make sure the Torch implementation is reproducing it + results correctly

In [1]:
from math import ceil

import tensorflow as tf

In [3]:
n_syllables=10
batch_size=11
time_bins=88
freq_bins=257
channels=1
conv1_filters=32
conv2_filters=64
pool1_size=(1, 8)
pool1_strides=(1, 8)
pool2_size=(1, 8)
pool2_strides=(1, 8)
learning_rate=0.001

In [4]:
xentropy = tf.compat.v1.nn.sparse_softmax_cross_entropy_with_logits


def out_width(in_width, filter_width, stride):
    return ceil(float(in_width - filter_width + 1) / float(stride))

## input shape
originally, spectrograms were transposed **before** being fed to the network, so the shape was 
(batch, time bins, freq bins, 'channel').
If you want to think of this an image, the order would have been:
(batch, width, height, channel).


In [5]:
X = tf.random.normal((batch_size, time_bins, freq_bins, 1))

## convolutional network

In [6]:
conv1 = tf.compat.v1.layers.conv2d(
    inputs=tf.reshape(X, [batch_size, -1, freq_bins, 1]),
    filters=conv1_filters,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu,
    name='conv1')

pool1 = tf.compat.v1.layers.max_pooling2d(inputs=conv1,
                                pool_size=pool1_size,
                                strides=pool1_strides,
                                name='pool1')

conv2 = tf.compat.v1.layers.conv2d(
    inputs=pool1,
    filters=conv2_filters,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu,
    name='conv2')

pool2 = tf.compat.v1.layers.max_pooling2d(inputs=conv2,
                                pool_size=pool2_size,
                                strides=pool2_strides,
                                name='pool2')

Instructions for updating:
Use keras.layers.conv2d instead.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use keras.layers.max_pooling2d instead.


In [7]:
for name, layer in zip(
    ['conv1', 'pool1', 'conv2', 'pool2'],
    [conv1, pool1, conv2, pool2]
):
    print(f'shape of {name}:', layer.shape)

shape of conv1: (11, 88, 257, 32)
shape of pool1: (11, 88, 32, 32)
shape of conv2: (11, 88, 32, 64)
shape of pool2: (11, 88, 4, 64)


## reshaping input for recurrent network

### determining number of hidden units
After passing through the convnet, the input spectrogram has been mapped to 64 channels with 8 frequency bins. We stack all of these channels on top of each other to produce a new "image" where the rows are "channel frequencies". 

Then we feed this "image" to a recurrent neural network with a hidden unit for each "channel frequency".

Therefore we need the number of hidden units to equal the number of frequency bins left after the downsampling done by the max pooling layers * the number of channels after the convolutional layers.

With eager mode + attributes added in later versions of tensorflow, there's no longer a need to determine the output shapes programatically as we do here.

It is helpful for people trying to understand the network structure to see the explicit variable names though (`freq_bins_after_pool2` and `conv2_filters`).

In [8]:
# Determine number of hidden units in bidirectional LSTM:
# uniquely determined by number of filters and frequency bins
# in output shape of pool2
freq_bins_after_pool1 = out_width(freq_bins,
                                  pool1_size[1],
                                  pool1_strides[1])
freq_bins_after_pool2 = out_width(freq_bins_after_pool1,
                                  pool2_size[1],
                                  pool2_strides[1])
num_hidden = freq_bins_after_pool2 * conv2_filters

In [10]:
print('number of hidden units:', num_hidden)


number of hidden units: 256


## bi-directional LSTM

In [11]:
# dynamic bi-directional LSTM
lstm_f1 = tf.compat.v1.nn.rnn_cell.BasicLSTMCell(num_hidden, forget_bias=1.0,
                                       state_is_tuple=True, reuse=None)
lstm_b1 = tf.compat.v1.nn.rnn_cell.BasicLSTMCell(num_hidden, forget_bias=1.0,
                                       state_is_tuple=True, reuse=None)
outputs, _states = tf.compat.v1.nn.bidirectional_dynamic_rnn(lstm_f1,
                                                             lstm_b1,
                                                             inputs=tf.reshape(pool2, 
                                                                               [batch_size,
                                                                                -1,
                                                                                num_hidden]),
                                                             time_major=False,
                                                             dtype=tf.float32,
                                                             # sequence_length=[time_bins],
                                                            )

Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API


Notice that the input to the bidirectional LSTM is `pool2`, reshaped so that the dimension order is:  
(batch, time_bins, num_hidden)

This is where the stacking happens, a reshape hidden inside a function call.

We need time bins to be the second axis.

It's fine to just use a minus one here because the output of `pool2` already has time bins on the second dimension.
We can confirm this by testing whether the results are equal to a reshape where we explicitly specify the second dimension should be of size `time_bins`.

In [10]:
tf.reduce_all(
    tf.math.equal(
        tf.reshape(pool2, [batch_size,-1, num_hidden]),
        tf.reshape(pool2, [batch_size, time_bins, num_hidden])
    )
).numpy()

AttributeError: 'Tensor' object has no attribute 'numpy'

In [27]:
lstm_f1.weights[0]

<tf.Tensor 'strided_slice_1:0' shape=() dtype=float32>

In [13]:
lstm_b1.weights

[<tf.Variable 'bidirectional_rnn/bw/basic_lstm_cell/kernel:0' shape=(512, 1024) dtype=float32_ref>,
 <tf.Variable 'bidirectional_rnn/bw/basic_lstm_cell/bias:0' shape=(1024,) dtype=float32_ref>]

In [11]:
print('shape of outputs of bidirectional rnn:')
print('forward: ', outputs[0].shape)
print('backward: ', outputs[1].shape)

shape of outputs of bidirectional rnn:
forward:  (11, 88, 256)
backward:  (11, 88, 256)


## projecting outputs
This is the part that is most low-level and hardest to wrap my mind around.

But essentially we create a set of weights for the forward direction, a set of weights for the backward pass, and a bias

In [None]:
# projection on the number of syllables creates logits time_steps
W_f = tf.Variable(tf.random.normal([num_hidden, n_syllables]))
W_b = tf.Variable(tf.random.normal([num_hidden, n_syllables]))
bias = tf.Variable(tf.random.normal([n_syllables]))

expr1 = tf.unstack(outputs[0],
                   axis=0,
                   num=batch_size)
expr2 = tf.unstack(outputs[1],
                   axis=0,
                   num=batch_size)
logits = tf.concat([tf.matmul(ex1, W_f) + bias + tf.matmul(ex2, W_b)
                    for ex1, ex2 in zip(expr1, expr2)], 0)