# Project summary:

* Implement a Dense NN with 5 hidden layers for MNIST digit classification
* Compress the weight matrices using Singular Value Decomposition (SVD)
* Improve the accuracy of the compressed network by carrying out low rank approximation of weights during network training

### Models:

The first model, henceforth referred to as the **Teacher Network**, has five hidden **FC** layers of $1024$ units each followed by the last layer with $10$ units. All the layers are activated using **ReLU** non-linearity, except for the last layer that uses **Softmax** to predict the output class. Similar to the previous assignments, the model tries to minimize the **Softmax cross-entropy** loss using the **Adam** optimizer with learning rate set to $10^{-3}$ and converges at a loss of $0.2126$ over just $31$ epochs for a batch size of 50 samples.

The second model, aka the **Student Network**, uses the low-rank approximated weights of the Teacher network as its starting point. Needless to say, the overall structure of the Student network, including the number of layers, units and the activation functions, are exactly same as that of the Teacher network. The training of the Student network is done by, yet again, carrying out the low-rank approximation of weights at each layer to compute $\hat{W}$ and feeding them forward to the next layer (except for the last layer). However, during backpropagation, a custom gradient function is used to assign the differentiation of this approximation as $1$, thereby updating $W$ rather than $\hat{W}$. This network converges at a loss of $0.2213$ over $30$ epcohs for a batch size of $100$ samples.

### Output:

The Teacher network achieves train and test accuracies north of $99\%$ and $98\%$ respectively. The Student network, which had a baseline accuracy of $91.81\%$, converges at $96\%$ training and testing accuracies.

### Narrative:
The only tricky part in this assignment was figuring out how to create a custom gradient function. Initially, I tried running the Student network without specifying the gradient function, and it worked fine giving an accuracy of $96\%$. After researching a bit online, I found that TensorFlow automatically assumes the custom gradient to be unity in case we have not explicitly specified the gradient.

Having said that, after reading up the TF documentation (which is really poor, by the way) and looking at similar examples on the web, I was able to implement my own custom gradient function. The trick was to use the `tf@RegisterGradient` decorator to define the custom gradient function while building the graph, and then using the `gradient_override_map` method on the graph that mapped the gradient of the **Identity** function to the gradient function registered in the graph. Obviously, I had to wrap the outputs of all layers (calculated using the approximated weights) in `tf.identity` function withing the override method so that the mapping is triggered at the time of execution.

## Boot

In [None]:
%tensorflow_version 1.x
%load_ext tensorboard

import tensorflow as tf
import numpy as np
import math

# !pip install librosa
import librosa
import IPython

In [None]:
tf.__version__

'1.15.0'

In [None]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

!cat /proc/cpuinfo
!cat /proc/meminfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2300.000
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	:

## Load data

In [None]:
# Load MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


## Build graph for Teacher network

In [None]:
'''
  Teacher Neural Network
'''

tf.reset_default_graph()

!rm -rf ./logs/
STORE_PATH = "./logs/"

# initialize tensors
x = tf.placeholder(tf.float32, shape=(784, None), name="inputs")
y = tf.placeholder(tf.float32, shape=(10, None), name="labels")
lr = tf.constant(0.0001)
init = tf.initializers.glorot_normal()

def create_tensors(m, n, H, w_name, b_name):
  W = tf.Variable(init([m, n], dtype=tf.float32), dtype=tf.float32, name=w_name)
  b = tf.Variable(init([n, 1], dtype=tf.float32), dtype=tf.float32, name=b_name)
  if n == 10:
    Y = tf.nn.softmax(tf.transpose(W)@H + b)
  else:
    Y = tf.nn.relu(tf.transpose(W, name="trans"+w_name[-3:])@H + b, name="relu"+w_name[-3:])
  return W, b, Y

# implement 5 fully-connected hidden layers
Wi, bi, Yi = create_tensors(784, 1024, x, "wt_Li", "bias_Li")
W1, b1, Y1 = create_tensors(1024, 1024, Yi, "wt_L1", "bias_L1")
W2, b2, Y2 = create_tensors(1024, 1024, Y1, "wt_L2", "bias_L2")
W3, b3, Y3 = create_tensors(1024, 1024, Y2, "wt_L3", "bias_L3")
W4, b4, Y4 = create_tensors(1024, 1024, Y3, "wt_L4", "bias_L4")
Wo, bo, Yo = create_tensors(1024, 10, Y4, "wt_Lo", "bias_Lo")

# loss and optimizer
loss = tf.reduce_mean(-y*tf.log(tf.exp(Yo)/tf.reduce_sum(tf.exp(Yo), 0, True)), name="loss")
opt = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)

# compute accuracy
acc = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, axis=0), tf.argmax(Yo, axis=0)), dtype=tf.float32), name="accuracy")

with tf.Session() as sess:
    writer = tf.summary.FileWriter(STORE_PATH, sess.graph)

## Fit model on Teacher network

In [None]:
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

In [None]:
# Fit and evaluate
dim = mnist.train.num_examples
b_size = 50
maxIter = 40

errt = np.zeros(maxIter)  # for storing the loss per epoch

for i in range(maxIter):
  epoch_loss = 0
  for j in range(dim//b_size):
    feat, lab = mnist.train.next_batch(b_size)
    batch_loss, _ = sess.run([loss, opt], feed_dict={x: np.transpose(feat), y: np.transpose(lab)})
    epoch_loss += batch_loss
  # collect loss
  errt[i] = epoch_loss * b_size / dim
  # evaluate train and test accuracies
  train_accu = sess.run(acc, feed_dict={x: np.transpose(mnist.train.images), y: np.transpose(mnist.train.labels)})
  test_accu = sess.run(acc, feed_dict={x: np.transpose(mnist.test.images), y: np.transpose(mnist.test.labels)})
  print("epoch: ", i+1, "  loss: ", errt[i], "  train acc: ", train_accu, " test acc: ", test_accu)
  # early stopping
  if test_accu >= 0.98:
    break

epoch:  1   loss:  0.21395367618311537   train acc:  0.8524  test acc:  0.8645
epoch:  2   loss:  0.21309344024820762   train acc:  0.9238  test acc:  0.927
epoch:  3   loss:  0.21294779968532648   train acc:  0.9399091  test acc:  0.9433
epoch:  4   loss:  0.21285404059019955   train acc:  0.94274545  test acc:  0.9432
epoch:  5   loss:  0.2128465810418129   train acc:  0.91785455  test acc:  0.9223
epoch:  6   loss:  0.21280549243092536   train acc:  0.95967275  test acc:  0.9572
epoch:  7   loss:  0.2127521503919905   train acc:  0.93363637  test acc:  0.9373
epoch:  8   loss:  0.21274347630414095   train acc:  0.96689093  test acc:  0.965
epoch:  9   loss:  0.21273233738812533   train acc:  0.9730545  test acc:  0.9699
epoch:  10   loss:  0.21277180958877911   train acc:  0.9668546  test acc:  0.9619
epoch:  11   loss:  0.21272344814105468   train acc:  0.9746909  test acc:  0.9697
epoch:  12   loss:  0.21273506451736798   train acc:  0.9751818  test acc:  0.9682
epoch:  13   loss:

## Compress trained weights and evaluate baseline results

In [None]:
# Extract the learned weights
w_L1, b_L1, w_L2, b_L2, w_L3, b_L3, w_L4, b_L4, w_L5, b_L5 = sess.run([Wi, bi, W1, b1, W2, b2, W3, b3, W4, b4])

# Carry out SVD on weights
s_L1, u_L1, v_L1 = tf.linalg.svd(w_L1)
s_L2, u_L2, v_L2 = tf.linalg.svd(w_L2)
s_L3, u_L3, v_L3 = tf.linalg.svd(w_L3)
s_L4, u_L4, v_L4 = tf.linalg.svd(w_L4)
s_L5, u_L5, v_L5 = tf.linalg.svd(w_L5)

In [None]:
# Compute test accuracies for different compressions
def set_wts(n, to_np=0):
  w1 = tf.matmul(tf.matmul(u_L1[:,:D], tf.linalg.diag(s_L1[:D,])), tf.transpose(v_L1[:,:D]))
  w2 = tf.matmul(tf.matmul(u_L2[:,:D], tf.linalg.diag(s_L2[:D,])), tf.transpose(v_L2[:,:D]))
  w3 = tf.matmul(tf.matmul(u_L3[:,:D], tf.linalg.diag(s_L3[:D,])), tf.transpose(v_L3[:,:D]))
  w4 = tf.matmul(tf.matmul(u_L4[:,:D], tf.linalg.diag(s_L4[:D,])), tf.transpose(v_L4[:,:D]))
  w5 = tf.matmul(tf.matmul(u_L5[:,:D], tf.linalg.diag(s_L5[:D,])), tf.transpose(v_L5[:,:D]))
  if to_np:
    return w1.eval(), w2.eval(), w3.eval(), w4.eval(), w5.eval()
  else:
    return w1, w2, w3, w4, w5

for D in [10, 20, 50, 100, 200, 784]:
  w1, w2, w3, w4, w5 = set_wts(D)
  sess.run(tf.assign(Wi, w1))
  sess.run(tf.assign(W1, w2))
  sess.run(tf.assign(W2, w3))
  sess.run(tf.assign(W3, w4))
  sess.run(tf.assign(W4, w5))
  print("Evaluation results for D = ", D)
  print("Test accuracy: ", sess.run(acc, feed_dict={x: np.transpose(mnist.test.images), y: np.transpose(mnist.test.labels)}))

Evaluation results for D =  10
Test accuracy:  0.7541
Evaluation results for D =  20
Test accuracy:  0.9181
Evaluation results for D =  50
Test accuracy:  0.9706
Evaluation results for D =  100
Test accuracy:  0.9733
Evaluation results for D =  200
Test accuracy:  0.9774
Evaluation results for D =  784
Test accuracy:  0.9813


## Use compressed weights to build Student network

In [None]:
# Extract weights corr. to D=20 as numpy arrays
w1, w2, w3, w4, w5 = set_wts(20, to_np=1)

In [None]:
'''
  Student Neural Network
'''

sess.close()
tf.reset_default_graph()

!rm -rf ./logs/
STORE_PATH = "./logs/"

# register custom gradient
@tf.RegisterGradient("SvdGrad")
def svd_grad(op, grad):
  return 1

# function to carry out SVD operation
def svd_op(w, b, H, w_name):
  s, u, v = tf.linalg.svd(w, name="svd_op"+w_name[-3:])
  W_hat = tf.Variable(tf.matmul(tf.matmul(u[:,:20], tf.linalg.diag(s[:20,])), tf.transpose(v[:,:20])), name=w_name, trainable=False)
  Y = tf.nn.relu(tf.transpose(W_hat, name="trans"+w_name[-3:])@H + b, name="relu"+w_name[-3:])
  return Y

# initialize tensors
init = tf.initializers.glorot_normal()
lr = tf.constant(0.001, name="lr")

x = tf.placeholder(tf.float32, shape=(784, None), name="inputs")
y = tf.placeholder(tf.float32, shape=(10, None), name="labels")

Wi = tf.Variable(w1, dtype=tf.float32, name="wt_Li")
W1 = tf.Variable(w2, dtype=tf.float32, name="wt_L1")
W2 = tf.Variable(w3, dtype=tf.float32, name="wt_L2")
W3 = tf.Variable(w4, dtype=tf.float32, name="wt_L3")
W4 = tf.Variable(w5, dtype=tf.float32, name="wt_L4")
Wo = tf.Variable(init([1024, 10], dtype=tf.float32), dtype=tf.float32, name="wt_Lo")

bi = tf.Variable(b_L1, dtype=tf.float32, name="bias_Li")
b1 = tf.Variable(b_L2, dtype=tf.float32, name="bias_L1")
b2 = tf.Variable(b_L3, dtype=tf.float32, name="bias_L2")
b3 = tf.Variable(b_L4, dtype=tf.float32, name="bias_L3")
b4 = tf.Variable(b_L5, dtype=tf.float32, name="bias_L4")
bo = tf.Variable(init([10, 1], dtype=tf.float32), dtype=tf.float32, name="bias_Lo")

Yi = svd_op(Wi, bi, x, "W_hat_Li")
Y1 = svd_op(W1, b1, Yi, "W_hat_L1")
Y2 = svd_op(W2, b2, Y1, "W_hat_L2")
Y3 = svd_op(W3, b3, Y2, "W_hat_L3")
Y4 = svd_op(W4, b4, Y3, "W_hat_L4")
Yo = tf.nn.softmax(tf.transpose(Wo, name="trans_Lo") @ Y4 + bo)

g = tf.get_default_graph()

# loss and optimizer
loss = tf.reduce_mean(-y*tf.log(tf.exp(Yo)/tf.reduce_sum(tf.exp(Yo), 0, True)), name="loss")
opt = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)

# compute accuracy
acc = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, axis=0), tf.argmax(Yo, axis=0)), dtype=tf.float32), name="accuracy")

# override gradients
with g.gradient_override_map({'Identity': 'SvdGrad'}):
    Yi = tf.identity(Yi, name="svd_Li")
    Y1 = tf.identity(Y1, name="svd_L1")
    Y2 = tf.identity(Y2, name="svd_L2")
    Y3 = tf.identity(Y3, name="svd_L3")
    Y4 = tf.identity(Y4, name="svd_L4")
    grad_Li = tf.gradients(Yi, Wi, name="grad_Li")
    grad_L1 = tf.gradients(Y1, W1, name="grad_L1")
    grad_L2 = tf.gradients(Y2, W2, name="grad_L2")
    grad_L3 = tf.gradients(Y3, W3, name="grad_L3")
    grad_L4 = tf.gradients(Y4, W4, name="grad_L4")

with tf.Session() as sess:
    writer = tf.summary.FileWriter(STORE_PATH, sess.graph)

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


## Fit and evaluate Student network

In [None]:
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

In [None]:
# Fit and evaluate
dim = mnist.train.num_examples
b_size = 100
maxIter = 30

errt = np.zeros(maxIter)  # for storing the loss per epoch

for i in range(maxIter):
  epoch_loss = 0
  for j in range(dim//b_size):
    feat, lab = mnist.train.next_batch(b_size)
    batch_loss, _ = sess.run([loss, opt], feed_dict={x: np.transpose(feat), y: np.transpose(lab)})
    epoch_loss += batch_loss
  # collect loss
  errt[i] = epoch_loss * b_size / dim
  # evaluate train and test accuracies
  train_accu = sess.run(acc, feed_dict={x: np.transpose(mnist.train.images), y: np.transpose(mnist.train.labels)})
  test_accu = sess.run(acc, feed_dict={x: np.transpose(mnist.test.images), y: np.transpose(mnist.test.labels)})
  print("epoch: ", i+1, "  loss: ", errt[i], "  train acc: ", train_accu, " test acc: ", test_accu)
  # early stopping
  if test_accu >= 0.97:
    break

epoch:  1   loss:  0.22152437635443428   train acc:  0.9528  test acc:  0.9531
epoch:  2   loss:  0.22135312275453048   train acc:  0.9578182  test acc:  0.9582
epoch:  3   loss:  0.2213447708975185   train acc:  0.96114546  test acc:  0.9597
epoch:  4   loss:  0.22133866315538234   train acc:  0.9602182  test acc:  0.9591
epoch:  5   loss:  0.2213400980017402   train acc:  0.96221817  test acc:  0.9616
epoch:  6   loss:  0.22133784313093532   train acc:  0.96205455  test acc:  0.9602
epoch:  7   loss:  0.22133790788325397   train acc:  0.96165454  test acc:  0.9607
epoch:  8   loss:  0.22133393257856368   train acc:  0.964  test acc:  0.9635
epoch:  9   loss:  0.2213344425775788   train acc:  0.9608727  test acc:  0.9618
epoch:  10   loss:  0.22133299087936228   train acc:  0.96196365  test acc:  0.9619
epoch:  11   loss:  0.22133188570087606   train acc:  0.9607818  test acc:  0.9606
epoch:  12   loss:  0.22133179502053693   train acc:  0.9613091  test acc:  0.963
epoch:  13   loss: 

In [None]:
g.get_collection("trainable_variables")

[<tf.Variable 'wt_Li:0' shape=(784, 1024) dtype=float32_ref>,
 <tf.Variable 'wt_L1:0' shape=(1024, 1024) dtype=float32_ref>,
 <tf.Variable 'wt_L2:0' shape=(1024, 1024) dtype=float32_ref>,
 <tf.Variable 'wt_L3:0' shape=(1024, 1024) dtype=float32_ref>,
 <tf.Variable 'wt_L4:0' shape=(1024, 1024) dtype=float32_ref>,
 <tf.Variable 'wt_Lo:0' shape=(1024, 10) dtype=float32_ref>,
 <tf.Variable 'bias_Li:0' shape=(1024, 1) dtype=float32_ref>,
 <tf.Variable 'bias_L1:0' shape=(1024, 1) dtype=float32_ref>,
 <tf.Variable 'bias_L2:0' shape=(1024, 1) dtype=float32_ref>,
 <tf.Variable 'bias_L3:0' shape=(1024, 1) dtype=float32_ref>,
 <tf.Variable 'bias_L4:0' shape=(1024, 1) dtype=float32_ref>,
 <tf.Variable 'bias_Lo:0' shape=(10, 1) dtype=float32_ref>]