## Imports of utility functions we will be using
below we import the work for our residual network and deep hybrid scatter network. All the real 'heavy lifting' for this project was done in the resnet and ScatterTransform scripts. For a more detailed look at what's going on under the hood go give the scripts a peak.

In [1]:
import sys
import os
sys.path.append('/home/ubuntu/feature_viz/resnet')
sys.path.append('/home/ubuntu/feature_viz/ScatteringTransform/src/model')
from flags import define_flags as scatternet_define_flags
from train_mnist import train_model as scatternet_train
from res_features import train as resnet_train
from visualize import visualize_features
import sklearn
import tensorflow as tf

mnist = tf.contrib.learn.datasets.load_dataset("mnist")
x_test, y_test = mnist.test.next_batch(1280)

  return f(*args, **kwds)
  from ._conv import register_converters as _register_converters


Extracting MNIST-data/train-images-idx3-ubyte.gz
Extracting MNIST-data/train-labels-idx1-ubyte.gz
Extracting MNIST-data/t10k-images-idx3-ubyte.gz
Extracting MNIST-data/t10k-labels-idx1-ubyte.gz


## Multilayer Neural Network & Logistic Regression
The score function we use for comparison is just the mean number of labels correctly assigned to the test samples. The _viz features and labels used below were never used for training. The _viz features and labels were only used for TSNE visualizations and that is why we can still justify using them as our test set.

In [None]:
def logistic_regression(train_features, train_labels, test_features, test_labels):
    lm = sklearn.linear_model.LogisticRegression(multi_class='multinomial', solver='saga')
    lm.fit(train_features, train_labels)
    score = lm.score(test_features, test_labels)
    return score

def NN_score(nn, x_test, test_labels):
    '''
    nn here either represents the resnet or hybrid scatter network. Here we call them to generate a mean accuracy
    
    we don't submit the same test features here because the network is deterministic and will generate the same
    intermediate features during the forward pass
    '''
    return nn.score(x_test, test_labels)
    

# FEATURE EXTRACTION WITH RESIDUAL NETWORK AND FINE TUNING
## Resnet 50 pretrained from the Kth layer and higher
Below we generate two different pretrained residual networks. In both cases we generate a 50 layer residual network. The network variables are initialized with the values from a pretrained imagenet model. The difference between these two models is that one freezes weights before layer k while the other network leaves all variables as trainable. The inspiration behind this type of pretraining is because the initial layers of a CNN are learning features which are generally task invariant. As a result of this we make the assumption that the layers before K are "good enough" and our training should be spent on tuning the layers >= K.

In the case of both resnets we stack an additional 3 fully connected layers on top of the flattened previously generated features. The first 2 layers use recified linear units as activation functions followed by a final layer which uses a softmax. All 3 of these layers initialize weights with a uniform Xavier initializer. More information on the motivation behind Xavier initializers can be found [here](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

The purpose of the final 3 stacked fully connected layers is to generate logits for classification. These layers can be thought of as replacements to the final fully connected layers generating logits for the 1000 imagenet classes. In addition to a classification mismatch between imagenet and MNIST we do not take variables from these layers because they are far more task dependent.

In [None]:
resnet_classifier_k = resnet_train(freeze_before_k=3) # all blocks before block k are frozen for training
resnet_k_features_test = resnet_classifier_k.get_features(x_test)

nn_res_k_score = NN_score(resnet_classifier_k, x_test, y_test)
print('fully connected neural network stacked on frozen resnet features accuracy: ', nn_res_k_score)

tf.reset_default_graph() # clear graph for classifier_0
resnet_classifier_0 = resnet_train(freeze_before_k=0)
resnet_0_features_test = resnet_classifier_0.get_features(x_test)

nn_res_0_score = NN_score(resnet_classifier_0, x_test, y_test)
print('fully connected neural network stacked on unfrozen resnet features accuracy: ', nn_res_0_score)

Extracting MNIST-data/train-images-idx3-ubyte.gz
Extracting MNIST-data/train-labels-idx1-ubyte.gz
Extracting MNIST-data/t10k-images-idx3-ubyte.gz
Extracting MNIST-data/t10k-labels-idx1-ubyte.gz
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Restoring parameters from /home/ubuntu/feature_viz/resnet/resnet_v1_50.ckpt
iter 0 train accuracy: 0.09375
iter 100 test accuracy: 0.6499999761581421
EPOCH:  0
saving after 100 iterations
iter 100 train accuracy: 0.734375
iter 200 test accuracy: 0.9200000166893005
EPOCH:  0
saving after 200 iterations
iter 200 train accuracy: 0.8359375
iter 300 test accuracy: 0.8799999952316284
EPOCH:  0
saving after 300 iterations
iter 300 train accuracy: 0.9140625
iter 400 test accuracy: 0.8799999952316284
EPOCH:  0
saving after 400 iterations
iter 400 train accuracy: 0.890625
iter 500 test accuracy: 0.9399999976158142
EPOCH:  1
saving after 500 iterations
iter 500 train accuracy: 0.890625
iter 600 test accuracy: 0.9100000262260437
EPOCH:  1
savi

# FEATURE EXTRACTION WITH SCATTER NETWORK
## Scatternet Training and Testing on MNIST
The work for scatternet was mostly based off the work done by [tdeboissiere](https://github.com/tdeboissiere) found [here](https://github.com/tdeboissiere/DeepLearningImplementations/tree/master/ScatteringTransform). This scatter network is actually a bit different than the scatter network found in Brunna & Mallat's work. What we analyze here is  a deep hybrid scatter network. We use scatter transforms similar to those found in [Brunna & Mallat's work](https://arxiv.org/abs/1203.1513) followed by a few convolutional layers and fully connected layers. A detailed explanation of how this works can be found in Oyallon's [Deep Hybrid Networks paper](https://arxiv.org/abs/1703.08961).

In [None]:
scatternet_define_flags() 
scatternet_classifier = scatternet_train()  # no pretrained weights for the hybrid scatter net so we do not need to specify a k

nn_scatternet_score = NN_score(scatternet_classifier, x_test, y_test)
print('fully connected neural network stacked on scatter network features accuracy: ', nn_scatternet_score)

# FEATURE VISUALIZATION
Sample some data points from MNIST to map to a feature space with both of our classifiers. Once we have both of these new feature spaces we can use TSNE to reduce dimensionality to a space easily visualized.

## TSNE on 50 layer residual network features
Looks like we are forming some clusters of each of the classes however there is heavy overlap. Multiple clusters are being formed for some of the classes. Some interesting future work could be a more in depth cluster analysis. There looks to be 3 different clusters for images representing 9s. Perhaps one cluster contains 9s that are straight, another of 9s slanted to the left, and a final cluster of 9s slanted to the right.

In [None]:
tf.reset_default_graph(); resnet_classifier_k.load() # have to save and load b/c of conflicting params between resnets

resnet_k_features_test = resnet_classifier_k.get_features(x_test)
visualize_features(resnet_k_features_test, y_test, 'resnet50_4')

## TSNE on unfrozen 50 layer residual network features
It appears that we generate better separation between classes with the unfrozen residual network. I suspect that this is the case because the MNIST dataset did not impose a data constraint on us. However, if we were to significantly reduce access to data I suspect that the frozen residual network would provide better features.

In [None]:
tf.reset_default_graph(); resnet_classifier_0.load() # have to save and load b/c of conflicting params between resnets\

resnet_0_features_test = resnet_classifier_0.get_features(x_test)
visualize_features(resnet_0_features_test, y_test, 'resnet50_0')

## TSNE on Deep Hybrid Network
Class separation is much stronger here than we experienced in our 50 layer residual network. I am guessing this is because the Deep Hybrid Network allowed for all the weights to be trained. While we froze all the weights and biases below the 3rd block of the resnet.

In [None]:
scatternet_features_test = scatternet_classifier.get_features(x_test)
visualize_features(scatternet_features_test, y_test, 'scatternet')

# IMAGE CLASSIFICATION


## train images and feature generation
collect a batch of training images we can use as a training set for the logistic regression models.

In [None]:
x_train, y_train = mnist.train.next_batch(1280)

## Deep Hybrid Scatter Network: logistic regression vs fully connected neural network

In [None]:
scatternet_features_train = scatternet_classifier.get_features(x_train)

lm_scatternet_score = logistic_regression(scatternet_features_train, y_train, scatternet_features_test, y_test)
print('logistic regression on deep hybrid scatter network features accuracy: ', lm_scatternet_score)

nn_scatternet_score = NN_score(scatternet_classifier, x_test, y_test)
print('fully connected neural network stacked on scatter network features accuracy: ', nn_scatternet_score)

## Frozen block K resnet: logistic regression vs fully connected neural network

In [None]:
tf.reset_default_graph(); resnet_classifier_k.load() # have to save and load b/c of conflicting params between resnets
resnet_k_features_train = resnet_classifier_k.get_features(x_train)

lm_res_k_score = logistic_regression(resnet_k_features_train, y_train, resnet_k_features_test, y_test)
print('logistic regression on frozen resnet features accuracy: ', lm_res_k_score)

nn_res_k_score = NN_score(resnet_classifier_k, x_test, y_test)
print('fully connected neural network stacked on frozen resnet features accuracy: ', nn_res_k_score)

## Unfrozen resnet: logistic regression vs fully connected neural network

In [None]:
tf.reset_default_graph(); resnet_classifier_0.load() # have to save and load b/c of conflicting params between resnets
resnet_0_features_train = resnet_classifier_0.get_features(x_train)

lm_res_0_score = logistic_regression(resnet_0_features_train, y_train, resnet_0_features_test, y_test)
print('logistic regression on unfrozen resnet features accuracy: ', lm_res_0_score)

nn_res_0_score = NN_score(resnet_classifier_0, x_test, y_test)
print('fully connected neural network stacked on unfrozen resnet features accuracy: ', nn_res_0_score)

# Results
In our analysis we compared 6 different types of classifiers. The 6 classifiers fell into a few different sub groups. First, the features were either generated from a 50 layer residual network or a deep hybrid network. Second The variables included in the 50 layer residual network were either frozen if they belonged to a layer earlier than K or they were initialized but left free to train. Third, the actual classification was done in one of two ways. The first choice of classification was using a basic multiclass logistic regression model. Our logistic regression model used saga for optimization. More information about saga can be read [here](https://www.di.ens.fr/~fbach/Defazio_NIPS2014.pdf) in a 2014 NIPs paper. The second choice of classification was using a multilayer fully connected neural network.

1. The Unfrozen residual network and the deep hybrid network performed very similarly. It is unclear which of these two is actually supperior at this task. The Deep Hybrid Network performed marginally better, but hyperparams and preprocessing were optized for MNIST. While in the case of the residual network there was no preprocessing techniques used aside from resizing the imge. Hyperparameters were also left as default for the residual network. I am also unsure if any image preprocessing techniques were used when training the initialized residual network on imagenet. For example we did not apply mean shifting to mnist images and if mean shifting each channel was used during pretraining this could render the initialized weights useless.

1. an interesting next piece of work would be to make a comparison between the features generated by a basic Scattering Convolutional Network with a Deep Hybrid Network.

| Frozen Resnet LR | Frozen Resnet NN | Unfrozen Resnet LR | Unfrozen Resnet NN | Deep Hybrid Network LR | Deep Hybrid Network NN
------------ | ------------- | ------------- | ------------- | ------------- | ------------- |
test | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4  