# Neural Network Examples - Feature Learning

This notebook aims to accomplish the following 3 goals:

1. Visually demonstrate the power of neural networks as heirarchical feature learners. 
2. Serve as a lightweight introduction to building/training neural networks in **Keras**, a user-friendly python wrapper for **tensorflow**.
3. Provide examples of text-processing neural nets including **recurrent neural networks**, as well as examples of applying **transfer learning** with pre-trained word vectors.  

## Installing Keras
TRY THIS FIRST (in command line):

conda install -c conda-forge keras

If that doesn't work - 

conda install tensorflow  
move into a folder for installing tools  
git clone https://github.com/fchollet/keras.git  
cd keras  
python setup.py install  

In [1]:
#Installing network structure viz

#!pip install pydot-ng
#!brew install graphviz
#!pip install pydot

In [2]:
# display and plotting imports
%pylab inline 
import seaborn as sns
sns.set()
from IPython.display import SVG

import pandas as pd

# sklearn imports
from sklearn.datasets import fetch_mldata
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.cluster import KMeans

# keras imports
from keras.utils import np_utils
from keras.utils.vis_utils import model_to_dot
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.models import Model, Sequential
from keras.layers import Dense, Activation, Dropout
from keras.layers import Input, Embedding, Bidirectional, LSTM

# gensim import for word2vec loading
from gensim.models.keyedvectors import KeyedVectors

Populating the interactive namespace from numpy and matplotlib


Using TensorFlow backend.


## Digit Images Example (MNist)

We'll start by simply loading in the MNist digits data (restricted to digits 0-4), doing some PCA visualization in 2 dimensions, and seeing how well a simple linear model (softmax regression) can perform on this 2-dim representation.

In [6]:
# We are building some model to classify the 5 digits

# mnist = fetch_mldata("MNIST Original")
from sklearn.datasets import fetch_openml # new
mnist = fetch_openml('mnist_784') # new

X_digits, Y_digits = mnist.data, mnist.target
Y_digits = Y_digits.astype(int)  # new
# only keeping 0 1 2 3 4
X_digits, Y_digits = X_digits[Y_digits < 5], Y_digits[Y_digits < 5]

print(X_digits.shape)

X_train, X_test, y_train, y_test = (train_test_split(X_digits, Y_digits, 
                                                     test_size = .2, random_state = 42))



URLError: <urlopen error [WinError 10054] An existing connection was forcibly closed by the remote host>

As usual, we want to standard scale before running PCA.

In [None]:
# Before we do PCA, we ALWAYS want to do some scaling!!!!!!!!!!!!

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # fit and transform
X_test = scaler.transform(X_test)        # take the same scaler object and transform the x_test

Now we'll do dimensionality reduction to 2 principal components and visualize them. We can clearly see patterns that will let us separate the various digit classes, but also a lot of messiness. **We shouldn't expect a linear model to have outstanding performance** on this representation, since the classes are clearly not linearly separable.  

In [None]:
pca = PCA(n_components = 2)

X_train_2PC = pca.fit_transform(X_train)    # pca fit_transform the train data
X_test_2PC = pca.transform(X_test)          # pca transform the test data

plt.scatter(X_train_2PC[:,0], X_train_2PC[:,1], c = y_train)

Running the softmax regression confirms what we expected above - mediocre performance.

In [None]:
# We try using a linear model to classify the above data

lr = LogisticRegression(multi_class = 'multinomial', solver = 'lbfgs')
lr.fit(X_train_2PC, y_train)
lr.score(X_test_2PC, y_test)   # we see that the accuracy is only 64%

### Building a Neural Network Model

We've already taken the step of standard scaling our data, so we're nearly ready to build a NN. 

We do need to adjust the format of the training labels - right now we have a 1 dimensional array of digit labels like [0, 0, 1, 3, ...], but **multi-class NN output format requires a 2-dim array with binary columns corresponding to each class** (one hot encoding). Luckily, keras provides some utilities that let us easily reformat. 

In [None]:
y_train_cat = np_utils.to_categorical(y_train)

y_train_cat  # this is some kind of one-hot encoding to fit into our deep learning model

Now we get to the fun part! We'll construct our first NN with Keras. 

Here's a quick breakdown of what all of these component parts are:

 * **Sequential** : default initialization of a multi-layer network **(based layer, the foundation)**
 * **Dense** : basic hidden layer type - fully connected, meaning that for each node we learn  a weight for each of the previous layer features, just like logistic regression. The first argument is the number of nodes (output feature dimensions) 
 
 
 * **Activation** (to transform our output into another form): The nonlinearity we pass through at each layer. Typical choices are 'sigmoid', 'tanh', and 'relu', **'relu' often works best.** The activation at the end **(softmax in this case, because we are doing a multi-class classifier)** corresponds to the output format we want, which in this case is multi-class. We would use **sigmoid for binary classification**, and **no activation for a regression problem.**
 * **Loss**: Which loss function to optimize for.
 * **Optimizer** : and which style of gradient descent to use. 'adam' : adaptive momentum, often works really well for optimizing.
 
 
 * **Epochs** (1 epoch = 1 time dataset): Number of passes through the training data. Too few can underfit, too many can  overfit. Can be **optimized with validation/CV including with early stopping methods.**
 * **Batch Size** (batch size 64 = moving a set of 64 rows): Number of samples per gradient update. CF stochastic gradient descent - we train NNs through mini-batch gradient descent, and this controlls the mini-batch size. Larger batch sizes will lead to faster epochs but run the risk of causing arrival at local minima. 

In [None]:
'''
 In this network structure, note that we follow a very common heuristic of "funneling"
 to lower dimensional representations over time with multiple layers. Tuning the exact
 choice of number of nodes and layers is quite challenging and there aren't generically
 correct choices, but this heuristic often works pretty well.
'''

NN = Sequential()  # base layer of your model, base of your lego building (foundation of your model)

# Add more layers on the base layer
NN.add(Dense(100,  
             input_dim = X_train.shape[1]) # telling our model to expect 748 features coming in (from X_train.shape) == (___, 784)
      ) # need feature input dim (28x28) for first hidden layer
NN.add(Activation('relu'))    # add an activation layer after adding your dense layer

### Continue adding layers

NN.add(Dense(20))
NN.add(Activation('relu'))

NN.add(Dense(10)) # note we would typically use higher dim than this for last hidden layer
NN.add(Activation('relu', name = '2D_layer')) # naming this layer so we can extract it later


### We know that this is our last layer because 'softmax' is there.
NN.add(Dense(5))
NN.add(Activation('softmax'))


# After we have the whole neural network architecture (as written above, with all the layers)
NN.compile(loss='categorical_crossentropy', optimizer='adam')  # Now we compile all the layers together 
NN.fit(X_train, y_train_cat, epochs=20, batch_size=512, verbose=1) # track progress as we fit


We've built and trained our model already, but even before training it we can get a summary of the network structure and visualize it to understand exactly how the model is set up.   

In [None]:
NN.summary()  # very useufl function to call

# Shows you information about all your layers that you have put in above

# For such a small model, there is already 80,785 parameters to train and obtain weights for

In [None]:
SVG(model_to_dot(NN, show_shapes=True).create(prog='dot', format='svg'))

Of course we can also run predictions and score our model on the test data. It does really well!

In [None]:
accuracy_score(y_test, NN.predict_classes(X_test))

# this gives a score of 98%!!

Next, to build some visual intuition for how neural networks perform **representation learning by creating new features (often in reduced dimensions)**, we're going to do something neat: extract the two node outputs from the last hidden layer and visualize them. 

In [None]:
# But why does this work so well?
# Let's gain some intuition for this

# All of the nodes just perform some kind of feature transformation

In [None]:
# We are going to gain some intuition for how our feature is transformed by cutting our network
# Cut it halfway through
feature_extractor = \  # this feature extractor itself is actually a deep learning model
    Model(inputs=NN.input, 
          outputs=NN.get_layer('2D_layer').output) # output at just the 2D layer (after the second dense layer)

X_train_NN_features_2d = feature_extractor.predict(X_train)
X_test_NN_features_2d = feature_extractor.predict(X_test)

Now we'll plot the 2 feature representation learned by the neural network, and compare with the 2 principle components of the original data as before. Look at how the **neural network has created a beautiful, linearly separable representation** of the original data. 

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12,5))

# This is the PCA representation of our MNIST dataset (the one we have on top)
axes[0].scatter(X_train_2PC[:,0], X_train_2PC[:,1], c = y_train)
axes[0].set_title('Top 2 Principle Components: Unsupervised')


# Look at what hpapens when we put our data through the neural network
axes[1].scatter(X_train_NN_features_2d[:,0], 
                X_train_NN_features_2d[:,1], c = y_train) # we are visualizing dimension 3 and dimension 4
axes[1].set_title('Neural Network Top Layer Features: Supervised')

# What neural networks are doing is that they are doing a lot of transformation such that it becomes easy to classify

And as expected, a softmax regression shows very strong performance on the data representation that the network has learned. 

In [None]:
lr = LogisticRegression(multi_class = 'multinomial', solver = 'lbfgs')

# Let's see what happens if we extract these output from the neural network into a logistic regression model
lr.fit(X_train_NN_features_2d, y_train)
lr.score(X_test_NN_features_2d, y_test)  # We get a score of 0.98 actually!

In [None]:
# Therefore we now find a way to do dimension reduction using deep learning!!

In [None]:
# A little writeup below about the point of the whole process we did above

Hopefully this is a visually powerful representation of the **potential predictive power to be gained from using supervised feature learning / dimension reduction techniques**. Our neural network was designed to construct a 2-dimensional, linearly separable representation of the dataset and was able to accomplish this with flying colors.

### Why does this happen? Here's some intuition:  
  
The network structure is set up to terminate in a simple softmax regression mapping to the final predictions (see output layer above). So the features that are **fed to that mapping must be linearly separable for the network to predict well.**   
  
In this sense, the **network is designed to create a final hidden layer of linearly separable features**!!! The beauty of the feed-forward / back-propogation structure is that it makes it possible to algorithmically generate this representation.

### What can you think of Neural Networks as?
This is why I like to think of neural nets as **analagous to a supervised version of PCA**. They learn features in a heirarchical fashion that ultimately represent the input data in a much simpler and more useful way for prediction. 

PCA is unsupervised so can only represent the data in a simpler way based on explained variance, but neural nets are supervised so can represent the data in a simpler way based on **target explainability**.

## Digit Exercises

In [None]:
# EXERCISE: 
#   Reducing to a layer with 2 feature dimensions before the terminal softmax 
#   oversimplifies the model. Try adjusting the number of nodes in this layer to improve
#   the model's prediction performance.

# EXERCISE: 
#   Experiment with the network structure to try to improve performance. 
#   
#   Try adding or taking away nodes/layers. Look at the summary and visual diagram to
#   understand how the network and # of parameters are changing.

#   Try adjusting the number of epochs and the batch size. 
#   What impact do they have on performance?
#
#   Are you overfitting or underfitting? Is more or less complexity better? 
#   You can use # of parameters as a simple proxy for model complexity

# EXERCISE:
#   As you experiment with network structure, try to also incorporate dropout regularization.
#   See the bottom of the text example below for the syntax.

## Text Classification Example

Here we're going to look at a balanced binary text classification problem (sentiment detection), and train a **very simple neural network** - it will learn a 2-d representation of data just for the sake of visualization. Note that as in the digits example, this is making the model much simpler than it needs to be / below the level of complexity we would typically use in practice when building a predictive NN model.

In [None]:
df = pd.read_csv("Data/amazon_cells_labelled.txt", sep='\t', names=['text', 'sentiment'])
# Take a look
df.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.text, df.sentiment, 
                                                    test_size=0.2, random_state = 42)

tfidf_vect = TfidfVectorizer(decode_error = 'ignore')
X_train = tfidf_vect.fit_transform(X_train).todense()
X_test = tfidf_vect.transform(X_test).todense()

X_train.shape

Here's our simple network to get a supervised, **2-dimensional embedding of the the tf-idf features**. 

In [None]:
NN = Sequential()

NN.add(Dense(2, input_dim = X_train.shape[1], name = '2D_layer'))
NN.add(Activation('relu'))

NN.add(Dense(1))
NN.add(Activation('sigmoid'))

NN.compile(loss='binary_crossentropy', # We use binary_crossentropy for classification problems
           optimizer='adam')
NN.fit(X_train, y_train, epochs=65, batch_size=128, verbose=1)

In [None]:
feature_extractor = \
    Model(inputs=NN.input, outputs=NN.get_layer('2D_layer').output) 

X_train_NN_features_2d = feature_extractor.predict(X_train)
X_test_NN_features_2d = feature_extractor.predict(X_test)

plt.scatter(X_train_NN_features_2d[:,0], X_train_NN_features_2d[:,1], c = y_train)
plt.title('NN Learned 2D Feature Representation vs. Digit Class Label')

# We can see from the plot that halfway through the model, we can see that it is actaully pretty easy to classify

Neural networks can be a very powerful tool for working with text data, **provided there is enough data**. In this case, we're only training on 800 samples so we should not expect amazing generalization results from the network.

As we can see from plotting the **learned features for the test data set**, the representation that works extremely well for the training data does not generalize as well to unseen data.

In [None]:
plt.scatter(X_test_NN_features_2d[:,0], X_test_NN_features_2d[:,1], c = y_test)
plt.title('NN Learned 2D Feature Representation vs. Digit Class Label')

In [None]:
accuracy_score(y_test, NN.predict_classes(X_test))

We can power up the complexity of this NN by adding more layers and choosing a higher number of dimensions (hidden nodes) for the top layer, but it's hard to really do much better than our simple network. This example demonstrates that there's a risk of learning a representation that's overfit to the training data. **This overfitting becomes increasingly likely if we make the network excessively complex (too many nodes + layers)**. 

In this case, **we're likely better off just using a simple model like logistic regression or naive bayes on tf-idf features due to the small data size**.

In [None]:
lr = LogisticRegression(C = 100)
lr.fit(X_train, y_train)
print('Simple logistic score: {}'.format(lr.score(X_test, y_test)))

nb = MultinomialNB()
nb.fit(X_train, y_train)
print('Naive Bayes score: {}'.format(nb.score(X_test, y_test)))

And here is the fancy 3 layer network, which doesn't seem to be a real improvement from a simple baseline at all. 

In [None]:
NN.summary()

In [None]:
NN = Sequential()

NN.add(Dense(200, input_dim = X_train.shape[1]))
NN.add(Activation('relu'))
NN.add(Dropout(.3))

NN.add(Dense(100))
NN.add(Activation('relu'))
NN.add(Dropout(.3))

NN.add(Dense(50)) # 50 dimensional top-layer representation
NN.add(Activation('relu'))
NN.add(Dropout(.3))

NN.add(Dense(1))
NN.add(Activation('sigmoid'))

NN.compile(loss='binary_crossentropy', optimizer='adam')
NN.fit(X_train, y_train, epochs=30, batch_size=512, verbose=1)

In [None]:
accuracy_score(y_test, NN.predict_classes(X_test))

## Text Classification: Moving Beyond the Simple Fully-Connected Model

When it comes to **handling unstructured data, the real power of neural networks starts to become clear** when we move beyond the simple computational graph of fully-connected models. With neural nets, we're able to arbitrarily piece together building blocks of matrix algebra transformations to create models that mimic the underlying patterns present in the data we're modeling. Once we've constructed a graph that mimics these patterns, we can train for the optimal weights of the linear algebraic transformations with our old friend backpropagation/gradient descent. 

For example, we can treat a text as a sequence of words and process these words in an explicitly sequential manner. This style captures the **spatial patterns** of word usage, which are extremely relevant to how humans communicate; in previous methods we've seen like word-counting/tf-idf, our models were completely unaware of order, which may cause us to lose lots of predictive signal. We'll see another example later on (convolutional neural networks) in working with images. 

### Enter the Recurrent Neural Network (RNN)
Below is an example of the computational graph structure of a **Recurrent Neural Network (RNN)**. For our simple sentiment classification exercise, we can ignore the multiple output steps and treat this network as a mechanism for storing a "memory" and sequentially updating it as new information comes in word by word, then using the final state of the "memory" to make a fixed prediction.

![rnn](img/rnn3.jpg)

In [7]:
# The node itself is called the Long Short-Term Memory

We'll build a modern variant of this neural net architecture, the **Long Short-Term Memory (LSTM) network**. At its core is the computational engine shown above, with some added complexity around the memory updating strategy. In the process of building the LSTM we'll also see the typical preprocessing steps required for leveraging this network architecture for NLP problems.

In [None]:
# However, to use this model, we need to do a lot of pre-processing

In [None]:
# Only keep sentences with a length of 100
seq_len = 100 # standardized length of each word sequence 

# We just want to keep the most frequent 1500 vocabulary
max_vocab = 1500 # max number of words to consider when tokenizing (based on freq)


# fit tokenizer vocab (note that it lowercases and strips punct)
tokenizer = Tokenizer(num_words=max_vocab)
tokenizer.fit_on_texts(df.text)


# standard train/val split
train_text, val_text, y_train, y_val = train_test_split(df.text, df.sentiment, 
                                                        test_size=0.2, random_state = 42)

# convert train and val texts to token sequences of standardized length 100,
# padding fills leading 0s in or cuts off sequence at 100th word
train_text = tokenizer.texts_to_sequences(train_text)   # convert words into numbers. giving them an index. NOT VECTORIZING THEM. THAT IS DIFFERENT 
train_text = pad_sequences(train_text, maxlen=seq_len)  # change size of data into fixed size

val_text = tokenizer.texts_to_sequences(val_text)
val_text = pad_sequences(val_text, maxlen=seq_len)

train_text[0]  
# for the first text, we only have 3 words. so we add a shitton of zeros infront to 
# make sure that the matrix becomes 100

# We do this because the model is very rigid. it only accepts data that has the same size.

Now we use the **keras functional API** to create a computational mapping from input text sequences to output sentiment binary targets. We'll actually make the LSTM component of the model **bidirectional**, meaning that we process the text both front-to-back and back-to-front, allowing us to capture a rich set of context. 

In [None]:
embedding_dim = 20 # hyper-parameter 

# As opposed to adding dense layers in the previous neural nets
# We link the layers by putting them behind

inp = Input(shape=(seq_len,)) # must specify format of input layer
x = (Embedding(max_vocab, 
              embedding_dim)  # this embedding_dim is defined above
     (inp)) # model learns its own word embeddings

x = Bidirectional(LSTM(8, recurrent_dropout=.3))(x) # bi-LSTM with regularization. 
# The '(x)' is done by putting the output of the previous layer at the back

y = Dense(1, activation='sigmoid')(x)  # Final layer is y.

NN = Model(inp, y)
NN.summary()

# We can see that we get a lot of trainable parameters

The below diagram may start to **confuse you more than it illuminates,** but it helps emphasize the manner in which an RNN can be bidirectional:

![rnn](img/rnn.png)

Without further ado, we fit the model and track its train and validation accuracy over the training epochs. Then we'll plot the accuracy curves to get a feel for the accuracy trajectory. 

In [None]:
NN.compile(loss='binary_crossentropy', 
           optimizer='adam', 
           metrics=['accuracy']) # This accuracy is just used to keep track of the progress. 
                                # It DOES NOT use the accuracy to do ANYTHING!
    
history = \ # We assign the NN.fit to a variable "history". this keeps the progress of the model!
NN.fit(train_text, y_train, 
                 validation_data=(val_text, y_val),  # this is just doing validation. Not CROSS validation
                 epochs=50, batch_size=512, verbose=1)


# We want to see that both our loss and validation loss decrease steadily!

# We can see that beyond the third epoch, it is facing some difficulty.
# There could be some overfitting
# 

In [None]:
# Using the history, we can do some plotting to see how our model changes
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('Accuracy vs. Training Epoch')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train','Validation'])

# We see that it's not doing so well on the validation 

### Hmm it looks like we really can't do any better than our amazing naive bayes/logistic baseline!

In [None]:
accuracy_score(y_val, (NN.predict(val_text)[:,0] > .5).astype(int))

## Text Classification: Leveraging The Power Of Transfer Learning

### **Well no, actually we can do better than that baseline.**  We're going to take advantage of what Google and Stamford has already done
  
But we'll have to do more, effectively leveraging a much larger text dataset than our paltry 1000 records: we'll use **transfer learning**. Transfer learning broadly refers to the process of training neural network weights on one dataset/task, then taking those weights and applying them to a different dataset/task. It sounds like it shouldn't really work, but it turns out that since neural net weights can learn very rich representations of fairly low-level, generalizable concepts, these weights often have broad applicability. Since these weights can be learned on massive datasets and ported over to much smaller ones, this method often helps us essentially use a lot more data than we immediately have on hand for training.

One classic example of transfer learning, which we'll see here, is use of **pre-trained word vectors**. Recall that word vectors are learned via the task of predicting what words appear in similar contexts. This training task allows the vectors to capture a great deal of semantic and syntactic information that often gives relevant signal for other prediction tasks. We'll test this out by **using google's pre-trained word vectors as fixed word embeddings in our sentiment model**, instead of training embeddings on the fly. 

The code below is adapted from: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html. First we load the word vectors and build an embedding matrix.

In [None]:
# This is the word_index that we have defined above. This is basically what the output of the word indexing looks like 
word_index 

In [None]:
word_index = tokenizer.word_index

# change the path to point to your pretrained google vectors file
w2v_file = '/Users/jeddy-metis/nltk_data/GoogleNews-vectors-negative300.bin.gz'

# load the w2v vectors using gensim
word_vectors = KeyedVectors.load_word2vec_format(w2v_file, binary=True)

# in word_vectors, every single word has certain vector!
# eg. word_vectors['cat'] = vector


embedding_dim = 300 # w2v embedding dim


#### There are thousands and thousands of words in word2vec. 
#### We don't need everything! We just need the words that are relevant to our dataset!

# use the gensim model to build a numpy array of embeddings,
# we'll feed this array to the keras embeddings layer.
# each row i of the array will correspond to the word token assigned to that value 

embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))

for word, i in word_index.items():    # this will pull out all the word vectors that are relevant to our dataset
    try:
        embedding_vector = word_vectors[word]
        embedding_matrix[i] = embedding_vector
    except: # word in our data vocab is missing in w2v, will use 0 vector for that word
        pass

Now we have what we need to define the model.

In [None]:
embedding_dim = 300

### LET US CHECK OUT THE TRANSFER LEARNING!!!!!!!

In [None]:
# Now using Recurrent Neural Nets

inp = Input(shape=(seq_len,))
x = Embedding(len(word_index) + 1,
              embedding_dim,
              weights=[embedding_matrix], ##### This part is the transfer learning part
                                          ##### where we feed the pretrained vecs from google
              
              trainable=False)(inp) # freeze these parameters in the model. 
                                    # trainable=False, That means there will be NO UPDATING IN THIS LAYER AT ALL.


x = Bidirectional(LSTM(64, recurrent_dropout=.1))(x)
x = Dense(32)(x) # fully connected layer on top of the output of the bi-LSTM
x = Dropout(.3)(x)
y = Dense(1, activation='sigmoid')(x)

NN = Model(inp, y)
NN.summary()

Let's train and see how we did!

In [None]:
NN.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = NN.fit(train_text, y_train, 
                 validation_data=(val_text, y_val),
                 epochs=30, batch_size=512, verbose=1)

In [None]:
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('Accuracy vs. Training Epoch')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train','Validation'])


# Now we plot out our results.
# We can see that our validation provides a way better score than when we did the embedding training ourselves
# And all this happens just because we used the transferred weights from google!

# 

Looks like we can improve over the naive bayes baseline by at least 2% or so using the google vectors. So there's hope for deep learning after all! (once again, a big problem with this dataset is that 1000 records is just a very small size for neural net methods. It already should strike you as extremely promising that we can do so well with the transfer learning method here).