## Advanced Tensorflow Tutorial

![img](https://github.com/yandexdataschool/nlp_course/raw/master/resources/tf_birds_bees.png)

A highly subjective list of cool stuff about tensorflow that didn't fit into basic tutorial.



## Part I: Debugging tensorflow

Tensorflow error messages are hideous monstrosities with a heart of gold :)

If your code breaks, TF will throw a wall of text your way. But you shouldn't be afraid of it. The key skill here is finding the part of error that actually matters: your code. Let's look at an example:

In [None]:
import numpy as np
import tensorflow as tf
keras, L = tf.keras, tf.keras.layers
tf.reset_default_graph()
sess = tf.Session()

embeddings = tf.Variable(np.random.randn(16, 10).astype('float32'))

sequence_ids = tf.placeholder('int32')
sequence_emb = tf.gather(embeddings, sequence_ids)
mean_emb = tf.reduce_mean(sequence_emb, axis=2)



In [None]:
sess.run(tf.global_variables_initializer())
sess.run(mean_emb, {sequence_ids: np.random.randint(32, size=[3, 20])})
sess.run(mean_emb, {sequence_ids: np.random.randint(32, size=20)})

Okay, here's what you should see
* First and most important, this is just a traceback. No need to freak out. Keep calm.
* Second, it tells us which sess.run caused an error - the second one. Here's the relevant part
```
        1 sess.run(tf.global_variables_initializer())
        2 sess.run(mean_emb, {sequence_ids: np.random.randint(32, size=[3, 20])})
----> 3 sess.run(mean_emb, {sequence_ids: np.random.randint(32, size=20)})
```

*  Then it tells us which line broke down:
```
  File "<ipython-input-76-a9559652841a>", line 11, in <module>
    mean_emb = tf.reduce_mean(sequence_emb, axis=2)
```
* And the error
```
Invalid reduction dimension (2 for input with 2 dimension(s)
```

This information should already be sufficient ot find out what happened: we took 1d indices, mapped them to 2d embeddings and now want to averate over axis 2, but embeddings only got axes [0, 1].

Let's try a few more:

In [None]:
%%writefile my_rnn_library.py
import numpy as np
import tensorflow as tf

def my_rnn(x_emb, emb_size, hid_size):
    """ takes x_emb[time, batch, emb_size] and predicts"""
    W = tf.Variable(np.random.randn(emb_size + hid_size, hid_size).astype('float32'),)
    h0 = tf.zeros([tf.shape(x_emb)[1], hid_size])
    
    def scan_step(h_t, x_t):
      rnn_inp = tf.concat([h_t, x_t], axis=1)
      h_next = tf.tanh(tf.matmul(x_t, W))
      return h_next
      

    return tf.scan(scan_step, elems=x_emb, initializer=h0)

In [None]:
%load_ext autoreload
%autoreload 2
# ^-- an extension that reloads .py modules if you change their code

In [None]:
import my_rnn_library
x = tf.placeholder('float32', [None, None, None])
h = my_rnn_library.my_rnn(x, emb_size=32, hid_size=128)

sess.run(tf.global_variables_initializer())
sess.run(h, {x: np.random.randn(10, 3, 32)})

# spoiler: its gonna fail. Your task is to understand what operation failed and how to fix that.

### Debugging tensorflow: invalid values

If your code fails with an error, it's easy to find out what's wrong. However, sometimes there's no error, but your network doesn't train and your loss is equal to NaN or -inf. Or mean squared error is negative. Or ... well, you just know it's wrong.

The question is: where is it wrong. There are two strategies: using tf.asserts and good old tinkering. We'll try the old way.

The next example contains two errors:
* an error with shapes that causes tensorflow 
* an error that causes tensorflow to return NaN

In [None]:
x = tf.placeholder_with_default(np.random.randn(3, 15, 100).astype('float32'),
                                [None, None, 100])
x_len = tf.placeholder_with_default(np.array([3, 14, 8], dtype='int32'), [None])

logits = L.Dense(256)(x)

mask = tf.sequence_mask(x_len, dtype=tf.float32)


logits = logits - (1 - mask)[:, :, None] * 1e9

probs = tf.nn.softmax(logits, axis=1)

mean_logp = tf.log(tf.reduce_mean(probs, axis=-1))

mean_prob = tf.exp(mean_logp)

grads = tf.gradients(mean_prob, [x])[0]

grad_norms = tf.reduce_sum(grads ** 2, axis=(1, 2)) ** 0.5

In [None]:
sess.run(tf.global_variables_initializer())
sess.run(grad_norms)

# Your quest is as usual: find where's Waldo (NaN). And eliminate it :)

## Part II: Cool tensorflow features

In [None]:
# for the next section we'll need to reload tensorflow without eager
# PLEASE RESTART THE NOTEBOOK! (kernel-restart in jupyter, runtime -> restart in colab)
# also if you're in colab, please request GPU-enabled runtime (settings -> notebook settings)

### 1. Tensorflow Eager

When you've first seen tensorflow in action, there was a lot of complicated stuff happening: defining operations on placeholders, sessions, variable initializers, etc.

Luckily, TF also allows you to write code on the fly much the same way as you did in numpy. It's called __Tensorflow Eager__.

In [None]:
import numpy as np
import tensorflow as tf
tf.enable_eager_execution()

In [None]:
# use tensorflow operations like you would use numpy
x = tf.constant([[1, 2], [3, 4]], dtype=tf.float32)
y = tf.matmul(x, tf.random_normal([2, 4]))
z = tf.nn.softmax(y, axis=1)

In [None]:
# every tensor has a value (like numpy arrays)
z

In [None]:
# ... and can be converted to numpy
z.numpy()

In [None]:
# you can even mix numpy arrays in tf computations
z + np.linspace(0, 4, 4) 

### Training with tf.eager

Eager execution has it's own API for automatic gradients. It's called GradientTape.

In [None]:
x = tf.Variable([3.0, 5.0]) 

with tf.GradientTape() as tape:
    y = x * x 
    dy_dx = tape.gradient(y, x)

print('gradients:', dy_dx)

Now let's train some networks. As usual, we'll use keras functional API for the ease of execution.


In [None]:
from keras.datasets.mnist import load_data
(X_train, y_train), (X_test, y_test) = load_data()
X_train, X_test = X_train.astype('float32') / 255., X_test.astype('float32') / 255.
y_train, y_test = y_train.astype('int32'), y_test.astype('int32')

In [None]:
keras, L = tf.keras, tf.keras.layers # use these and not just import keras

model = keras.models.Sequential([ 
    L.InputLayer(X_train.shape[1:]), L.Flatten(), L.Dense(100), L.Activation('relu'), L.Dense(10) 
])
opt = tf.train.AdamOptimizer(learning_rate=1e-3)

In [None]:
for i in range(1000):
    batch = np.random.randint(0, len(X_train), size=100)
    with tf.GradientTape() as tape:
        logits = model(X_train[batch])
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
            labels=y_train[batch], logits=logits)
        loss = tf.reduce_mean(loss)
        
        grads = tape.gradient(loss, model.trainable_variables)
        opt.apply_gradients(zip(grads, model.trainable_variables))
      
    if i % 100 == 0:
        print('step %i, loss=%.3f' % (i, loss.numpy()))

In [None]:
# we can now evaluate our model using any external metrics we want
from sklearn.metrics import accuracy_score

y_test_pred = model(X_test).numpy().argmax(-1)
print("Test acc:", accuracy_score(y_test, y_test_pred))

RTFM:
* [tf.eager basics](https://www.tensorflow.org/tutorials/eager/eager_basics)
* [tape-based gradients](https://www.tensorflow.org/tutorials/eager/automatic_differentiation)
* [training walkthrough](https://www.tensorflow.org/tutorials/eager/custom_training_walkthrough)
* You can also embed eager code into normal tensorflow graph with [tf.contrib.eager.py_func](https://www.tensorflow.org/guide/eager)

In [None]:
# Please restart the notebook again

### 2. Tensorboard

If you run more than one experiment, you will eventually have to compare your results. We've already mentioned that this can be done with tensorboard. Ideally, you wanna obtain something like this:

![img](https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/lm_acc1.png)

_except the training is not finished_

If you're not into tensorflow [visdom](https://github.com/facebookresearch/visdom), [tensorboardX](https://github.com/lanpa/tensorboardX)


### 3. Tensorflow Hub

Most deep learning applications nowadays depend on some kind of pre-trained network to start from. Be it 

Keras [applications](https://keras.io/applications/) for computer vision, [gensim](https://github.com/RaRe-Technologies/gensim) for embeddings, and many smaller model zoos dedicated to every particular topic.

One such model zoo is Tensorfow Hub, featuring several hot NLP models:
* [Universal Sentence Encoder](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb#scrollTo=MSeY-MUQo2Ha)
* [ELMO](https://tfhub.dev/google/elmo/2)

In [None]:
import numpy as np
import tensorflow as tf
tf.reset_default_graph()
sess = tf.Session()

!pip3 install --quiet tensorflow-hub
import tensorflow_hub as hub

model = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
sess.run([tf.global_variables_initializer(), tf.tables_initializer()]);

In [None]:
sentence_embs = model(["A cat sat on a mat.", "I am the monument to all your sins"])

print(sess.run(sentence_embs))

## Part III. Worst practices

There's a number of things about TF that kind of... ~~sucks~~ in need of improvement.

Don't get me wrong, they are all great for their job. Except they can easily be misused with dramatic consequences.

__#1. TF.contrib is a mess__

Tensorflow [contrib](https://www.tensorflow.org/api_docs/python/tf/contrib) is a place where tensorflow holds dozens of sub-libraries dedicated to everything. You name it:
* Helper functions for sequence-to-sequence models - [check!](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq)
* Wrapper modules for CUDNN RNN operations - [check!](https://github.com/tensorflow/probability)
* A full-blown deep learning library? - [check](https://www.tensorflow.org/api_docs/python/tf/contrib/keras)-[check](https://www.tensorflow.org/api_docs/python/tf/contrib/slim)-[check](https://www.tensorflow.org/api_docs/python/tf/contrib/learn)!

The catch is that most of the code in tf.contrib was built by independent authors. Sometimes it's poorly supported. Sometimes it's outdated. And it is definitely not designed for full compatibility with one another.

For instance LSTM cells from tf.contrib.rnn are not guaranteed to work with tf.keras abstractions. And neither do tf.slim layers fit into keras models.

There's a rule of thumb: if the functionality you need is in both tf core and tf.contrib, pick tf core. If it's only in tf.contrib - read through it and maybe play with it on a toy task before integrating it into your larger projects.


__#2. Pythonic and symbolic loops__


Sometimes you want your tensorfow graph to contain loops. The most obvious example is RNN.

Tensorflow allows you to define such loops with primitives like [tf.while_loop](https://www.tensorflow.org/api_docs/python/tf/while_loop) and [tf.scan](https://www.tensorflow.org/api_docs/python/tf/scan).

If you read the docs, you'll also see other primitives like __tf.map_fn__ and __tf.cond__. It is tempting to use those operations to write python-style code. 

__But you shouldn't__. Or rather, try hard to have as few of them as possible. Each iteration of symbolic loop introduces a gigantic overhead in computation time.

In [None]:
import numpy as np
import tensorflow as tf
tf.reset_default_graph()
sess = tf.Session()

x_ph = tf.placeholder_with_default(np.linspace(-10, 10, 10**4).astype('float32'), [None])


my_square = tf.map_fn(lambda x_i: x_i ** 2, x_ph)
my_sum_squares = tf.scan(lambda ctr, x_i: ctr + x_i, elems=my_square, initializer=0.0)[-1]


tf_square = x_ph ** 2
tf_sum_squares = tf.reduce_sum(tf_square)

In [None]:
print("Symbolic loops:")
%time print(sess.run(my_sum_squares))
print("Vector operations:")
%time print(sess.run(tf_sum_squares))

__TL;DR:__ use control flow ops sparingly. Few large iterations are okay, many small iterations are not.

__#3. Control Dependencies__ 

By default, if your tensorflow graph has two parallel branches of code, there's no way of telling which branch will be executed first. This can cause inconveniences. You may want to explicitly tell tensoflow "Run this op before that one" to save memory or make debug logs prettier.

However, you can also use control dependencies to mutate graph state in the middle of execution. DON'T DO THAT unless you absolutely have to. 

And even then __DON'T DO THAT__.

Here's a demotivational example

In [None]:
import tensorflow as tf
tf.reset_default_graph()
sess = tf.Session()

x = tf.Variable(1.0)

y1 = x ** 2

add_first = tf.assign_add(x, 1)

with tf.control_dependencies([add_first, y1]):
  y2 = x ** 2
  
  add_second = tf.assign_add(x, 1)
  with tf.control_dependencies([y2, add_second]):
    y3 = x ** 2

sess.run(tf.global_variables_initializer())

print('First run:', sess.run([y1, y2, y3]))

print('Second run:', sess.run([y1, y2, y3]))

# Bonus quest: change as few lines as possible to make it print [1, 4, 9], [9, 16, 25]

## Part IV: cool stuff that didn't make it into tutorial
* [tf.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) - an advanced tool for loading and managing data.
* Creating new tf ops - [in c++](https://www.tensorflow.org/extend/adding_an_op)
* Managing gradient computation with [tf.stop_gradient](https://github.com/tensorflow/fold) and [gradient override map](https://stackoverflow.com/questions/41391718/tensorflows-gradient-override-map-function)
* [Gradient checkpointing](https://github.com/openai/gradient-checkpointing/) to backprop through large models in low memory
* Tensorflow is available in many other languages. For instance, here's [tf for javascript](https://js.tensorflow.org/) or [tutorial on exporting keras model for an android app](https://medium.com/@thepulkitagarwal/deploying-a-keras-model-on-android-3a8bb83d75ca)
* Efficient gpu parallelism with [horovod](https://github.com/uber/horovod)


### [if we have time] XLA: Tensorflow, compiled

While tf.eager gives you the freedom to experiment, eventually you'll figure out exactly what you want and you'll need your code to run... faster. Preferably much faster. And on half as much gpu memory so you can increase batch size.

Your typical neural network has a lot of operations that are fast to compute but require allocating large  amounts of memory. Consider adding bias element-wise to a large tensor and then applying nonlinearity. These operations can be _fused_ together: you don't allocate new memory but perform everything in-place as a single operation.

__Warning:__ XLA become included by default starting from tf 1.12; earlier versions will require compiling tensorflow manually **with** XLA support.

In [None]:
# Please restart notebook and make sure you use tensorflow with GPU. 
# If you don't, the code will work but it XLA will give no performance boost
# and actually run slower.

import numpy as np
import tensorflow as tf
keras = tf.contrib.keras
L = tf.contrib.keras.layers

assert tf.test.is_gpu_available()

In [None]:
tf.reset_default_graph()
config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
config.gpu_options.per_process_gpu_memory_fraction = 0.5
sess = tf.Session(config=config)

In [None]:
with tf.device('/gpu:0'):
    model = keras.models.Sequential()
    model.add(L.InputLayer([None, 256]))
    model.add(L.SimpleRNN(256, return_sequences=True))
    model.add(L.SimpleRNN(256))
    model.add(L.Dense(100))
    
    x = tf.placeholder_with_default(np.random.randn(1, 1000, 256).astype('float32'), [None, None, 256])
    pred = model(x)

In [None]:
sess.run(tf.global_variables_initializer())
sess.run(pred);  # "warmup run"

In [None]:
%timeit sess.run(pred)

__RTFM:__ [XLA jit](https://www.tensorflow.org/performance/xla/jit)

__What to expect:__ many small ops (lstm steps, dropout and batchnorm, etc) on GPU = large improvement. Few large ops (vgg16, large batch) or CPU = no improvement