# TensorFlow

### ... is a general math framework

TensorFlow is designed to accommodate...

* Easy operations on tensors (n-dimensional arrays)
* Mappings to performant low-level implementations, including native CPU and GPU
* Optimization via gradient descent variants
    * Including high-performance differentiation
    
Low-level math primitives called "Ops"

From these primitives, linear algebra and other higher-level constructs are formed.

Going up one more level common neural-net components have been built and included.

At an even higher level of abstraction, various libraries have been created that simplify building and wiring common network patterns. Over the last year, we've seen 3-5 such libraries.

We will focus later on one, Keras, which has now been adopted as the "official" high-level wrapper for TensorFlow.
  

### We'll get familiar with TensorFlow so that it is not a "magic black box"

But for most of our work, it will be more productive to work with the higher-level wrappers.

In [None]:
import tensorflow as tf

x = tf.constant(100, name='x')
y = tf.Variable(x + 50, name='y')

print(y)

### There's a bit of "ceremony" there...

... and ... where's the actual output?

For performance reasons, TensorFlow separates the design of the computation from the actual execution.

TensorFlow programs describe a computation graph -- an abstract DAG of data flow -- that can then be analyzed, optimized, and implemented on a variety of hardware, as well as potentially scheduled across a cluster of separate machines.

Like many query engines and compute graph engines, evaluation is __lazy__ ... so we don't get "real numbers" until we force TensorFlow to run the calculation:

In [None]:
model = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(model)
    print(session.run(y))

### TensorFlow integrates tightly with NumPy

and we typically use NumPy to create and manage the tensors (vectors, matrices, etc.) that will "flow" through our graph

In [None]:
import numpy as np

data = np.random.randint(1000, size=100)

print(data)

In [None]:
data = np.random.normal(loc=10.0, scale=2.0, size=[3,3]) # mean 10, std dev 2

print(data)

In [None]:
x = tf.constant(data, name='x')
y = tf.Variable(x * 10, name='y')

model = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(model)
    print(session.run(y))

### We will often iterate on a calculation ... 

Calling `session.run` runs just one step, so we can iterate using Python as a control:

In [None]:
with tf.Session() as session:
    for i in range(3):
        session.run(model)
        x = x + 1
        print(session.run(x))
        print("----------------------------------------------")

### TensorBoard is a helper tool for visualizing compute graphs and outputs

In [None]:
x = tf.constant(100, name='x')

print(x)

y = tf.Variable(x + 1, name='y')

with tf.Session() as session:
    merged = tf.summary.merge_all()
    writer = tf.summary.FileWriter("data/scratch", session.graph)
    model =  tf.global_variables_initializer()
    session.run(model)
    print(session.run(y))

This code records a log file... To view it, run TensorBoard separately from the command line:

```
tensorboard --logdir=data/scratch/

Starting TensorBoard 39 on port 6006
```

And browse to `localhost:6006` (or whichever port is selected)


In [None]:
x = tf.placeholder("float", [3])
y = x * 2

with tf.Session() as session:
    result = session.run(y, feed_dict={x: [1, 2, 3]})
    print(result)

In [None]:
x = tf.placeholder("float", [2, 3])
y = x * 2

with tf.Session() as session:
    x_data = [[1, 2, 3],
              [4, 5, 6],]
    result = session.run(y, feed_dict={x: x_data})
    print(result)

In [None]:
x = tf.placeholder("float", [None, 2]) #None --> unspecified
y = x * 2

with tf.Session() as session:
    x_data = [[1, 2], [3, 4], [5, 6]]
    result = session.run(y, feed_dict={x: x_data})
    print(result)

### Let's make a slightly more complex graph:

We'll use a slice operator (https://www.tensorflow.org/api_docs/python/tf/slice)

In [None]:
x = tf.placeholder("float", [2, 3])
y = x * 2
z = tf.slice(y, [0,1], [2,2]) * 10

with tf.Session() as session:
    x_data = [[1, 2, 3],
              [4, 5, 6],]
    result = session.run(z, feed_dict={x: x_data})
    print(result)

What does the compute graph for this look like?

In [None]:
session = tf.InteractiveSession()

x = tf.constant(list(range(10)))

print(x)
print(x.eval())

session.close()

### Optimizers

TF includes a set of built-in algorithm implementations (though you could certainly write them yourself) for performing optimization.

These are oriented around a gradient-descent methods, with a set of handy extensions flavors to make things converge faster.

## Gradient Descent

A family of numeric optimization techniques, where we solve a problem with the following pattern:

1. Formulate the goal as a function of many parameters, where we would like to minimize the function's value
<br><br>*For example, if we can write the error (e.g., RMSE or cross-entropy) as a function of variables, we would like to find values for those variables that will minimize the error.*<br><br>

2. Calculate the target function value

3. Compute the gradient, or directional derivative, of the target -- the "slope toward lower error"

4. Adjust the input variables in the indicated direction

5. Repeat

<img src="http://i.imgur.com/ntIU6Q8.png">

### Big Picture Goal: Function Approximation

If we can imagine our goal in terms of a function: a value that comes out based on values in, or a probability distribution out from a sampled distribution going in...

... and we can formulate our error as something solvable by gradient descent ...

... then we can use a numerical solving technique like this to get us plausible values for the parameters of the function.

__One more time, because this is important: we're not solving for the output of the function -- in order to "learn" a function we need real input and output to begin with. We're solving for the parameters of the the function that that get us close to the output__

What about the structure of the function? Where does that come in? That is the going to be our hard-coded hypothesis that we start with. For example, earlier we imagined Logistic Regression as a type of model -- in that case, the logistic regression itself is the structure of the function. We proposed it as a hypothesis, knowing that it would be good at some distributions and bad at others.

__As we get further into Deep Learning, we'll see that the neural net topology or type is the structure of the function. As hard as it is to conjecture what might work, it can also get pragmatically hard to solve for all the function parameters because we may have tens or hundreds of thousands of interrelated parameters!__

### Notes and Thoughts on Gradient Descent

We want function approximation in a trainable way. Trainable by gradient descent means the shape of the space we're optimizing should ideally be smooth. It would be great if we knew it was convex or had a unique minimum but those are rarely true, and we can try to get an "ok" solution anyway.

That's where the research and experimentation comes in.

#### Some ideas to help build your intuition

* What happens if the variables (imagine just 2, to keep the mental picture simple) are on wildly different scales ... like one ranges from -1 to 1 while another from -1e6 to +1e6?

* What if some of the variables are correlated? I.e., a change in one corresponds to, say, a linear change in another?

* Other things being equal, an approximate solution with fewer variables is easier to work with than one with more -- how could we get rid of some less valuable parameters? (e.g., L1 penalty)

* How do we know how far to "adjust" our parameters with each step?

<img src="http://i.imgur.com/AvM2TN6.png" width=600>


What if we have billions of data points? Does it makes sense to use all of them for each update? Is there a shortcut?

Yes: __Stochastic Gradient Descent__

### Beyond SGD

In the beginning of the big-data machine learning revolution, SGD was the workhorse of optimization. 

It works, but there are a variety of refinements that have been created so that now, in production, we typically use a slightly more complext variant.

#### Momentum

* We may want to use some weighted history of our gradient descent path, to smooth out changes in direction and velocity

#### Conjugate-Gradient

* We can use a change of basis to minimize the amount of time we go back and forth "undoing" progress in a a particular dimension ... i.e., we can try to approach the goal in a more additive way.

<img src="http://i.imgur.com/Jx8YKDc.jpg" width="500">


#### Second-Order Methods

* Inspired by Newton's method of approximation
* Use 2nd derivatives ... ${n^2}$ second derivatives in the Hessian matrix, so we need some tricks to keep this tractable
* L-BFGS, OWL-QN


<img src="http://i.imgur.com/SZxMEg0.png" width=200>

---

We'll try and talk about some of the more "advanced" tricks later, but just so the terminology isn't mysterious, these are things like *Adam*, *Adagrad*, *Adadelta*, *RMSProp* etc.

You definitely don't need to be able to code all of those by hand, but it will be useful having a bit of intuition about what the idea is, so you can choose -- or understand why another scientist chose -- a particular optimizer and parameters. In reality, a lot of time is based on empirical experimentation.

### Back to TensorFlow: Using TF optimizer to solve problems

As we've said, TF is a toolkit for math, and it has higher-level implementations (like gradient descent optimizers) that are available to us.

We can use them to solve anything (not just neural networks) so let's start with a simple equation.

We supply a bunch of data points, that represent inputs. We will generate them based on a known, simple equation (y will always be 2\*x + 6) but we won't tell TF that. Instead, we will give TF a function structure ... linear with 2 parameters, and let TF try to figure out the parameters by minimizing an error function.

What is the error function? 

The "real" error is the absolute value of the difference between TF's current approximation and our ground-truth y value.

But absolute value is not a friendly function to work with there, so instead we'll square it. That gets us a nice, smooth function that TF can work with, and it's just as good:

In [None]:
x = tf.placeholder("float")
y = tf.placeholder("float")

m = tf.Variable([1.0], name="m-slope-coefficient") # initial values ... for now they don't matter much
b = tf.Variable([1.0], name="b-intercept")

y_model = tf.multiply(x, m) + b

error = tf.square(y - y_model)

train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error)

model = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(model)
    for i in range(10):
        x_value = np.random.rand()
        y_value = x_value * 2 + 6
        session.run(train_op, feed_dict={x: x_value, y: y_value})

    out = session.run([m, b])
    print(out)
    print("Model: {r:.3f}x + {s:.3f}".format(r=out[0][0], s=out[1][0]))

#### That's pretty terrible :)

Try two experiments. Change the number of iterations the optimizer runs, and -- independently -- try changing the learning rate (that's the number we passed to `GradientDescentOptimizer`)

See what happens with different values.

#### We can also look at the errors and plot those:

In [None]:
x = tf.placeholder("float")
y = tf.placeholder("float")

m = tf.Variable([1.0], name="m-slope-coefficient") # initial values ... for now the don't matter much
b = tf.Variable([1.0], name="b-intercept")

y_model = tf.multiply(x, m) + b

error = tf.square(y - y_model)

train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error)

model = tf.global_variables_initializer()

errors = []

with tf.Session() as session:
    session.run(model)
    for i in range(100):
        x_value = np.random.rand()
        y_value = x_value * 2 + 6
        _, error_val = session.run([train_op, error], feed_dict={x: x_value, y: y_value})
        errors.append(error_val)

    out = session.run([m, b])
    print(out)
    print("Model: {r:.3f}x + {s:.3f}".format(r=out[0][0], s=out[1][0]))
    
import matplotlib.pyplot as plt
plt.plot(errors)
plt.show()

### That is the essence of TensorFlow!

There are three principal directions to explore further:

* Working with tensors instead of scalars: this is not intellectually difficult, but takes some practice to wrangle the shaping and re-shaping of tensors. If you get the shape of a tensor wrong, your script will blow up. Just takes practice.

* Building more complex models. You can write these yourself using lower level "Ops" -- like matrix multiply -- or using higher level classes like `tf.layers.dense` *Use the source, Luke!*

* Operations and integration ecosystem: as TensorFlow has matured, it is easier to integrate additional tools and solve the peripheral problems:
    * TensorBoard for visualizing training
    * tfdbg command-line debugger
    * Distributed TensorFlow for clustered training
    * GPU integration
    * Feeding large datasets from external files
    * Tensorflow Serving for serving models (i.e., using an existing model to predict on new incoming data)
