In [2]:
import numpy as np
import matplotlib.pyplot as plt

## 1. TensorFlow Graph and Session

* TensorFlow’s api is built around the idea of a computational graph, a way of performing a mathematical process.

> TensorFlow separates the defining of a computational graph from executing that computational graph by having them happen in separate places: a `graph` defines the operations, but the operations only happen within a `session`. Graphs and sessions are created independently. A graph is like a blueprint, and a session is like a construction site. Details refer to [Graphs and Sessions](https://www.tensorflow.org/versions/r1.3/programmers_guide/graphs)

<img src="images/tf_graph.png" alt="tf_graph" style="width:60%;height:60%"/>

* TensorFlow graph has two general computation units: 
    * `Nodes`, represent mathematical operations
    * `Edges`, represent tensors (multi-dimensional arrays) that hold values passing through the graph
* Note that: when we are constructing the graph, values are not known and they are represented by either `variables` or `placeholders`
* When we start running the computation on the constructed graph, we will feed values into the graph through the variables and placeholders.
    * `Variables` store values that can be updated during computation and are initialized before graph computation You must explicitly call operation like `tf.global_variables_initializer` with session.run to initialize all variables. Typically, weights and biases are represented by variables.
    * `Placeholders` store values that can not be updated during the computation and are provided by user (outside of the graph). Typically, external inputs such as training, testing samples and some hyperparameters are represented by placeholders.

**Why Graph**
* One obvious reason is that Neural Network is essentially a graph
* More importantly, it makes taking partial derivative with respect to certain variable easier. 
    * For example, the partial derivative of E with respect to C is 1 and it does not depend on D. The edges of the graph tell us which way to calculate the derivatives.

In [3]:
import tensorflow as tf

  return f(*args, **kwds)


> At this point TensorFlow has already started managing a lot of state for us. There's already an implicit default graph, for example. Internally, the default graph lives in the `_default_graph_stack`, but we don't have access to that directly. We use [`tf.get_default_graph()`](https://www.tensorflow.org/api_docs/python/tf/get_default_graph).

In [3]:
graph = tf.get_default_graph()
print("default graph: ", graph)
print("all operations: ", graph.get_operations())

default graph:  <tensorflow.python.framework.ops.Graph object at 0x10fadfc50>
all operations:  []


Currently, there isn't anything in the graph. We’ll need to put everything we want TensorFlow to compute into that graph. Let's start with a simple constant input value of one.

> TensorFlow uses protocol buffers internally. ([Protocol buffers](https://developers.google.com/protocol-buffers/) are sort of like a Google-strength JSON.) Printing the `node_def` for the constant operation above shows what's in TensorFlow's protocol buffer representation for the number one.

In [4]:
input_value = tf.constant(1.0)
operations = graph.get_operations()
print("all operations: ", operations)
print("first operations: ", operations[0].node_def)

all operations:  [<tf.Operation 'Const' type=Const>]
first operations:  name: "Const"
op: "Const"
attr {
  key: "dtype"
  value {
    type: DT_FLOAT
  }
}
attr {
  key: "value"
  value {
    tensor {
      dtype: DT_FLOAT
      tensor_shape {
      }
      float_val: 1.0
    }
  }
}



We can create an new graph by using [tf.Graph](https://www.tensorflow.org/api_docs/python/tf/Graph) and set it as the default graph by using `tf.Graph.as_default` context manager, which overrides the current default graph for the lifetime of the context:

In [7]:
g = tf.Graph()
default_g = tf.get_default_graph()
assert g is not default_g
with g.as_default():
  # Define operations and tensors in `g`.
  c = tf.constant(30.0)
  assert c.graph is g

Why TensorFlow defines its own versions of objects such as variables instead of just using a normal Python variable? [One of the TensorFlow tutorials has an explanation](https://www.tensorflow.org/get_started/mnist/pros#deep-mnist-for-experts):
> To do efficient numerical computing in Python, we typically use libraries like NumPy that do expensive operations such as matrix multiplication outside Python, using highly efficient code implemented in another language. Unfortunately, there can still be a lot of overhead from switching back to Python every operation. This overhead is especially bad if you want to run computations on GPUs or in a distributed manner, where there can be a high cost to transferring data.

> TensorFlow also does its heavy lifting outside Python, but it takes things a step further to avoid this overhead. Instead of running a single expensive operation independently from Python, TensorFlow lets us describe a graph of interacting operations that run entirely outside Python. This approach is similar to that used in Theano or Torch.

> The role of the Python code is therefore to build this external computation graph, and to dictate which parts of the computation graph should be run.


As described above, graph is like a blueprint that sketches the neural network architecture and it does not contains any numerical values so far. If we inspect our input_value, we see it is a constant 32-bit float tensor of no dimension: just one number.

In [11]:
input_value

<tf.Tensor 'Const_2:0' shape=() dtype=float32>

Note that this doesn't tell us what that number is. To evaluate input_value and get a numerical value out, we need to create a “session” where graph operations can be evaluated and then explicitly ask to evaluate or “run” input_value. 

Note `the session picks up the default graph by default`.

### tf.Session()

* A "TensorFlow Session" is an environment for running a graph. A graph only run computations after creation of a session. The session is in charge of allocating the operations to GPU(s) and/or CPU(s), including remote machines. 
* Why we need session?
    * Session is kind of separating running a constructed tensorflow graph from the construction of that tensorflow graph. It will allocate necessary computational resources (be it a CPU, a GPU or multiple GPUs) via configuration when we actually running the graph
* The following code creates a session instance `sess` using [`tf.Session`](https://www.tensorflow.org/api_docs/python/tf/Session). We defines a session within a `with` block. So, after running the with block, the session will close automatically.
    * The `sess.run()` function then evaluates the tensor and returns the results.

In [12]:
# hello_constant = tf.constant('Hello World!')
with tf.Session() as sess:
    output = sess.run(input_value)
    print(output)

1.0


### A Side Note

* Tensorflow has some syntax or boilerplate code, such as session, that has nothing to do with the machine learning or deep learning. They exist for the tensor flow to work. 
* Therefore, provided you well understand the fundamental components/concepts of machine learning and deep learning, you can work with any framework to construct even tremendous complex deep neural network.
    * Data + Model + Model Assumption
    * Loss function
    * Optimizer that minimize loss with respect to model parameters
    * Make predictions


## 2. TensorFlow Basics


### What are Tensors?

* Tensors are the standard way of representing data in TensorFlow.
* Tensors are multidimensional arrays, an extension of two-dimensional tables (matrices) to data with higher dimension.
    * Tensors with rank 0 is scalar
    * Tensors with rank 1 is vector
    * Tensors with rank 2 is matrix (table of numbers)
    * Tensors with rank 3 is 3-tensor (cube of numbers)
    * Tensors with rank 4 or higher, you image it

### tf.placeholder()

* You can’t just set your X variable to your training data and put it in TensorFlow, because over time you'll want your TensorFlow model to take in different training data with different parameters. You need `tf.placeholder()`

* `tf.placeholder()` returns a tensor that gets its value from data passed to the `tf.session.run()` function through `feed_dict` argument, allowing you to set the input right before the session runs and optionally to constrain shape of input as well. .

* [`tf.placeholder()`](https://www.tensorflow.org/api_docs/python/tf/placeholder) accepts parameters:
    * `dtype`: The type of elements in the tensor to be fed.
    * `shape`: The shape of the tensor to be fed (optional). If the shape is not specified, you can feed a tensor of any shape.
    * `name`: A name for the operation (optional).

In [84]:
A = tf.placeholder(tf.float32, shape=(5,5), name='A')
print(A)
v = tf.placeholder(tf.float32)
print(v)
w = tf.matmul(A, v)
print(w)

Tensor("A_2:0", shape=(5, 5), dtype=float32)
Tensor("Placeholder_2:0", dtype=float32)
Tensor("MatMul_81:0", shape=(5, ?), dtype=float32)


### Weights and Bias in TensorFlow

The most common operation in neural networks is calculating the linear combination of inputs, weights, and biases. As a reminder, we can write the output of the linear operation as

$$y = xW + b$$

$W$ is a matrix of the weights connecting two layers. The output $y$, the input $x$ and the biases $b$ are all vectors.

* The goal of training a neural network is to modify weights and biases to best predict the labels. 


> In order to use weights and bias, you'll need a Tensor that can be modified. This leaves out `tf.placeholder()` and `tf.constant()`, since those Tensors can't be modified. This is where <b style='color: red'>tf.Variable</b> class comes in.

### tf.Variable()

> Internally, a `tf.Variable` stores a persistent tensor. Specific ops allow you to read and modify the values of this tensor. These modifications are visible across multiple tf.Sessions, so multiple workers can see the same values for a tf.Variable.

> Variables are manipulated via the tf.Variable class. A tf.Variable represents a tensor whose value can be changed by running ops on it. Unlike tf.Tensor objects, a tf.Variable exists outside the context of a single session.run call.

> Internally, a tf.Variable stores a persistent tensor. Specific ops allow you to read and modify the values of this tensor. These modifications are visible across multiple tf.Sessions, so multiple workers can see the same values for a tf.Variable.

* The tf.Variable class creates a tensor with an initial value that can be modified, much like a normal Python variable. 
* This tensor stores its state in the session, so you must initialize the state of the tensor manually. You'll use the `tf.global_variables_initializer()` function to initialize the state of all the Variable tensors.


In [40]:
tf.reset_default_graph()
graph = tf.get_default_graph()

print("----- All operations at the very beginning --------------------")    
for op in graph.get_operations(): print(op.name)
    
x = tf.Variable(5)

print("------All operations after define a variable x ----------------")    
for op in graph.get_operations(): print(op.name)
print("---------------------------------------------------------------")  
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(x.eval())
    
print("------All operations after session closed ----------------")    
for op in graph.get_operations(): print(op.name)

----- All operations at the very beginning --------------------
------All operations after define a variable x ----------------
Variable/initial_value
Variable
Variable/Assign
Variable/read
---------------------------------------------------------------
5
------All operations after session closed ----------------
Variable/initial_value
Variable
Variable/Assign
Variable/read
init


You might expect that adding a variable (e.g., `x = tf.Variable(5)`) would add one operation to the graph, but in fact that one line adds four operations.

> The `tf.global_variables_initializer()` call returns an operation that will initialize all TensorFlow variables from the graph. You call the operation using a session to initialize all the variables as shown above. Using the tf.Variable class allows us to change the weights and bias, but an initial value needs to be chosen.

> Initializing the weights with random numbers from a normal distribution is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time you train it. 

> Similarly, choosing weights from a normal distribution prevents any one weight from overwhelming other weights. You'll use the `tf.truncated_normal()` function to generate random numbers from a normal distribution.

### tf.truncated_normal()

* The [`tf.truncated_normal()`](https://www.tensorflow.org/api_docs/python/tf/truncated_normal) function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean.

```
truncated_normal(
    shape,
    mean=0.0,
    stddev=1.0,
    dtype=tf.float32,
    seed=None,
    name=None
)
```

* Since the weights are already helping prevent the model from getting stuck, you don't need to randomize the bias. Let's use the simplest solution, setting the bias to 0.

In [86]:
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))
print(weights)

<tf.Variable 'Variable_175:0' shape=(120, 5) dtype=float32_ref>


### f.zeros()
* The tf.zeros() function returns a tensor with all zeros.
* Many other similar functions, refer to [document](https://www.tensorflow.org/api_guides/python/constant_op#truncated_normal)

In [27]:
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))
print(bias)

<tf.Variable 'Variable_1:0' shape=(5,) dtype=float32_ref>


### Session’s feed_dict

* Use the feed_dict parameter in <b>tf.session.run()</b> to set the placeholder tensor
* <b>Note</b>: If the data passed to the feed_dict doesn’t match the tensor type and can’t be cast into the tensor type, you’ll get the error “ValueError: invalid literal for...”.

In [7]:
with tf.Session() as session:
    output = session.run(w, feed_dict={A: np.random.randn(5,5), v:np.random.randn(5,1)})
    print(output, type(output))

[[ 0.89380491]
 [ 0.76145381]
 [ 0.35286978]
 [-0.47771859]
 [ 0.92436922]] <class 'numpy.ndarray'>


> tensor flow does real matrix multiplication. Therefore, when you input a vector, you have to specify all dimensions

```
v:np.random.randn(5,1)
```

## TensorFlow Math

Getting the input is great, but now you need to use it. You're going to use basic math functions that everyone knows and loves - add, subtract, multiply, and divide - with tensors. (There's many more math functions you can check out in the [documentation](https://www.tensorflow.org/api_guides/python/math_ops).)

### Addition

We’ll start with the add function. The `tf.add()` function does exactly what you expect it to do. It takes in two numbers, two tensors, or one of each, and returns their sum as a tensor.

In [29]:
weight = tf.Variable(5)
input_value = tf.constant(4)
z = tf.add(weight, input_value)

# This shows how the multiplication operation tracks 
# where its inputs come from: they come from other 
# operations in the graph; one is a variable and the
# other is a constant
op = graph.get_operations()[-1]
print(op.name)
for op_input in op.inputs: print(op_input)

Add
Tensor("Variable_3/read:0", shape=(), dtype=int32)
Tensor("Const:0", shape=(), dtype=int32)


### Subtraction and Multiplication
Here’s an example with subtraction and multiplication.

```python
x = tf.subtract(10, 4) # 6
y = tf.multiply(2, 5)  # 10
```

The x tensor will evaluate to 6, because 10 - 4 = 6. The y tensor will evaluate to 10, because 2 * 5 = 10. That was easy!

### Converting types

It may be necessary to convert between types to make certain operators work together. For example, if you tried the following, it would fail with an exception:

```python
# Fails with ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int32:
tf.subtract(tf.constant(2.0),tf.constant(1)) 
```

That's because the constant 1 is an integer but the constant 2.0 is a floating point value and subtract expects them to match.

In cases like these, you can either make sure your data is all of the same type, or you can cast a value to another type. In this case, converting the 2.0 to an integer before subtracting, like so, will give the correct result:

```python
tf.subtract(tf.cast(tf.constant(2.0), tf.int32), tf.constant(1))   # 1
```


**Quiz** The code below is a simple algorithm using division and subtraction. Convert the following algorithm in regular Python to TensorFlow and print the results of the session. You can use tf.constant() for the values 10, 2, and 1.

In [36]:
# Convert the following to TensorFlow:
# x = 10
# y = 2
# z = x/y - 1

x = tf.constant(10)
y = tf.constant(2)
z = tf.subtract(tf.divide(x,y),tf.cast(tf.constant(1), tf.float64))

# Print z from a session
with tf.Session() as sess:
    output = sess.run(z)
    print(output)

4.0


## TensorFlow Softmax

> The softmax function squashes it's inputs, typically called <b style='color: red'>logits</b> or logit scores, to be between 0 and 1 and also normalizes the outputs such that they all sum up to 1. 

>This means the output of the softmax function is equivalent to a categorical probability distribution. It's the perfect function to use as the output activation for a network predicting multiple classes.

<img src="images/softmax.png" alt="Drawing" style="width:60%;height:60%"/>

* We're using TensorFlow to build neural networks and, appropriately, there's a function for calculating softmax.
* <b style='color: red'>tf.nn.softmax()</b> implements the softmax function for you. It takes in logits and returns softmax activations.

In [11]:
def compute_softmax():
    output = None
    logit_data = [1.2, 0.9, 0.4]
    logits = tf.placeholder(tf.float32)
    
    # Calculate the softmax of the logits
    softmax = tf.nn.softmax(logits)    
    
    with tf.Session() as sess:
        # Feed in the logit data
        output = sess.run(softmax,   feed_dict={logits: logit_data} )

    return output

print(compute_softmax())

[ 0.45659032  0.3382504   0.20515925]


## TensorFlow Cross Entropy

When you're using softmax, your output is a vector that contains probability distribution over output logits. You can express your data labels as a vector using what's called `one-hot encoding`.

`one-hot encoding` means that you have a vector the length of which is the number of classes, and the target label is marked with a 1 while the other labels are set to 0. In the case of classifying digits, our label vector for the image of the number 4 would be:

y = [0,0,0,0,1,0,0,0,0,0]

And our output prediction vector could be something like:

$\hat{y}$ = [0.047,0.048,0.061,0.07,0.330,0.062,0.001,0.213,0.013,0.150].


We want our error to be proportional to how far apart these two vectors are. To calculate this distance, we'll use the `cross entropy`. Then, our goal while training the network is to make our prediction vectors as close as possible to the label vectors by minimizing the cross entropy. The cross entropy calculation is shown below:

<img src="images/cross_entropy.png" alt="Drawing" style="width:60%;height:60%"/>

As you can see above, the cross entropy is the sum of the label vector times the natural log of the prediction vector. Note that this formula is not symmetric! Flipping the vectors is a bad idea because the label vector has a lot of zeros and taking the log of zero will cause an error.

What's cool about using one-hot encoding for the label vector is that $y_j$ is 0 except for the one true class. Then, all terms in that sum except for where $y_j=1$ are zero and the cross entropy is simply $D=−ln(\hat{y})$ for the true label. For example, if your input image is of the digit 4 and it's labeled 4, then only the output of the unit corresponding to 4 matters in the cross entropy cost.

* As with the softmax function, TensorFlow has a function to do the cross entropy calculations for us.
* To create a cross entropy function in TensorFlow, you'll need to use two new functions:
    * <b style='color: red'>tf.reduce_sum()</b>, takes an array of numbers and sums them together.
    * <b style='color: red'>tf.log()</b>, takes the natural log of a number

In [15]:
x = tf.reduce_sum([1, 2, 3, 4, 5])  # 15
x = tf.log(100.0)  # 4.60517

array([-0.78396875, -1.08396883, -1.58396877])

In [12]:
# probability distribution vector
prod_dist_data = np.array([0.45659032, 0.3382504, 0.20515925])
# one hot vector
one_hot_data = np.array([1.0, 0.0, 0.0])

print(prod_dist_data.shape)

prod_dist = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

cross_entropy = -tf.reduce_sum(tf.multiply(one_hot, tf.log(prod_dist)))

with tf.Session() as sess:
    output = sess.run(cross_entropy, feed_dict={prod_dist:prod_dist_data, one_hot:one_hot_data})
    print(output)

(3,)
0.783969


tensorflow provides a function computing the the cross entropy called [<b style='color: red'>tf.nn.softmax_cross_entropy_with_logits()</b>](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits)

* It computes softmax cross entropy between logits and labels.

* Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class). For example, each CIFAR-10 image is labeled with one and only one label: an image can be a dog or a truck, but not both.

> While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.

> If using exclusive labels (wherein one and only one class is true at a time), see sparse_softmax_cross_entropy_with_logits.

* Other important points

> This operation expects **unscaled logits**, since it performs a softmax on logits internally for efficiency. Do not call this operation with the output of softmax, as it will produce incorrect results.

> `logits` and `labels` must have the same shape, e.g. `[batch_size, num_classes]` and the same dtype (either float16, float32, or float64).

> To avoid confusion, it is required to pass only named arguments to this function.

The most important arguments for this function are:
* `labels`: Each row labels[i] must be a valid probability distribution.
* `logits`: Unscaled log probabilities.
* `dim`: The class dimension. Defaulted to -1 which is the last dimension.

It returns a **1-D tensor of length `batch_size`** of the same type as logits with the softmax cross entropy loss.

Backpropagation in this version of softmax cross entropy will happen only into logits. To calculate a cross entropy loss that allows backpropagation into both logits and labels, see [tf.nn.softmax_cross_entropy_with_logits_v2](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits_v2).

In [8]:
logit = np.array([[1.2, 0.9, 0.4],[0.8, 0.9, 1.4],[1.2, 0.9, 1.1]])
one_hot = np.array([[1.0, 0.0, 0.0],[0, 0, 1.0],[0, 1.0, 0]])

print('logit shape', logit.shape)
print('one hot shape', one_hot.shape)

logit_placeholder = tf.placeholder(tf.float32)
one_hot_placeholder = tf.placeholder(tf.float32)

# NOTE: you should take the tf.log() of the softmax
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logit_placeholder, labels=one_hot_placeholder)
with tf.Session() as sess:
    output = sess.run(cross_entropy, feed_dict={logit_placeholder:logit, one_hot_placeholder:one_hot})
    print('cross entropy:', output)
    
# tf.nn.softmax_cross_entropy_with_logits returns a 1-D tensor of length batch_size of the same type 
# as logits. 
# Each element in the returned tensor is a cross entropy for a logit with its corresponding one hot vector. 
# To compute the average cross entropy of all training examples, we use tf.reduce_mean function.
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logit_placeholder, 
                                                                       labels=one_hot_placeholder))
with tf.Session() as sess:
    output = sess.run(cross_entropy, feed_dict={logit_placeholder:logit, one_hot_placeholder:one_hot})
    print('cross entropy mean', output)

logit shape (3, 3)
one hot shape (3, 3)
cross entropy: [0.78396875 0.7679496  1.2729189 ]
cross entropy mean 0.9416124


[tf.reduce_mean](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/reduce_mean) computes the mean of elements across specified dimensions of a tensor.

> Reduces input_tensor along the dimensions given in axis. Unless keep_dims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keep_dims is true, the reduced dimensions are retained with length 1.

> If axis has no entries, all dimensions are reduced, and a tensor with a single element is returned

The most important arguments for this function:
* `input_tensor`: The tensor to reduce. Should have numeric type.
* `axis`: The dimensions to reduce. If None (the default), reduces all dimensions.
* `keep_dims`: If true, retains reduced dimensions with length 1.

It returns the reduced tensor.

As shown in the code, we are using `tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y))` to compute the loss of all the training examples in a mini-batch. 
* `tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y)` computes a 1-D tensor in which each element is the softmax cross entropy value for one training example and there are totally `batch_size` of them.
* Then, the `tf.reduce_mean` computes the mean of the `batch_size` number of softmax cross entropy values.


## TensorFlow ReLU

* ReLU, a non-linear activation function, or rectified linear unit. The ReLU function is 0 for negative inputs and x for all inputs x>0.
* TensorFlow provides the ReLU function as <b style='color: red'>tf.nn.relu()</b>, as shown below.

```python
# Hidden Layer with ReLU activation function
hidden_layer = tf.add(tf.matmul(features, hidden_weights), hidden_biases)
hidden_layer = tf.nn.relu(hidden_layer)

output = tf.add(tf.matmul(hidden_layer, output_weights), output_biases
```

The above code applies the tf.nn.relu() function to the hidden_layer, effectively turning off any negative weights and acting like an on/off switch. Adding additional layers, like the output layer, after an activation function turns the model into a nonlinear function. This nonlinearity allows the network to solve more complex problems.

* Below you'll use the ReLU function to turn a linear single layer network into a non-linear multilayer network.
<img src="images/two_layer_nn.png" alt="Drawing" style="width:60%;height:60%"/>

In [117]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

output = None
hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8]]
out_weights = [
    [0.1, 0.6],
    [0.2, 0.1],
    [0.7, 0.9]]

# Weights and biases
weights = [
    tf.Variable(hidden_layer_weights),
    tf.Variable(out_weights)]
biases = [
    tf.Variable(tf.zeros(3)),
    tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0, -4.0], [11.0, 12.0, 13.0, 14.0]])

# forward pass
hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)

logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

# Print session results
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(logits))

[[  5.11000013   8.44000053]
 [  0.           0.        ]
 [ 24.01000214  38.23999786]]


## TensorFlow Gradient Descent Optimizer

* In this section, we will use [`tf.train.GradientDescentOptimizer()`](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer) to compute the minimal value of a simple convex function. 

$$u^2 + u + 1$$

* You can refer to this [document](https://www.tensorflow.org/api_guides/python/train) to check many other optimizers.
    * [`tf.train.GradientDescentOptimizer()`](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer)
```python
__init__(
    learning_rate,
    use_locking=False,
    name='GradientDescent'
)
```
    * [`tf.train.AdamOptimizer`](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)
```python
__init__(
    learning_rate=0.001,
    beta1=0.9,
    beta2=0.999,
    epsilon=1e-08,
    use_locking=False,
    name='Adam'
)
```

In [15]:
u = tf.Variable(20.0)
cost = u*u + u + 1
print(cost)

Tensor("add_1:0", shape=(), dtype=float32)


In [17]:
# 0.3 is the learning rate
optimizer = tf.train.GradientDescentOptimizer(0.3)
train_op = optimizer.minimize(cost)
print(train_op)

name: "GradientDescent_1"
op: "NoOp"
input: "^GradientDescent_1/update_Variable_3/ApplyGradientDescent"



In [19]:
init = tf.global_variables_initializer()
with tf.Session() as session:
    session.run(init)
    for i in range(12):
        session.run(train_op)
        print("i = %d, cost %.3f, u = %.3f" % (i , cost.eval(), u.eval()))


i = 0, cost 67.990, u = 7.700
i = 1, cost 11.508, u = 2.780
i = 2, cost 2.471, u = 0.812
i = 3, cost 1.025, u = 0.025
i = 4, cost 0.794, u = -0.290
i = 5, cost 0.757, u = -0.416
i = 6, cost 0.751, u = -0.466
i = 7, cost 0.750, u = -0.487
i = 8, cost 0.750, u = -0.495
i = 9, cost 0.750, u = -0.498
i = 10, cost 0.750, u = -0.499
i = 11, cost 0.750, u = -0.500


The `minimize(cost)` function of the optimizer automatically (1) computes gradients through a whole network and (2) applies gradients to variables (i.e., carrying out the backward step for learning).
* This method simply combines calls `compute_gradients()` and `apply_gradients()`. If you want to process the gradient before applying them call compute_gradients() and apply_gradients() explicitly instead of using this function.

**compute_gradients()**

```python
compute_gradients(
    loss,
    var_list=None,
    gate_gradients=GATE_OP,
    aggregation_method=None,
    colocate_gradients_with_ops=False,
    grad_loss=None
)
```

* Compute gradients of loss for the variables in var_list.
* This is the first part of minimize(). It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.
* Args:
    * `loss`: A Tensor containing the value to minimize.
    * `var_list`: Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key `GraphKey.TRAINABLE_VARIABLES`.
    * `gate_gradients`: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
    aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.
    * `colocate_gradients_with_ops`: If True, try colocating gradients with the corresponding op.
    * `grad_loss`: Optional. A Tensor holding the gradient computed for loss.

Returns:
A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.

**apply_gradients()**

```python
apply_gradients(
    grads_and_vars,
    global_step=None,
    name=None
)
```

* Apply gradients to variables.
* This is the second part of minimize(). It returns an Operation that applies gradients.
* Args:
    * `grads_and_vars`: List of (gradient, variable) pairs as returned by compute_gradients().
    * `global_step`: Optional Variable to increment by one after the variables have been updated.
    * `name`: Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns:
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.


In [35]:
u = tf.Variable(20.0)
cost = u*u + u + 1

optimizer = tf.train.GradientDescentOptimizer(0.3)
grads_and_vars = optimizer.compute_gradients(cost)
train_op = optimizer.apply_gradients(grads_and_vars)

init = tf.global_variables_initializer()
with tf.Session() as session:
    session.run(init)
    for i in range(12):
        session.run(train_op)
        print("i = %d, cost %.3f, u = %.3f" % (i , cost.eval(), u.eval()))

i = 0, cost 67.990, u = 7.700
i = 1, cost 11.508, u = 2.780
i = 2, cost 2.471, u = 0.812
i = 3, cost 1.025, u = 0.025
i = 4, cost 0.794, u = -0.290
i = 5, cost 0.757, u = -0.416
i = 6, cost 0.751, u = -0.466
i = 7, cost 0.750, u = -0.487
i = 8, cost 0.750, u = -0.495
i = 9, cost 0.750, u = -0.498
i = 10, cost 0.750, u = -0.499
i = 11, cost 0.750, u = -0.500


In [21]:
a=list([[[1, 2, 3],
            [4, 5, 6],
            [7, 8, 9]],
           [[10,11,12],
            [13,14,15],
            [16,17,18]]])
print(a)
print(type(a))
A = tf.convert_to_tensor(a)
# A = tf.constant(a)
# B = tf.constant(b)
B = A[:, -1]
print("A", A.shape)
print("A", A)
concatenated=tf.concat(A, axis=1) 

with tf.Session() as sess:
    concatenated = tf.reshape(concatenated, [-1, 3])
    C, B = sess.run([concatenated, B])
    print("C", C)
    print("C shape", C.shape)
    print("B", B)
    print("B shape", B.shape)

[[[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[10, 11, 12], [13, 14, 15], [16, 17, 18]]]
<class 'list'>
A (2, 3, 3)
A Tensor("Const_17:0", shape=(2, 3, 3), dtype=int32)
C [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]
 [13 14 15]
 [16 17 18]]
C shape (6, 3)
B [[ 7  8  9]
 [16 17 18]]
B shape (2, 3)


### Difference between tf.contrib.layers.fully_connected and tf.layers.dense

They are essentially the same, [tf.contrib.layers.fully_connected](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/fully_connected) calls [tf.layers.dense](https://www.tensorflow.org/api_docs/python/tf/layers/dense)
Layers in tf.layers (and in tf.contrib.layers) are part of the "higher-level" API of tensorflow that takes care of initializing weights and biases. However you can choose an initializer for them.

One [major difference](https://stackoverflow.com/questions/44912297/are-tf-layers-dense-and-tf-contrib-layers-fully-connected-interchangeable) is tf.contrib.fully_connected has relu as it's default activation, while tf.layers.dense is a linear activation by default

```
tf.layers.dense
dense(
    inputs,
    units,
    activation=None,
    use_bias=True,
    kernel_initializer=None,
    bias_initializer=tf.zeros_initializer(),
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    trainable=True,
    name=None,
    reuse=None
)

tf.contrib.layers.fully_connected
fully_connected(
    inputs,
    num_outputs,
    activation_fn=tf.nn.relu,
    normalizer_fn=None,
    normalizer_params=None,
    weights_initializer=initializers.xavier_initializer(),
    weights_regularizer=None,
    biases_initializer=tf.zeros_initializer(),
    biases_regularizer=None,
    reuse=None,
    variables_collections=None,
    outputs_collections=None,
    trainable=True,
    scope=None
)

```

* `inputs`: A tensor of at least rank 2 and static value for the last dimension; i.e. [batch_size, depth], depth can be the number of features of an example; [None, None, None, channels].
* `units/num_outputs`: Integer or long, the number of output units in the layer.
* `activation_fn`: Activation function. The default value is a ReLU function. Explicitly set it to None to skip it and maintain a linear activation.


We may have different ways to define the fully connected output layer.

```python
## 1. 
# (1) We first define a fully-connected layer that only performs 
#     linear transformation of x since we set activation=None.
# (2) Then, we define the leaky relu as activation function
h = tf.layers.dense(x, n_units, activation=None)
h = tf.maximum(alpha * h1, h1)

## 2.
# (1) We first reshape the tensor from the output of a LSTM network 
#     to the tensor with shape accepted by the fully-connected layer we will define next.
# (2) Then, we define a dense (i.e., fully-connected layer) that only performs 
#     linear transformation of x since we set activation=None.
# (3) Finally, we use tf.nn.sigmoid_cross_entropy_with_logits to compute the cost.
# (4) Note that tf.nn.sigmoid_cross_entropy_with_logits will compute sigmoid of the inputted logits 
#     Therefore, we should not compute sigmoid on logits by ourselves.
last_layer_input = reshape(lstm_output)
logits = tf.layers.dense(last_layer_input, 1, activation=None)
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits (logits=logits, labels=labels))


## 3.
# (1) We first reshape the tensor from the output of a LSTM network 
#     to the tensor with shape accepted by the fully-connected layer we will define next.
# (2) Then, we define a fully-connected layer with sigmoid activation function.
# (3) Finally, we use tf.losses.mean_squared_error to compute the cost.
# (4) Note that tf.losses.mean_squared_error will not compute sigmoid of the inputted logits 
#     Therefore, we should compute sigmoid on logits by ourselves.
last_layer_input = reshape(lstm_output)
predictions = tf.contrib.layers.fully_connected(last_layer_input , 1, activation_fn=tf.sigmoid)
cost = tf.losses.mean_squared_error(labels, predictions)
```

## Variable

[document](https://jhui.github.io/2017/03/08/TensorFlow-variable-sharing/)

### Difference between tf.Variable and tf.get_variable

* [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable) will always create a new variable and thus `tf.Variable` requires that an `initial_value` be specified. 

```
__init__(
    initial_value=None,
    trainable=True,
    collections=None,
    validate_shape=True,
    caching_device=None,
    name=None,
    variable_def=None,
    dtype=None,
    expected_shape=None,
    import_scope=None,
    constraint=None
)
```
```
W = tf.Variable(<initial-value>, name=<optional-name>)
```

*  [`tf.get_variable`](https://www.tensorflow.org/api_docs/python/tf/get_variable) gets from the graph an existing variable or creates a new one if it does not exists. (It does not require an initial value as using tf.Variable).
    * This function requires you to specify the Variable's name. 
        * This name will be used by other replicas to access the same variable, as well as to name this variable's value when checkpointing and exporting models. 
    * To create a new variable by using `tf.get_variable`, you have to specify the `shape` of the input in addition to the variable name. The `tf.get_variable` will automatically initialize the variable with the specified shape by using an `initializer`
        * If initializer is None (the default), the default initializer passed in the `variable scope` will be used. If that one is None too, a `glorot_uniform_initializer` will be used.
        * When the initializer is a `tf.Tensor` you should not specify the variable's shape, as the shape of the initializer tensor will be used.

```
get_variable(
    name,
    shape=None,
    dtype=None,
    initializer=None,
    regularizer=None,
    trainable=True,
    collections=None,
    caching_device=None,
    partitioner=None,
    validate_shape=True,
    use_resource=None,
    custom_getter=None,
    constraint=None
)
```
```
W = tf.get_variable("W", shape=[784, 256],
       initializer=tf.contrib.layers.xavier_initializer())
```     
       
It would be better using `tf.get_variable()` since it will make it easier to refactor your code if you need to share variables at any time, e.g. in a multi-gpu setting (see the multi-gpu CIFAR example). There is no downside to it.

> An new variable is added to the graph collections listed in collections, which defaults to `GraphKeys.GLOBAL_VARIABLES`.
> If trainable is True the variable is also added to the graph collection `GraphKeys.TRAINABLE_VARIABLES`.



In [55]:
import tensorflow as tf

tf.reset_default_graph()

with tf.variable_scope("one"):
    a = tf.get_variable("v", [1]) #a.name == "one/v:0"
    print('a', a)
    
with tf.variable_scope("one"):
    a2 = tf.get_variable("v2", shape=[1]) #ValueError: Variable one/v already exists
    print('a2', a2)
    
# with tf.variable_scope("one"):
#     b = tf.get_variable("v", [1]) #ValueError: Variable one/v already exists
    
with tf.variable_scope("one", reuse = True):
    c = tf.get_variable("v", [1]) #c.name == "one/v:0"
    print('c', c)

with tf.variable_scope("two"):
    d = tf.get_variable("v", [1]) #d.name == "two/v:0"
    e = tf.Variable(1, name = "v", expected_shape = [1]) #e.name == "two/v_1:0"
    print('d', d)
    print('e', e)

assert(a is c)  #Assertion is true, they refer to the same object.
assert(a is d)  #AssertionError: they are different objects
assert(d is e)  #AssertionError: they are different objects

a <tf.Variable 'one/v:0' shape=(1,) dtype=float32_ref>
a2 <tf.Variable 'one/v2:0' shape=(1,) dtype=float32_ref>
c <tf.Variable 'one/v:0' shape=(1,) dtype=float32_ref>
d <tf.Variable 'two/v:0' shape=(1,) dtype=float32_ref>
e <tf.Variable 'two/v_1:0' shape=() dtype=int32_ref>


AssertionError: 

Although two variables `d` and `e` have the same name 'v' under the same scope 'two', they are different variables. This is because `tf.Variable()` alway creates an new variable in the current graph. If current graph already contained an operation named "two/v", the TensorFlow would append "_1", "_2", and so on to the name, in order to make it unique.

In [57]:
print(d.name)   #d.name == "two/v:0"
print(e.name)   #e.name == "two/v_1:0"

two/v:0
two/v_1:0


## References

* [Hello tensorflow](https://www.oreilly.com/learning/hello-tensorflow)
* [TensorFlow tutorials](https://www.tensorflow.org/tutorials/)
* [TensorFlow programmers guide](https://www.tensorflow.org/programmers_guide/)
* [TensorFlow get started](https://www.tensorflow.org/get_started/)
