# Main Content
- ANN architectures
- Multi-Layer Perceptrons
- MNIST digit classification

# From Biological to Artificial Neurons
## Biological Neurons
## Logical Computations with Neurons
## The Perceptron
It is based on a slightly different artificial neuron called a **linear threshold unit(LTU)**.

![10](images/10-4.png)

The most common step function used in Perceptron is the **Heaviside step function**. Sometimes **Sign function**.

![10](images/e10-1.png)

A perceptron is simply composed of a single layer of LTUs, with each neuron connected to all the inputs.

![10](images/10-5.png)

#### How is a perceptron trained?

![10](images/e10-2.png)

An example on the iris dataset.

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:,(2,3)] # petal length, petal width
y = (iris.target == 0).astype(np.int) # Iris Setosa

per_clf = Perceptron(random_state = 42)
per_clf.fit(X,y)

y_pred = per_clf.predict([[2,0.5]])
print(y_pred)

[1]




In fact, Scikit learn's Perceptron class is equivalent to using an `SGDClassifier` with the hyperparameters:`loss='perceptron', learning_rate ='constant', eta0=1(learning rate), penalty=None(no regulazation)`

**NOTE:** Perceptrons do not output a class probability as Logistic Regressioin does. They make predictions based on a hard threshold. So Logistic Regression is preferable.

To solve trival problems like Exclusive OR(XOR) classification problem, many researchers dropped **connectionism** in favor of higher-level problems such as logic, problem solving and search. However, it turns out some of the limitations can be eliminated by stacking multiple Perceptrons, which is **Multi-layer Perceptron(MLP)**.

![10](images/10-6.png)

# Multi-Layer Perceptron and Backpropagation

![10](images/10-7.png)

#### Backpropagation -- the first way to trian MLP. 
Today we would describe it as Gradient Descent using reverse-mode autodiff.

**Description**: for each training instance the backpropagation algorithm first makes a prediction(forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection(reverse pass), and finally slightly tweaks the connection weights to reduce the error(Gradient Descent step).

In order for this algorithm to work properly, the authors made a key change to MLP's architecture: **replaced the step function with the logistic function. $\sigma(z)= 1/(1+exp(z))$**.

This was essential because the step function contains only flat segments, so there is no gradient to work with, while the logistic function has a well-defined nonzero derivative everywhere, allowing GD to make some progress at every step.

##### Other activation functions instead of Logistic function
- The hyperbolik tangent funciton $tanh(z)=2\sigma(2z)-1$:
    - S shaped, continuous and differentiable
    - output value ranges from -1 to 1, which tends to make each layer's output more or less normalized at the begining of training. THis helps speed up convergence.

- The ReLU funciton $ReLU(z)=max(0,z)$:
    - continuous
    - not differentiable at z=0
    - fast to compute
    - does not have a maximum output value, which helps reduce some issues during Gradient Descent.
    
![10](images/10-8.png)

![10](images/10-9.png)

**Biological neurons seem to implement a roughly sigmoid (S-shaped) activation function. But it turns out that ReLU activation function generally works better in ANNs. This is one of the cases where the biological analogy was misleading.**

# Training an MLP with TensorFlow's High-Level API
The `DNNClassifier` class makes it trivial to train a deep neural network with any number of hidden layers, and a softmax output layer to output estimated class probabilities.

In [2]:
import sklearn
import numpy as np
import tensorflow as tf
from sklearn.datasets import fetch_mldata
from sklearn.cross_validation import train_test_split


from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("datasets/MNIST_data/", one_hot=True)

# mnist = fetch_mldata('MNIST original')
X_train = mnist.train.images
y_train = mnist.train.labels
X_test = mnist.test.images
y_test = mnist.test.labels



Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting datasets/MNIST_data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting datasets/MNIST_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting datasets/MNIST_data/t10k-images-idx3-ubyte.gz
Extracting datasets/MNIST_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


**I had some problems with the following codes, plz refer to this picture to understand.**
```
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
# feature_columns = tf.contrib.estimator.multi_class_head(X_train)
dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[300,100],n_classes=10, feature_columns=feature_columns)
dnn_clf.fit(x=X_train, y=y_train, batch_size=50, steps=40000)
```

![10](images/c10-1.png)

# Training a DNN Using Plain TensorFlow
**Take more control over the architecture of the network**.

Implement Mini-batch Gradient Descent to train it on the MNIST data in two steps.
## Constructioin Phase
- First import libarary, then specify the number of inputs and outputs, and set the number of hidden neurons in each layer.

In [3]:
import tensorflow as tf

n_inputs = 28*28 # size of an image
n_hidden1 = 300
n_hidden2 = 300
n_outputs = 10

- Second, use placeholder nodes to represent the training data and targets.
    - for the shape of X: the number of features is going to be 28*28 for one instance and the number of instances is unknown.
    - y is 1D tensor with one entry per instance, but we don't know the size of the training batch at this point.

In [5]:
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int64, shape=(None), name='y')

- Thirdly, create the actual neural network. 
    - The placeholder X will act as the input layer and it will be replaced with one training batch at a time(all the instances in a training batch will be processed simultaneously by the neural network).
    - The two hidden layers are almost identical. The output layer is also similar, but it uses a softmax activation function instead of a ReLU activation function.

So create a `neuron_layer()` function to create one layer at a time.

In [6]:
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2/np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name='weights')
        b = tf.Variable(tf.zeros([n_neurons]), name='bias')
        z = tf.matmul(X, W) + b
        if activation=='relu':
            return tf.nn.relu(z)
        else:
            return z

![10](images/tx10-1.png)

#### Use neuron_layer function to create a neuron layer

In [7]:
with tf.name_scope('dnn'):
    hidden1 = neuron_layer(X, n_hidden1, "hidden1", activation='relu')
    hidden2 = neuron_layer(hidden1, n_hidden2, "hidden2", activation='relu')
    logits = neuron_layer(hidden2, n_outputs, 'outputs')

logits is the output of the neural network before going through the softmax activation function.

#### Now do some optimizations
Use the `fully_connected()` function to build a network instead of `neuron_layer()` function.

In [9]:
from tensorflow.contrib.layers import fully_connected

with tf.name_scope('dnn'):
    hidden1 = fully_connected(X, n_hidden1, scope='hidden1')
    hidden2 = fully_connected(hidden1, n_hidden2, scope='hidden2')
    logits = fully_connected(hidden2, n_outputs, scope='outputs', activation_fn=None)

- Now define the cost function to train the neural network model.
    - Cross entropy will penalize models that estimate a low probability for the target class. 
    - Use `sparse_soft_max_cross_entropy_with_logits()`: It computes the cross entropy based on the logits. It expects labels in the form of integers ranging from 0 to number of classes minus 1. (`soft_max_cross_entropy_with_logits()` takes labels in the form of one-hot vectors). This will give us a 1D tensor containing the cross entropy for each instance. 
    - Use `reduce_mean()` function to compute the mean cross entropy over all instances.

In [10]:
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')