# Regression

**Prediction**: Linear Regression

**Classification**: Logistic Regression

# Perceptrons


!["The Perceptron"](NeuralNetworksFiles/perceptron.png "The Perceptron, a fundamental part of Neural Netowrks")
_The Perceptron, a fundamental part of Neural Networks_

## Perceptron Formula
### Discrete Activation Function
$$f(x_1, x_2, x_3, ..., x_m) = \left\{
                \begin{array}{ll}
                  0 \text{ if } b + \sum w_i * x_i < 0\\
                  1 \text{ if } b + \sum w_i * x_i > 0
                \end{array}
              \right.$$
              
### Continuous Activation Function
_Sigmoid Function:_

$$ \sigma(x) = \frac{1}{1 + e^{-x}}$$

## Softmax
$$p_i = \frac{e^{z_i}}{ \sum_{j=0}^{n} e^{z_j}}$$ where $p_i$ is the probability of class $i$

## One Hot Encoding
One column per feature, with binary values. [0,1] fature present/not present.

## Maximum Likelihood
Pick the model that gives the existing labels the highest probability.

## Cross Entropy
Negatives of the logarithms of the predicted probabilities. Smaller is better, because larger predicted probability is better [when the prediction is correct].

$$ \text{Cross Entropy } = - \sum_{i=1}^{m} y_i ln(p_i) + (1-p_i) ln(1-p_i)$$
or
$$ H(p,q) = - \sum_{x} p(x) \ log \ q(x) $$

## Multi-Class Cross Entropy

$$ \text{ Cross Entropy } = - \sum_{i=1}^{n} \sum_{j=1}^{m} y_{ij} \ ln(p_{ij}) $$

...

## Back-Propagation

*A backward pass that computes the prediction error and adjusts input weights.*

### Gradient Descent
_**Gradient:** partial derivatives of the loss function with respect to all of the weights._

_**Gradient Descent:** Adjusting the weights to reduce prediction error (minimizing error function) by moving weights in the opposite direction of the gradient by an amount equal to the learning rate._


Pseudocode: `x = x - learning_rate * gradient_of_x`

![image.png](attachment:image.png)


#### Neural Network as a Function Composition
_A neural network is the composition of several functions, usually two per node (linear + sigmoid)_

_Therefore Gradient Descent needs to compute the partial derivative w.r.t. a node's inputs for each node._

#### Chain Rule
$$ \frac{\delta f \circ g}{\delta x} = \frac{\delta f}{\delta g} \frac{\delta f}{\delta x}$$


### Stochastic Gradient Descent
_Gradient Descent with random sampling of batch of data points [on which the error is computed]_



# Tensorflow

Hello world:

In [1]:
import tensorflow as tf

# Create TensorFlow object called hello_constant
hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output = sess.run(hello_constant)
    print(output)

  return f(*args, **kwds)


b'Hello World!'


![image.png](attachment:image.png)
_Data is stored as n-dimensional tensors_


`tf.placeholder()` returns a dynamically sized tensor [at runtime], whihc adjusts to the dataset size.

In [2]:
x = tf.placeholder(tf.string)

with tf.Session() as sess:
    output = sess.run(x, feed_dict={x: 'Hello World'})

_The _`tf.Variable` _class has functionality creates a tensor of variables._ Initialization:

`init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)`

`tf.truncated_normal()` initializes tensor to random Gaussian values within 2 std. dev

`tf.zeros()` initializes tensor to all zeroes

In [3]:
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))

In [4]:
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))

### Multinomial Logistic Classification
`X -> Linear Model -> Logit -> Soft Max -> 1-Hot Encoded Predited Label, Actual Label -> Cross Entropy` 

or 

`Linear Model(X) -> Soft Max(Logit) -> Cross Entropy(1-Hot Encoded Predicted Labels, Actual Labales)`

**Training Loss**: Average Cross Entropy Across Whole Dataset

### Improving Stochastic Gradient Descent
#### Momentum
Running average of the gradient. Smooths out direction.

#### Learning Rate Decay
Reducing the learning rate as we progress toward the loss function minima.
Ex. using exponential decay, reducing on plateau, etc.

#### Adagrad
Takes care of initializing learnign rate, decaying learning rate and computing the momentum.

### Mini Batching
Helps handle large datasets that wouldn't fit in memory.
Per Epoch: Shuffle data, create mini-batches, train on each batch.

Random Batches -> SGD

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

In [6]:
# Features and Labels
# The None dimension is a placeholder for the batch size.
# At runtime, TensorFlow will accept any batch size greater than 0.
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

## Epochs
An epoch is a single forward and backward pass of the whole dataset.

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches  # Helper function created in Mini-batching section


def print_epoch_stats(epoch_i, sess, last_features, last_labels):
    """
    Print cost and validation accuracy of an epoch
    """
    current_cost = sess.run(
        cost,
        feed_dict={features: last_features, labels: last_labels})
    valid_accuracy = sess.run(
        accuracy,
        feed_dict={features: valid_features, labels: valid_labels})
    print('Epoch: {:<4} - Cost: {:<8.3} Valid Accuracy: {:<5.3}'.format(
        epoch_i,
        current_cost,
        valid_accuracy))

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
valid_features = mnist.validation.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
valid_labels = mnist.validation.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
learning_rate = tf.placeholder(tf.float32)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.global_variables_initializer()

batch_size = 128
epochs = 10
learn_rate = 0.001

train_batches = batches(batch_size, train_features, train_labels)

with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch_i in range(epochs):

        # Loop over all batches
        for batch_features, batch_labels in train_batches:
            train_feed_dict = {
                features: batch_features,
                labels: batch_labels,
                learning_rate: learn_rate}
            sess.run(optimizer, feed_dict=train_feed_dict)

        # Print cost and validation accuracy of an epoch
        print_epoch_stats(epoch_i, sess, batch_features, batch_labels)

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))

# ...

# Convolutional Neural Networks

In [8]:
# Output depth
k_output = 64

# Image Properties
image_width = 10
image_height = 10
color_channels = 3

# Convolution filter
filter_size_width = 5
filter_size_height = 5

# Input/Image
input = tf.placeholder(
    tf.float32,
    shape=[None, image_height, image_width, color_channels])

# Weight and bias
weight = tf.Variable(tf.truncated_normal(
    [filter_size_height, filter_size_width, color_channels, k_output]))
bias = tf.Variable(tf.zeros(k_output))

# Apply Convolution
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
# Add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)
# Apply activation function
conv_layer = tf.nn.relu(conv_layer)
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')

The `tf.nn.max_pool()` function performs max pooling with the `ksize` parameter as the size of the filter and the `strides` parameter as the length of the stride. 2x2 filters with a stride of 2x2 are common in practice.

The `ksize` and `strides` parameters are structured as 4-element lists, with each element corresponding to a dimension of the input tensor (`[batch, height, width, channels]`). For both `ksize` and `strides`, the batch and channel dimensions are typically set to `1`.