# ACTIVATION FUNCTIONS

# What is an activation function?

* It's a function applied to neurons during predictino (ex. `relu`)
* A small list of activations **account for the vast majority of activation needs**,   
  improvements on them have been minute in most cases
* An infinite number of functions could be used as activations,   
  but not any function can be used as activation

### Several constraints on what makes a function an activation function:

**1.** Must be **continuous and infinite** in domain:   
   = has an output number for **_any_** input, no input x for which we can't compute an output (y)   
<br/>   

**2.** Must be **monotonic**, never changing direction:   
   = is 1:1, i.e. either **_always increasing_** or **_always decreasing_**   
   = can never have the same output value for multiple input values   
   = can make learning harder is there are multiple right answers, multiple possible perfect weight configurations   
   SEE convex vs non-convex optimization      
<br/>   

**3.** Must be **nonlinear** (they squiggle and turn)   
<br/>   

**4.** Should be **efficiently computable** (and their derivatives):   
  = you'll be calling it a lot (sometimes billions of times)   
  = ex. `relu` is the most popular, it is ridiculously easy to compute at the expense of its expressiveness 


# Hidden layer activation functions

### sigmoid

* smoothly squishes the infinite amount of input to an output **between 0 and 1**
* you can interpret the output of an individual neuron as a probability
* thus people use this nonlinearity both in hidden and output layers

### tanh

* better than `sigmoid` for hidden layers
* smoothly squishes the infinite amount of input to an output **between -1 and 1**
* gives varying degrees of positive correlation (see selective correlation),
* so it can also throw in some *negative correlation*:
  * not useful for output layers unless predictions between -1 and 1
  * powerful for hidden layers
* outperforms `sigmoid` on many problems

# Output layer activation functions

* different than what's best for hidden layers, especially for classification
* 3 major types of output layers

### no activation: when predicting raw data values

* most straightforward but least common type of output layer
* range of output values is something other than a probability   
  ex. predicting average temperature in Paris given temperature in surrounding cities
* focus on ensuring that output nonlinearity can predict the right answers
* `sigmoid`or `tanh` would be inappropriate: forces predictions between 0 and 1, or -1 and 1   
  range of temperature values can be higher or lower

### sigmoid: when predicting unrelated yes/no probabilities

* multiple binary probabilities in one network (see "GD with multiple outputs")
* often when predicting one label, the network will learn something useful to one of the other labels
* `sigmoid` more appropriate because it models individual probabilities separately for each output node

### softmax: when predicting which-one probabilities

* by far the most common type of output layer: predicting a single label out of many
* could use `sigmoid` and declare that highest output probability is the most likely, will work reasonably well
* but it's far better with an activation function that models the idea that   
  "the more likely it's one label, the less likely it's any of the other labels"
* example:
  * raw dot product values: `0, 0, 0, 0, 0, 0, 0, 0, 0, 100` (mnist, output label is '9')
  * with `sigmoid`: `.50, .50, .50, .50, .50, .50, .50, .50, .50, .99`   
    network seems less sur that it's a 9, seems to think there's a 50% chance it could be any of the other digits
  * with `tanh`: `0, 0, 0, 0, 0, 0, 0, 0, 0, 1`   
    9 is the highest, and the network doesn't even suspect it's any of the other possible digits   
* this is a flaw of `sigmoid` that can have serious effects during backpropagation   
  example: mean squared error calculated on `sigmoid`'s output is `.25, .25, .25, .25, .25, .25, .25, .25, .25, .00`   
  weights would be massively updated even though the network predicted perfectly!   
  why? because for `sigmoid` to reach 0 error, it doesn't just have to predict the highest positive number for the true output, it also has to predict 0 everywhere else.      
<br/>   

* `softmax` = "which digits seems like the best fit for this input?"
* `sigmoid` = "you better believe that it's only digit 9 and doesn't have anything in common with the other digits"
  

# Notes

## Different inputs share characteristics

* different input nodes can have overlapping features (ex. digits '2' and '3' both have a top curve)
* as a general rule, similar inputs create similar outputs
* training with `sigmoid` would penalize the network for trying to predict a 2 based on this input (top curve),  
  because by doing so, it would be looking for the same input it does for 3s   
  when input would be 3, the 2 would get some probability (because of that top curve), and vice versa
* side effect: lots of images share pixels in the middle of images, so network tries to focus on the edges
* these can be the best **_individual indicators_** for a label
* but the best overall is a network that sees **_the entire shape_** for what it is
* individual indicators can be accidentally triggered by a different input
* network isn't learning the true essence of a label,   
  because it needs to learn 2 and *not 1*, *not 3*, *not 4*, etc.   
<br/>   

* Good activation doesn't penalize labels that are similar
* it pays attention to all information that can be indicative of any potential input
* `softmax` works better both in theory and practice

## softmax

* raises each input value $x$ exponentially ($e^x$) and then divides by the layer's sum
* notice: turns every prediction into a positive number,   
  **negative numbers** turn into **very small positive numbers** (between 0 and 1),   
  and **big numbers** turn into **very big numbers**
* next step: compute sum of the layer and divide each node's value by that sum,   
  which makes every number 0 except the value for the predicted label
* with `softmax`, the higher the network predicts one label, the lower it predicts all others   
  = increase the **_sharpness of attenuation_**: encourages the network to predict one output with very high probability
* to adjust how aggressively it does this, use numbers slightly higher of lower than $e$ when exponentiating:
  * lower numbers -> lower attenuation
  * higer numbers -> higher attenuation   
  * most people just stick with $e$

# Activation installation instructions

* Adding an activation function to a layer in forward propagation is relatively straightforward
* But properly compensating for the activation function in backpropagation is a bit more nuanced   

### In forward propagation

* Apply the function to each node **in the input** of the layer.
* The **_input to a layer_** is the value before nonlinearity:   
  in this example, the input to `layer_1` is `np.dot(layer_0, weights_0_1)`, not `layer_0` (the previous layer)
```
layer_0 = test_images[m:m+1]
layer_1 = relu(np.dot(layer_0, weights_0_1))
layer_2 = np.dot(layer_1, weights_1_2)
```

### In backpropagation

* When you backpropagate, in order to generate `layer_1_delta`:
  * multiply the backpropagated `delta` from `layer_2`: `layer_2_delta.dot(weights_1_2_.T)`
  * by the slope of `relu` *at the point predicted in forward propagation*: `relu2deriv(layer_1)`
  ```
  layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
  ```
  * `relu2deriv` takes the output of `relu` and calculates the slope of `relu` at that point   
<br/>   

* The slope is an indicator of how much a *tiny* change to the input affects the output   
  You wan to modify the incoming `delta` (from the following layer)   
  to take into account whether a weight update before this node would have any effect)
* This step encourages the network to leave weights alone if adjusting them will have little to no effect   
  by multiplying it by the slope (it's no different from `sigmoid`)   
<br/>   

#### Multiplying the backpropagated delta by the layer's slope

* `delta` on a neuron tells the weights whether they should move or not:   
  if moving them has no effect, they should be left alone
* that's what `relu` does: either turn on or turn off
* `sigmoid`: more nuanced
  * it's sensitivity to change in the input slowly increases as **input approaches 0 from either direction** (negative or positive) (see the sigmoid curve)
  * and **very positive and very positive** inputs approach a slope of very near 0
* so small changes to the incoming weights become **_less relevant_** to the neuron's error at this training example   
  = many hidden nodes are irrelevant to the accurate prediction of a particular label,   
  but we shouldn't mess with their weights because we could corrupt their usefulness elsewhere (predicting another digit)
* inversely, it also creates **_stickiness_**: 
  * weights previously updated a lot in one direction (for similar trng examples) confidently predict a high or low value
  * these nonlinearities help make it harder for occasional erroneous training examples to corrupt intelligence that has been reinforced many times
    
#### Converting output to slope (derivative)

* most great activations can convert their output to their slope
* adding an activation to a layer (`relu`) changes how to compute `delta` for that layer (`relu2deriv`)
* unlike in calculus, most great activation functions use the **_output_** of the layer (at forward propagation) to compute the derivative  
<br/>  
* Functions and their derivatives:  
  * `input`: Numpy `ndarray`, the input layer
  * `output`: prediction
  * `deriv`: derivative of the vector of activation derivatives correponding to the derivative of the activation at each node
  * `target_pred`: vector of expected values (typically for the correct label position, 0 everywhere else)   
<br/>   
  
<table style="width: 80%; align:center;">
    <tr>
        <th style="width: 20%; text-align: center;">Function</th>
        <th style="width: 40%; text-align: center;">Forward propagation</th>
        <th style="width: 40%; text-align: center;">Backpropagation delta</th>
    </tr>
    <tr>
        <td style="text-align: center">relu</td>
        <td>
            <p>ones_and_zeros = input > 0</p>
            <p>output = input * ones_and_zeros</p>
        </td>
        <td>
            <p>mask = output > 0</p>
            <p>deriv = output * mask</p>
        </td>
    </tr>
    <tr>
        <td style="text-align: center">sigmoid</td>
        <td>
            <p>output = 1 / (1 + np.exp(-input))</p>
        </td>
        <td>
            <p>deriv = output * (1 - output)</p>
        </td>
    </tr>
    <tr>
        <td style="text-align: center">tanh</td>
        <td>
            <p>output = np.tanh(input</p>
        </td>
        <td>
            <p>deriv = 1 - (output ** 2)</p>
        </td>
    </tr>
    <tr>
        <td style="text-align: center">softmax</td>
        <td>
            <p>temp = np.exp(intput)</p>
            <p>output /= np.sum(temp)</p>
        </td>
        <td>
            <p>temp = output - target_pred</p>
            <p>output = temp / len(target_pred)</p>
        </td>
    </tr>
</table>

# Example code

* `tanh` should be better for hidden layers   
  `softmax` should be better for output layers
* but things are not always as simple   

Adjustments to tune the network properly with these activations:
* for `tanh`: reduce the std dev of incoming weights:
  * when initializing weights, `np.random.random` creates a matrix with numbers between 0 and 1
  * by multiplying by 0.2 and subtracting 0.1, we **rescale** the random range between -0.1 and 0.1
  * worked great for `relu` but less optimal for `tanh`:
    * `tanh` needs narrower random initialization
    * so we adjust it to be between -0.01 and 0.01

* Removed `error` calculation because `softmax` is best used with the **_cross entropy_** error function   
  (seen later in the book)
* Much higher `alpha` was required to reach a good score within 300 iterations

#### Loading data

In [6]:
import sys
import numpy as np

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, y_train = x_train[0:1000], y_train[0:1000]

# IMAGES
train_images = x_train.reshape(1000, 28*28) / 255
test_images = x_test.reshape(len(x_test), 28*28) / 255

# LABELS: one_hot vectors
# label '4' = [0, 0, 0, 0, 1, 0, 0...]
train_labels = np.zeros((len(y_train), 10))
for i,l in enumerate(y_train):
    train_labels[i][l] = 1

test_labels = np.zeros((len(y_test), 10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

#### Network parameters

In [7]:
np.random.seed(1)
tanh = lambda x: np.tanh(x)
tanh2deriv = lambda output: 1 - (output ** 2)
def softmax(x):
    temp = np.exp(x)
    return temp / np.sum(temp, axis=1, keepdims=True)

alpha = 2
iterations = 701
hidden_size = 100
batch_size = 100

pixels_per_image = 784
num_labels = 10
num_train_images = len(train_images)
num_test_images = len(test_images)

weights_0_1 = 0.02 * np.random.random((pixels_per_image, hidden_size)) - 0.01
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.1

#### Learning and predicting

In [8]:
for j in range(iterations):
    
    # Predicting on training data
    
    train_correct_cnt = 0
    
    for i in range(int(num_train_images / batch_size)):
        batch_start, batch_end = (i * batch_size), ((i+1) * batch_size)
        
        layer_0 = train_images[batch_start:batch_end]
        layer_1 = tanh(np.dot(layer_0, weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = softmax(np.dot(layer_1, weights_1_2))
        
        for k in range(batch_size):
            train_correct_cnt += int( np.argmax(layer_2[k:k+1]) == \
                                      np.argmax(train_labels[batch_start+k:batch_start+k+1]) )
        train_acc = train_correct_cnt / float(num_train_images)
            
        layer_2_delta = (train_labels[batch_start:batch_end] - layer_2) / (batch_size * layer_2.shape[0])
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * tanh2deriv(layer_1)
        layer_1_delta *= dropout_mask
            
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
        
    # Predicting on test data
    
    test_correct_cnt = 0
    
    for m in range(num_test_images):
        layer_0 = test_images[m:m+1]
        layer_1 = tanh(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        
        test_correct_cnt += int( np.argmax(layer_2) == np.argmax(test_labels[m:m+1]) )
        test_acc = test_correct_cnt / float(num_test_images)
        
    if (j % 20 == 0) or (j == iterations-1):
        print('Iter ' + str(j) + ' Train-Acc ' + str(train_acc)[0:5] + ' Test-Acc ' + str(test_acc)[0:5])

Iter 0 Train-Acc 0.156 Test-Acc 0.394
Iter 20 Train-Acc 0.732 Test-Acc 0.702
Iter 40 Train-Acc 0.794 Test-Acc 0.766
Iter 60 Train-Acc 0.849 Test-Acc 0.810
Iter 80 Train-Acc 0.867 Test-Acc 0.831
Iter 100 Train-Acc 0.883 Test-Acc 0.840
Iter 120 Train-Acc 0.901 Test-Acc 0.848
Iter 140 Train-Acc 0.905 Test-Acc 0.852
Iter 160 Train-Acc 0.925 Test-Acc 0.857
Iter 180 Train-Acc 0.933 Test-Acc 0.861
Iter 200 Train-Acc 0.926 Test-Acc 0.864
Iter 220 Train-Acc 0.93 Test-Acc 0.866
Iter 240 Train-Acc 0.938 Test-Acc 0.868
Iter 260 Train-Acc 0.945 Test-Acc 0.868
Iter 280 Train-Acc 0.949 Test-Acc 0.869
Iter 300 Train-Acc 0.95 Test-Acc 0.871
Iter 320 Train-Acc 0.95 Test-Acc 0.872
Iter 340 Train-Acc 0.952 Test-Acc 0.873
Iter 360 Train-Acc 0.948 Test-Acc 0.872
Iter 380 Train-Acc 0.958 Test-Acc 0.874
Iter 400 Train-Acc 0.951 Test-Acc 0.873
Iter 420 Train-Acc 0.958 Test-Acc 0.875
Iter 440 Train-Acc 0.955 Test-Acc 0.877
Iter 460 Train-Acc 0.962 Test-Acc 0.878
Iter 480 Train-Acc 0.958 Test-Acc 0.878
Iter 500 

In [10]:
from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]

print(get_available_devices())

['/device:CPU:0', '/device:XLA_CPU:0']
