### Deep Learning

### Prompt:

What are your working definitions of Deep Learning?

### Useful Answer:  

A subset of Machine Learning That Uses Multiple Layers of Computation to Extract Abstract Features From Raw Input.

### More Conversational Answer:

Doing fun things with Neural Networks.

#### What Is Deep Learning Meant To Do?

 - Work best with raw, unstructured data
 - Best at finding subtle, non-linear patterns in data that isn't clearly marked for human consumption
 - Can be very computationally expensive -- fitting times are often days/weeks long for final work
 - Primarily run on GPU's
 - Use a different set of libraries to fit -- Tensorflow or PyTorch
 - Mostly used in classification for now

### So How Does Deep Learning Work?  

 1.  Take in raw data
 2.  Randomly initialize sets of weights for every single neuron within the network
 3.  Create linear combination between your data and your computation layer
 4.  Take the output from step 3, and pass it through a *non-linear activation function*
 5.  Take this data, and pass it through to your next layer, repeat steps 3 & 4
 6.  Proceed until there are no more units left, at which point you make your prediction
 7.  Use the error term in your prediction to update the values of your weights
 8.  Repeat steps 1-7 all over again

### Neural Network Layers

 - Layers are the basic unit of neural network construction
 - Consist of 3 basic types:
  - **Input Layer**: Your dataset
  - **Output Layer**: Your set of probabilities for your prediction (ie, sklearn's predict_proba method)
  - **Hidden Layer**: Intermediate layers of computation that are used to detect non-linear patterns within your data
  
**Somewhat Interesting Note**:  A neural network without a hidden layer is just a linear model!  You can think of Logistic Regression as the simplest possible implementation of a neural network.

### Case Study: A Simple 3 Layer Neural Network That Predicts A Binary Outcome

### Step 1:  Creating the Input Layer

 - **Clue:**  THIS IS ALWAYS YOUR DATASET!
 - If it is multidimensional, then you'll have to flatten it before feeding it to something else

In [56]:
# let's create some synthetic data
X = np.random.normal(loc=0.0, scale=.1, size=10000).reshape(1000, 10)

In [57]:
# we have a data set with 10000 rows + 10 columns -- our input layer
X.shape

(1000, 10)

### The Hidden Layer(s) 

 - Basically what makes a neural network a neural network
 - A neural network with at least 2 hidden layers is 'Deep Learning'
 - Are what is needed to create the subtle patterns of non-linearity in your data

In [62]:
# let's pretend that we have a linear model
coef_      = np.random.random(10)
intercept_ = 0.20

In [63]:
# we could make our predictions like this:
preds = X.dot(coef_) + intercept_
preds

array([ 4.14117135e-01,  1.89116614e-01,  2.54843852e-01,  2.73971715e-01,
        3.63089909e-01,  2.01218408e-01,  3.81355101e-01,  2.23998936e-01,
        2.68179686e-01,  3.41797375e-01,  9.80392422e-02,  2.27624378e-01,
        3.60333655e-01,  4.14284310e-01,  7.79933528e-02,  1.20234819e-02,
        1.66830153e-01,  1.63099871e-01,  1.51837473e-01,  3.24021886e-01,
        1.37006318e-01,  6.75806462e-02,  3.88265067e-01,  1.29081324e-01,
        2.08381632e-01,  1.48597799e-01,  2.15641003e-01,  1.47088298e-01,
        2.62606244e-01,  1.17455772e-01, -3.58945318e-04,  3.11797990e-01,
        3.66925873e-01,  2.11866187e-01,  1.10630074e-02,  1.12665875e-01,
        1.76394002e-02,  1.24263567e-01,  8.62156904e-02,  3.09995493e-01,
        3.57731275e-01,  2.45099897e-01,  4.66268447e-01,  3.69762324e-02,
        1.87907902e-01,  2.61121351e-01,  1.16979201e-01,  1.93576857e-01,
        3.18096881e-01,  3.04810542e-01,  2.20135477e-01,  1.51586083e-01,
        1.48314209e-01,  

In [64]:
# our vector of predictions is 1000 rows, by 1 column
preds.shape

(1000,)

In [65]:
# now, let's try a slightly different twist
coef_ = np.random.random(20).reshape(10, 2)
coef_

array([[0.69289092, 0.60135457],
       [0.78841963, 0.33034955],
       [0.42753444, 0.24694184],
       [0.63211957, 0.81858974],
       [0.75048724, 0.03439717],
       [0.6276052 , 0.18382032],
       [0.96199689, 0.35972355],
       [0.89633581, 0.8763487 ],
       [0.78989909, 0.38946304],
       [0.97913371, 0.4550744 ]])

In [66]:
# now run through the formula again
preds = X.dot(coef_) + intercept_
preds

array([[ 0.61194202,  0.37124985],
       [ 0.33434484,  0.1353811 ],
       [ 0.24539244,  0.26576138],
       ...,
       [ 0.5093821 ,  0.45236435],
       [-0.10534285, -0.00812207],
       [ 0.42219637,  0.17154141]])

In [67]:
# our predictions now have a second layer to them (Note -- these are NOT our actual predictions)
preds.shape

(1000, 2)

### Keypoint:  For every single column we add to coef_, we'd have an additional column of output in our predictions

 - Assuming this represents our hidden layer, each additional column is referred to as a *neuron*

In [68]:
# as an example, to add an additional neuron to our hidden layer
coef_ = np.random.random(30).reshape(10, 3)
preds = X.dot(coef_) + intercept_
# this would be a hidden layer with 3 neurons
preds.shape

(1000, 3)

### Question:  Is Our Data Currently A Linear or Non-Linear Representation of Our Input?

 - What if we just did this same process with another hidden layer?  Would anything change?
 - Can you derive non-linearity from a combination of strictly linear hidden units?

### Activation Functions

 - Used at each non-input layer to create a non-linear transformation from the previous step
 - Allow us to derive the non-linear patterns inherent within our data
 - Different activation functions are used at different points within a neural network:
     - **hidden layers:** ReLu, others
     - **output layer:** Sigmoid, Softmax, others -- used to make your final prediction
 - **Key Point**: activation functions in hidden units are meant to be 'gentle', and reduce non-linearity slowly, output activation functions are much stronger, used to assign a row to its most likely value

### ReLu Activation Function

 - stands for Rectified Linear Unit
 - most commonly used activation function for hidden layers

**Generic Function**:  

$$ max(0, x) $$

In [72]:
hidden_output = np.maximum(0, preds)
hidden_output[:20]

array([[0.41837907, 0.38371288, 0.41460791],
       [0.4172674 , 0.24030903, 0.27089342],
       [0.19979873, 0.34171297, 0.1055203 ],
       [0.14964119, 0.16027052, 0.01891691],
       [0.37383954, 0.39087638, 0.27395295],
       [0.08111006, 0.3363707 , 0.        ],
       [0.44170424, 0.47014229, 0.33048733],
       [0.10739369, 0.16921601, 0.08830669],
       [0.21141957, 0.40442362, 0.21960638],
       [0.22938957, 0.33537887, 0.35931258],
       [0.11231355, 0.21665335, 0.24263517],
       [0.23016502, 0.14222932, 0.26784108],
       [0.50262578, 0.22022004, 0.41847515],
       [0.15502927, 0.2050146 , 0.20231301],
       [0.10919169, 0.21469923, 0.17217959],
       [0.        , 0.0763882 , 0.03615379],
       [0.23382187, 0.22899891, 0.24760317],
       [0.18286833, 0.21654106, 0.15020049],
       [0.28907719, 0.41671786, 0.33870905],
       [0.17749718, 0.4794991 , 0.24994409]])

### Question Prompts:

 - Have we radically changed the shape of our data?
 - What's the benefit of this approach?

### The Output Layer

 - Where you actually make your predictions
 - Is essentially the same output as `predict_proba` in SKlearn
 - Will have as many columns as unique categories that you're trying to predict
 - Is created essentially the same way as the hidden layer: randomly generate weights

### Prompt:

We want to predict a binary outcome.  What should the dimensions of our output layer be?

In [81]:
output_coef_ = np.random.normal(0, 0.1, 6).reshape(3, 2)

In [82]:
output_coef_

array([[-0.16436953, -0.10568609],
       [ 0.05983523, -0.07302869],
       [-0.01900292,  0.21000109]])

In [86]:
# we'll now use this to create our final predictions -- for this round
final_output = hidden_output.dot(output_coef_)

In [87]:
# we now have a 1000 x 2 array with our output
final_output.shape

(1000, 2)

In [88]:
# it gives us this
final_output[:10]

array([[-0.05368798,  0.01482922],
       [-0.05935487, -0.0047609 ],
       [-0.01439954, -0.02391142],
       [-0.0153661 , -0.02354677],
       [-0.04326556, -0.01052441],
       [ 0.0067948 , -0.03313692],
       [-0.05075187, -0.01161317],
       [-0.00920526, -0.00516314],
       [-0.01472532, -0.00576105],
       [-0.02446517,  0.02672047]])

In [95]:
# and now we provide our final activation function -- the sigmoid!
from scipy.special import expit
# expit is another name for the sigmoid function
predict_proba = expit(final_output)
predict_proba[:10]

array([[0.48658123, 0.50370724],
       [0.48516564, 0.49880978],
       [0.49640018, 0.49402243],
       [0.49615855, 0.49411358],
       [0.4891853 , 0.49736892],
       [0.50169869, 0.49171653],
       [0.48731476, 0.49709674],
       [0.4976987 , 0.49870922],
       [0.49631874, 0.49855974],
       [0.49388401, 0.50667972]])

In [94]:
# and get our final predictions
final_predictions = np.argmax(predict_proba, axis=1)

In [104]:
final_predictions[:10]

array([1, 1, 0, 0, 1, 0, 1, 1, 1, 1], dtype=int64)

### Some Important Notes:  

 - What we just did is called *forward propagation*
 - This is basically how you churn through layers of computation to get predictions in a neural network
 - You go through this process multiple times when fitting a neural network, iteratively updating the weights after each round

### Lab Wrap Up:

 - forward propagation is the basic machinery of how a neural network makes its predictions
 - the presence of hidden layers + non-linear activation functions allows a neural network to go beyond linear models and tease out hidden patterns in raw data
 - when training a neural network, you go through many round (typically 10-50) of forward propagation to converge on the appropriate weights for each variable

### Additional Questions:

 - how many hidden layers do you add? how many neurons?
 - how exactly are the weights updated?  
 - how does one implement neural networks in practice?

#### On Hidden Layers.......

 - It has been shown that any problem that can be approximated with a neural network can be done *with just one hidden layer*......although this is not always the fastest way to do things.
 - In practice it's usually best to keep the number of hidden layers to a minimum, to avoid overfitting
 - When a neural network has 2 or more hidden layers this is typically when you start Deep Learning.
 - A lot of research has been done on different types of layers to use in a neural network, which has expanded the amount of hidden layers that can feasibly be added to a neural network for various tasks

### Deep Learning Frameworks

 - Two major ones are Tensorflow & PyTorch
 - Developed by Google & Facebook, respectively
 - Tensorflow tends to be used more for production, PyTorch for prototyping & development
 - Tensorflow has a high-level API called Keras, which is very easy to use

### Keras

 - wrapper built around Tensorflow to make it easy to construct neural networks
 - essentially a connector set for Neural Network layers

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt

In [None]:
mnist = keras.datasets.mnist
(train_img, train_label), (test_img, test_label) = mnist.load_data()

In [None]:
# standardize your data
train_img = train_img / 255.0

test_img = test_img / 255.0

In [None]:
# this is the equivalent of what we just created in the previous lab
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(10, activation=tf.nn.relu),
    keras.layers.Dense(5, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.sigmoid)
])

In [None]:
model.compile(optimizer='sgd', 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

### Compiling your model

 - optimizer: strategy you use to update your weights
 - loss: loss function you use to optimize when training
     - 'sparse_categorical_crossentropy': encoding your labels as 0, 1, 2, 3, 4, etc (most common)
 - metrics: metric you use to score your model

#### Optimizers:

 - Technique you use for updating your weights
 - All built around the same basic concept:  gradient descent
 - Common choices:
  - **sgd**: stochastic gradient descent (most common)
  - **adam**: updated version of sgd, formulated in 2015.....updates itself as training moves on
 - Most common parameter to play around with:  **the learning rate**
 - Typical values will be anywhere from 0.0001 to 0.3

### The Learning Rate:  The Speed With Which You Increase Your Weight Size

 - Updating weights in a neural network is based off of the derivative of your cost function 
 - The learning rate is the size of the 'step' that you take in the direction of the derivative
 - Essentially:
  - a larger learning rate will converge faster, but potentially 'skip' over more ideal versions of your weights
  - a smaller one will have the opposite problem, and potentially get stuck in local minima
 - Essentially a way of handling the bias-variance problem in Deep Learning

In [None]:
# you would change the parameters of your optimizer in the following way
sgd = keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)

In [None]:
# let's fit our model - epochs is the rounds of forward propagation
model.fit(train_img, train_label, epochs=10)

In [None]:
# and finally - score our model
test_loss, test_acc = model.evaluate(test_img, test_label)

### Lab:  The Fashion MNIST Dataset

In [None]:
# load in the dataset this way
mnist = keras.datasets.fashion_mnist
(train_img, train_label), (test_img, test_label) = mnist.load_data()

### Step 1:  
 - Go ahead and build your model using the *same* parameters that we had before:
  - Flattened input layer
  - Hidden layers with 5 & 10 neurons + ReLu activation function
  - Output layer with 10 neurons + sigmoid activation
  
What are our results?

### Step 2:  Can you get to around ~ 90% accuracy?

Try getting better results by trying the following parameters:

 - The number of neurons in your hidden layers
 - Switching the final activation function from sigmoid to softmax
 - Increasing or decreasing your learning rate
 - Changing the number of epochs
 
Take 20 minutes

### Basic Rules For Hidden Layers/Neurons:

 - If the sole purpose of a hidden layer is for non-linear computation, usually no more than 2 is needed
 - 1 is often sufficient
 - Adding neurons is typically better for teasing out non-linear boundaries for one task or another
 - number of neurons:
  - Best to add somewhere between a total number between the number of columns in your input and output layer
  - Can typically adjust depending on whether or not your model is overfitting or underfitting

### Basic Rules for Training Neural Networks

 - small learning rate + lots of hidden units is safest best for working with complicated data
 - you can also use regularization to curb overfitting
 - you don't use traditional cross validation.  You use drop out instead

### Layers You Can Add to A Neural Network:

 - **Dense**:  Each input unit is connected to every single output unit.  Ie, a linear combination.  This is what we've been doing so far.  These are the most straightforward.
 - **Convolutional**:  Best for processing images.  Segments your data into smaller chunks when forming connections.
 - **Recurrent**:   Maintain information about previous input from last round of backpropagation.  Therefore have a notion of sequence.  Useful for items where the order of items matters.  Language processing especially.