Latest innovations: speech recognition, machine translation, transformers>embeddings  

# 1. BUILDING

### Shallow NN
1 hidden layer (logistic regression)   

### Deep NN
more than 1 hidden layers  
first layers: learn low level features (common to all related tasks) -> simple tasks   
last layers: learn high level features -> complex tasks  
ex: audio signal -> phonemes -> words -> sentences   

<img src="images/nn-basic-architecture.png" width="650">



### Standard NNs 

### Convolutional NNs  

### Recurrent NNs  

### LSTM

- Long Short-Term Memory
- Type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.


### Structured data
supervised lrng  

### Unstructured data
audio, img  


### Logistic Regression
binary classification   


### Linear Regression  


### Deep Learning
- Learn the parameters that minimize the error 


### Training in Deep Learning  
  Adjust the parameters (w, b...) in order to minimize the cost function  
- Adjust each weight in the **correct direction** by the **correct amount** so **error reduces to 0**


### Parameters     
  - weight (w)  
  - bias (b)  


### Hyperparameters   
  (see course 2 week 3 Tuning process: order of importance)
  they control the parameters and determine their final value   
  - learning rate (`alpha`)   
    - if alpha too large, we may overshoot the optimal value  
    - if alpha too small, too many iterations to converge  
    tuning = plot the curve with various values  
    **See** ai specialization course 1 'w2-002-logistic-regression-with-nn'  
  - regularization parameter (`lambda`)  
    tuning = use the dev set   
  - number of iterations   
  - number of hidden layers
  - number of hidden units
  - choice of activation functions
  - momentum (`beta`)
  - mini-batch size
  - `beta1`, `beta2`, `epsilon` in Adam optim algo
  - learning rate decay


### Learning rate
- indicate at which pace the weights get updated 
- Often noted `α` or sometimes `η`, indicates at which pace the weights get updated.
- Can be fixed or adaptively changed
- Current most popular method: called Adam, a method that adapts the learning rate.
- methods:
  - Adam (most popular)
  - Momentum
  - RMSProp
  
<img src="images/optimization-adaptive-learning-rates.png" width="650">


### Hyperparameters tuning     
deep learning = empirical process   
trying different values of hyperparameters and plotting (nb iterations * cost J) to find the best curve    
best values for a given problem can change in time (cpu, data update...)   


### Loss function: 
cost for one example  
- **L1** loss function  
- **L2** loss function 
 
### Cost function: 
  average loss functions for all examples = `J(w,b)` surface of the curve  

<img src="images/loss-function-cross-entropy.png" width="650">



### Gradient Descent algorithm: 
update w and b  


###  Derivative/Slope 
  (partial derivative = J function of 2+ args): height/width   
  for any value of a, if you nudge it by a certain amount, the slope will be ~
  Derivatives at each step are used to compute derivatives at following steps  


### Computation Graph: 
  organizes the computations:  
  1. Initialize parameters / Define hyperparameters: nn architecture  
  2. Loop for num_iterations:   
    a. Forward propagation: compute output (linear->activation)   
    b. Compute cost function   
    c. Backward propagation: compute gradients/derivatives using cost   
    d. Update parameters: gradient descent, using parameters, and grads from backprop    
  4. Predict: use trained parameters to predict labels
  **See** ai specialization course 1 'w2-002-logistic-regression-with-nn' + exos c1w4


### Chain rule  
  in backpropagation, if we change a, we change z, so we change J  


### Backpropagation  
[coursera ai: backpropagation](https://www.coursera.org/learn/neural-networks-deep-learning/lecture/6dDj7/backpropagation-intuition-optional).  
  compute the derivative of the final output with respect to some other variable usually called ´dvar´ in code  
  
<img src="images/backpropagation.png" width="650">


### Updating weights

<img src="images/updating-weights.png" width="650">


### Vectorization  
  whenever possible, avoid explicit ´for´ loops  


### Broadcasting in Python  
  https://numpy.org/doc/stable/user/basics.broadcasting.html  


### Softmax   
  normalizing function used when algorithm needs to classify two or more classes    


### Hidden
not observed, not seen in the training set


### Activation functions
[coursera ai: activation functions](https://www.coursera.org/learn/neural-networks-deep-learning/lecture/4dDC1/activation-functions)    
[coursera ai: nns-non-linear-activation-functions](https://www.coursera.org/learn/neural-networks-deep-learning/lecture/OASKH/why-do-you-need-non-linear-activation-functions): why do nns need non-linear activation functions   
   -> because the nn would just be outputing a linear function of the input   
   = standard linear regression.   
linear activation function: g(z) = z -> outputs the input  
[coursera ai: derivatives of activation functions](https://www.coursera.org/learn/neural-networks-deep-learning/lecture/qcG1j/derivatives-of-activation-functions)   
  -> output values based on possible input values (max, min, 0)     
    
can be different for different layers in the same nn
- **sigmoid**: [0..1] -> only better for binary classification (0 or 1)  
- **tanh**: hyperbolic tangent [-1..1] -> 0 mean. always better for hidden layers. makes learning easier for the next layer  
- **relu**: [1 or 0 (or 0,00000000000x)] *most used* -> faster learning of all  
- **leaky relu**

<img src="images/activation-functions.png" width="650">


### Initialization
[coursera ai: random initialization](https://www.coursera.org/learn/neural-networks-deep-learning/lecture/XtFPI/random-initialization).   
- **weights**: do not initialize with zeros -> all hidden units are symmetric, they all compute the same function. initialize with random values instead.   
`np.random.randn((2,2)) * 0.01`   # '* 0.01' initializes to very small values   
if weights too large, GD will be very slow (values in the flat parts of the activation functions)  
- **bias**: can be initialized with zeros, as long as weights are note   
  
  
### Overfitting

- If a network is overfitting, you can augment the loss function by :
    - choosing simpler nonlinearities
    - smaller layer sizes
    - shallower architectures
    - larger datasets
    - or more aggressive regularization techniques
- All have a similar effect of the loss function and similar consequence on the behavior of the network
   
   
   
### Underfitting

# Get your dimensions right

[coursera ai: getting your dimensions right](https://www.coursera.org/learn/neural-networks-deep-learning/lecture/Rz47X/getting-your-matrix-dimensions-right)

In [4]:
import numpy as np

# --- Rank 1 arrays (value,)
a = np.random.randn(5)
print(a.shape)
print(a.T)                # a and a.T are the same
print(np.dot(a, a.T))     # value

# --- Column vectors: (value, 1) - Row vectors: (1, value)
print("")
a = np.random.randn(5,1)  # 5*1 matrix
print(a)
print(a.T)                # inversed shape
print(np.dot(a, a.T))     # vector
assert(a.shape == (5,1))  # can help as DOCUMENTATION

# --- ALWAYS use column or row vectors, not rank 1 arrays
# OR reshape:
a.reshape((5,1))

# A trick when you want to flatten a matrix X of shape (a,b,c,d) 
# to a matrix X_flatten of shape (b ∗ c ∗ d, a) is to use:
X_flatten = X.reshape(X.shape[0], -1).T

## Many software bugs in deep learning 
## come from having matrix/vector dimensions that don't fit

(5,)
[-1.48936219  1.13770586  2.16944225 -1.43955269  0.58857812]
10.637790162716158

[[ 0.72143301]
 [ 1.3055879 ]
 [ 1.46287306]
 [-0.44561694]
 [-0.31813234]]
[[ 0.72143301  1.3055879   1.46287306 -0.44561694 -0.31813234]]
[[ 0.52046558  0.9418942   1.05536491 -0.32148277 -0.22951117]
 [ 0.9418942   1.70455976  1.90990937 -0.58179209 -0.41534974]
 [ 1.05536491  1.90990937  2.13999759 -0.65188102 -0.46538723]
 [-0.32148277 -0.58179209 -0.65188102  0.19857446  0.14176516]
 [-0.22951117 -0.41534974 -0.46538723  0.14176516  0.10120819]]


NameError: name 'X' is not defined

# Best NN perfs = scale
  scale algorithms: very large NN (lots of hidden layers, of parameters, of connections)  
+ scale data: very large amount of data  
+ scale computation: CPU, GPU  
ex: switch from sigmoid (lrng becomes really slow at extremities) to relu  
 -> GD works much faster  
 (Neural Networks and Deep Learning - Semaine 1 - Why is Deep Learning taking off?)  
