<a href="https://colab.research.google.com/github/rjsaito/Data-Science-Essentials/blob/master/Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Algebra

## Eigenvalues and Eigenvectors



$$
A v = \lambda v
$$


- eigen value 
- eigen vector
- invertibe matrix
- singularity
- singular values (square root of eigen values)
- rank
- linear independence
- singular value decomposition
- eigenvalue decomposition (matrix represented in eigenvalues and eigenvectors)
- singular value decomposition (complexity of m^2 n + n^3, or O(mn^2))
- matrix factorization (nm)^(O(2^r r^2)) time exact solution



## Principal Components

## Factor Analysis

# Machine Learning

## Loss Functions

### Cross Entropy (Log Loss)

In [None]:
# Full function: -(y*log(p) + (1-y)*log(1-p))
def CrossEntropy(p, y):
  if y == 1:
    return -log(p)
  else:
    return -log(1 - p)

### Hinge

In [None]:
def Hinge(p, y):
  return np.max(0, 1 - yHat * y)
  

### Mean Absolute Error (L1)

In [None]:
def MAE(yH, y):
  return np.sum(np.absolute(yH - y)) / y.size

### Mean Squared Error (L2)

In [None]:
def MSE(yH, y):
  return np.sum((yH - y)**2) / y.size

## scikit-learn

## Regression

In [None]:
# load libraries
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.datasets import load_boston

# load data
boston = load_boston()
features = boston.data
target = boston.target

# standardize
scalar = StandardScaler()
features_standardized = scalar.fit_transform(features)
features.shape

(506, 13)

### Feature Engineering

In [None]:
# polynomial terms
polynomial = PolynomialFeatures(degree = 3, include_bias = False)
features_polynomial = polynomial.fit_transform(features)

### Linear Regression

In [None]:
# create regression
regression = LinearRegression()

# fit model
model = regression.fit(features_polynomial, target)

# get model output
print(model.intercept_)
print(model.coef_)

### Ridge Regression

In [None]:
# create ridge regression, with cross validation
ridge = RidgeCV(cv = 5)

# fit model
ridgemodel = ridge.fit(features, target)

# get model output
print(ridgemodel.intercept_)
print(ridgemodel.coef_)

27.467884964141177
[-0.10143535  0.0495791  -0.0429624   1.95202082 -2.37161896  3.70227207
 -0.01070735 -1.24880821  0.2795956  -0.01399313 -0.79794498  0.01003684
 -0.55936642]




### Lasso Regression

In [None]:
# create lasso regression, with cross validation
lasso = LassoCV(cv = 5)

# fit model
lassomodel = lasso.fit(features, target)

# get model output
print(lassomodel.intercept_)
print(lassomodel.coef_)

36.33499969015174
[-0.07426626  0.04945448 -0.          0.         -0.          1.804385
  0.01133345 -0.81324404  0.27228399 -0.01542465 -0.74287183  0.00892587
 -0.70365352]


## Tree-Based Models

- Classification tree analysis is when the predicted outcome is the class (discrete) to which the data belongs.
- Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient's length of stay in a hospital).
- Boosted trees Incrementally building an ensemble by training each new instance to emphasize the training instances previously mis-modeled. A typical example is AdaBoost. These can be used for regression-type and classification-type problems.[7][8]
- Bootstrap aggregated (or bagged) decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction.[9]
  - A random forest classifier is a specific type of bootstrap aggregating
- Rotation forest – in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features.[10]

## Feature Importance


**Information Gain**

Synonym for **Kullback-Leibler diverence** (also called relative entropyt), Information Gain is the amount of information gained about a random variable or signal from observing another random variable.

The information gain of a random variable X obtained from an observation of a random variable A taking value A = a is defined:

$ {\displaystyle IG_{X,A}{(X,a)}=D_{\text{KL}}{\left(P_{X}{(x|a)}\|P_{X}{(x|I)}\right)},} $

the Kullback-Leibler diverence of the prior distribution $P_X(x|I) $ for x from the posterior distribution $ P_{x|A}(X|a) $ for x given a


In general terms, the expected information gain is the **change in information entropy H** as:

$ IG(T, a) = H(T) - H(T|a) $

where $ H(T|a) $ is the conditional entropy of T given the value of attribute a


$ H(T) - I_E(p_1, p_2, .., p_J) = -\Sigma^J_{i=1}p_i log_2 p_i $ 


## Matrix Factorization

### LibMF

https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_open_source.pdf

Non-convex optimization problem:

$$ \underset{P,Q}{min} \underset{(u,v)\epsilon R}{\Sigma} [f(p_u, q_v; r_{u,v}) + \mu_p || p_u||_1 + \mu_q ||q_v||_1 +
\frac{λ_p}{2} || p_u ||^2_2 + \frac{λ_q}{2} ||q_v||^2_2 ]  \ \ \ (1)
$$

where 

$f(p_u, q_v; r_{u,v})$ is the loss function, $p_u, q_v$ are latent factors, $r_{u,v}$ is the interaction, and $\mu_p, \mu_q, λ_p, λ_q$ are regularization parameters.

### Training with Stochastic Gradient Descent

The algorithm for the Fast Parallelized Stochastic Gradient Descent is:

1. randomly shuffle R
2. grid R into a set B with at least (s + 1) × (s + 1) blocks
3. sort each block by user (or item) identities
4. construct a scheduler
5. launch s working threads
6. wait until the total number of updates reaches a user-defined value


The basic idea of SG is that, instead of expensively calculating the gradient of (1), it randomly selects a $(u,v)$ entry from the summation and calculates the corresponding gradient [Robbins and Monro 1951; Kiefer and Wolfowitz 1952]. Once $r_{u,v}$ is chosen, the objective function in (1), is:

$$ f(p_u, q_v; r_{u,v}) + \mu_p p_u + \mu_q q_v +
\frac{λ_p}{2} p_u^T p_u + \frac{λ_q}{2} q_v^T q_v \ \ \ (2)
$$

\\

We calculate the sub-gradient over $p_u$ and $q_v$. Variables are updated by the following rules:

\\

$$
p_u ← p_u + γ (\frac{d}{dp_u}(2)),  \\
q_v ← q_v + γ (\frac{d}{dq_v}(2))
$$

where $\gamma$ is the learning rate.

### Loss Functions

 $f(·)$ is a non-convex loss function of $p_u$ and $q_v$, and $\mu_p$, $\mu_q$, $\lambda_p$, and $\lambda_q$ are regularization coefficients. For Real Valued MF, the loss function can be a squared loss, an absolute loss, or generalized KL-divergence. If R is a binary matrix, users may select among logistic loss, hinge loss, and squared hinge loss to perform BMF. Note that, the non-negative constraints,

# Deep Learning

## Two-layer Neural Network

In [None]:
def two_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):
    """
    Implements a two-layer neural network: LINEAR->RELU->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (n_x, number of examples)
    Y -- true "label" vector (containing 1 if cat, 0 if non-cat), of shape (1, number of examples)
    layers_dims -- dimensions of the layers (n_x, n_h, n_y)
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- If set to True, this will print the cost every 100 iterations 
    
    Returns:
    parameters -- a dictionary containing W1, W2, b1, and b2
    """
    
    np.random.seed(1)
    grads = {}
    costs = []                              # to keep track of the cost
    m = X.shape[1]                           # number of examples
    (n_x, n_h, n_y) = layers_dims
    
    # Initialize parameters dictionary, by calling one of the functions you'd previously implemented
    ### START CODE HERE ### (≈ 1 line of code)
    parameters = initialize_parameters(n_x, n_h, n_y)
    ### END CODE HERE ###
    
    # Get W1, b1, W2 and b2 from the dictionary parameters.
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    
    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> SIGMOID. Inputs: "X, W1, b1, W2, b2". Output: "A1, cache1, A2, cache2".
        ### START CODE HERE ### (≈ 2 lines of code)
        A1, cache1 = linear_activation_forward(X, W1, b1, activation = "relu")
        A2, cache2 = linear_activation_forward(A1, W2, b2, activation = "sigmoid")
        ### END CODE HERE ###
        
        # Compute cost
        ### START CODE HERE ### (≈ 1 line of code)
        cost = compute_cost(A2, Y)
        ### END CODE HERE ###
        
        # Initializing backward propagation
        dA2 = - (np.divide(Y, A2) - np.divide(1 - Y, 1 - A2))
        
        # Backward propagation. Inputs: "dA2, cache2, cache1". Outputs: "dA1, dW2, db2; also dA0 (not used), dW1, db1".
        ### START CODE HERE ### (≈ 2 lines of code)
        dA1, dW2, db2 = linear_activation_backward(dA2, cache2, activation = "sigmoid")
        dA0, dW1, db1 = linear_activation_backward(dA1, cache1, activation = "relu")
        ### END CODE HERE ###
        
        # Set grads['dWl'] to dW1, grads['db1'] to db1, grads['dW2'] to dW2, grads['db2'] to db2
        grads['dW1'] = dW1
        grads['db1'] = db1
        grads['dW2'] = dW2
        grads['db2'] = db2
        
        # Update parameters.
        ### START CODE HERE ### (approx. 1 line of code)
        parameters = update_parameters(parameters, grads, learning_rate)
        ### END CODE HERE ###

        # Retrieve W1, b1, W2, b2 from parameters
        W1 = parameters["W1"]
        b1 = parameters["b1"]
        W2 = parameters["W2"]
        b2 = parameters["b2"]
        
        # Print the cost every 100 training example
        if print_cost and i % 100 == 0:
            print("Cost after iteration {}: {}".format(i, np.squeeze(cost)))
        if print_cost and i % 100 == 0:
            costs.append(cost)
       
    # plot the cost

    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per hundreds)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

## **Recurrent Neural Networks**

#### **Types of Sequence Data**

- Speech recognition
- Music generation
- Sentiment Classification
- DNA sequence analysis
- Machine translation
- Video activitry recognition
- Name entity recognition

**Why Standard Networks don't work**

If input were one-hot-encoded word vectors:

- Inputs, Outputs can be different lengths in different examples
- Doesn't share features learned across different positions of text

### **(Unidirectional) RNN**

---



Steps:

- $a^{<0>} = 0$
- $\hat{a}^{<1>} = g_1(w_{aa}a^{<0>} + w_{ax}x^{<1>}+b_a) $  <- tanh/ReLu
- $\hat{y}^{<1>} = g_1(w_{ya}a^{<1>} + b_y) $ <- sigmoid
- $\hat{a}^{<t>} = g_1(w_{aa}a^{<t-1>} + w_{ax}x^{<t>}+b_a) $  
- $\hat{y}^{<t>} = g_1(w_{ya}a^{<t>} + b_y) $ 

Written otherwise:

- $\hat{a}^{<t>} = g(w_{a}[a^{<t-1>}, x^{<t>}]+b_a) $
  - Stack $w_{aa}$ and $w_{ax}$ , compress into one


### **Bidirectional RNN**

Bidirectional RNN has a similar structure as the Unidirectional RNN, but with an additional activation at each iteration in reverse

### **BERT**

Called Bidirectional Encoder Representation - applies the bidrectional training of Transformer, a popular attention model. (Technically it is non-directional since it looks at all words simultaneously

**Masked LM (MLM)**

Before feeding word sequences into BERT (tokenized and encoded), 15% of words in each sequence are replaced with a [MASK] token. The model then attemps to predict the original value of the masked words, basde on the context provided by other non-masked words in the sequence

The prediction of the output words requires:
1. adding a classification layer on top of the encoder output
2. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulaty dimension
3. Calculating the probability of each word in the vocabulary with softmax

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words.

As a consequence, the model coverages slower than directional models, a characteristic which is offest by its increased context awareness

**Next Sentence Prediction (NSP)**

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentencei n the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is ther subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

How input is processed before entering the model:
1. A [CLS] token is inserted at the beginning of the first sentenfce and a [SEP] token is inserted at the end of each sentence
2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2
3. A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper

### Architecture

#### Example

RNN for Short-Term Prediction

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import tensorflow as tf
tf.enable_eager_execution()
print("Tensorflow version: " + tf.__version__)

Generate Fake Data

In [None]:
DATA_SEQ_LEN = 1024*128
data = np.concatenate([create_time_series(waveform, DATA_SEQ_LEN) for waveform in Waveforms]) # 4 different wave forms
picture_this_1(data, DATA_SEQ_LEN)
DATA_LEN = DATA_SEQ_LEN * 4 # since we concatenated 4 sequences

Hyperparameters

In [None]:
RNN_CELLSIZE = 32   # size of the RNN cells
SEQLEN = 32         # unrolled sequence length
BATCHSIZE = 32      # mini-batch size
LAST_N = SEQLEN//2  # loss computed on last N element of sequence in advanced RNN model

Visualize Training Sequences

In [None]:
picture_this_2(data, BATCHSIZE, SEQLEN) # execute multiple times to see different sample sequences

Create Model

In [None]:
# this is how to create a Keras model from neural network layers
def compile_keras_sequential_model(list_of_layers, msg):
  
    # a tf.keras.Sequential model is a sequence of layers
    model = tf.keras.Sequential(list_of_layers)
    
    # keras does not have a pre-defined metric for Root Mean Square Error. Let's define one.
    def rmse(y_true, y_pred): # Root Mean Squared Error
      return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))
    
    print('\nModel ', msg)
    
    # to finalize the model, specify the loss, the optimizer and metrics
    model.compile(
       loss = 'mean_squared_error',
       optimizer = 'rmsprop',
       metrics = [rmse])
    
    # this prints a description of the model
    model.summary()
    
    return model
  
#
# three very simplistic "models" that require no training. Can you beat them ?
#

# SIMPLISTIC BENCHMARK MODEL 1
predict_same_as_last_value = lambda x: x[:,-1] # shape of x is [BATCHSIZE,SEQLEN]
# SIMPLISTIC BENCHMARK MODEL 2
predict_trend_from_last_two_values = lambda x: x[:,-1] + (x[:,-1] - x[:,-2])
# SIMPLISTIC BENCHMARK MODEL 3
predict_random_value = lambda x: tf.random.uniform(tf.shape(x)[0:1], -2.0, 2.0)

def model_layers_from_lambda(lambda_fn, input_shape, output_shape):
  return [tf.keras.layers.Lambda(lambda_fn, input_shape=input_shape),
          tf.keras.layers.Reshape(output_shape)]

model_layers_RAND  = model_layers_from_lambda(predict_random_value,               input_shape=[SEQLEN,], output_shape=[1,])
model_layers_LAST  = model_layers_from_lambda(predict_same_as_last_value,         input_shape=[SEQLEN,], output_shape=[1,])
model_layers_LAST2 = model_layers_from_lambda(predict_trend_from_last_two_values, input_shape=[SEQLEN,], output_shape=[1,])

# three neural network models for comparison, in increasing order of complexity

l = tf.keras.layers  # syntax shortcut

# BENCHMARK MODEL 4: linear model (RMSE: 0.215 after 10 epochs)
model_layers_LINEAR = [l.Dense(1, input_shape=[SEQLEN,])] # output shape [BATCHSIZE, 1]

# BENCHMARK MODEL 5: 2-layer dense model (RMSE: 0.197 after 10 epochs)
model_layers_DNN = [l.Dense(SEQLEN//2, activation='relu', input_shape=[SEQLEN,]), # input  shape [BATCHSIZE, SEQLEN]
                    l.Dense(1)] # output shape [BATCHSIZE, 1]

# BENCHMARK MODEL 6: convolutional (RMSE: 0.186 after 10 epochs)
model_layers_CNN = [
    l.Reshape([SEQLEN, 1], input_shape=[SEQLEN,]), # [BATCHSIZE, SEQLEN, 1] is necessary for conv model
    l.Conv1D(filters=8, kernel_size=4, activation='relu', padding="same"), # [BATCHSIZE, SEQLEN, 8]
    l.Conv1D(filters=16, kernel_size=3, activation='relu', padding="same"), # [BATCHSIZE, SEQLEN, 8]
    l.Conv1D(filters=8, kernel_size=1, activation='relu', padding="same"), # [BATCHSIZE, SEQLEN, 8]
    l.MaxPooling1D(pool_size=2, strides=2),  # [BATCHSIZE, SEQLEN//2, 8]
    l.Conv1D(filters=8, kernel_size=3, activation='relu', padding="same"),  # [BATCHSIZE, SEQLEN//2, 8]
    l.MaxPooling1D(pool_size=2, strides=2),  # [BATCHSIZE, SEQLEN//4, 8]
    # mis-using a conv layer as linear regression :-)
    l.Conv1D(filters=1, kernel_size=SEQLEN//4, activation=None, padding="valid"), # output shape [BATCHSIZE, 1, 1]
    l.Reshape([1,]) ] # output shape [BATCHSIZE, 1]

# RNN
model_layers_RNN = [
    # input shape needed on first layer only
    l.Reshape([SEQLEN, 1], input_shape=[SEQLEN,]),
    l.GRU(RNN_CELLSIZE), # shape [BATCHSIZE, RNN_CELLSIZE]
    l.Dense(1, ) # shape [BATCHSIZE, 1]
    
]

## Convolutional Neural Networks

#### **Types of CV Problems**

- Image Classifcation
- Object Detection
- Neural Style Transfer


#### **Edge Detection**

Edges are detected using Filters (or Kernels)
- In Tensorflow: tf.nn.conv2d
- In Keras: Cond2D

Horizontal Edge Detection (3x3 filter)

|  1,  1,  1 | <br/>
|  0,  0,  0 | <br/>
| -1, -1, -1 |

Vertical Edge Detection (3x3 filter)

| 1, 0, -1 | <br/>
| 1, 0, -1 | <br/>
| 1, 0, -1 |

Sobol Filter (3x3)
- More robust (vertical) detector with higher weight in the center

| 1, 0, -1 | <br/>
| 2, 0, -2 | <br/>
| 1, 0, -1 |

#### **Padding**

Padding is used to ensure the input array size matches the output (so as not to lose the edge pixels)

Two Types of Convolution:
- Valid: no filter (nxn * fxf -> [n - f + 1] x [n - f + 1])
- Same: pad so output size is the same as input (padding p = (f - 1) / 2)

*Note: f is usually odd (1x1, 3x3, 5x5)

#### **Strided Convolution**

Stride is the # of steps (pixels) to move over at each filter iteration (vertically and horizontally)

 (nxn * fxf, stride s = 2  -> [(n + 2p - f)/s + 1] x [(n + 2p - f)/s + 1] )

#### **Convolution on RGB images**

Your input may look like 6 x 6 x 3 (height x width x color channels)

Your convolution (filter) may then be a 3 x 3 x 3

#### **Pooling Layers**

Pooling will abstract or summarise your input (bring down to lower resolution)

Types of Pooling:
- Max Pooling
- Average Pooling

#### **Why use Convolution?**

- **Parameter sharing**: A feature detector (such as vertical edge detector) that's useful in one part of the iomage is probably useful in another part of the image
- **Sparsity of connections**: In each layer, each output value depends only on a small number of inputs

#### **\# of Parameters and Shape**

- Conv Layer:
  - \# Parameters: ([ width m ] x [ height n ] x [prev layer's filters d] + bias 1) * [k filters in current layer)
  - Shape: [(n + 2 \* padding - filters)/stride + 1] x [(n + 2 \* padding - filters)/stride + 1] x ? )


| Type | Dimensions |
|--|--|
| Input | nh[L-1] x nw[L-1] x nc[L-1]|
| Activation a[l] | nh[L] x nw[L] x nc[L] |
| Weights | f[L] x f[L] x nc[L-1] x nc[L]|
| Bias | nc[L]|

Where
- f[L] = filter size
- p[L] = padding
- s]L] = stride
- nc[L] = number of filters
- nc[L-1] = color channels

https://engmrk.com/convolutional-neural-network-3/


### **Classic Architectures**

**LeNet-5 (1998)**

1. Convolution 6 [5x5] (s = 1)
2. Avg Pooling (f = 2, s = 2)
3. Convolution 16 [5x5] (s = 1)
4. Avg Pooling (f = 2, s = 2)
5. Dense Layer (120 nodes)
6. Dense Layer (84 nodes)
7. Softmax (10-way)

~60K parameters




**AlexNet (2012)**

0. Start: 227x227x3
1. Convolution 96 [11x11] (s = 4)
2. Max Pooling (f = 3, s = 2)
3. Convolution 256 [5x5] (same)
3. Max Pooling (f = 3, s = 2)
5. Convolution 384 [3x3] (same)
6. Convolution 384 [3x3] (same)
7. Convolution 256 [3x3] (same)
8. Max Pooling (f = 3, s = 2)
9. Dense (4096)
10. Dense (4096)
11. Softmax (1000)

~60M parameters

**VGG-16 (2015)**

Key Features:
- Conv = [3x3] (s = 1, same padding)
- All Max Pooling [2x2] w/ stride = 2

0. Start: 224x224x3
1. Convolution 64x2 (2 times) [2x2] (same)
2. Max Pooling (f = 2, s = 2)
3. Convolution 128x2 (s = 1, same)
4. Max Pooling (f = 2, s = 2)
5. Convolution 256x3 (s = 1, same)
6. Max Pooling (f = 2, s = 2)
7. Convolution 512x3 (s = 1, same)
8. Max Pooling (f = 2, s = 2)
9. Convolution 512x3 (s = 1, same)
10. Max Pooling (f - 2, s = 2)
11. Dense (4096)
12. Dense (4096)
13. Softmax (1000)

~138M parameters

**ResNet (2015)**

**Key Features**:
- Utilizes Skip Networks 
- Utilizes Residuial block
- BY training w/ Residual Block, training error does NOT increase with more layers

**Residual Block**
- Start w/ block A
- Linear Operator (z = WA + b)
- ReLu (A[L+1] = g(z))
- Linear Operator (z2 = WA[L+1] + b)
- ReLu w shortcut A (A[L+2] = g(z2 + A))

**Identity Block**

A standard block, corresponds to the case where the input activation (a[L]) has the same dimension as the output activation (a[L+2 or L+3]) (skipping over 2 or 3 layers)

- Start with A
- (1) Conv2D(F1, [1x1], valid padding)
- (1) Batch Norm
- (1) ReLu
- (2) Conv2D(F2, [1x1], valid padding)
- (2) Batch Norm
- (2) ReLu
- (3) Conv2D (F3, [1x1], valid padding
- (3) Batch Norm
- X_shortcut added to ^ output (Add()([var2,var2])
- ReLu

**Convolution Block**

Another block type - use when input and output dimensions don't match up. The difference is that convolution is applied to the shortcut

#### **Example ResNet-50 Model**

The details of this ResNet-50 model are:

- Zero-padding pads the input with a pad of (3,3)
- Stage 1 (1 layer):
  - The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is "conv1".
  - BatchNorm is applied to the 'channels' axis of the input.
  - MaxPooling uses a (3,3) window and a (2,2) stride.
- Stage 2 (9 layers (1x3 + 2x3)):
  - The convolutional block uses three sets of filters of size [64,64,256], "f" is 3, "s" is 1 and the block is "a".
  - The 2 identity blocks use three sets of filters of size [64,64,256], "f" is 3 and the blocks are "b" and "c".
- Stage 3 (12 layers (1x3 + 3x3)):
  - The convolutional block uses three sets of filters of size [128,128,512], "f" is 3, "s" is 2 and the block is "a".
  - The 3 identity blocks use three sets of filters of size [128,128,512], "f" is 3 and the blocks are "b", "c" and "d".
- Stage 4 (18 layers 1x3 + 5x3)):
  - The convolutional block uses three sets of filters of size [256, 256, 1024], "f" is 3, "s" is 2 and the block is "a".
  - The 5 identity blocks use three sets of filters of size [256, 256, 1024], "f" is 3 and the blocks are "b", "c", "d", "e" and "f".
- Stage 5 (10 layers (3x3 + 1)):
  - The convolutional block uses three sets of filters of size [512, 512, 2048], "f" is 3, "s" is 2 and the block is "a".
  - The 2 identity blocks use three sets of filters of size [512, 512, 2048], "f" is 3 and the blocks are "b" and "c".
  - The 2D Average Pooling uses a window of shape (2,2) and its name is "avg_pool".
  - The 'flatten' layer doesn't have any hyperparameters or name.
  - The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

In [None]:
def ResNet50(input_shape = (64, 64, 3), classes = 6):
    """
    Implementation of the popular ResNet50 the following architecture:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
    -> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> TOPLAYER

    Arguments:
    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    Returns:
    model -- a Model() instance in Keras
    """
    
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)

    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)
    
    # Stage 1
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # Stage 2
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], stage = 2, block='a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='b')
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='c')

    # Stage 3
    X = convolutional_block(X, f = 3, filters = [128, 128, 512], stage = 3, block='a', s = 2)
    X = identity_block(X, 3, [128, 128, 512], stage=3, block='b')
    X = identity_block(X, 3, [128, 128, 512], stage=3, block='c')
    X = identity_block(X, 3, [128, 128, 512], stage=3, block='d')

    # Stage 4
    X = convolutional_block(X, f = 3, filters = [256, 256, 1024], stage = 4, block='a', s = 2)
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='b')
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='c')
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='d')
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='e')
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='f')

    # Stage 5
    X = convolutional_block(X, f = 3, filters = [512, 512, 2048], stage = 5, block='a', s = 2)
    X = identity_block(X, 3, [512, 512, 2048], stage=5, block='b')
    X = identity_block(X, 3, [512, 512, 2048], stage=5, block='c')

    # AVGPOOL
    X = AveragePooling2D(pool_size=(2,2), name="avg_pool")(X)

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='softmax', name='fc' + str(classes), kernel_initializer = glorot_uniform(seed=0))(X)
    
    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

**Inception Network (2014)**

Inception Networks are used to apply different convolutions at an activation step, and concatenate all the outputs into one layer (channel concatenation)

Motivation: significantly reduces # of parameters needed

### **Data Augmentation**

Motivation for Data Augmentation: Increase training data size

Methods:
- Mirroring
- Random Cropping (identity, rotating, or sheering)
- Color Shifting (RGB) (scalar or PCA)


### **Bias and Variance**

How to reduce Bias:
- More training data

How to reduce Variance:
- Deeper Network
- Regularization (dropout)

### **Problem with Deep Networks**

1. Vanishing / Exploding Gradients
- as you backprop from the final layer to the first layer, multiplying by the weight matrix on each step can cause the gradient to decrease / increase exponentially quickly 
- addressed using Batch Normalization or using Skip Connections


### **Object Localization**

Most Image Classification Tasks require Classification with Localization (or Detection for multiple objects)

In general for localization tasks, the final layer will output the classes + the points of the bounding box (of the object)

#### **Defining the target label y**

For a multi-class object detetection problem, the y label may look like:

y = [pc, bx, by, bh, bw, c1, c2, c3]

Where
- pc: is there any object?
- bx: coordinates of the object
- by: coordinates of the object
- bh: coordinates of the object
- bw: coordinates of the object
- c1: is it class 1?
- c2: is it class 2?
- c3: is it class 3?

#### **Landmark Detection**

Landmarks are objects you want defined with an image (e.g. for a face recognition task, a landmark may be eyes, mouth, nose, etc - and you may want to point out multiple points along each object)

Each x-y coordinate on the landmark becomes a labeled point

E.g. if you have 64 landmark points, then you have 64x2 points + 1 indicator point = 129 labels

*Used widely in AR (e.g. Snapchat Filters)

#### **Sliding Windows Detection**

If you have cropped images of your object, you can use a sliding window to iterate through every position in the image to find the object

You might use different window sizes and iterate through the image multiple times



#### **Dense Layers as Convolution Layers**

A Dense (or Fully Connected) layer can be converted into a Convolution Layer by the following:

Example
- Previous Layer: [5 x 5 x 16] tensor
- You want to replace a Dense Layer w/ 400 Nodes
- Replace with a [5 x 5] kernel with f = 400
- Output is a [1 x 1 x 400]

### **YOLO (You only look once) (2015)**

Key Feature:
- Only need to iterate through the image once to find the object

**Labels for training**

For each grid cell (sliding windows):
- y = [pc, bx, by, bh, bw, c1, c2, c3]
- if you have 3x3 grid cells: target output is [3 x 3 x 8]


**Specifying the position**

for each grid cell: top left is (0,0) and bot right is (1,1)

#### **Intersection over Union**

Sliding windows may not always perfectly encapsulate an object (i.e localize it)

The Intersection of Union calculates the [size of intersection] / [size of union of window and object]

Typically, if IoU >= 0.5, then deem as "correct" (can be some other threshold)

#### **Non-max suppression**

When an object spans multiple windows, multiple bounding boxes may be detected. Simply find the one with the highest overlap, and remove the rest

In Code:
- For each output prediction: Discard all boxes with pc <= 0.6
- While there are any remaining boxes:
  - Pick the box with the largest pc, output as prediction
  - Discard any remaining box with IoU >= 0.5 with the box output in the previous step

#### **Anchor Boxes**

When you have overlapping objects with the same (or similar) mid point, non-max suppression may remove one of the valid objects.

How to use Anchor Boxes:
- when you have 2 overlapping objects:
- a sample y is concatenated with the 2 objects
  - e.g. y = [pc, bx, by, bh, bw, c1, c2, c3, pc, bx, by, bh, bw, c1, c2, c3]
  - output may look like [3 x 3 x 16]


### **Face Recognition**

Two kinds of task:

**Face Verification**
- Input: Image, Name ID
- Output: whether or not the imput image is that of the claimed person
- 1:1 task (easy)

**Face Recognition**
- Has a database of K persons
- Input: image
- Output: ID if the image is any of the K persons (or "not recognized")
- 1:K task (difficult)

#### **One Shot Learning**

Learning from one example to recognize the person again (typically you only have 1 training sample)

Also - every time you want to add a new face, if using a DL, you need to add one more node to the softmax layer (retraining every time does not work well)

Instead -> learn a "similarity" function -> d(img1, img2)

#### **Siamese Network (2014)**

- Input: Image
- Use sequence of Conv, Pooling, and FC Layers
- Obtain the encoding (embedding vector) (f(x(1))
- $ d(x^{(1)}, x^{(2)}) = || f(x(^{1)}) - f(x^{(2)}) ||^2 $
- Learning parameters such that:
  - if x(i) and x(i) are the same person, || f(x(1)) - f(x(2)) ||2 is small
  - if x(i) and x(i) are diffferent people, || f(x(1)) - f(x(2)) ||2 is large
- We can do this using the Triplet Loss (Anchor, Positive, Negative)
- $$ L(A, P, N) = max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha (margin), 0) $$
- Note: Choose triplets that are "hard" to train


#### **Face Verification and Binary Classification**

Face Verification problem can be treated as a Binary Classification problem (1:1 mapping)

Input: 2 images

### **Neural Style Transfer**

Input: Content C, Style S

Output: Generated Image G (Content C in Style S)

Cost Function
$$ J(G) = \alpha  J_{content}(C, G) + \beta  J_{style}(S, G)$$


Steps:
1. Initiage G randomly (white noise)
  - G: 100 x 100 x 3 (RGB)
2. Use gradient descent to minimize J(G)
  - $ G := G - \frac{d}{dg} J(G)$

**Content Cost Function**

- Say you use hidden layer l to compute content cost
- Use pre-trained ConvNet (e.g. VGG network)
- Let a\[l](C) and a\[l](G) be the activation of layer l on the images
- If  a\[l](C) and a\[l](G) are similar, both images have similar content

$$ J_{content}(C, G) = \frac{1}{2} ||a^{[l](C)} - a^{[l](G)}||^2 $$

**Style Cost Function**

- Say you are using layer l's activation to measure "style"
- Define style as **correlation** between activations across channels
- Let a_(i,j,k)[l] = activation at (i=H,j=W,k=C). G[l] is n_c[l] x n_c[l]



$$ J_{style}^{[l]}(S,G) = \frac{1}{(2n_H^{[l]}n_W^{[l]}n_C^{[l]})^2}\Sigma_k \Sigma_{k'}(G_{kk'}^{[l](S)} - G_{kk'}^{[l](G)})  $$

### **Conv 3D**

- In 3D, you will have height, width and depth (e.g. 14 x 14 x 14)
- Filters are also 3D (5 x 5 x 5)
(14 x 14 x 14 x 1) x (5 x 5 x 5 x 1) x 16 filters = 10 x 10 x 10 x 16
