## Hyperparameters

- Should use random sampling to choose the number of layers, number of features, etc.
- Scale parameters accordingly.
    - For example, $\alpha = 0.0001 \dots 1$
        - Use log scale such that $0.0001, 0.001, 0.01, 0.1, 1$
    - For example, $\beta = 0.9 \dots 0.999$
        - Use $1-\beta$ such that $0.1, 0.01, 0.001$
- Panda: babysit one model.
- Caviar: train many models in parallel.

## Batch nomralization

- Normalize activations.
    - Given some intermediate values in neural network $z^{(1)} \dots z^{(m)}$
    - $\mu = \dfrac{1}{m}\displaystyle\sum_{i}z^{(i)}$
    - $\sigma = \dfrac{1}{m}\displaystyle\sum_{i}(z_{i}-\mu)^{2}$
    - $z_{norm}^{(i)} = \dfrac{z^{(i)}-\mu}{\sqrt{\sigma^{2}+\epsilon}}$
    - $\tilde{z}^{(i)} = \gamma z_{norm}^{(i)} + \beta$
- For example, if $\gamma = \sqrt{\sigma^{2}+\epsilon}, \beta = \mu$, then $z_{norm}^{(i)} = \tilde{z}^{(i)}$
- Use $\tilde{z}^{(i)}$ instead of ${z}^{(i)}$ 
- But unlike inputs, you don't want to force activation to be ~ $N(0,1)$

$X \xrightarrow{w^{[1]}, b^{[1]}} z^{[1]} \xrightarrow{\beta^{[1]}, \gamma^{[1]}} \tilde{z}^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde{z}^{[1]}) \xrightarrow{w^{[2]}, b^{[2]}} z^{[2]} \xrightarrow{\beta^{[2]}, \gamma^{[2]}} \tilde{z}^{[2]} \rightarrow a^{[2]} \rightarrow \dots$ 
- parameters: $w, b, \beta, \gamma$

Working with mini-batches
- Parameters: $w, \beta, \gamma$ (no need for $b$)
- $z^{[l]} = w^{[l]}a^{[l-1]}$
- $\tilde{z}^{[l]} = \gamma^{[l]}z_{norm}^{[l]} + \beta^{[l]}$
- For $t = 1 \dots$ num_mini_batches
    - Compute forward prop on $X^{\{t\}}$ 
        - In each layer, use BN to replace $z^{[l]}$ with $\tilde{z}^{[l]}$
    - Use backprop to compute $dw^{[l]}, d\beta^{[l]}, d\gamma^{[l]}$ (no need for $db^{[l]}$)
    - Update $w^{[l]} = w^{[l]} - \alpha dw^{[l]}, \beta^{[l]} = \beta^{[l]} - \alpha d\beta^{[l]}, \gamma^{[l]} = \gamma^{[l]} - \alpha d\gamma^{[l]}$
    
Batch normalization as regularization
- Each mini-batch is scaled by mean/variance computed on just that mini-batch.
- This adds some noise to $z^{[l]}$
- This has slight regularization effect.

Batch normalization as test time
- $\mu, \sigma^{2}$: estimate using exponentially weighted average (across mini-batches)
- $X^{\{1\}} \rightarrow \mu^{\{1\}[l]}, \sigma^{\{1\}[l]}, X^{\{2\}} \rightarrow \mu^{\{2\}[l]}, \sigma^{\{1\}[2]}, X^{\{3\}} \rightarrow \mu^{\{3\}[l]}, \sigma^{\{3\}[l]}, \dots$

## Softmax regression

- Let $C$ be number of classes.
- Last layer (softwax layer) has $n^{[L]}= C$ units.
    - $z^{[L]} = w^{[L]}a^{[L-1]} + b^{[L]}$
    - $t = e^{(z^{[L]})}$
    - $a^{[L]} = \dfrac{e^{(z^{[L]})}}{\displaystyle\sum_{j}t_{i}}, a_{i}^{[L]} = \dfrac{t_{i}}{\displaystyle\sum_{j}t_{i}}$
- Softmax regression generalizes logistic regression to $C$ classes.
- Loss function
    - $L(\hat{y}, y) = -\displaystyle\sum_{j}y_{j}log\hat{y}_{j}$
- Cost function
    - $J(w^{[1]}, b^{[1]}, \dots) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})$
    
$z^{[L]} \rightarrow a^{[L]} = \hat{y} \rightarrow L(\hat{y}, y)$
- Backprod: $dz^{[L]} = \hat{y} - y$

# Structuring Machine Learning Projects

## Introduction to ML strategy

### Orthogonalization

- Fit training set well on cost function. (bigger network, better optimization algorithm)
- Then, fit dev set well on cost function. (regularization, bigger training set)
- Then, fit test set well on cost function. (bigger dev set)
- Then, perform well in real world. (change dev set or cost function)

## Setting up your goal

### Sinle number evaluation metric

### Satisfying and optimizaing metric

- Ex. maximize accuracy subject to running_time $\le 100ms$
- $N$ metrics: $1$ optimizing, $N-1$ satisfying.

### Train/dev/test distributions

- Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.
- Dev and test set must come from the same distribution.

### Size of dev/test sets

- Set your test set to be big enough to give high confidence in the overall performance of your system.

### When to change dev/test sets and metrics

- If doing well on your metric and dev/test set does not correcpond to doing well on your application, change your metric and/or dev/test set.

## Comparing to human-level performance



# Feature selection

- Remove unneeded, irrelevant, redundant attribute from data.
- Redundant features can mislead the model. (Especially, k-nearest neighbors)
- Irrelevant features can overfit the model. 

## Filter method

- Assign score to each feature.
- Ex. Chi squared test, information gain, correlation coefficient scores

## Embedded method

- Learn which features are contributing to the accuracy of model.
- Ex. regularization (LASSO, elastic net, ridge regression)

# Precision and recall

- True Negative: ground truth was negative and prediction was negative.
- True Positive: ground truth was positive and prediction was positive.
- False Negative: ground truth was positive but prediction was negative.
- False Positive: ground truth was negative but prediction was positive.

Precision: 
- What percentage of positive predictions were correct?
    - Ex. Of examples recognized as cat, what % actually are cats?
- True Positive / (True Positive + False Positive)

Recall
- What percentage of positive cases did you catch?
    - Ex. What % of actual cats are correctly recognized.
- True Positive / (True Positive + False Negative)

F1 score
- Average of precision and recall.
- $\dfrac{2}{\dfrac{1}{P}+\dfrac{1}{R}}$

Accuracy
- What percentage of predictions were correct?
- (True Positive + True Negative) / (True Negative + True Positive + False Negative + False Positive)

False Positive Vs. False Negative
- In medical exam, False Negative is threatening to patients. Thus, False Positive is preferred.
- In spam filtering, False Positive is annoying to users. Thus, False Negative is preferred.

# Imbalanced data in classification?

1. Collect more data.
2. Undersample from over-represented class.
3. Change performance metric
    - Accuracy is not the right metric to use when data is imbalanced.
    - Look at precision / recall / F1 score.

# Bias and variance

## Why human-level performance

- While ML is worse than human, you can
    - Get labeled data from human.
    - Gain insight from manual error analysis. (why did a person get this right?)
    - Better analysis of bias/variance.
    
## Avoidable bias

- Human error as a proxy for bays error.
- Gap between human and training error: avoidable bias.
- Gap between training and dev error: variance.

## Two fundamental assumptions of supervised learning

- You can fit the training set pretty well ~ avoidable bias.
- Training set performance generalizes pretty well to dev/test set ~ variance.
- Avoidable bias
    - Traing bigger model.
    - Train longer / use better optimization algorithms.
    - NN architecture / hyperparameters search.
- Dev error
    - More data.
    - Regularization.
    - NN architecture / hyperparameters search.
- Increasing $\lambda$ decrease variance, decreasing $\lambda$ decrease bias.
- More features decrease bias but increases variance. Less features decreases variance but increases bias.



# Sparse data

- L1 regularization.
- Linear regression if linear relationship.