# Structuring Machine Learning Projects

## Introduction to ML strategy

### Orthogonalization

- Fit training set well on cost function. (bigger network, better optimization algorithm)
- Then, fit dev set well on cost function. (regularization, bigger training set)
- Then, fit test set well on cost function. (bigger dev set)
- Then, perform well in real world. (change dev set or cost function)

## Setting up your goal

### Sinle number evaluation metric

- Precision: of examples recognized as cat, what % actually are cats?
- Recall: what % of actual cats are correctly recognized.
- F1 score: "average" of precision and recall.
    - $\dfrac{2}{\dfrac{1}{P}+\dfrac{1}{R}}$

### Satisfying and optimizaing metric

- Ex. maximize accuracy subject to running_time $\le 100ms$
- $N$ metrics: $1$ optimizing, $N-1$ satisfying.

### Train/dev/test distributions

- Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.
- Dev and test set must come from the same distribution.

### Size of dev/test sets

- Set your test set to be big enough to give high confidence in the overall performance of your system.

### When to change dev/test sets and metrics

- If doing well on your metric and dev/test set does not correcpond to doing well on your application, change your metric and/or dev/test set.

## Comparing to human-level performance

### Why human-level performance

- While ML is worse than human, you can
    - Get labeled data from human.
    - Gain insight from manual error analysis. (why did a person get this right?)
    - Better analysis of bias/variance.
    
### Avoidable bias

- Human error as a proxy for bays error.
- Gap between human and training error: avoidable bias.
- Gap between training and dev error: variance.

### Two fundamental assumptions of supervised learning

- You can fit the training set pretty well ~ avoidable bias.
- Training set performance generalizes pretty well to dev/test set ~ variance.
- Avoidable bias
    - Traing bigger model.
    - Train longer / use better optimization algorithms.
    - NN architecture / hyperparameters search.
- Dev error
    - More data.
    - Regularization.
    - NN architecture / hyperparameters search.

## Error analysis

### Carrying out error analysis

- Look at dev examples to evaluate ideas.
- Ex. cat detection
    - Dog being recognized as cats.
    - Big cats (lions, panthers, etc) being recognized as cats.
    - Blurry images.
    
### Cleaning up incorrectly labeled data

- DL algorithms are quite robust to random errors (not systematic) errors in training set.
- Consider errors due to incorrect labels vs. errors due to other causes.
- Apply same process to your dev and test sets to make sure they continue to come from the same distributions.
- Consider examining examples your algorithms got right as well as ones it got wrong.
- Train and dev/test data may now come from slightly different distributions.

#### Build your first system quickly, then iterate

- Set up dev/test set and metric.
- Build initial system quickly.
- Use bias/variance analysis & error analysis to prioritize next steps.

## Mismatched training and dev/test set

### Training and testing on different distributions

- Ex. cat (data from webpages (200,000) and mobile app (10,000))
    - Train: 200,000 from web + 5,000 from mobile.
    - Dev: 2,500 from mobile.
    - Test: 2,500 from mobile.
    - This is to ensure dev & test sets come from the same distribution.
- Ex. speech recognition
    - Training: purchased data, smart speaker control, voice keyboard.
    - Dev/test: speech activated rearview mirror.
    
### Bias and variance with mismatched data distribution

- Training-dev set: same distribution as training set, but not used for training.
- Gap between human-level and training error? avoidable bias.
- Gap between training and training-dev error? variance.
- Gap between training-dev and dev error? data mismatch.
- Gap between test and dev error? degree of overfitting to dev set.

### Addressing data mismatch

- Carry out manual error analysis to try to understand difference between training and dev/test sets.
- Make training data more similar, or collect more data similar to dev/test sets.
    - Artificial data synthesis.
    
## Learning from mutiple tasks

### Transfer learning (from A to B)

- Task A and B have the same input $x$
- You have a lot more data for task A than task B.
- Low level features from A could be helpful for learning B.

### Multi-task learning

- Training on a set of tasks that could benefit from having shared lower-level features.
- Usually, amount of data you have for each task is quite similar.
- Can train a big enough neural network to do well on all the tasks.

## End-to-end deep learning

Speech recognition example
- Audio $\rightarrow$ features $\rightarrow$ phonemes $\rightarrow$ workds $\rightarrow$ trainscript Vs. audio $\rightarrow$ trainscript.

Machine translation example
- English $\rightarrow$ text analysis $\rightarrow$ $\dots$ $\rightarrow$ french vs. english $\rightarrow$ french.

Pros
- Let the data speak.
- Less hand-designing of components needed.

Cons
- May need large amount of data.
- Exclude potentially useful hand-designed components.