# STRUCTURING MACHINE LEARNING PROJECTS

<h1><span style="background:yellow;padding: 15px;">Orthogonalization</span></h1>
<br/>
    
- = Know exactly **what to change** in order to achieve **what effect**
- **orthogonal** = at 90° to each other
- Example:
  - old TV set with a lot of knobs to adjust the picture
  - each knob does only one thing and does not affect other knobs   
  -> ***orthogonal controls*** = at 90° to each other   
  -> much easier to adjust one feature and then another separately
  - same with cars: stearing wheel, accelerators, breaks, each has only one effect (speed, angle...)
  - imagine if stearing wheel changed 0.3*angle - 0.8*speed!   

# *Machine Learning chain of assumptions*
   
- Each step has its own knobs to adjust   
  => improving nns = diagnose what exactly is the bottleneck to your system's performance
  
   
1. Fit model on **training set** until does best on cost function -> achieve ***human-level performance***
  - train a bigger network
  - switch to better optimization algorithm (Adam...)
  - etc.
     
     
2. Adjust model on **dev set** until does best on cost function
    - regularization
    - get a bigger training set  
    
    
3. Hope model will do well on **test set** -> cost function
    - get a bigger dev set, because if well on dev set but not test set, then model overtuned to dev set  
    
    
4. Hope model will do well in **production**
    - change the dev set -> dev set distribution not set correctly
    - change the cost function -> cost function not measuring the right thing
    
    
- **Note**. Andrew Ng: better not to use early stopping because not orthogonal, affects both how well you fit the training set and dev set (improves its performance)

# *Using a single evaluation metric*

Makes it easier at the beginning of the project
   
   
### Satisficing and Optimizing metric

- **Optimizing**: metric we want to optimize
- **Satisficing**: we want to reduce it to a certain threshold, under that we don't care
- If you have N metrics:
  - choose **one** as optimizing -> maximize it
  - the others are satisficing -> as long as they reach a threshold, you don't care how they do
   
##### Example

- **Precision**: of images recognized as cats, what percentage actually are cats?
- **Recall**: of all images of cats, what percentage are actually recognized as such?   
  
-> If one model does better on precision while the other is better on recall, which one is better?    
-> 2 evaluation metrics makes it difficult to quickly pick one   
  
  
- **F1 score**: ***harmonic mean*** of precision and recall. `2 / (1/P + 1/R)`    
  
-> Well defined dev set + single evaluation metric => **speed up iterating**   

##### Example

- Idem if accuracy + running time -> combine them into single metric: `accuracy - 0.5*running_time`

# *Train/Dev/Test set distributions*

- Choose **dev** and **test** sets:
  - with **same distribution**
  - that reflect **real world data**
  - that you consider **important to do well** on
  
  
<table align=left style="border:1px solid black">
    <tr>
        <td style="text-align:left; width: 320px;">Setting up <b>dev set + evaluation metric</b></td>
        <td style="text-align:left; width: 350px;">= <b>defining</b> what target you want to hit</td>
    </tr>
    <tr>
        <td style="text-align:left; width: 320px;">Setting up <b>dev set + test set</b> to same distribution</td>
        <td style="text-align:left; width: 350px;">= <b>aiming</b> at that target</td>
    </tr>
    <tr>
        <td style="text-align:left; width: 320px;">Setting up <b>training set</b></td>
        <td style="text-align:left; width: 350px;">= how well you will <b>hit</b> that target</td>
    </tr>
</table>

# *Size of the dev/test sets*

- Choose test set big enough to give **high confidence** in overall system performance
  -> Allows you to evaluate how good your final system is

# *When to change dev/test sets and metrics*

- metric does not match humal evaluation   
see details in video

#### Orthogonalization
- First define your target: evaluation metric
- Then worry about how well to do on this metric

# *Why human-level performance?*

- Before reaching human performance (ex 92%), **progress is fast**
- Then progress **slows down** when it surpasses human performance
- Ultimately reaches **optimum** level of performance (ex 97%) and **plateaus**
  = **Bayes Optimal Error**: cannot be surpassed by any function mapping x -> y   
  ex. noisy audio, blurry image
  

#### Why compare to human level performance?

- Humans quite good at a lot of tasks.
- So long as ML worse than humans, you can:
  - get labeled data from humans
  - gain insight from manual error analysis: why did a person get this right when system is wrong?
  - better analysis of bias and variance

# *Avoidable bias*
  
- Depending on what we estimate human error to be, we focus on different solutions
  

<table align=left style="border:1px solid black">
    <tr>
        <td style="text-align:left; width:200px;"></td>
        <td style="text-align:left; width:200px;"></td>
        <th style="text-align:left; width:300px;">Problem</th>
        <th style="text-align:left; width:300px;">Solution</th>
    </tr>
    <tr>
        <th style="text-align:left;">Humans</th>
        <td style="text-align:left; color:red;"><b>1%</b></td>
        <td rowspan=3 style="text-align:left;">
            huge gap between human and training shows that system is not fitting well on training set
        </td>
        <td rowspan=3 style="text-align:left;">
            <b>focus on bias</b>: deeper network, more hidden units, etc.
        </td>
    </tr>
    <tr>
        <th style="text-align:left;">Training error</th>
        <td style="text-align:left; color:red;"><b>8%</b></td>
    </tr>
    <tr>
        <th style="text-align:left;">Humans</th>
        <td style="text-align:left;">10%</td>
    </tr>
</table>   
     
<BR/>    

<table align=left style="border:1px solid black">
    <tr>
        <td style="text-align:left; width:200px;"></td>
        <td style="text-align:left; width:200px;"></td>
        <th style="text-align:left; width:300px;">Problem</th>
        <th style="text-align:left; width:300px;">Solution</th>
    </tr>
    <tr>
        <th style="text-align:left;">Humans</th>
        <td style="text-align:left;"><b>7.5%</b></td>
        <td rowspan=3 style="text-align:left;">
            images so blurry even human cannot recognize
        </td>
        <td rowspan=3 style="text-align:left;">
            <b>focus on variance</b>. regularization, etc.
        </td>
    </tr>
    <tr>
        <th style="text-align:left;">Training error</th>
        <td style="text-align:left; color:red;"><b>8%</b></td>
    </tr>
    <tr>
        <th style="text-align:left;">Humans</th>
        <td style="text-align:left; color:red;"><b>10%</b></td>
    </tr>
</table>     

# *Understanding human-level performance*

cf. details

# *Surpassing human-level performance*

# *Improving your model performance*

**= Reducing (avoidable) bias and variance**
- Different tools for each problem

### Avoidable bias
- train bigger model
- train longer/better optimization algorithms (Momentum, RMSPorp, Adam)
- improve NN architecture/hyperparameters search (RNN, CNN, different activations...)

### Variance
- get more data (helps generalize better to dev set)
- regularization (L2, dropout, data augmentation)
- improve NN architecture/hyperparameters search (RNN, CNN, different activations...)

# *Error analysis*

### Error analysis

- manually examine errors made by the system to gain insights on what to do next

### Clean up incorrectly labeled data

- if you correct incorrectly labeled data in the dev set
  - you should also do it for the test, so that they stay the same distribution
  - you don't need to also do it for training data

### Build your first system quickly, then interate

# *Mismatched training and dev/test set*

### Training and testing on different distributions

### Bias and variance with mismatched data distributions

<img src="images/course2improvingnnsP2.png" width="950">

### Addressing data mismatch

- Carry out error analysis to try to understand difference between training and dev/test sets
  - ex: cause noisy in-car audio
- Make training data more similar; or collect data more similar to dev/test sets
  - ex: find more audio with in-car noise - or synthesise clean audio with few recorder in-car noise

# *Learning from multiple tasks: transfer learning*

- Tasks A and B have the same input x
- You have a lot more data for Task A than for Task B
- Low level features from A could be helpful for learning B
  

- **few** data for a **specific** task, but **lots** of data for a similar more **generic** task
- **few** data for a **general** task, but **lots** of data for a similar more **specific** task    
  -> more data is always better
  - Example:
    - Low-level features (lines, dots, curves, small parts of objects)    
      = useful info about how images look like in general
    - can be helpful for **any** image related task
   
   
- **Example application**:
  - you train a nn for speech recognition (general task)
  - now you want to use it for wake-word detection (specific task)
  - **Process**:
    1. remove the output layer   
    2. replace it with a new output node, or with multiple new layers, specific to your task   
    3. depending on how much data you have, you might 
        - retrain only the latest layers, 
        - or also retrain previous layers from the general task

<img src="images/transfer-learning.png" width="650">

# *Learning from multiple tasks: multi-task learning*

- Training on set of tasks that could benefit from having shared low-level features
- Output = vector of classes instead of only 1 class
  - ex: [has_red_light, has_pedestrian, has_stop_sign...]
- Usually same amount of data available for each task   
  -> combining datasets for each task => bigger dataset
- Only cases when multi-task learning not better than individual tasks -> network not big or deep enough
- If some examples are missing some features, you can compute the cost such that it is not influenced by the fact that some entries haven't been labeled
  
  
- **Example**: autonomous driving, simultaneously recognizing signs, pedestrians, cars, etc.
  - each image labeled [1 0 0 1 0] means contains stop sign and red traffic light
  - it's ok if some features are not labeled for some examples [1 ? 0 1 ?]
  - 100.000 labeled images + 900.000 found online (not same distribution)
    - training set = 80.000 of your labeled images + 900.000 internet images
    - dev and test set = remaining 20.000 of your labeled images

# *End-to-end learning*

= multiple processing steps combined in a single one
  
- **traditional approach**: multi-steps
  - speech recognition: audio -> features -> phonemes -> words -> transcript  
- **deep learning approach**: single step from input to output
  - speech recognition: audio -> transcript
  - sometimes intermediate steps  
- might need **lot of data** before it works well
  - few data -> hand-designed components very useful
  
  
- **Example**:
  - MT for pairs of languages with lots of aligned data available
  - English -> text analysis ... -> French **VS** English -> French


- **Pros and cons**:
  - **Pros**:
    - let the data speak: more machine reasoning that human reasoning
    - less hand-designing of components needed
  - **Cons**:
    - May need large amount of data
    - Excludes potentially useful hand-designed components
   
   
### Key question

Do you have sufficient data to learn a function of the complexity needed to map directly x to y ?