# Training and Testing on Different Distributions

Suppose that we train a model using cat pictures from the web to predict whether pictures uploaded by users on their mobile phones are cat pictures.

Suppose that we have only 10,000 pictures from our user app, and around 200,000 pictures crawled from the internet. 

## Option 1

First we put the two datasets together. Then we shuffle it and split them into train/dev/test set. This should be avoided because we are not evaluating our model against an accurate portrayal of the real world data that we will be receiving.

## Option 2

Use the 200,000 images from the web as the training set, and use the mobile app data for the dev/test sets. There are problems with training sets coming from different distributions, but we will worry about that later.

# Bias and Variance with Mismatched Data

Suppose that we have a cat classifier example, where human error is around 0% (near perfect).

If we have a training error of 1% and dev error of 10%, we can say that our model has a high variance. However, if the training set and dev set comes from different distribution, we cannot come to the same conclusion. Because the dev set might be a more difficult dataset as compared to the training set. 

In order to remove the two effects, we should define a new set of data called a **training-dev set**. It should have the same distribution as the training set, but you don't train a neural network on this.

What we do:

1. Perform the normal train/dev/test split
2. Randomly shuffle training set and take a piece of the training set as the training-dev set
3. Train neural network with training data, without the training-dev set
4. Evaluate on training-dev set AND dev set for error analysis

Suppose that:
- Error on training set is 1%
- Error on training-dev set is 9%
- Error on dev error is 10%

This mean that the neural network is overfitting (variance problem) and is not generalizing well to training-dev set which comes from the same distribution as training data. 

Now suppose that:
- Error on training set is 1%
- Error on training-dev set is 1.5%
- Error on dev error is 10%

Now we don't have a variance problem, but rather a **data mismatch** problem. 

Again, suppose that:
- Error on training set is 10%
- Error on training-dev set is 11%
- Error on dev error is 12%

And assume that human level approximate is 0%. This performance is indicative of an avoidable bias problem. 

Finally, suppose that:
- Error on training set is 10%
- Error on training-dev set is 11%
- Error on dev error is 20%

In this case, the avoidable bias is still quite high, the variance is small, but there is data mismatch.

## General Principles

Key quantities to look out for:

1. Human error
2. Training set error
3. Training-dev set error
4. Dev set error

An even more general formulation can be tabulated:

![Data Mismatch](./images/data-mismatch.png)

There aren't many ways to address data mismatch, but there are a few ways to help.

# How to Address Data Mismatch?

We can try to carry out **manual error analysis** and understand differences between training and dev/test sets. Find out the difference between the training and dev sets. 

Another way is to collect data that is more similar to dev/test set to include into the training set.

## Artificial Data Synthesis

The example used in the lecture is to create new artificial sound data by appending car noise to "The quick brown fox jumps over the lazy dog".

Note of caution: suppose we have 10,000 hours of data that was recorded and just one hour of car noise. We could repeat the car noise 10,000 times in order to add it to the data. The audio will sound perfectly fine to the human ear, but there is a risk that the algorithm will overfit to one hour of the car noise.

No matter what artificial data we have, there is always a chance that we may overfit to a very small subset of our entire dataset.