# Conclusion

Model evaluation takes on an entirely new dimension for the Data Scientist. At some point, someone will say to you, "we want to increase engagement" and that's fine, except that we need to agree on what that means in a very concrete way. Does it mean time on the page or number of pages visited? You need to see if that data is being collected and if it can be calculated. This should have been part of the original CoNVO or subsequent improvements to the understanding of the problem. It may have been missed.

Let us take an example from the very first module, you have a news site and you want people to read the articles. One way you think you can do that is to build a model that scores each story for each individual, producing a propensity to read the story. This is a complex classification/information retrieval problem. So when building the model, you might conduct some cross validation studies to see how well the model performs.

But "reading propensity" may not be the metric you're trying to optimize. So in the A/B Test, the performance metric diverges from the loss function and performance metric of the model. Your organization may have a whole slate of engagement scores. Your model was not built to optimize all of those and in a dynamic context.

Put differently, there's a difference between the estimation of response variables (and associated loss functions) and evaluation metrics. The response variable of your model may be the probability that someone buys a certain product. The evaluation metric of the model may be purchase rates. These are not the same thing.

A/B Testing may ultimately be used to test what is more important and only validates the model in a very indirect sort of way.

## Summary

Here are a few take away points:

1. High Bias/High Variance are not absolutes. There is no algorithm that produces models that are always high bias. High bias or variance exist in context with the inputs, metaparameters, and data.
2. There is a lot of trial and error in "debugging" a machine learning model. It's a very large search space of possibilities. Use EDA to inform your search. We knew to try a quadratic transformation on $x_1$ because we'd looked at the data. Note that some companies, DataRobot for example, are trying to automate the process by automatically searching the grid of possibilities using huge clusters in the cloud. You  might be able to accomplish something similar on your own.
3. Real life curves--learning or validation--rarely look as neat as the textbook versions.

## Review


1. Describe the two cultures of model evaluation. How are they different?
2. What is the most typical evaluation metric for regression (value prediction) problems? What is the formula?
3. What is a confusion matrix? What are each of the elements in the matrix?
4. Define and provide the formula for:
  1. accuracy
  2. error rate 
  3. sensitivity/true positive rate
  4. specificity
  5. precision
5. How does 10 fold cross validation work? How would you apply it to linear or logistic regression?
6. What is the bias/variance tradeoff?
7. How does bias relate to underfitting?
8. How does variance relate to overfitting?
9. What are the five general ways in which a "model" can be improved?
10. Describe how to calculate learning curves. What do they tell you about your model?
11. What do learning curves look like when your current situation involves high bias?
12. What do learning curves look like when your current situation involves high variance?
13. If your learning curves indicate high bias, what does that suggest about getting more data? What should you do?
14. If your learning curves indicate high variance, what does that suggest about getting more data? What should you do?
13. What are validation curves used for? How do you interpret them?
14. What are some other ways you can improve your model under high bias/high variance?
15. What is regularization?
16. What is the difference between "backtesting" (cross validation) and A/B testing?
  