# PA 3: Evaluation and Comparison on Deep learning Models

## 1. Understanding Evaluation Metrics (20 points)


### 1.1 Please explain the following commonly used evaluation metrics. (10 points)
- Accuracy: Accuracy is how often the model predicts the outcome correctly. It is the percentage correct calculated by: the number of samples correctly classified divided by the total number of samples. It answers the question: how often is the model right? Accuracy is simple to calculate and easy to understand but it is not very helpful if our classes are imbalanced or if what we really care about is predicting events that rarely occur.
- Precision: Precision measures how often a model correctly predicts the positive class out of the predicted positives. It is the percentage calculated by: the number of true positive predictions divided by the total number of positive predictions (both true and false). It answers the question: how often are the positive predictions correct? Precision addresses the problem that accuracy has when dealing with unbalanced classes. It is especially helpful when the cost of a false positive is high (but are okay if you miss some positives (false negatives)). Precision is best when we care more about "being right" than detecting them all.
- Recall: Recall measures how often a model correctly predicts the positive (true positives) out of all the actual positives. It is the percentage calculated by: the number of true positive predictions divided by the total number of positives (true positives plus false negatives). It answers the question: how well does a model find all instances of the positive class? Recall addressed the problem that accuracy has when dealing with unbalanced classes. It is especially helpful when the cost of a false negative is high (but are okay with some false positives). Recall is best when we care more about detecting them all than "being right".
- F1 Score: The F1 score is a way of averaging the precision and recall rates into one value. Because precision and recall are both rates, it creates this using the harmonic mean: 2 * (Precision * Recall) / (Precision + Recall). The result is a number between 0 and 1 that indicates how well a model classifies samples into their correct classes (0 being not classifying anything correctly and 1 being classifying all samples correctly). F1 scores is used to evaluate LLM accuracy as well as binary and multi-class classification problems (especially when classes are unbalanced). It is useful when wanting to account for both precision and recall and the costs of false negatives and false positives are relatively even. If one is more costly than the other, using straight precision or recall would be best.
- ROC Curve and AUC (Area Under the Curve): ROC is a graph that plots the True Positive Rate (Recall) on the y axis against the False Positive Rate (1 -  Precision, can be thought of as the "False Alarm Rate") on the x axis. The multiple curves are created for the classification model using different thresholds. These can then be compared on the ROC curve. The random baseline will be a straight line from the bottom left of the graph to the top right of the graph. The more the ROC curves upwards away from the baseline, the better the model is. (A perfect ROC curve would actually run staight up the y axis until 1, then turn right and run straight vertically). This upward curve is quantified by the AUC metric. It measures the area under the curve and allows us to determined which threshold is best (the largest AUC). 

### 1.2 Provide a practical scenario and explain which metric(s) should be chosen to assess the model performance in that scenario. (10 points)
Scenario: A binary classification model to identify tumors in mammogram images. The data used for this model would have unbalanced classes, since there are many more mammograms taken of healthy breast tissue than of tumors. (Of all women getting mammograms, only 0.5% have breast cancer). This means that accuracy would not be a good metric (we could classify as tumor-free every time and would be correct 99.5% of the time). We could choose to use the F1 score if the cost of false negatives and false positives is similar. However, in this case, a false positive means a woman would have the stress of coming in for more diagnostic testing but a false negative would prevent her from receiving possible life-saving treatment. The cost of a false negative is much higher than a false positive. In this case recall would be our most important metric. We want to make sure we detect all the true positives even if we end up with some false positives because lives are at stake. This isn't true without bounds, however. We could, in theory, classify every image as positive and we'd have 100% recall. This is the same problem as accuracy on the flip-side. This would result in a large cost (and stress) burden as women are re-tested unnecessarily. We might, therefore, want to plot this model using different thresholds on an ROC curve. We want to maximize our recall, which would induce us to perhaps over-categorize positives (perhaps adjusting our threshold more towards a positive classification). The ROC curves would allow us to compare the different thresholds while balancing our recall against the false positive rate. The AUC would quantify which threshold is best (in case it's not visually clear on the graph).)

Sources:
- "Accuracy vs. precision vs. recall in machine learning: what's the difference?" https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall#:~:text=Accuracy%20is%20a%20metric%20that,often%20the%20model%20is%20right%3F
- "Understanding and Applying F1 Score: AI Evaluation Essentials with Hands-On Coding Example" https://arize.com/blog-course/f1-score/#:~:text=F1%20score%20is%20often%20preferred,number%20of%20non%2Dspam%20emails.
- "How to explain the ROC curve and ROC AUC score?" https://www.evidentlyai.com/classification-metrics/explain-roc-curve
- "What Percentage of Abnormal Mammograms Are Cancer?" https://www.medicinenet.com/what_percentage_of_abnormal_mammograms_are_cancer/article.htm