# Intro to ML: Model Performance

How do we know we can trust our models?

Say we're working with a model that predicts weather a patient has cancer. If a model predicts that a pateint has cancer, we would want to know how sure the model is of this prediction. 

Models make predictions but our confidence/trust in a models ability to make **accurate** predictions comes from evaluating how well they perform on **practice (training) cases** and **new (test) cases**. 

This homework explores methods to assess the accuracy of model predictions. 

## Key idea 💡
**“A model is only as good as the data it’s trained on — and as the metrics we use to judge it.”**

<p align="left">
    <img src = "https://www.explainxkcd.com/wiki/images/d/d7/flawed_data.png" width = "400">
</p>

Consider the following analogy: 

A students performance on a calculus test is directly related to the amount of time and the types of problems they practiced. It's best to evaluate a students calculus abilities using calculus problems! Don't use a geometry test to test abilities in calculus!

In this analogy: 

1. model = student
2. training data = the caluclus problems student practiced
3. metrics to test model performance = calculus test 

<p align="left">
    <img src = "https://dataedo-website.s3.amazonaws.com/cartoon/machine_learning.png?1654170935" width = "400">
</p>

### Q: In your own words, and given the analogy above, define Machine Learning.

### A:
YOUR ANSWER HERE

Hopefully this reinforces the idea of **Machine Learning**: an algorithm learning a relationship in the data so that it can make predictions when handed new data

----

## Accuracy

Accuracy is defined as: 

$$\text{Accuracy}=\frac{\text{Correct Predictions}}{\text{Total Predictions}}$$

Though it's an intuitive first choice for a model performance metric it can be misleading when working with multi-class datasets. For example, predicting “does not have cancer" for everyone when 99% do not have cancer is not very helpful. We don't really know how well the model does at predicting in situations when someone does have cancer!

<p align="left">
    <img src = "https://media.licdn.com/dms/image/v2/C5612AQGpdX8HSIAEeA/article-cover_image-shrink_600_2000/article-cover_image-shrink_600_2000/0/1610873977828?e=2147483647&v=beta&t=gnytCodqB3H_WcT0Zz3NpDqHfbNNIL_OTctWWcB1iKE" width = "800">
</p>

## The Data Life Cycle in ML

A schematic of the training (learning) and testing (evaluation) procedures for a ML model. 
<p align="left">
    <img src = "https://www.machinelearningplus.com/wp-content/uploads/2022/12/train_test_split-procedure.jpg" width = "800">
</p>

Imagine you’re studying for a test. If you only practice using the exact same questions that will be on the test, will you really know the material? Probably not — you’d just memorize the answers.

## Key idea 💡
**We want our models to learn patterns in data, not memorize the training examples.**

That’s why we divide our data into parts — to test if the model can handle new, unseen examples.

----

## Training and Testing
When we train a machine learning model:
1. Training set: Used to teach the model patterns in the data.
2. Testing set: Used to evaluate how well the model performs on new data.

Typical we allocate 70–80% of the data for training and 20–30% for testing. 

### Generalization

Generalization is a model’s ability to make accurate predictions on data it has never seen before. Models that don't generalize well fall into one of two categories, over or underfitting: 

| Type                       | Description                                                      | Example Behavior                            | 
| -------------------------- | ---------------------------------------------------------------- |  ------------------------------------------ |
| Overfitting                | Overfitting	Model is too complex, memorizes noise          | Perfect on training data, fails on new data        | 
| Underfitting               |  Model is too simple, doesn’t capture underlying pattern               | Straight line through nonlinear data | 	

<p align="left">
    <img src = "https://miro.medium.com/v2/resize:fit:1400/1*pXJJTOS0f0dqgnlleP10aA.jpeg" width = "800">
</p>

### Q: What are the pros and cons to splitting your data 80% train, 20% test vs. of 50% train, 50% test?

### A: 
YOUR ANSWER HERE

Below we train a classification algorithm to label patients as 'has cancer' or 'does not have cancer'. The train and testing accuracy is printed below.

In [8]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)

# split data into train (80%) and test (20%); random state for reproducibility 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# train a logistic regression model -- it's typically used to answer binary classification problems
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# predict on both train and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print(f"Training Accuracy: {accuracy_score(y_train, y_train_pred):.3f}%")
print(f"Testing Accuracy: {accuracy_score(y_test, y_test_pred):.3f}%")


Training Accuracy: 0.958%
Testing Accuracy: 0.956%


### Q: If you get > 95% accuracy in testing, does this alone tell us our model is great?

### A: 
YOUR ANSWER HERE

----

## Classification Report

If a model predicts whether a patient has cancer or not, what does it mean if it’s 95%+ accurate? Is that good? We can't really tell! What we really want to know is: 
1. How accurate is the model at correctly predicting the pateint has cancer? 
2. How accurate is the model at correctly predicting the pateint does *not* have cancer?

That’s where the classification report comes in — it gives us a deeper look at model performance **by class**. The classification report summarizes several important metrics for each class in a classification problem. It shows how well the model identifies each category, it includes:

| Metric    | Meaning                                                   | Formula                               |
| --------- | --------------------------------------------------------- | ------------------------------------- |
| Precision | Of all items predicted as class X, how many were correct? | $\frac{TP}{TP + FP}$                |
| Recall    | Of all true class X items, how many did we find?          | $\frac{TP}{TP + FN}$                |
| F1-Score  | Balance between precision and recall.                     | $2 \times \frac{P \times R}{P + R}$ |
| Support   | Number of true instances of each class.                   | Count                                 |

Let Positive = 'does *not* have cancer' and Negative = 'does have cancer'
- FP: False Positive
    - ex: someone with cancer is told they do not have cancer
- TP: True Positive 
    - ex: someone that does not have cancer is told they do not have cancer
- FN: False Negative
    - ex: someone without cancer is told they have cancer
- TN: True Negative
    - ex: someone with cancer is told they have cancer

These Positive and Negative labels are summarized in the form of a **confusion matrix**, shown below: 
<p align="left">
    <img src = "https://www.blog.trainindata.com/wp-content/uploads/2024/09/confusion-matrix-1.png" width = "500">
</p>


The confusion matrix for the algorithm we trained above is shown below. The matrix depicts the results of the testing data.

In [13]:
from sklearn.metrics import classification_report, confusion_matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred), "\n")

Confusion Matrix:
 [[39  4]
 [ 1 70]] 



In [14]:
print(classification_report(y_test, y_test_pred, target_names=load_breast_cancer().target_names))

              precision    recall  f1-score   support

   malignant       0.97      0.91      0.94        43
      benign       0.95      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



### Q: What could you say about the model we trained above if the precision and recall were both 1? 
Make quantitative arguments about the values of FN and FP to support your answer.

### A: 
YOUR ANSWER HERE

### Q: Sometimes it's not possible to increase precision without decreasing recall (and vice-versa). Provide a conceptual argument for why this might be the case. 
Assume there's a way to force the model to maximize precision (or recall) upon the users request (there is). 

### A: 
YOUR ANSWER HERE

----

## More Testing via Cross Validation  

Sometimes, a single train/test split isn’t enough. Why? Because your model’s performance might depend on how the data happened to be split.

Consider the following example: 

If you only take one practice test, it might not reflect your true skill. But if you take five different versions and average your scores, you’ll get a better estimate of how you’ll perform on the real exam.

This is what **cross-validation** does!

<p align="left">
    <img src = "https://miro.medium.com/1*GhKMAUmi4bfFiEwZCPlDsA.png" width = "700">
</p>

K-Fold Cross-Validation Psuedo Code: 

1. Split the data into K parts (folds).
2. Train on K–1 folds, test on the remaining one.
3. Repeat K times, each time using a different fold as the test set.
4. Average the performance scores.

Notice that cross validation uses *all data* for both training and testing (just not at the same time). This allows us to produce a more reliable estimate of model performance.

### Types of Cross-Validation (CV)
1. K-Fold CV: Standard method as described above.
2. Stratified K-Fold: Keeps class proportions balanced.
    - example: Given an image of a solid color, a machine learning model predicts the color of an image as "orange", "purple", or "yellow". Stratified K-fold splits up the data so that in each fold there is an equal amount of "orange", "purple", or "yellow" images for **both* training and testing. 
3. Leave-One-Out CV: Each sample is its own test set (expensive but precise).

### Q: (Bonus, for Cool Nerds Only) If you're working with a classification model, why is it bad if your data is imbalanced? What would you need to do if your data is imbalanced but you still want to perform CV?

### A: 
YOUR ANSWER HERE

In [15]:
from sklearn.datasets import load_wine
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X, y = load_wine(return_X_y=True)

# another type of algorithm commonly used for classification problems 
model = RandomForestClassifier(random_state=42)

# 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print("Cross-validation accuracies:", scores)
print("Mean accuracy:", np.mean(scores))
print("Standard deviation:", np.std(scores))


Cross-validation accuracies: [1.         1.         0.94444444 0.97142857 1.        ]
Mean accuracy: 0.9831746031746033
Standard deviation: 0.02230370548603213


### Q: I committed a cardinal sin when evaluating the performance of this algorithm. What was it?

### A:

YOUR ANSWER HERE


# <span style="color:orange;">Happy Halloween!</span>
<p align="left">
    <img src = "https://static.wikia.nocookie.net/p__/images/1/1e/SpookleyTheSquarePumpkin.webp/revision/latest/thumbnail/width/360/height/360?cb=20250921145457&path-prefix=protagonist" width = "300">
</p>