# Day 8: Model Evaluation & Assessment

**Duration:** 90 minutes  
**Dataset:** Titanic Passenger Data

## Learning Objectives
- Understand bias vs variance tradeoff
- Identify overfitting and underfitting
- Apply regression metrics (MSE, RMSE, MAE, R²)
- Apply classification metrics (Precision, Recall, Accuracy, F1-Score)
- Interpret confusion matrices
- Understand ROC/AUC curves
- Differentiate Type 1 vs Type 2 errors
- Understand ethical AI considerations
- Recognize Trustworthy AI principles

---

## Part 1: Setup and Data Loading (5 mins)

### Introduction to Model Evaluation

Building a machine learning model is only half the battle. The real question is: **How good is our model?**

Today we'll learn how to:
- Measure model performance objectively
- Identify when models are too simple or too complex
- Choose the right metrics for different problems
- Ensure our AI systems are trustworthy and ethical

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns

# Machine Learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score,
    mean_squared_error, mean_absolute_error, r2_score
)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully!")

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')
print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Survival rate: {df['survived'].mean()*100:.1f}%")
df.head()

In [None]:
# Prepare the data
# Select features and handle missing values
df_clean = df.copy()
df_clean['age'].fillna(df_clean['age'].median(), inplace=True)
df_clean['embarked'].fillna(df_clean['embarked'].mode()[0], inplace=True)
df_clean['fare'].fillna(df_clean['fare'].median(), inplace=True)

# Create features
df_clean['sex_encoded'] = df_clean['sex'].map({'male': 1, 'female': 0})
df_clean['family_size'] = df_clean['sibsp'] + df_clean['parch'] + 1
df_clean['is_alone'] = (df_clean['family_size'] == 1).astype(int)

# Select final features
features = ['pclass', 'sex_encoded', 'age', 'fare', 'family_size', 'is_alone']
X = df_clean[features]
y = df_clean['survived']

print("Features prepared:")
print(X.head())

---
## Part 2: Train-Test Split & The Golden Rule (8 mins)

### The Golden Rule of Machine Learning

**Never test on your training data!** 

Think of it like studying for an exam:
- **Training Set:** Practice problems you study from
- **Test Set:** The actual exam with new questions

If you memorize the practice problems without understanding, you'll fail the real exam!

In [None]:
# TODO: Split data into training and testing sets (80/20 split)
# Hint: Use train_test_split with test_size=0.2 and random_state=42

X_train, X_test, y_train, y_test = # YOUR CODE HERE

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining set survival rate: {y_train.mean()*100:.1f}%")
print(f"Test set survival rate: {y_test.mean()*100:.1f}%")

**Question:** Why do we use `random_state=42`?

Your answer: ___________________________________

---
## Part 3: Bias-Variance Tradeoff (12 mins)

### Understanding Bias and Variance

**Bias:** How far off are our predictions on average?
- High bias = Underfitting (model too simple)
- Model misses important patterns

**Variance:** How much do predictions vary with different training data?
- High variance = Overfitting (model too complex)
- Model memorizes training data, including noise

**The Sweet Spot:** Balance between bias and variance!

```
High Bias (Underfitting)  →  Sweet Spot  →  High Variance (Overfitting)
Too Simple                                   Too Complex
```

### Exercise 3.1: Training Models with Different Complexities

Let's train 3 models with varying complexity:
1. **Simple:** Logistic Regression (linear, low complexity)
2. **Medium:** Decision Tree with limited depth
3. **Complex:** Decision Tree with no depth limit

In [None]:
# Model 1: Simple - Logistic Regression
model_simple = LogisticRegression(max_iter=1000, random_state=42)
model_simple.fit(X_train, y_train)

# Model 2: Medium - Decision Tree (max_depth=3)
model_medium = DecisionTreeClassifier(max_depth=3, random_state=42)
model_medium.fit(X_train, y_train)

# TODO: Model 3: Complex - Decision Tree (no depth limit)
# Hint: Use DecisionTreeClassifier with no max_depth parameter
model_complex = # YOUR CODE HERE
model_complex.fit(X_train, y_train)

print("✓ All models trained!")

In [None]:
# Compare training vs test accuracy
models = {
    'Simple (Logistic)': model_simple,
    'Medium (Tree depth=3)': model_medium,
    'Complex (Tree unlimited)': model_complex
}

results = []
for name, model in models.items():
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    gap = train_acc - test_acc
    
    results.append({
        'Model': name,
        'Train Accuracy': train_acc,
        'Test Accuracy': test_acc,
        'Gap': gap
    })

results_df = pd.DataFrame(results)
print("Model Performance Comparison:")
print(results_df.round(4))

In [None]:
# Visualize the bias-variance tradeoff
fig = go.Figure()

fig.add_trace(go.Bar(
    name='Training Accuracy',
    x=results_df['Model'],
    y=results_df['Train Accuracy'],
    marker_color='lightblue'
))

fig.add_trace(go.Bar(
    name='Test Accuracy',
    x=results_df['Model'],
    y=results_df['Test Accuracy'],
    marker_color='darkblue'
))

fig.update_layout(
    title='Bias-Variance Tradeoff: Training vs Test Performance',
    xaxis_title='Model Complexity',
    yaxis_title='Accuracy',
    barmode='group',
    yaxis_tickformat='.0%'
)
fig.show()

**Question:** Which model shows signs of overfitting? How can you tell?

Your answer: ___________________________________

**Question:** Which model would you choose and why?

Your answer: ___________________________________

### Exercise 3.2: Learning Curves

Learning curves show how model performance changes with training set size.

In [None]:
# Generate learning curve for the medium complexity model
train_sizes, train_scores, test_scores = learning_curve(
    model_medium, X_train, y_train, 
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, random_state=42
)

# Calculate mean and std
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot learning curve
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=train_sizes, y=train_mean,
    name='Training Score',
    mode='lines+markers',
    line=dict(color='lightblue')
))

fig.add_trace(go.Scatter(
    x=train_sizes, y=test_mean,
    name='Cross-validation Score',
    mode='lines+markers',
    line=dict(color='darkblue')
))

fig.update_layout(
    title='Learning Curve: How Performance Changes with Training Data Size',
    xaxis_title='Training Set Size',
    yaxis_title='Accuracy',
    yaxis_tickformat='.0%'
)
fig.show()

**Question:** What happens to the gap between training and test scores as we add more data?

Your answer: ___________________________________

---
## Part 4: Regression Metrics (10 mins)

While our main task is classification (survived or not), let's quickly understand regression metrics by predicting a continuous variable: **fare**.

### Common Regression Metrics

**1. MAE (Mean Absolute Error)**
- Average absolute difference between predictions and actual values
- Easy to interpret: "On average, we're off by X units"
- Formula: `MAE = mean(|actual - predicted|)`

**2. MSE (Mean Squared Error)**
- Average of squared errors
- Penalizes large errors more heavily
- Formula: `MSE = mean((actual - predicted)²)`

**3. RMSE (Root Mean Squared Error)**
- Square root of MSE
- Same units as the target variable
- Formula: `RMSE = √MSE`

**4. R² (R-Squared / Coefficient of Determination)**
- Percentage of variance explained by the model
- Range: 0 to 1 (higher is better)
- 0.8 = model explains 80% of variance

In [None]:
# Create a regression problem: predict fare from other features
X_reg = df_clean[['pclass', 'sex_encoded', 'age', 'family_size']].copy()
y_reg = df_clean['fare'].copy()

# Split the data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train a linear regression model
reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_reg = reg_model.predict(X_test_reg)

print("✓ Regression model trained!")

In [None]:
# TODO: Calculate regression metrics
# Hint: Use mean_absolute_error, mean_squared_error, r2_score

mae = # YOUR CODE HERE
mse = # YOUR CODE HERE
rmse = # YOUR CODE HERE (np.sqrt(mse))
r2 = # YOUR CODE HERE

print("Regression Metrics:")
print(f"MAE:  £{mae:.2f} (on average, off by this much)")
print(f"MSE:  {mse:.2f}")
print(f"RMSE: £{rmse:.2f} (similar to MAE but penalizes large errors)")
print(f"R²:   {r2:.3f} (model explains {r2*100:.1f}% of variance)")

In [None]:
# Visualize predictions vs actual values
comparison_df = pd.DataFrame({
    'Actual Fare': y_test_reg,
    'Predicted Fare': y_pred_reg,
    'Error': y_test_reg - y_pred_reg
})

fig = px.scatter(comparison_df, x='Actual Fare', y='Predicted Fare',
                 title='Actual vs Predicted Fare',
                 labels={'Actual Fare': 'Actual Fare (£)', 'Predicted Fare': 'Predicted Fare (£)'})

# Add perfect prediction line
fig.add_trace(go.Scatter(
    x=[0, comparison_df['Actual Fare'].max()],
    y=[0, comparison_df['Actual Fare'].max()],
    mode='lines',
    name='Perfect Prediction',
    line=dict(color='red', dash='dash')
))

fig.show()

**Question:** What does it mean if points are far from the red line?

Your answer: ___________________________________

---
## Part 5: Classification Metrics (15 mins)

Now back to our main task: predicting survival (classification).

### The Four Outcomes of Binary Classification

```
                    Predicted
                 No  |  Yes
          No    TN  |  FP   (Type I Error)
Actual    
          Yes   FN  |  TP   (Type II Error)
```

- **TP (True Positive):** Correctly predicted survival ✓
- **TN (True Negative):** Correctly predicted death ✓
- **FP (False Positive):** Predicted survival, but died ✗ (Type I Error)
- **FN (False Negative):** Predicted death, but survived ✗ (Type II Error)

### Key Metrics

**Accuracy:** Overall correctness
- Formula: `(TP + TN) / Total`
- Good when classes are balanced

**Precision:** Of all positive predictions, how many were correct?
- Formula: `TP / (TP + FP)`
- "When I predict survival, how often am I right?"

**Recall (Sensitivity):** Of all actual positives, how many did we catch?
- Formula: `TP / (TP + FN)`
- "Of all survivors, how many did I identify?"

**F1-Score:** Harmonic mean of Precision and Recall
- Formula: `2 × (Precision × Recall) / (Precision + Recall)`
- Balanced metric when you care about both Precision and Recall

In [None]:
# Train a model for classification
clf_model = RandomForestClassifier(n_estimators=100, random_state=42)
clf_model.fit(X_train, y_train)

# Make predictions
y_pred = clf_model.predict(X_test)

print("✓ Classification model trained!")

In [None]:
# TODO: Calculate classification metrics
# Hint: Use accuracy_score, precision_score, recall_score, f1_score

accuracy = # YOUR CODE HERE
precision = # YOUR CODE HERE
recall = # YOUR CODE HERE
f1 = # YOUR CODE HERE

print("Classification Metrics:")
print(f"Accuracy:  {accuracy:.3f} ({accuracy*100:.1f}% correct overall)")
print(f"Precision: {precision:.3f} (when we predict survival, we're right {precision*100:.1f}% of the time)")
print(f"Recall:    {recall:.3f} (we catch {recall*100:.1f}% of actual survivors)")
print(f"F1-Score:  {f1:.3f} (balanced metric)")

### Exercise 5.1: Understanding the Tradeoff

**Scenario:** You're designing an alarm system for detecting icebergs.

- **High Precision:** Alarm rarely goes off falsely, but might miss some icebergs
- **High Recall:** Catches all icebergs, but many false alarms

**Question:** For iceberg detection, would you prioritize Precision or Recall? Why?

Your answer: ___________________________________

---
## Part 6: Confusion Matrix (10 mins)

A confusion matrix shows all four outcomes (TP, TN, FP, FN) in a table.

In [None]:
# TODO: Create confusion matrix
# Hint: Use confusion_matrix(y_test, y_pred)

cm = # YOUR CODE HERE

print("Confusion Matrix:")
print(cm)
print("\nInterpretation:")
print(f"True Negatives (TN):  {cm[0,0]} - Correctly predicted died")
print(f"False Positives (FP): {cm[0,1]} - Predicted survived but died (Type I Error)")
print(f"False Negatives (FN): {cm[1,0]} - Predicted died but survived (Type II Error)")
print(f"True Positives (TP):  {cm[1,1]} - Correctly predicted survived")

In [None]:
# Visualize confusion matrix
fig = px.imshow(cm, 
                labels=dict(x="Predicted", y="Actual", color="Count"),
                x=['Died (0)', 'Survived (1)'],
                y=['Died (0)', 'Survived (1)'],
                text_auto=True,
                color_continuous_scale='Blues',
                title='Confusion Matrix')
fig.show()

### Exercise 6.1: Manual Calculation

Let's verify our metrics using the confusion matrix values.

In [None]:
# TODO: Calculate metrics manually from confusion matrix
TN, FP, FN, TP = cm.ravel()

# Calculate accuracy
manual_accuracy = # YOUR CODE HERE (TP + TN) / (TP + TN + FP + FN)

# Calculate precision
manual_precision = # YOUR CODE HERE TP / (TP + FP)

# Calculate recall
manual_recall = # YOUR CODE HERE TP / (TP + FN)

print("Manual Calculations:")
print(f"Accuracy:  {manual_accuracy:.3f}")
print(f"Precision: {manual_precision:.3f}")
print(f"Recall:    {manual_recall:.3f}")

print("\nVerification (should match earlier calculations):")
print(f"Accuracy matches:  {np.isclose(manual_accuracy, accuracy)}")
print(f"Precision matches: {np.isclose(manual_precision, precision)}")
print(f"Recall matches:    {np.isclose(manual_recall, recall)}")

### Exercise 6.2: Classification Report

Scikit-learn provides a convenient summary of all metrics.

In [None]:
# TODO: Generate classification report
# Hint: Use classification_report(y_test, y_pred)

print("Classification Report:")
print(# YOUR CODE HERE)

---
## Part 7: Type I vs Type II Errors (8 mins)

### Understanding Error Types

**Type I Error (False Positive):**
- Predicting something that isn't true
- Example: Diagnosing a healthy person as sick
- Example: Spam filter marking important email as spam

**Type II Error (False Negative):**
- Missing something that is true
- Example: Failing to diagnose a sick person
- Example: Spam filter letting spam through

### Medical Testing Example

```
Disease Testing:
- Type I Error (FP):  Telling healthy person they're sick
- Type II Error (FN): Telling sick person they're healthy
```

### Exercise 7.1: Real-World Scenarios

For each scenario, identify which error type is worse:

**Scenario 1: Cancer Screening**
- Type I Error: Healthy person told they have cancer (leads to stress, more tests)
- Type II Error: Cancer patient told they're healthy (cancer goes untreated)

Which is worse? ___________________________________

**Scenario 2: Credit Card Fraud Detection**
- Type I Error: Legitimate transaction blocked (customer inconvenience)
- Type II Error: Fraudulent transaction approved (financial loss)

Which is worse? ___________________________________

**Scenario 3: Spam Email Filter**
- Type I Error: Important email goes to spam (might miss it)
- Type II Error: Spam email in inbox (minor annoyance)

Which is worse? ___________________________________

In [None]:
# Calculate error rates from our Titanic model
type1_error_rate = FP / (FP + TN)  # False Positive Rate
type2_error_rate = FN / (FN + TP)  # False Negative Rate

print("Error Analysis for Titanic Survival Prediction:")
print(f"\nType I Error Rate (False Positive):  {type1_error_rate:.3f}")
print(f"  - We predicted {FP} people would survive but they didn't")
print(f"  - That's {type1_error_rate*100:.1f}% of actual non-survivors")

print(f"\nType II Error Rate (False Negative): {type2_error_rate:.3f}")
print(f"  - We predicted {FN} people would die but they survived")
print(f"  - That's {type2_error_rate*100:.1f}% of actual survivors")

print("\nQuestion: In a real disaster scenario, which error would be worse?")
print("Type I (predicting survival when they don't) or Type II (predicting death when they survive)?")

---
## Part 8: ROC Curve & AUC (12 mins)

### Understanding ROC Curves

**ROC (Receiver Operating Characteristic) Curve:**
- Shows tradeoff between True Positive Rate and False Positive Rate
- X-axis: False Positive Rate (Type I Error)
- Y-axis: True Positive Rate (Recall)

**AUC (Area Under Curve):**
- Summary metric: area under the ROC curve
- Range: 0.5 to 1.0
- 0.5 = Random guessing (coin flip)
- 1.0 = Perfect classifier
- 0.7-0.8 = Acceptable
- 0.8-0.9 = Excellent
- 0.9+ = Outstanding

In [None]:
# Get probability predictions (not just 0/1)
y_pred_proba = clf_model.predict_proba(X_test)[:, 1]

# TODO: Calculate ROC curve
# Hint: Use roc_curve(y_test, y_pred_proba)
fpr, tpr, thresholds = # YOUR CODE HERE

# TODO: Calculate AUC
# Hint: Use roc_auc_score(y_test, y_pred_proba)
auc = # YOUR CODE HERE

print(f"AUC Score: {auc:.3f}")

In [None]:
# Plot ROC Curve
fig = go.Figure()

# ROC curve
fig.add_trace(go.Scatter(
    x=fpr, y=tpr,
    mode='lines',
    name=f'ROC Curve (AUC = {auc:.3f})',
    line=dict(color='blue', width=2)
))

# Random classifier line
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random Classifier (AUC = 0.5)',
    line=dict(color='red', dash='dash')
))

fig.update_layout(
    title='ROC Curve: True Positive Rate vs False Positive Rate',
    xaxis_title='False Positive Rate (Type I Error)',
    yaxis_title='True Positive Rate (Recall)',
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
    width=700, height=700
)
fig.show()

**Question:** What does it mean if the ROC curve is close to the top-left corner?

Your answer: ___________________________________

**Question:** What does it mean if the ROC curve follows the diagonal red line?

Your answer: ___________________________________

### Exercise 8.1: Comparing Multiple Models

Let's compare ROC curves for different models.

In [None]:
# Train multiple models
models_to_compare = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}

fig = go.Figure()

for name, model in models_to_compare.items():
    # Train model
    model.fit(X_train, y_train)
    
    # Get predictions
    y_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc_score = roc_auc_score(y_test, y_proba)
    
    # Add to plot
    fig.add_trace(go.Scatter(
        x=fpr, y=tpr,
        mode='lines',
        name=f'{name} (AUC = {auc_score:.3f})'
    ))

# Add random classifier
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random (AUC = 0.5)',
    line=dict(color='red', dash='dash')
))

fig.update_layout(
    title='ROC Curves: Model Comparison',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    width=700, height=700
)
fig.show()

**Question:** Which model performs best? How do you know?

Your answer: ___________________________________

---
## Part 9: Cross-Validation (8 mins)

### The Problem with Single Train-Test Split

What if our test set happened to be particularly easy or hard? Our results might be misleading!

### Solution: K-Fold Cross-Validation

1. Split data into K parts (folds)
2. Train K times, each time using a different fold as test set
3. Average the results

```
Fold 1: [Test][Train][Train][Train][Train]
Fold 2: [Train][Test][Train][Train][Train]
Fold 3: [Train][Train][Test][Train][Train]
Fold 4: [Train][Train][Train][Test][Train]
Fold 5: [Train][Train][Train][Train][Test]
```

This gives us more reliable performance estimates!

In [None]:
# TODO: Perform 5-fold cross-validation
# Hint: Use cross_val_score with cv=5

cv_scores = # YOUR CODE HERE (cross_val_score(clf_model, X, y, cv=5))

print("Cross-Validation Results (5 folds):")
print(f"Scores: {cv_scores}")
print(f"\nMean Accuracy: {cv_scores.mean():.3f}")
print(f"Standard Deviation: {cv_scores.std():.3f}")
print(f"95% Confidence Interval: {cv_scores.mean():.3f} ± {1.96*cv_scores.std():.3f}")

In [None]:
# Visualize cross-validation results
fig = go.Figure()

fig.add_trace(go.Bar(
    x=[f'Fold {i+1}' for i in range(len(cv_scores))],
    y=cv_scores,
    marker_color='lightblue'
))

fig.add_hline(y=cv_scores.mean(), line_dash="dash", line_color="red",
              annotation_text=f"Mean: {cv_scores.mean():.3f}")

fig.update_layout(
    title='Cross-Validation Scores Across 5 Folds',
    xaxis_title='Fold',
    yaxis_title='Accuracy',
    yaxis_tickformat='.0%'
)
fig.show()

**Question:** Why is cross-validation more reliable than a single train-test split?

Your answer: ___________________________________

---
## Part 10: Ethical AI & Trustworthy AI (10 mins)

### Why Ethics Matter in Machine Learning

Our Titanic model makes life-or-death predictions. In real applications:
- **Healthcare:** Diagnosis and treatment decisions
- **Criminal Justice:** Risk assessment, sentencing
- **Finance:** Loan approvals, credit scores
- **Hiring:** Resume screening, candidate selection

**Models can perpetuate and amplify human biases!**

### Exercise 10.1: Detecting Bias in Our Model

Let's check if our model treats men and women fairly.

In [None]:
# Analyze model performance by gender
results_by_gender = []

for gender in [0, 1]:  # 0=female, 1=male
    # Filter test set by gender
    mask = X_test['sex_encoded'] == gender
    X_test_gender = X_test[mask]
    y_test_gender = y_test[mask]
    
    if len(y_test_gender) > 0:
        # Make predictions
        y_pred_gender = clf_model.predict(X_test_gender)
        
        # Calculate metrics
        acc = accuracy_score(y_test_gender, y_pred_gender)
        prec = precision_score(y_test_gender, y_pred_gender, zero_division=0)
        rec = recall_score(y_test_gender, y_pred_gender, zero_division=0)
        
        results_by_gender.append({
            'Gender': 'Female' if gender == 0 else 'Male',
            'Count': len(y_test_gender),
            'Accuracy': acc,
            'Precision': prec,
            'Recall': rec
        })

bias_df = pd.DataFrame(results_by_gender)
print("Model Performance by Gender:")
print(bias_df.round(3))

In [None]:
# Visualize performance differences
fig = go.Figure()

metrics = ['Accuracy', 'Precision', 'Recall']
for metric in metrics:
    fig.add_trace(go.Bar(
        name=metric,
        x=bias_df['Gender'],
        y=bias_df[metric]
    ))

fig.update_layout(
    title='Model Performance by Gender',
    xaxis_title='Gender',
    yaxis_title='Score',
    barmode='group',
    yaxis_tickformat='.0%'
)
fig.show()

**Question:** Does the model perform equally well for both genders? If not, what might be the reason?

Your answer: ___________________________________

**Question:** In the actual Titanic disaster, women and children were prioritized for lifeboats. Is this historical bias reflected in our model? Is that a problem?

Your answer: ___________________________________

### Exercise 10.2: Trustworthy AI Principles

The EU's **Trustworthy AI Framework** outlines 7 key requirements:

1. **Human Agency & Oversight**
   - Humans should remain in control
   - AI should augment, not replace, human decision-making

2. **Technical Robustness & Safety**
   - Models should be reliable and secure
   - Should handle errors gracefully

3. **Privacy & Data Governance**
   - Respect user privacy
   - Secure data handling

4. **Transparency**
   - Decisions should be explainable
   - Users should understand how AI works

5. **Diversity, Non-discrimination & Fairness**
   - Avoid bias
   - Ensure equal treatment

6. **Societal & Environmental Wellbeing**
   - Consider broader impact
   - Sustainability

7. **Accountability**
   - Clear responsibility for AI decisions
   - Mechanisms for redress

### Reflection Exercise: Applying Trustworthy AI

For each scenario, identify which Trustworthy AI principle(s) are most relevant:

**Scenario 1:** A hospital uses an AI system to prioritize patients in the emergency room. The system is a "black box" - doctors don't understand why it makes certain recommendations.

Which principles are violated? ___________________________________

**Scenario 2:** A facial recognition system has 99% accuracy for white males but only 65% accuracy for dark-skinned females.

Which principles are violated? ___________________________________

**Scenario 3:** A loan approval AI system was trained on historical data that reflected past discriminatory lending practices.

Which principles are violated? ___________________________________

**Scenario 4:** An AI hiring tool automatically rejects candidates without giving them a chance to appeal or understand why.

Which principles are violated? ___________________________________

### Exercise 10.3: Feature Importance & Interpretability

In [None]:
# TODO: Examine feature importance from our Random Forest model
# This helps with transparency!

feature_importance = pd.DataFrame({
    'Feature': features,
    'Importance': clf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance:")
print(feature_importance)

# Visualize
fig = px.bar(feature_importance, x='Importance', y='Feature', 
             orientation='h',
             title='Which Features Does Our Model Rely On Most?')
fig.show()

**Question:** Is it ethical for a survival prediction model to heavily weight 'sex'? Why or why not?

Your answer: ___________________________________

**Question:** How would you explain this model's predictions to someone who's not a data scientist?

Your answer: ___________________________________

---
## Summary & Reflection (5 mins)

### Key Takeaways

Today we learned:

**Model Evaluation Fundamentals:**
- ✓ Always split data into train/test sets
- ✓ Bias-variance tradeoff: balance between underfitting and overfitting
- ✓ Cross-validation provides more reliable estimates

**Regression Metrics:**
- ✓ MAE, MSE, RMSE, R² for continuous predictions

**Classification Metrics:**
- ✓ Accuracy, Precision, Recall, F1-Score
- ✓ Confusion matrix shows all four outcomes
- ✓ ROC/AUC curves visualize performance tradeoffs

**Error Types:**
- ✓ Type I Error (False Positive): Predicting yes when it's no
- ✓ Type II Error (False Negative): Predicting no when it's yes
- ✓ Different applications prioritize different error types

**Ethics & Trust:**
- ✓ Models can perpetuate bias
- ✓ Trustworthy AI requires transparency, fairness, and accountability
- ✓ Feature importance helps explain model decisions

### Metric Selection Guide

```
Choose Accuracy when:
  - Classes are balanced
  - All errors are equally bad

Choose Precision when:
  - False positives are costly
  - Example: Spam detection (don't want false alarms)

Choose Recall when:
  - False negatives are costly
  - Example: Disease screening (don't miss cases)

Choose F1-Score when:
  - Need balance between Precision and Recall
  - Classes are imbalanced

Choose AUC when:
  - Want a single metric for overall performance
  - Comparing multiple models
```

### Reflection Questions

1. **Understanding Metrics:** Which metric do you think is most important for the Titanic survival prediction task? Why?

   Your answer: ___________________________________

2. **Overfitting:** Describe in your own words why overfitting is a problem and how to detect it.

   Your answer: ___________________________________

3. **Real-World Application:** Think of a real-world ML application. What would be the cost of Type I vs Type II errors?

   Your answer: ___________________________________

4. **Ethics:** What's one thing you'll keep in mind about ethical AI when building models in the future?

   Your answer: ___________________________________

---
## Bonus Challenges (Optional)

If you finish early, try these additional exercises!

### Bonus 1: Precision-Recall Curve

Similar to ROC curve, but plots Precision vs Recall.

In [None]:
# TODO: Create a Precision-Recall curve
# Hint: from sklearn.metrics import precision_recall_curve

# YOUR CODE HERE

### Bonus 2: Stratified K-Fold

Regular K-Fold might create imbalanced folds. Stratified K-Fold ensures each fold has the same class distribution.

In [None]:
# TODO: Compare regular K-Fold with Stratified K-Fold
# Hint: from sklearn.model_selection import StratifiedKFold

# YOUR CODE HERE

### Bonus 3: Cost-Sensitive Learning

What if false negatives are 10x worse than false positives? Adjust your model!

In [None]:
# TODO: Train a model with class weights
# Hint: Use class_weight parameter in LogisticRegression or RandomForestClassifier

# YOUR CODE HERE

### Bonus 4: Fairness Metrics

Calculate demographic parity and equalized odds to assess fairness.

In [None]:
# TODO: Calculate fairness metrics comparing male vs female
# Demographic Parity: P(predicted=1 | male) should equal P(predicted=1 | female)
# Equalized Odds: TPR and FPR should be equal across groups

# YOUR CODE HERE

---
## Resources for Further Learning

### Documentation
- **Scikit-learn Metrics:** https://scikit-learn.org/stable/modules/model_evaluation.html
- **Confusion Matrix Guide:** https://en.wikipedia.org/wiki/Confusion_matrix
- **ROC Curves Explained:** https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

### Ethics & Fairness
- **EU Trustworthy AI:** https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai
- **Fairness in ML:** https://fairmlbook.org/
- **Google's Responsible AI:** https://ai.google/responsibility/responsible-ai-practices/

### Interactive Learning
- **Confusion Matrix Calculator:** https://www.machinelearningplus.com/statistics/confusion-matrix-calculator/
- **ROC Curve Demo:** https://arogozhnikov.github.io/2015/10/05/roc-curve.html

**Congratulations on completing Day 8!** 🎉

You now have the tools to rigorously evaluate machine learning models and ensure they're trustworthy and ethical!