# ML Practice Questions - Part 1: ML Fundamentals and Problem Types

This notebook covers fundamental machine learning concepts including different learning paradigms, problem types, and basic evaluation concepts. Each question includes detailed explanations, mathematical foundations, and practical implementation examples.

## Learning Objectives

By completing these questions, you will:
- Understand the key differences between supervised, unsupervised, and reinforcement learning
- Classify real-world problems into appropriate ML categories
- Understand the importance of proper data splitting
- Apply basic performance metrics to different problem types
- Recognize common pitfalls in ML problem formulation

## Difficulty Levels
- ★☆☆ **Beginner**: Basic conceptual understanding
- ★★☆ **Intermediate**: Applied knowledge and implementation
- ★★★ **Advanced**: Deep understanding and complex scenarios

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

---

## Question 1: Learning Paradigms ★☆☆

**Question:** Explain the key differences between supervised, unsupervised, and reinforcement learning. For each paradigm, provide:
1. A clear definition
2. The type of data required
3. The learning objective
4. Two real-world examples

### Answer 1: Learning Paradigms

#### **Supervised Learning**
**Definition:** Learning from labeled examples to make predictions on new, unseen data.

**Data Required:** 
- Input features (X) paired with target labels/values (y)
- Training dataset: {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}

**Learning Objective:** 
Learn a mapping function f: X → Y that minimizes prediction error on new data

**Examples:**
1. **Email Spam Detection**: Classify emails as spam/not spam based on content features
2. **House Price Prediction**: Predict property values based on location, size, amenities

#### **Unsupervised Learning**
**Definition:** Finding hidden patterns or structures in data without labeled examples.

**Data Required:**
- Only input features (X) without target labels
- Dataset: {x₁, x₂, ..., xₙ}

**Learning Objective:**
Discover underlying data structure, relationships, or patterns

**Examples:**
1. **Customer Segmentation**: Group customers based on purchasing behavior
2. **Anomaly Detection**: Identify unusual network traffic patterns

#### **Reinforcement Learning**
**Definition:** Learning through interaction with an environment via trial and error.

**Data Required:**
- States, actions, and rewards from environment interactions
- Sequential decision-making scenarios

**Learning Objective:**
Maximize cumulative reward through optimal action selection

**Examples:**
1. **Game Playing**: AI learning to play chess or video games
2. **Autonomous Driving**: Vehicle learning optimal driving strategies

#### **Key Distinguishing Factors:**

| Aspect | Supervised | Unsupervised | Reinforcement |
|--------|------------|--------------|---------------|
| **Feedback** | Immediate (labels) | None | Delayed (rewards) |
| **Goal** | Prediction accuracy | Pattern discovery | Cumulative reward |
| **Data Structure** | (X, y) pairs | X only | (state, action, reward) sequences |
| **Evaluation** | Test set performance | Intrinsic metrics | Environment performance |

In [None]:
# Demonstration of the three learning paradigms

# 1. Supervised Learning Example: Binary Classification
print("=== Supervised Learning Example ===")
X_sup, y_sup = make_classification(n_samples=1000, n_features=2, n_redundant=0, 
                                   n_informative=2, n_clusters_per_class=1, random_state=42)

# Train a simple classifier
clf = LogisticRegression()
clf.fit(X_sup, y_sup)
accuracy = clf.score(X_sup, y_sup)
print(f"Supervised model accuracy: {accuracy:.3f}")

# 2. Unsupervised Learning Example: Clustering
print("\n=== Unsupervised Learning Example ===")
X_unsup, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Apply clustering (note: no labels used!)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_unsup)
print(f"Discovered {len(np.unique(clusters))} clusters in unlabeled data")

# 3. Reinforcement Learning Concept (simplified)
print("\n=== Reinforcement Learning Concept ===")
print("RL involves sequential decision making:")
print("State → Action → Reward → New State → ...")
print("Goal: Learn optimal policy to maximize cumulative reward")

# Simple RL-like example: Multi-armed bandit
np.random.seed(42)
# Three slot machines with different reward probabilities
arm_probabilities = [0.1, 0.5, 0.8]  # Arm 3 is best
n_trials = 100
epsilon = 0.1  # Exploration rate

# Simple epsilon-greedy strategy
arm_counts = np.zeros(3)
arm_rewards = np.zeros(3)
total_reward = 0

for trial in range(n_trials):
    # Epsilon-greedy action selection
    if np.random.random() < epsilon or trial < 3:
        action = np.random.randint(3)  # Explore
    else:
        action = np.argmax(arm_rewards / (arm_counts + 1e-10))  # Exploit
    
    # Get reward (simulate pulling arm)
    reward = 1 if np.random.random() < arm_probabilities[action] else 0
    
    # Update statistics
    arm_counts[action] += 1
    arm_rewards[action] += reward
    total_reward += reward

print(f"RL agent total reward: {total_reward}/{n_trials}")
print(f"Best arm discovered: Arm {np.argmax(arm_rewards / arm_counts) + 1}")
print(f"True best arm: Arm {np.argmax(arm_probabilities) + 1}")

---

## Question 2: Problem Type Classification ★★☆

**Question:** For each of the following scenarios, identify:
1. The learning paradigm (supervised/unsupervised/reinforcement)
2. The specific problem type (classification/regression/clustering/etc.)
3. What would constitute the input features (X) and target (y, if applicable)
4. An appropriate evaluation metric

**Scenarios:**
a) Predicting stock prices for the next week
b) Grouping customers by shopping behavior
c) Determining if a medical image contains a tumor
d) Optimizing ad placement to maximize click-through rates
e) Predicting the number of stars (1-5) a user will give to a product

### Answer 2: Problem Type Classification

#### **a) Predicting stock prices for the next week**
- **Learning Paradigm:** Supervised Learning
- **Problem Type:** Regression (continuous target values)
- **Input Features (X):** Historical prices, trading volume, technical indicators, market sentiment, economic indicators
- **Target (y):** Future stock prices (continuous values)
- **Evaluation Metric:** MAE (Mean Absolute Error) or RMSE (Root Mean Squared Error)
- **Rationale:** We have historical data (features) and known outcomes (past prices) to learn from

#### **b) Grouping customers by shopping behavior**
- **Learning Paradigm:** Unsupervised Learning
- **Problem Type:** Clustering
- **Input Features (X):** Purchase frequency, average order value, product categories, seasonality patterns
- **Target (y):** None (no predefined labels)
- **Evaluation Metric:** Silhouette score, inertia, or business-relevant metrics
- **Rationale:** No predefined customer segments; we want to discover natural groupings

#### **c) Determining if a medical image contains a tumor**
- **Learning Paradigm:** Supervised Learning
- **Problem Type:** Binary Classification
- **Input Features (X):** Image pixels, extracted features, or deep learning representations
- **Target (y):** Binary labels (tumor/no tumor)
- **Evaluation Metric:** Sensitivity (recall), specificity, AUC-ROC (medical context requires high sensitivity)
- **Rationale:** We have labeled medical images from expert radiologists

#### **d) Optimizing ad placement to maximize click-through rates**
- **Learning Paradigm:** Reinforcement Learning
- **Problem Type:** Sequential decision making / Multi-armed bandit
- **State/Features:** User demographics, browsing history, time of day, device type
- **Actions:** Different ad placements, formats, targeting strategies
- **Reward:** Click-through rates, conversion rates
- **Evaluation Metric:** Cumulative click-through rate, total conversions
- **Rationale:** Need to balance exploration (trying new strategies) vs exploitation (using known good strategies)

#### **e) Predicting the number of stars (1-5) a user will give to a product**
- **Learning Paradigm:** Supervised Learning
- **Problem Type:** Ordinal Regression or Multi-class Classification
- **Input Features (X):** User profile, product features, past ratings, review text, price
- **Target (y):** Star rating (1, 2, 3, 4, or 5)
- **Evaluation Metric:** Mean Absolute Error (preserves ordinal nature) or classification accuracy
- **Rationale:** Historical user ratings provide labeled training data; ordinal nature of ratings matters

In [None]:
# Demonstration of different problem types with synthetic data

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Different ML Problem Types', fontsize=16)

# 1. Regression Example (Stock Price Prediction)
np.random.seed(42)
time = np.linspace(0, 100, 200)
trend = 0.5 * time
noise = np.random.normal(0, 5, 200)
stock_price = 100 + trend + noise

axes[0, 0].plot(time, stock_price, 'b-', alpha=0.7)
axes[0, 0].set_title('Regression: Stock Price Prediction')
axes[0, 0].set_xlabel('Time')
axes[0, 0].set_ylabel('Price')
axes[0, 0].grid(True, alpha=0.3)

# 2. Binary Classification (Medical Diagnosis)
X_med, y_med = make_classification(n_samples=200, n_features=2, n_redundant=0, 
                                   n_informative=2, n_clusters_per_class=1, random_state=42)
colors = ['red' if label == 1 else 'blue' for label in y_med]
axes[0, 1].scatter(X_med[:, 0], X_med[:, 1], c=colors, alpha=0.6)
axes[0, 1].set_title('Binary Classification: Tumor Detection')
axes[0, 1].set_xlabel('Feature 1')
axes[0, 1].set_ylabel('Feature 2')
axes[0, 1].legend(['No Tumor', 'Tumor'])

# 3. Clustering (Customer Segmentation)
X_cluster, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.5, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_cluster)
axes[0, 2].scatter(X_cluster[:, 0], X_cluster[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)
axes[0, 2].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
                   c='red', marker='x', s=200, linewidths=3)
axes[0, 2].set_title('Clustering: Customer Segmentation')
axes[0, 2].set_xlabel('Purchase Frequency')
axes[0, 2].set_ylabel('Average Order Value')

# 4. Multi-class Classification (Star Ratings)
X_rating, y_rating = make_classification(n_samples=300, n_features=2, n_redundant=0, 
                                         n_informative=2, n_classes=5, random_state=42)
scatter = axes[1, 0].scatter(X_rating[:, 0], X_rating[:, 1], c=y_rating, cmap='RdYlBu', alpha=0.6)
axes[1, 0].set_title('Multi-class: Star Ratings (1-5)')
axes[1, 0].set_xlabel('Product Quality')
axes[1, 0].set_ylabel('User Satisfaction')
plt.colorbar(scatter, ax=axes[1, 0])

# 5. Reinforcement Learning Concept (Multi-armed Bandit)
# Show the learning curve of our bandit example
trials = range(1, n_trials + 1)
cumulative_rewards = []
running_reward = 0

# Simulate the learning process again for plotting
np.random.seed(42)
arm_counts = np.zeros(3)
arm_rewards = np.zeros(3)

for trial in range(n_trials):
    if np.random.random() < epsilon or trial < 3:
        action = np.random.randint(3)
    else:
        action = np.argmax(arm_rewards / (arm_counts + 1e-10))
    
    reward = 1 if np.random.random() < arm_probabilities[action] else 0
    arm_counts[action] += 1
    arm_rewards[action] += reward
    running_reward += reward
    cumulative_rewards.append(running_reward / (trial + 1))

axes[1, 1].plot(trials, cumulative_rewards, 'g-', linewidth=2)
axes[1, 1].axhline(y=max(arm_probabilities), color='r', linestyle='--', 
                   label='Optimal Performance')
axes[1, 1].set_title('RL: Multi-armed Bandit Learning')
axes[1, 1].set_xlabel('Trial')
axes[1, 1].set_ylabel('Average Reward')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Problem Complexity Comparison
problem_types = ['Regression', 'Binary\nClassif.', 'Multi-class\nClassif.', 'Clustering', 'RL']
complexity_scores = [3, 4, 5, 6, 8]  # Relative complexity
colors_complexity = ['lightblue', 'lightgreen', 'orange', 'lightcoral', 'purple']

bars = axes[1, 2].bar(problem_types, complexity_scores, color=colors_complexity, alpha=0.7)
axes[1, 2].set_title('Problem Complexity Comparison')
axes[1, 2].set_ylabel('Relative Complexity Score')
axes[1, 2].set_ylim(0, 10)

# Add value labels on bars
for bar, score in zip(bars, complexity_scores):
    axes[1, 2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                     str(score), ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("• Regression predicts continuous values (stock prices, temperatures)")
print("• Classification predicts discrete categories (spam/not spam, tumor/no tumor)")
print("• Clustering finds hidden patterns without labels")
print("• RL learns through trial and error with delayed feedback")
print("• Problem complexity increases with more classes, unlabeled data, and sequential decisions")

---

## Question 3: Data Splitting Strategy ★★☆

**Question:** You're building a machine learning model to predict customer churn (binary classification). You have a dataset with 10,000 customers collected over 2 years, with 15% churn rate.

1. Design an appropriate data splitting strategy (train/validation/test)
2. Explain why you chose those proportions
3. What potential issues should you watch out for with this dataset?
4. How would your strategy change if you only had 1,000 customers?

### Answer 3: Data Splitting Strategy

#### **1. Recommended Data Splitting Strategy (10,000 customers)**

**Split Proportions:**
- **Training Set: 70% (7,000 customers)**
- **Validation Set: 15% (1,500 customers)**
- **Test Set: 15% (1,500 customers)**

#### **2. Rationale for These Proportions**

**Training Set (70%):**
- Need sufficient data to learn complex patterns
- With 15% churn rate: ~1,050 positive examples, 5,950 negative examples
- Enough samples for reliable model training

**Validation Set (15%):**
- Used for hyperparameter tuning and model selection
- ~225 churners, 1,275 non-churners
- Large enough for reliable performance estimates
- Prevents overfitting to training data

**Test Set (15%):**
- Final, unbiased performance evaluation
- Never used during model development
- Simulates real-world performance

#### **3. Potential Issues to Watch For**

**Class Imbalance:**
- Only 15% positive cases (churn)
- May lead to models biased toward majority class
- **Solution:** Stratified sampling, balanced metrics (F1, AUC-ROC)

**Temporal Dependencies:**
- Customer behavior may change over 2 years
- Seasonal patterns in churn behavior
- **Solution:** Time-based splitting, temporal validation

**Data Leakage:**
- Features that wouldn't be available at prediction time
- Information from the future predicting the past
- **Solution:** Careful feature engineering, temporal awareness

**Customer-Level Splitting:**
- Ensure same customer doesn't appear in multiple sets
- Account for potential data from same household/company

#### **4. Strategy for Smaller Dataset (1,000 customers)**

**Modified Split:**
- **Training: 80% (800 customers)**
- **Validation: 20% (200 customers)**
- **Test: Use k-fold cross-validation instead**

**Rationale:**
- With only 150 churners total, need to maximize training data
- Use k-fold CV (k=5 or k=10) for more robust evaluation
- Consider stratified k-fold to maintain class balance
- Might need simpler models to avoid overfitting

In [None]:
# Demonstration of proper data splitting for churn prediction

# Generate synthetic churn dataset
np.random.seed(42)
n_customers = 10000
churn_rate = 0.15

# Create synthetic features
X_churn = np.random.randn(n_customers, 5)  # 5 features
# Add some correlation to make the problem realistic
churn_probability = 1 / (1 + np.exp(-(X_churn[:, 0] + 0.5 * X_churn[:, 1] - 0.3)))
y_churn = np.random.binomial(1, churn_probability)

print(f"Dataset Overview:")
print(f"Total customers: {n_customers:,}")
print(f"Churned customers: {y_churn.sum():,} ({y_churn.mean():.1%})")
print(f"Non-churned customers: {(1-y_churn).sum():,} ({(1-y_churn).mean():.1%})")

# Proper stratified splitting
from sklearn.model_selection import train_test_split

# First split: train+val vs test (85% vs 15%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X_churn, y_churn, test_size=0.15, stratify=y_churn, random_state=42
)

# Second split: train vs val (70% vs 15% of original)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.15/0.85, stratify=y_temp, random_state=42
)

print(f"\nData Split Results:")
print(f"Training set: {len(X_train):,} samples ({len(X_train)/n_customers:.1%})")
print(f"  - Churned: {y_train.sum():,} ({y_train.mean():.1%})")
print(f"Validation set: {len(X_val):,} samples ({len(X_val)/n_customers:.1%})")
print(f"  - Churned: {y_val.sum():,} ({y_val.mean():.1%})")
print(f"Test set: {len(X_test):,} samples ({len(X_test)/n_customers:.1%})")
print(f"  - Churned: {y_test.sum():,} ({y_test.mean():.1%})")

# Verify stratification worked
print(f"\nStratification Check:")
print(f"Original churn rate: {y_churn.mean():.3f}")
print(f"Train churn rate: {y_train.mean():.3f}")
print(f"Val churn rate: {y_val.mean():.3f}")
print(f"Test churn rate: {y_test.mean():.3f}")

# Visualization of data splits
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Sample sizes
splits = ['Train', 'Validation', 'Test']
sizes = [len(X_train), len(X_val), len(X_test)]
colors = ['skyblue', 'lightgreen', 'lightcoral']

bars = axes[0].bar(splits, sizes, color=colors, alpha=0.7)
axes[0].set_title('Data Split Sizes')
axes[0].set_ylabel('Number of Samples')
for bar, size in zip(bars, sizes):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                 f'{size:,}', ha='center', va='bottom')

# Plot 2: Churn rates across splits
churn_rates = [y_train.mean(), y_val.mean(), y_test.mean()]
bars = axes[1].bar(splits, churn_rates, color=colors, alpha=0.7)
axes[1].set_title('Churn Rates Across Splits')
axes[1].set_ylabel('Churn Rate')
axes[1].set_ylim(0, 0.2)
for bar, rate in zip(bars, churn_rates):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                 f'{rate:.1%}', ha='center', va='bottom')

# Plot 3: Class distribution
split_data = [y_train, y_val, y_test]
bottoms = [0, 0, 0]
churned = [data.sum() for data in split_data]
not_churned = [len(data) - data.sum() for data in split_data]

axes[2].bar(splits, not_churned, label='Not Churned', color='lightblue', alpha=0.7)
axes[2].bar(splits, churned, bottom=not_churned, label='Churned', color='red', alpha=0.7)
axes[2].set_title('Class Distribution Across Splits')
axes[2].set_ylabel('Number of Samples')
axes[2].legend()

plt.tight_layout()
plt.show()

# Demonstrate what happens with smaller dataset
print(f"\n\n=== Smaller Dataset Analysis (1,000 customers) ===")
# Sample 1,000 customers from our dataset
small_indices = np.random.choice(n_customers, 1000, replace=False)
X_small = X_churn[small_indices]
y_small = y_churn[small_indices]

print(f"Small dataset: {len(X_small)} customers, {y_small.sum()} churned ({y_small.mean():.1%})")

# For small dataset, use 80/20 split + cross-validation
X_small_train, X_small_test, y_small_train, y_small_test = train_test_split(
    X_small, y_small, test_size=0.2, stratify=y_small, random_state=42
)

print(f"Small dataset split:")
print(f"  Train: {len(X_small_train)} ({len(X_small_train)/len(X_small):.0%})")
print(f"  Test: {len(X_small_test)} ({len(X_small_test)/len(X_small):.0%})")
print(f"  Use cross-validation on training set for validation")

# Demonstrate cross-validation for small dataset
from sklearn.model_selection import StratifiedKFold, cross_val_score

clf_small = RandomForestClassifier(n_estimators=50, random_state=42)
cv_scores = cross_val_score(clf_small, X_small_train, y_small_train, 
                           cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                           scoring='f1')

print(f"\n5-Fold CV F1 Scores: {cv_scores}")
print(f"Mean CV F1 Score: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

print(f"\nKey Takeaways:")
print(f"• Stratified sampling maintains class balance across splits")
print(f"• Larger datasets allow for dedicated validation sets")
print(f"• Smaller datasets benefit from cross-validation")
print(f"• Always consider temporal aspects in time-series data")
print(f"• Test set should never be used during model development")

---

## Question 4: Performance Metrics Selection ★★★

**Question:** You're working on different ML projects. For each scenario below, choose the most appropriate primary evaluation metric and explain your reasoning. Also mention what secondary metrics you'd monitor.

**Scenarios:**
1. **Fraud Detection**: Detecting credit card fraud (0.1% fraud rate)
2. **Medical Screening**: Identifying potential cancer cases for further testing
3. **Recommendation System**: Movie recommendations for streaming platform
4. **A/B Testing**: Comparing two website designs for conversion rate
5. **Demand Forecasting**: Predicting daily sales for inventory management

### Answer 4: Performance Metrics Selection

#### **1. Fraud Detection (0.1% fraud rate)**

**Primary Metric: Precision-Recall AUC (AP Score)**
- **Rationale:** Extreme class imbalance makes accuracy misleading
- Precision-Recall curve better handles imbalanced datasets than ROC
- Focus on identifying fraud cases without too many false alarms

**Secondary Metrics:**
- **Precision at fixed recall** (e.g., 80% recall): Business constraint on missing fraud
- **F2 Score**: Emphasizes recall (catching fraud) over precision
- **Cost-sensitive metrics**: Actual monetary impact of false positives vs false negatives

**Key Consideration:** Cost of missing fraud >> Cost of investigating false alarms

#### **2. Medical Screening (Cancer Detection)**

**Primary Metric: Sensitivity (Recall)**
- **Rationale:** Missing a cancer case has severe consequences
- High sensitivity ensures we catch most potential cases
- False positives lead to additional testing; false negatives can be fatal

**Secondary Metrics:**
- **Specificity**: To avoid overwhelming the healthcare system with false positives
- **NPV (Negative Predictive Value)**: Confidence that negative results are truly negative
- **F1 Score**: Balance between precision and recall

**Key Consideration:** "First, do no harm" - don't miss cancer cases

#### **3. Recommendation System (Movies)**

**Primary Metric: Mean Average Precision (MAP@k)**
- **Rationale:** Ranking quality matters more than binary classification
- Measures precision at different cutoff points
- Accounts for position bias in recommendations

**Secondary Metrics:**
- **NDCG (Normalized Discounted Cumulative Gain)**: Considers rating scores, not just binary relevance
- **Diversity metrics**: Avoid filter bubbles
- **Coverage**: Percentage of catalog being recommended
- **Click-through rate**: Real user engagement

**Key Consideration:** User satisfaction and engagement drive business value

#### **4. A/B Testing (Website Design)**

**Primary Metric: Statistical Significance Test (e.g., Chi-square, Fisher's exact)**
- **Rationale:** Need to determine if observed difference is statistically significant
- Conversion rate difference with confidence intervals
- Account for multiple testing if comparing multiple metrics

**Secondary Metrics:**
- **Effect Size**: Practical significance, not just statistical
- **Confidence Intervals**: Range of likely true effect
- **Power Analysis**: Ensure sufficient sample size
- **Business Impact**: Revenue per visitor, lifetime value

**Key Consideration:** Statistical rigor prevents false conclusions

#### **5. Demand Forecasting (Sales Prediction)**

**Primary Metric: MAPE (Mean Absolute Percentage Error)**
- **Rationale:** Scale-independent, easy to interpret across different products
- Symmetric treatment of over/under-forecasting
- Directly relates to business planning accuracy

**Secondary Metrics:**
- **WMAPE (Weighted MAPE)**: Accounts for volume differences across products
- **Forecast Bias**: Systematic over/under-forecasting
- **Safety Stock Impact**: Cost of stockouts vs excess inventory
- **MAE**: Absolute error in units for inventory planning

**Key Consideration:** Balance between stockouts (lost sales) and excess inventory (holding costs)

In [None]:
# Demonstration of appropriate metrics for different scenarios

from sklearn.metrics import (
    precision_recall_curve, average_precision_score, roc_auc_score,
    precision_score, recall_score, f1_score, confusion_matrix
)
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
import matplotlib.pyplot as plt

# Scenario 1: Fraud Detection (Highly Imbalanced)
print("=== Scenario 1: Fraud Detection ===")
np.random.seed(42)

# Generate imbalanced fraud data (0.1% fraud rate)
n_transactions = 10000
fraud_rate = 0.001
y_true_fraud = np.random.binomial(1, fraud_rate, n_transactions)

# Simulate two models with different characteristics
# Model A: High precision, lower recall
y_scores_A = np.random.beta(0.5, 2, n_transactions)  # Conservative model
y_scores_A[y_true_fraud == 1] += 0.3  # Boost fraud scores

# Model B: Higher recall, lower precision
y_scores_B = np.random.beta(1, 1.5, n_transactions)  # More aggressive model
y_scores_B[y_true_fraud == 1] += 0.4  # Boost fraud scores more

# Calculate metrics
ap_A = average_precision_score(y_true_fraud, y_scores_A)
ap_B = average_precision_score(y_true_fraud, y_scores_B)
roc_auc_A = roc_auc_score(y_true_fraud, y_scores_A)
roc_auc_B = roc_auc_score(y_true_fraud, y_scores_B)

print(f"Fraud cases: {y_true_fraud.sum()} out of {n_transactions} ({y_true_fraud.mean():.3%})")
print(f"Model A - AP Score: {ap_A:.4f}, ROC-AUC: {roc_auc_A:.4f}")
print(f"Model B - AP Score: {ap_B:.4f}, ROC-AUC: {roc_auc_B:.4f}")
print(f"For fraud detection, focus on AP Score due to extreme imbalance")

# Scenario 2: Medical Screening
print("\n=== Scenario 2: Medical Screening ===")
np.random.seed(42)

# Generate medical screening data (5% positive cases)
n_patients = 1000
disease_rate = 0.05
y_true_medical = np.random.binomial(1, disease_rate, n_patients)

# Simulate model predictions (threshold affects sensitivity/specificity trade-off)
y_scores_medical = np.random.beta(2, 5, n_patients)
y_scores_medical[y_true_medical == 1] += 0.4

# Test different thresholds for sensitivity/specificity trade-off
thresholds = [0.1, 0.3, 0.5]
print(f"Disease cases: {y_true_medical.sum()} out of {n_patients} ({y_true_medical.mean():.1%})")
print(f"Threshold | Sensitivity | Specificity | PPV | NPV")
print(f"-" * 50)

for threshold in thresholds:
    y_pred = (y_scores_medical >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true_medical, y_pred).ravel()
    
    sensitivity = tp / (tp + fn)  # Recall
    specificity = tn / (tn + fp)
    ppv = tp / (tp + fp) if (tp + fp) > 0 else 0  # Precision
    npv = tn / (tn + fn) if (tn + fn) > 0 else 0
    
    print(f"   {threshold:.1f}    |    {sensitivity:.3f}   |    {specificity:.3f}   | {ppv:.3f} | {npv:.3f}")

# Scenario 3: Recommendation System
print("\n=== Scenario 3: Recommendation System ===")

# Simulate recommendation rankings (simplified)
def calculate_map_at_k(y_true, y_scores, k=10):
    """Calculate Mean Average Precision at K"""
    # Sort by scores (descending)
    sorted_indices = np.argsort(y_scores)[::-1]
    y_true_sorted = y_true[sorted_indices]
    
    # Calculate AP@K
    relevant_items = 0
    precision_sum = 0
    
    for i in range(min(k, len(y_true_sorted))):
        if y_true_sorted[i] == 1:
            relevant_items += 1
            precision_at_i = relevant_items / (i + 1)
            precision_sum += precision_at_i
    
    return precision_sum / min(k, np.sum(y_true)) if np.sum(y_true) > 0 else 0

# Generate movie relevance data
n_movies = 100
y_true_movies = np.random.binomial(1, 0.15, n_movies)  # 15% relevant
y_scores_movies = np.random.beta(2, 5, n_movies)
y_scores_movies[y_true_movies == 1] += 0.3  # Boost relevant movie scores

map_at_5 = calculate_map_at_k(y_true_movies, y_scores_movies, 5)
map_at_10 = calculate_map_at_k(y_true_movies, y_scores_movies, 10)

print(f"Relevant movies: {y_true_movies.sum()} out of {n_movies} ({y_true_movies.mean():.1%})")
print(f"MAP@5: {map_at_5:.4f}")
print(f"MAP@10: {map_at_10:.4f}")

# Scenario 4: A/B Testing
print("\n=== Scenario 4: A/B Testing ===")
from scipy.stats import chi2_contingency

# Simulate A/B test data
np.random.seed(42)
n_visitors_A = 5000
n_visitors_B = 5000
conversion_rate_A = 0.05  # 5% baseline
conversion_rate_B = 0.055  # 5.5% treatment (10% relative improvement)

conversions_A = np.random.binomial(n_visitors_A, conversion_rate_A)
conversions_B = np.random.binomial(n_visitors_B, conversion_rate_B)

# Contingency table
observed = np.array([[conversions_A, n_visitors_A - conversions_A],
                     [conversions_B, n_visitors_B - conversions_B]])

chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"Version A: {conversions_A}/{n_visitors_A} conversions ({conversions_A/n_visitors_A:.3%})")
print(f"Version B: {conversions_B}/{n_visitors_B} conversions ({conversions_B/n_visitors_B:.3%})")
print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")

# Effect size (Cohen's h for proportions)
p1, p2 = conversions_A/n_visitors_A, conversions_B/n_visitors_B
cohens_h = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1)))
print(f"Effect size (Cohen's h): {cohens_h:.4f}")

# Scenario 5: Demand Forecasting
print("\n=== Scenario 5: Demand Forecasting ===")

# Generate sales forecasting data
np.random.seed(42)
n_days = 100
true_sales = 100 + 10 * np.sin(np.linspace(0, 4*np.pi, n_days)) + np.random.normal(0, 5, n_days)
predicted_sales = true_sales + np.random.normal(0, 8, n_days)  # Add prediction error

# Calculate forecasting metrics
mae = mean_absolute_error(true_sales, predicted_sales)
mape = mean_absolute_percentage_error(true_sales, predicted_sales)
rmse = np.sqrt(np.mean((true_sales - predicted_sales)**2))

# Forecast bias
bias = np.mean(predicted_sales - true_sales)

print(f"Forecasting Performance:")
print(f"MAE: {mae:.2f} units")
print(f"MAPE: {mape:.2%}")
print(f"RMSE: {rmse:.2f} units")
print(f"Forecast Bias: {bias:.2f} units ({'Over' if bias > 0 else 'Under'}forecasting)")

print(f"\n=== Key Insights ===")
print(f"• Fraud Detection: Use AP Score for extreme imbalance")
print(f"• Medical Screening: Prioritize Sensitivity (don't miss cases)")
print(f"• Recommendations: Use ranking metrics (MAP@K, NDCG)")
print(f"• A/B Testing: Focus on statistical significance and effect size")
print(f"• Forecasting: Use scale-independent metrics (MAPE) for interpretability")

---

## Question 5: Common ML Pitfalls ★★★

**Question:** Identify and explain the problems in each of the following ML scenarios. For each problem, provide a solution.

1. **Data Leakage**: A model predicting loan defaults uses the applicant's credit score from 6 months after the loan application as a feature.

2. **Selection Bias**: A medical AI trained on data from a prestigious hospital is deployed in rural clinics.

3. **Target Leakage**: An e-commerce model predicting purchase likelihood includes "items in cart" as a feature.

4. **Survivorship Bias**: A model predicting startup success is trained only on companies that lasted at least 2 years.

5. **Simpson's Paradox**: A hiring algorithm shows better performance for both male and female candidates separately, but worse overall performance.

### Answer 5: Common ML Pitfalls

#### **1. Data Leakage Problem**

**Problem Identified:**
- Using **future information** (credit score 6 months later) to predict past events (loan default)
- This creates artificially high performance that won't generalize to real predictions
- The model learns from information that wouldn't be available at prediction time

**Why It's Problematic:**
- Credit scores often change due to loan default itself
- Creates impossibly good validation metrics
- Will fail catastrophically in production

**Solution:**
- **Temporal Cutoff**: Only use features available at or before loan application time
- **Feature Engineering**: Use historical credit score trends, not future values
- **Time-Aware Validation**: Use temporal splits instead of random splits
- **Domain Expertise**: Collaborate with loan officers to identify appropriate features

#### **2. Selection Bias Problem**

**Problem Identified:**
- **Population Mismatch**: Training data from prestigious hospitals doesn't represent rural clinic patients
- Different demographics, disease prevalence, imaging equipment, protocols
- Model learns patterns specific to the training population

**Why It's Problematic:**
- Prestigious hospitals: younger, wealthier patients, better equipment, specialist care
- Rural clinics: older, diverse socioeconomic backgrounds, basic equipment
- Disease presentation and prevalence can vary significantly

**Solution:**
- **Representative Sampling**: Include data from diverse healthcare settings
- **Domain Adaptation**: Techniques to adapt models across populations
- **Continuous Monitoring**: Track performance across different demographics
- **Local Validation**: Test on target population before deployment
- **Federated Learning**: Train on distributed data while preserving privacy

#### **3. Target Leakage Problem**

**Problem Identified:**
- **Feature is a direct result of the target**: Items in cart strongly indicates purchase intent
- This feature is essentially the target variable in disguise
- Creates circular reasoning in the model

**Why It's Problematic:**
- High correlation doesn't imply causation
- Cart abandonment is common in e-commerce
- Model won't help with actual business decisions (driving cart additions)

**Solution:**
- **Redefine Target**: Predict cart addition likelihood instead of purchase given cart
- **Temporal Separation**: Use past behavior to predict future actions
- **Causal Features**: Focus on features that influence behavior (browsing patterns, seasonality)
- **Business Logic**: Work with stakeholders to define meaningful prediction tasks

#### **4. Survivorship Bias Problem**

**Problem Identified:**
- **Missing Failed Cases**: Only training on successful startups ignores valuable failure signals
- Creates biased understanding of success factors
- Underestimates actual failure rates

**Why It's Problematic:**
- Early failures contain crucial information about what doesn't work
- Model will be overly optimistic about success probability
- Missing the majority of the actual startup population

**Solution:**
- **Complete Population**: Include all startups, regardless of outcome
- **Right-Censored Data**: Use survival analysis techniques for ongoing companies
- **Multiple Data Sources**: Government registrations, funding databases, news archives
- **Time-Based Analysis**: Track companies from inception with proper follow-up periods

#### **5. Simpson's Paradox Problem**

**Problem Identified:**
- **Confounding Variables**: Hidden factors affecting both gender and performance
- Aggregation level changes the apparent relationship
- Different base rates or conditions for subgroups

**Why It's Problematic:**
- May indicate discrimination in hiring pools or evaluation criteria
- Overall metric masks important subgroup differences
- Can lead to unfair or biased decisions

**Solution:**
- **Stratified Analysis**: Analyze performance within each subgroup
- **Causal Inference**: Identify and control for confounding variables
- **Fairness Metrics**: Use appropriate fairness definitions for the context
- **Domain Investigation**: Understand why paradox occurs in this specific context
- **Policy Review**: Examine hiring practices and evaluation criteria for bias

In [None]:
# Demonstrations of ML pitfalls and their effects

# 1. Data Leakage Demonstration
print("=== 1. Data Leakage Demonstration ===")
np.random.seed(42)

# Simulate loan data
n_loans = 1000
# Features available at application time
income = np.random.normal(50000, 15000, n_loans)
credit_score_initial = np.random.normal(650, 100, n_loans)
loan_amount = np.random.normal(25000, 10000, n_loans)

# Generate default outcome
default_prob = 1 / (1 + np.exp(-(0.05 - 0.00002*income - 0.01*credit_score_initial + 0.00001*loan_amount)))
defaults = np.random.binomial(1, default_prob)

# Credit score 6 months later (LEAKED feature - affects by default)
credit_score_later = credit_score_initial - 100 * defaults + np.random.normal(0, 20, n_loans)

# Compare models with and without leakage
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Model WITHOUT leakage (proper features)
X_proper = np.column_stack([income, credit_score_initial, loan_amount])
model_proper = RandomForestClassifier(n_estimators=50, random_state=42)
scores_proper = cross_val_score(model_proper, X_proper, defaults, cv=5, scoring='roc_auc')

# Model WITH leakage (includes future credit score)
X_leaked = np.column_stack([income, credit_score_initial, loan_amount, credit_score_later])
model_leaked = RandomForestClassifier(n_estimators=50, random_state=42)
scores_leaked = cross_val_score(model_leaked, X_leaked, defaults, cv=5, scoring='roc_auc')

print(f"Default rate: {defaults.mean():.1%}")
print(f"Model without leakage AUC: {scores_proper.mean():.3f} ± {scores_proper.std():.3f}")
print(f"Model with leakage AUC: {scores_leaked.mean():.3f} ± {scores_leaked.std():.3f}")
print(f"Leakage creates artificially high performance!")

# 2. Selection Bias Demonstration
print("\n=== 2. Selection Bias Demonstration ===")

# Simulate medical data from two different populations
np.random.seed(42)
n_patients = 500

# Prestigious hospital population (younger, healthier baseline)
age_prestigious = np.random.normal(45, 15, n_patients)
health_score_prestigious = np.random.normal(80, 10, n_patients)  # Better baseline health
disease_prob_prestigious = 1 / (1 + np.exp(-(0.05*age_prestigious - 0.02*health_score_prestigious - 2)))
disease_prestigious = np.random.binomial(1, disease_prob_prestigious)

# Rural clinic population (older, different health patterns)
age_rural = np.random.normal(60, 20, n_patients)  # Older population
health_score_rural = np.random.normal(70, 15, n_patients)  # Different health baseline
disease_prob_rural = 1 / (1 + np.exp(-(0.03*age_rural - 0.015*health_score_rural - 1)))
disease_rural = np.random.binomial(1, disease_prob_rural)

# Train on prestigious hospital, test on rural clinic
X_prestigious = np.column_stack([age_prestigious, health_score_prestigious])
X_rural = np.column_stack([age_rural, health_score_rural])

model_biased = RandomForestClassifier(n_estimators=50, random_state=42)
model_biased.fit(X_prestigious, disease_prestigious)

# Performance on same population (overoptimistic)
score_same = model_biased.score(X_prestigious, disease_prestigious)
# Performance on different population (realistic)
score_different = model_biased.score(X_rural, disease_rural)

print(f"Training population disease rate: {disease_prestigious.mean():.1%}")
print(f"Target population disease rate: {disease_rural.mean():.1%}")
print(f"Performance on training population: {score_same:.3f}")
print(f"Performance on target population: {score_different:.3f}")
print(f"Selection bias causes {(score_same - score_different):.3f} performance drop!")

# 3. Simpson's Paradox Demonstration
print("\n=== 3. Simpson's Paradox Demonstration ===")

# Simulate hiring data with confounding variable (department)
np.random.seed(42)

# Department A: Male-dominated, higher overall performance
dept_A_male = {'hired': 80, 'total': 100}  # 80% hire rate
dept_A_female = {'hired': 45, 'total': 50}  # 90% hire rate

# Department B: Female-dominated, lower overall performance  
dept_B_male = {'hired': 10, 'total': 20}   # 50% hire rate
dept_B_female = {'hired': 120, 'total': 200}  # 60% hire rate

# Calculate rates
male_A_rate = dept_A_male['hired'] / dept_A_male['total']
female_A_rate = dept_A_female['hired'] / dept_A_female['total']
male_B_rate = dept_B_male['hired'] / dept_B_male['total']
female_B_rate = dept_B_female['hired'] / dept_B_female['total']

# Overall rates
male_overall = (dept_A_male['hired'] + dept_B_male['hired']) / (dept_A_male['total'] + dept_B_male['total'])
female_overall = (dept_A_female['hired'] + dept_B_female['hired']) / (dept_A_female['total'] + dept_B_female['total'])

print(f"Department A - Male: {male_A_rate:.1%}, Female: {female_A_rate:.1%}")
print(f"Department B - Male: {male_B_rate:.1%}, Female: {female_B_rate:.1%}")
print(f"Overall - Male: {male_overall:.1%}, Female: {female_overall:.1%}")
print(f"\nParadox: Females outperform in each department, but underperform overall!")
print(f"This is due to different application distributions across departments.")

# 4. Survivorship Bias Visualization
print("\n=== 4. Survivorship Bias Impact ===")

# Simulate startup data
np.random.seed(42)
n_startups = 1000

# Features
funding = np.random.exponential(100000, n_startups)  # Funding amount
team_size = np.random.poisson(5, n_startups)  # Initial team size
market_size = np.random.normal(1000000, 500000, n_startups)  # Target market size

# Survival probability (2+ years)
survival_prob = 1 / (1 + np.exp(-(0.000001*funding + 0.1*team_size + 0.0000005*market_size - 3)))
survived = np.random.binomial(1, survival_prob)

print(f"Total startups: {n_startups}")
print(f"Survived 2+ years: {survived.sum()} ({survived.mean():.1%})")
print(f"Training only on survivors ignores {(1-survived.mean()):.1%} of the data!")
print(f"This creates overly optimistic success predictions.")

# Show bias in feature importance
X_startup = np.column_stack([funding, team_size, market_size])

# Model with survivorship bias (only successful companies)
X_survivors = X_startup[survived == 1]
y_survivors = np.ones(X_survivors.shape[0])  # All are "successful"

print(f"\nAverage funding - All startups: ${funding.mean():,.0f}")
print(f"Average funding - Survivors only: ${funding[survived == 1].mean():,.0f}")
print(f"Survivorship bias overestimates importance of high funding.")

print(f"\n=== Key Takeaways ===")
print(f"• Always validate features are available at prediction time")
print(f"• Ensure training data represents the target population")
print(f"• Be careful of features that are consequences of the target")
print(f"• Include all relevant cases, not just successful ones")
print(f"• Analyze subgroups separately to detect Simpson's Paradox")
print(f"• Domain expertise is crucial for identifying these pitfalls")

---

## Summary and Key Takeaways

### **Core Concepts Mastered**

1. **Learning Paradigms**: Clear distinction between supervised, unsupervised, and reinforcement learning
2. **Problem Classification**: Ability to map real-world scenarios to appropriate ML problem types
3. **Data Splitting**: Understanding proper train/validation/test strategies for different dataset sizes
4. **Metric Selection**: Choosing appropriate evaluation metrics based on problem context and business needs
5. **Pitfall Recognition**: Identifying and avoiding common ML mistakes that lead to poor real-world performance

### **Critical Success Factors**

- **Domain Understanding**: Always collaborate with subject matter experts
- **Data Quality**: Invest time in understanding your data before modeling
- **Validation Strategy**: Use appropriate evaluation methods for your specific problem
- **Temporal Awareness**: Consider time dependencies in your data and features
- **Bias Recognition**: Actively look for sources of bias in data collection and model development

### **Next Steps**

Continue to Part 2 to master data preprocessing and feature engineering techniques that build upon these foundational concepts.

### **Practice Recommendations**

1. Apply these concepts to real datasets in your domain
2. Practice identifying problem types in news articles and research papers
3. Review ML papers critically for potential pitfalls
4. Implement custom evaluation metrics for your specific use cases
5. Build a checklist for avoiding common ML mistakes