# Lab 2: Machine Learning Basics

**Day 1 - Foundations**

| Duration | Difficulty | Prerequisites |
|----------|------------|---------------|
| 75 min | Beginner | Lab 1 |

## Learning Objectives

- Understand supervised learning workflow
- Implement linear regression from scratch
- Use scikit-learn for classification
- Evaluate model performance with metrics

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix

np.random.seed(42)

---

## Exercise 1: Understanding the ML Workflow

The standard ML workflow:
1. Collect/load data
2. Explore and preprocess
3. Split into train/test sets
4. Train model
5. Evaluate performance
6. Iterate/improve

**Your Task:** Implement the data splitting step.

In [None]:
# Generate synthetic data
np.random.seed(42)
X = np.random.randn(100, 2)  # 100 samples, 2 features
y = (X[:, 0] + X[:, 1] > 0).astype(int)  # Binary classification

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Label distribution: {np.bincount(y)}")

In [None]:
def split_data(X, y, test_size=0.2, random_state=42):
    """
    Split data into training and test sets.
    
    Args:
        X: Feature matrix
        y: Labels
        test_size: Fraction for test set (0.2 = 20%)
        random_state: Random seed for reproducibility
    
    Returns:
        X_train, X_test, y_train, y_test
    """
    # TODO: Use train_test_split from sklearn
    # Hint: train_test_split(X, y, test_size=..., random_state=...)
    pass

In [None]:
# Test Exercise 1
result = split_data(X, y)
if result is not None:
    X_train, X_test, y_train, y_test = result
    print(f"Training set: {X_train.shape[0]} samples")
    print(f"Test set: {X_test.shape[0]} samples")
else:
    print("Implement split_data() function")

---

## Exercise 2: Linear Regression from Scratch

Linear regression finds the best line: y = mx + b

**Your Task:** Implement linear regression using the normal equation.

In [None]:
# Generate regression data
np.random.seed(42)
X_reg = np.random.rand(100, 1) * 10  # Feature: 0-10
y_reg = 2.5 * X_reg.flatten() + 5 + np.random.randn(100) * 2  # y = 2.5x + 5 + noise

plt.scatter(X_reg, y_reg, alpha=0.5)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Data')
plt.show()

In [None]:
class SimpleLinearRegression:
    """Linear regression from scratch."""
    
    def __init__(self):
        self.slope = None  # m
        self.intercept = None  # b
    
    def fit(self, X, y):
        """
        Fit the model using least squares.
        
        Formulas:
        slope = sum((x - x_mean) * (y - y_mean)) / sum((x - x_mean)^2)
        intercept = y_mean - slope * x_mean
        """
        X = X.flatten()  # Make sure X is 1D
        
        # TODO: Calculate x_mean and y_mean
        x_mean = None
        y_mean = None
        
        # TODO: Calculate slope using the formula above
        # numerator = sum((x - x_mean) * (y - y_mean))
        # denominator = sum((x - x_mean)^2)
        self.slope = None
        
        # TODO: Calculate intercept
        self.intercept = None
        
        return self
    
    def predict(self, X):
        """Predict y = mx + b."""
        X = X.flatten()
        # TODO: Return predictions using slope and intercept
        pass

In [None]:
# Test Exercise 2
model = SimpleLinearRegression()
model.fit(X_reg, y_reg)

if model.slope is not None:
    print(f"Slope: {model.slope:.4f} (expected ~2.5)")
    print(f"Intercept: {model.intercept:.4f} (expected ~5.0)")
    
    # Plot
    y_pred = model.predict(X_reg)
    plt.scatter(X_reg, y_reg, alpha=0.5, label='Data')
    plt.plot(X_reg, y_pred, 'r-', linewidth=2, label='Prediction')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend()
    plt.title('Linear Regression Fit')
    plt.show()
else:
    print("Implement the fit() method")

---

## Exercise 3: Using Scikit-Learn

Scikit-learn provides ready-to-use ML models.

**Your Task:** Use sklearn's LinearRegression and compare.

In [None]:
def train_sklearn_regression(X_train, y_train, X_test):
    """
    Train sklearn LinearRegression model.
    
    Returns:
        model: Trained model
        predictions: Predictions on X_test
    """
    # TODO: Create LinearRegression model
    model = None
    
    # TODO: Fit the model on training data
    # model.fit(X_train, y_train)
    
    # TODO: Make predictions on test data
    predictions = None
    
    return model, predictions

In [None]:
# Test Exercise 3
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

sk_model, sk_pred = train_sklearn_regression(X_train_reg, y_train_reg, X_test_reg)

if sk_model is not None:
    print(f"Sklearn slope: {sk_model.coef_[0]:.4f}")
    print(f"Sklearn intercept: {sk_model.intercept_:.4f}")
else:
    print("Implement train_sklearn_regression()")

---

## Exercise 4: Classification with Logistic Regression

Logistic regression predicts probabilities for classification.

**Your Task:** Train a classifier and make predictions.

In [None]:
# Generate classification data
from sklearn.datasets import make_classification

X_clf, y_clf = make_classification(
    n_samples=200,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    random_state=42
)

# Visualize
plt.scatter(X_clf[:, 0], X_clf[:, 1], c=y_clf, cmap='coolwarm', alpha=0.6)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Classification Data')
plt.colorbar(label='Class')
plt.show()

In [None]:
def train_classifier(X_train, y_train):
    """
    Train a logistic regression classifier.
    
    Returns:
        Trained LogisticRegression model
    """
    # TODO: Create LogisticRegression model
    # TODO: Fit on training data
    pass

In [None]:
def get_predictions_and_probabilities(model, X_test):
    """
    Get predictions and probability estimates.
    
    Returns:
        predictions: Class predictions (0 or 1)
        probabilities: Probability of class 1
    """
    # TODO: Use model.predict() for class predictions
    predictions = None
    
    # TODO: Use model.predict_proba() for probabilities
    # Note: predict_proba returns [prob_class_0, prob_class_1]
    probabilities = None
    
    return predictions, probabilities

In [None]:
# Test Exercise 4
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)

clf = train_classifier(X_train_clf, y_train_clf)
if clf is not None:
    preds, probs = get_predictions_and_probabilities(clf, X_test_clf)
    print(f"First 5 predictions: {preds[:5]}")
    print(f"First 5 probabilities: {probs[:5]}")
else:
    print("Implement train_classifier()")

---

## Exercise 5: Model Evaluation

Evaluating model performance is crucial.

**Your Task:** Calculate various evaluation metrics.

In [None]:
def calculate_regression_metrics(y_true, y_pred):
    """
    Calculate regression metrics.
    
    Returns:
        dict with MSE, RMSE, MAE
    """
    # TODO: Calculate Mean Squared Error
    # MSE = mean((y_true - y_pred)^2)
    mse = None
    
    # TODO: Calculate Root Mean Squared Error
    rmse = None
    
    # TODO: Calculate Mean Absolute Error
    # MAE = mean(|y_true - y_pred|)
    mae = None
    
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae}

In [None]:
def calculate_classification_metrics(y_true, y_pred):
    """
    Calculate classification metrics.
    
    Returns:
        dict with accuracy, confusion matrix
    """
    # TODO: Calculate accuracy
    # Accuracy = correct predictions / total predictions
    accuracy = None
    
    # TODO: Calculate confusion matrix
    # Use confusion_matrix from sklearn.metrics
    conf_matrix = None
    
    return {'accuracy': accuracy, 'confusion_matrix': conf_matrix}

In [None]:
def plot_confusion_matrix(conf_matrix, classes=['Class 0', 'Class 1']):
    """
    Visualize confusion matrix.
    """
    # TODO: Use plt.imshow() to visualize
    # Add labels for each cell
    pass

In [None]:
# Test Exercise 5
if sk_model is not None:
    y_pred_reg = sk_model.predict(X_test_reg)
    reg_metrics = calculate_regression_metrics(y_test_reg, y_pred_reg)
    print("Regression Metrics:")
    for name, value in reg_metrics.items():
        if value is not None:
            print(f"  {name}: {value:.4f}")

if clf is not None and preds is not None:
    clf_metrics = calculate_classification_metrics(y_test_clf, preds)
    print("\nClassification Metrics:")
    if clf_metrics['accuracy'] is not None:
        print(f"  Accuracy: {clf_metrics['accuracy']:.4f}")
    if clf_metrics['confusion_matrix'] is not None:
        print(f"  Confusion Matrix:\n{clf_metrics['confusion_matrix']}")

---

## Exercise 6: Understanding Overfitting

**Your Task:** Observe overfitting with polynomial regression.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

def demonstrate_overfitting():
    """Show underfitting, good fit, and overfitting."""
    # Generate noisy sine wave data
    np.random.seed(42)
    X = np.linspace(0, 4, 30).reshape(-1, 1)
    y = np.sin(X.flatten() * 1.5) + np.random.randn(30) * 0.3
    
    plt.figure(figsize=(15, 4))
    degrees = [1, 4, 15]  # Underfit, Good fit, Overfit
    titles = ['Underfitting (degree=1)', 'Good Fit (degree=4)', 'Overfitting (degree=15)']
    
    for i, degree in enumerate(degrees):
        plt.subplot(1, 3, i + 1)
        
        # TODO: Create polynomial model with given degree
        # Hint: make_pipeline(PolynomialFeatures(degree), LinearRegression())
        model = None
        
        # TODO: Fit model and predict on dense X values for smooth line
        # X_smooth = np.linspace(0, 4, 100).reshape(-1, 1)
        
        plt.scatter(X, y, color='blue', alpha=0.5, label='Data')
        # TODO: Plot the prediction line
        
        plt.title(titles[i])
        plt.xlabel('X')
        plt.ylabel('y')
        plt.legend()
    
    plt.tight_layout()
    plt.show()

In [None]:
# Test Exercise 6
demonstrate_overfitting()

---

## Checkpoint

Congratulations! You've completed Lab 2.

### Key Takeaways:
- Always split data into train/test sets
- Linear regression finds the best line through data
- Logistic regression is for classification
- Evaluation metrics help assess model quality
- Overfitting = memorizing, not learning

**Next:** Lab 3 - Neural Networks