# Homework 03: Loss Functions and Classification

### <p style="text-align: right;"> &#9989; Put your name here

<!-- ![image.png](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png) -->
<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" 
     alt="Palmer Penguins" 
     style="width: 600px; height: auto;">

## 🎯 **Learning Goals**

1. Learn to organize a machine learning project using a professional directory structure and clean workflow practices.

2. Understand binary classification with logistic regression, including loss functions, decision boundaries, and model interpretation.

3. Apply grid search to optimize logistic regression parameters and visualize decision boundaries and probability distributions.

4. Explore multi-class classification using one-vs-rest logistic regression with the Palmer Penguins dataset, and compare it to direct multi-class approaches.

**This assignment is due by 11:59 p.m. Friday September 19,** and should be uploaded into the appropriate "Homework" submission folder on D2L.  Submission instructions can be found at the end of the notebook.


**📚 Preparation:** Read Chapter 3 of the textbook and complete the following assignment

---

## Problem 1: Project Organization and Workflow (20 points)

The goal of this problem is to organize your machine learning project in a structured, professional manner. **Project organization is absolutely critical** in data science work as it enables reproducibility, facilitates collaboration with team members, maintains clean and efficient workflows, and most importantly, **allows for easy sharing and handoff of your work** to colleagues, stakeholders, or the broader community. A well-organized project can be understood and utilized by others without extensive explanation, making your work truly impactful and professional.

**Important Note:** There is no single "correct" way to organize data science projects - folder structures and organization patterns often depend on the specific task, team preferences, project complexity, and organizational standards. However, consistency and logical structure are always essential. For this assignment, we have chosen a particular organizational structure that represents common industry practices and will help you develop good organizational habits that you can adapt to different contexts throughout your career.

#### Required Directory Structure

You must organize your GitHub repository according to the following structure:

```
/ca_housing_project
├── /data
│   ├── /raw
│   ├── /train  
│   └── /test
├── /images
├── /models
└── /analysis
├── README.md
```

#### Task Instructions

**📚 Preparation:** Complete Parts 1-3 and have the Chapter 2 notebook code available from the textbook.

**🗒️ Task: (37 points)** Organize your project files according to the specifications below. You will be graded on proper file placement, code organization, and adherence to the directory structure.

#### `/data/raw`
- Store the **original, unmodified housing dataset**, *i.e.* the CSV file `housing.csv`
- This file should never be modified or processed

#### `/data/test`
- Store the **raw testing set**, containing the features and target variable created from stratified splitting
- This should contain **only 13 columns**
- File should be named descriptively (e.g., `housing_test.csv`)

#### `/data/train` 
- Store **two versions** of the training data:
  1. **Raw training set**: Stratified split with **13 columns only**
  2. **Processed training set**: After EDA with **24 features** (including engineered features)
- Use clear, descriptive filenames (e.g., `housing_train.csv`, `housing_train_processed.csv`)

#### `/images` Directory
This directory should contain all the images and plots created in the project. This directory is automatically created in the notebook from the textbook. 

#### `/analysis` Directory 
It should contain the following files.

#### `ida.ipynb` - Initial Data Analysis Notebook
- Contains code for **Initial Data Analysis**, *e.g.* up to dataset splitting
- Should include sections, clearly identified via markdown cells, for:
  - Data loading
  - Data type analysis  
  - Stratified train/test splitting
- **Final section**: Code for saving training and testing sets to appropriate folders
- Use clear markdown headers to organize sections

#### `eda.ipynb` - Exploratory Data Analysis Notebook  
- Contains code from textbook for conducting **Exploratory Data Analysis**
- Should include sections for:
  - Geographic data visualization
  - Feature correlation analysis
  - Feature engineering and creation
- **Final section**: Code for saving the processed dataset (24 features) to the training folder
- Use clear markdown headers to organize sections

#### `preprocessing_pipeline.py` - Python Script
- A **Python script** (not notebook) that:
  - Reads the raw training set
  - Uses a **scikit-learn Pipeline** to process the dataset
  - Saves the final processed dataset to the training folder
- Should be executable from command line
- Include proper imports and comments

#### `/models` Directory

The `models` directory should contain **separate notebooks** for each model type:
- `LinearRegression.ipynb`
- `DecisionTree.ipynb` 
- `RandomForest.ipynb`
- `SVR.ipynb`

#### `README.md` file

A README file explaining the content of the repo. 


Each model notebook must contain the following **clearly marked sections** using Markdown headers:

1. **# Data Loading**
   - Read the cleaned/processed dataset (24 features)
   - No pipelines needed since data is already preprocessed

2. **# Model Fitting** 
   - Initialize and fit the specific model
   - Display basic training results

3. **# Cross-Validation**
   - Implement cross-validation evaluation
   - Display CV scores and statistics

4. **# Hyperparameter Tuning**
   - Use GridSearchCV or RandomizedSearchCV
   - Show best parameters and improved performance

5. **# Model Saving**
   - Save the trained model to the `/models` directory
   - Use appropriate naming convention (e.g., `linear_regression_model.pkl`)


#### Submission Guidelines

**File Organization Requirements:**
- All files must be in their correct directories
- Use descriptive, consistent naming conventions
- No unnecessary or duplicate files

**Code Quality Requirements:**
- All notebooks should run without errors
- Use clear markdown headers for section organization
- Include appropriate comments in code
- Python script must be executable and functional

**Documentation Requirements:**
- Each notebook should have a brief introduction explaining its purpose
- Code cells should be well-documented
- File paths should be relative to the project root

---
## Problem 2: Binary Classification with Logistic Regression (20 points)

Now let's move to classification with four points in 2D:
- Class 0: (1,0), (2,0)
- Class 1: (3,1), (4,1)

Think about the encoding: what values will your $y_i$ take? 

Let's write $x_1$ for the x-coordinate and $x_2$ for the $y$-coordinate. Our model computes the probability:
$$P(y=1|\mathbf{x}) = \sigma(w_0 + w_1x_1 + w_2x_2)$$

where $\sigma(z)$ is the sigmoid function:
$$\sigma(z) = \frac{1}{1 + e^{-z}}.$$

We will use the loss function (binary cross-entropy):
$$L(w_0,w_1,w_2) = -\frac{1}{4}\sum_{i=1}^4 [y_i\log(p_i) + (1-y_i)\log(1-p_i)]$$
where $p_i = \sigma(w_0 + w_1x_{1i} + w_2x_{2i})$.

🗒️ **Task 2.1: (4 points)** Plot the four points in the plane, using different colors for each class

In [1]:
### ANSWER

🗒️ **Task 2.2 (5 points):**  
   - Write out the gradients with respect to $w_0$, $w_1$, and $w_2$
   - Set them to zero and comment on what you find
   - You don't need to solve these equations, but explain why they're challenging


✏️ **Answer:** 
*Put your answers here!*

🗒️ **Task 2.3 (5 points):** The decision boundary occurs where $P(y=1|x) = 0.5$, which is where:
   $w_0 + w_1x_1 + w_2x_2 = 0$
   - What shape is this in the $x_1-x_2$ plane?
   - Add a possible decision boundary to your plot (that is, guess where it might be and plot your guess)
   - How many different possible lines could separate these points?

✏️ **Answer:** 
*Put your answers here!*

In [None]:
### ANSWER

🗒️ **Task 2.4 (5 points):** If you have a new point at $(2.5, 0.5)$:
   - Where might it fall relative to your proposed decision boundary? (add a marker X to your plot at this new point)
   - What probability would you expect the model to assign to it?

✏️ **Answer:** 
*Put your answers here!*

In [None]:
### ANSWER

---
## Problem 3: Finding and Visualizing the Decision Boundary (20 points)

#### The Model

In logistic regression for 2D classification, we model the probability that $y=1$ given input coordinates $(x_1,x_2)$:
$$P(y=1|\mathbf{x}) = \sigma(w_0 + w_1x_1 + w_2x_2)$$
where σ is the sigmoid (logistic) function:
$$\sigma(z) = \frac{1}{1 + e^{-z}}.$$

#### The Loss Function
For binary classification with labels $y ∈ \{0,1\}$, we use binary cross-entropy loss:
$$
L(w_0,w_1,w_2) = -\frac{1}{n}\sum_{i=1}^n [y_i\log(p_i) + (1-y_i)\log(1-p_i)]
$$
where $p_i = P(y=1|x_i) = σ(w_0 + w_1x_{1i} + w_2x_{2i})$

#### The Decision Boundary
- The decision boundary occurs where $P(y=1|x) = 0.5$
- Since $\sigma(0) = 0.5$, this happens when $w_0 + w_1x_1 + w_2x_2 = 0$
- This creates a line in the $x_1-x_2$ plane with:
  - slope = $-w_1/w_2$
  - intercept = $-w_0/w_2$

🗒️ **Tasks: 3.1 (1 points)** Given four points:
- Class 0: (1,0), (2,0)
- Class 1: (3,1), (4,1)

Find the "best" parameters for this model using **grid search** written in Python (no `sklearn`, `numpy` or other libraries!!). A helper code is given below if you wish to use it. (In reality we might use something more efficient, like a gradient descent algorithm, but I want this to be very robust, clear and focussed.)

In [None]:
### ANSWER

# helper code if you want to use it
# if you use this, heavily comment it so that you can do it yourself next time

import numpy as np
import matplotlib.pyplot as plt

# Data
X = np.array([[1,0], [2,0], [3,1], [4,1]])  # Each row is [x₁,x₂]
y = np.array([0, 0, 1, 1])

def compute_loss(w0, w1, w2, X, y):
    """Compute binary cross-entropy loss."""
    z = w0 + w1*X[:,0] + w2*X[:,1]  # X[:,0] is x₁, X[:,1] is x₂
    p = 1/(1 + np.exp(-z))  # sigmoid
    eps = 1e-15  # avoid log(0)
    return -np.mean(y * np.log(p + eps) + (1-y) * np.log(1 - p + eps))

# Grid search over parameters
#   grid search is easy to understand, easy to code and ensures you cover the span of the parameters
#   but, it can be slow and inefficient
w0_range = np.linspace(-10, 10, 20)
w1_range = np.linspace(-10, 10, 20)
w2_range = np.linspace(-10, 10, 20)
best_loss = float('inf')
best_w0, best_w1, best_w2 = 0, 0, 0

# Brute-force grid search add your code below

🗒️ **Tasks: 3.2 (2 points)** Using your best parameters:
   - Write the equation of your decision boundary line
   - Calculate $P(y=1|x)$ for each training point
   - What is the slope and intercept of your boundary?


✏️ **Answer:** 
*Put your answers here!*

In [None]:
### ANSWER

🗒️ **Tasks: 3.3 (1 points)** Visualize the probability of being in class 1 and add the decision boundary to your plot. There might be helper code below if you are interested. 


In [None]:
### ANSWER
# 
# helper code if you want to use it
# if you use this, heavily comment it so that you can do it yourself next time

# Create a grid of points
x1, x2 = np.meshgrid(np.linspace(0, 5, 100),
                     np.linspace(-0.5, 1.5, 100))
# Compute probabilities for each point
Z = 1/(1 + np.exp(-(best_w0 + best_w1*x1 + best_w2*x2)))

# Plot probability heatmap
plt.figure(figsize=(10, 6))
plt.contourf(x1, x2, Z, levels=20, cmap='RdBu', alpha=0.7)
plt.colorbar(label='P(y=1|x)')

# Plot decision boundary (where P=0.5)
x1_bd = np.linspace(0, 5, 100)
x2_bd = -(best_w0 + best_w1*x1_bd)/best_w2

# Modify the code below to show the decision boundary
plt.contour(x1, x2, Z, levels=[???], colors='k', linestyles='--', label='Decision Boundary')


# Plot original points
plt.scatter(X[y==0,0], X[y==0,1], color='blue', label='Class 0')
plt.scatter(X[y==1,0], X[y==1,1], color='red', label='Class 1')


🗒️ **Tasks: 3.4 (16 points)** Interpretation (give detailed answers in a markdown cell):
   - How does the probability change as you move perpendicular to the decision boundary?
   - What determines how quickly probabilities change near the boundary?
   - How confident is your model in regions far from the training data?
   - For what applications might you want probabilities rather than just classifications?

✏️ **Answer:** 
*Put your answers here!*

---

## Problem 4: Multi-class Classification with Penguins (20 points)

We'll use the Palmer Penguins dataset to explore multi-class classification using one-vs-rest (OvR) with SGD.

The one-vs-rest approach trains three separate binary classifiers:
- Adelie vs rest: $P(y=0|x)$ vs $P(y\neq 0|x)$
- Gentoo vs rest: $P(y=1|x)$ vs $P(y\neq 1|x)$
- Chinstrap vs rest: $P(y=2|x)$ vs $P(y\neq 2|x)$

Each classifier uses logistic regression with loss:
$$L_k(w) = -\frac{1}{n}\sum_{i=1}^n [y_{ki}\log(p_{ki}) + (1-y_{ki})\log(1-p_{ki})]$$
where $y_{ki}$ is 1 if example $i$ is class $k$ and 0 otherwise.

Some helper code is given below to get this problem set up quickly. 

Tasks:

1. (5 points) Visualization Setup: Complete the visualization code below and create two plots:
- Original data points colored by species
- Decision regions after fitting the classifier

2. (5 points) Binary Boundaries
The one-vs-rest approach creates three binary boundaries. For each classifier:
- Plot the binary decision boundary (provide code)
- Identify regions where boundaries disagree
- What happens in these regions?

3. (5 points) Class Probabilities
Choose a point near a decision boundary and:
- Get probabilities for each class using `clf.predict_proba()`
- Explain how these relate to the three binary classifiers
- Does the highest probability always win?

4. (5 points) Reflection
Compare one-vs-rest to direct multi-class logistic regression:
- What are the advantages/disadvantages of each?
- When might you prefer one over the other?
- How do the decision boundaries differ?

Hints:
- Use `clf.fit(X_scaled, y)` to train the model
- Access binary classifiers with `clf.estimators_`
- For binary boundaries, create separate classifiers with:
  ```python
  y_binary = (y == class_label).astype(int)
  clf_binary = SGDClassifier(loss='log_loss')
  clf_binary.fit(X_scaled, y_binary)
  ```
- Check species encoding with `le.classes_` to confirm labels

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load data
penguins = sns.load_dataset("penguins")

# Prepare features (bill length and depth) and target (species)
# I am using .dropna() here for simplicity, but I don't generally recommend it!
X = penguins[['bill_length_mm', 'bill_depth_mm']].dropna()
y = penguins['species'].dropna()

# Convert species to numeric labels
le = LabelEncoder()
y = le.fit_transform(y)  # 0:Adelie, 1:Gentoo, 2:Chinstrap

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create OvR classifier
clf = SGDClassifier(loss='log_loss', max_iter=1000, random_state=42)

# what is next?!



In [None]:
def plot_decision_regions(X, y, clf, title):
    # Create meshgrid
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                        np.arange(y_min, y_max, 0.02))
    
    # Get predictions
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, alpha=0.4)
    plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
    plt.title(title)
    plt.xlabel('Standardized Bill Length')
    plt.ylabel('Standardized Bill Depth')
    return plt

&#169; Copyright 2025, Department of Computational Mathematics, Science and Engineering at Michigan State University.