## From score to confidence

Last time, we looked at the perceptron that uses the **sign** of a **score** to make a decision.

- For each training example, it computes the score: 
    $$
      z = \sum_{i=1}^n w_i x_i + b
    $$
- It predicts the class based on the sign of the score:
    - if $z > 0$, predict class 1
    - if $z < 0$, predict class -1

This makes a **hard decision**. Let's plot it and see.

#### Binary and Multi-Class Classification
This is a **binary classification** task because there are only two possible classes: class 1 (positive) and class -1 (negative). The perceptron must decide between these two options for each example. If we have more than two classes, it is called a **multi-class classification** problem. 

For now, we will keep our focus on the binary case. You will do a multi-class classification task in the `PA`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

z_vals = np.linspace(-10, 10, 400)
step = (z_vals > 0).astype(int)

plt.figure(figsize=(7, 4))
plt.axvline(x=0, color='black', linestyle='-', linewidth=2, alpha=0.7, label="z=0")

plt.plot(z_vals, step, label="Perceptron step(z)")
plt.xlabel("Score ($z$)")
plt.ylabel("Prediction")
plt.title("Perceptron Step Function")
plt.grid(True)
plt.ylim(-0.1, 1.1)
plt.legend()
plt.show()

### Learning

The perceptron learns by updating its weights and bias whenever it makes a **mistake** on a training example. 


  **Update rule:**

  - For each feature $i$:
    - $w_i = w_i + y \cdot x_i$
  - Bias:
    - $b = b + y$

  Here, $y$ is the true label (+1 or -1), and $x_i$ is the value of feature $i$.

  This update nudges the model to be more likely to predict the correct class next time for similar examples.



### Perceptron Loss

The perceptron update can be understood as reducing the **perceptron loss**.

The perceptron loss measures how badly a point is misclassified by a linear model.

The perceptron loss is given by:

$$
\mathcal{L}(y, z) = \max(0, -y z)
$$

This means:

- If $y z > 0$, the prediction is correct and the loss is 0.
- If $y z \le 0$, the point is misclassified and the loss is $-y z$.

The perceptron only updates when the loss is nonzero.


#### Why the Update Reduces Loss

For a misclassified point, $y z \le 0$.

- The weight update changes $z$ in the direction of $y$. The weight update increases $z$ when $y = 1$ and decreases $z$ when $y = -1$.
- The bias update does the same.

Together, these updates **increase $y z$**.

Since the loss is:
$$
\max(0, -y z)
$$

increasing $y z$ directly reduces the perceptron loss for that example and eventually drives it to zero.

Let's plot the perceptron loss next.

In [None]:
# Range of scores
z = np.linspace(-5, 5, 400)

# Perceptron loss
loss_y_pos = np.maximum(0, -z)   # y = +1
loss_y_neg = np.maximum(0, z)    # y = -1

plt.figure()
plt.plot(z, loss_y_pos, label="y = +1")

# Uncomment the next line to see the loss function for when y = -1

# plt.plot(z, loss_y_neg, label="y = -1")
plt.xlabel("Score z")
plt.ylabel("Perceptron loss")
plt.title("Perceptron Loss")
plt.grid(True)
plt.legend()
plt.show()

In [None]:
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider
import numpy as np  # Used only for fast grid rendering
import math

# Data using standard Python lists (labels in {-1, +1})
X_toy = [[1.5, 4.0], [1.0, 2.0], [2.0, 1.0], [3.0, 5.0], [4.0, 2.0], [0.0, 0.0], [1.5, -0.5]]
y_toy = [1, 1, -1, 1, -1, -1, -1]

def combined_perceptron_plot(w1=1.0, w2=1.0, b=-2.0):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # LEFT: 1D Step View (sign)
    z_axis = [z * 0.1 for z in range(-100, 101)]
    ax1.plot(z_axis, [1 if z >= 0 else -1 for z in z_axis],
             color='green', drawstyle='steps-mid', label="sign(z)")
    ax1.axvline(x=0, color='black', linestyle='--', alpha=0.5)
    
    total_loss = 0  # Initialize loss counter
    
    for i in range(len(X_toy)):
        x1, x2 = X_toy[i]
        y = y_toy[i]
        z_score = w1 * x1 + w2 * x2 + b
        pred = 1 if z_score >= 0 else -1
        
        loss = max(0.0, -y * z_score)  # Perceptron loss
        total_loss += loss
        
        color, m = ('red', 'x') if y == 1 else ('blue', 'o')
        if y == 1:
            ax1.scatter(z_score, pred, color=color, marker=m, s=100, linewidths=2, zorder=5)
        else:
            ax1.scatter(z_score, pred, color=color, marker=m, s=80, edgecolors='k', zorder=5)
            
    # Update title with the calculated loss
    ax1.set_title(f"Perceptron: Score vs Prediction\nTotal Perceptron Loss: {total_loss:.2f}")
    ax1.set_xlabel("Score (z)")
    ax1.set_ylim(-1.1, 1.1)
    ax1.set_yticks([-1, 1])

    # RIGHT: 2D Feature Space (Speedy Rendering)
    xx, yy = np.meshgrid(np.linspace(-1, 5, 100), np.linspace(-1, 6, 100))
    zz = w1 * xx + w2 * yy + b

    # Binary background: everything <0 is blue, everything >0 is red
    ax2.contourf(xx, yy, zz, levels=[-1e9, 0, 1e9], colors=['blue', 'red'], alpha=0.2)

    if w2 != 0:
        x_line = np.array([-1, 5])
        y_line = -(w1 * x_line + b) / w2
        ax2.plot(x_line, y_line, "k-", linewidth=2)

    for i in range(len(X_toy)):
        x1, x2 = X_toy[i]
        y = y_toy[i]
        if y == 1:
            ax2.scatter(x1, x2, color='red', marker='x', s=100, linewidths=2)
        else:
            ax2.scatter(x1, x2, color='blue', marker='o', s=80, edgecolors='k')

    ax2.set_xlim(-1, 5)
    ax2.set_ylim(-1, 6)
    ax2.set_title("2D View: Hard Decision Regions")
    plt.tight_layout()
    plt.show()

interact(combined_perceptron_plot,
         w1=(-5.0, 5.0, 0.1),
         w2=(-5.0, 5.0, 0.1),
         b=(-10.0, 10.0, 0.1))


Sometimes we don’t want a *hard* yes/no.

Instead, we might want to say things like:
- "I'm very confident this is class 1"
- "I'm somewhat confident this is class 1"
- "I'm unsure"
- "I'm somewhat confident this is class 0"
- "I'm confident this is class 0"

To do this, we can instead use a function that:
- takes any number (positive or negative) and outputs a value between 0 and 1


Q: Why a value between 0 and 1?

`YOUR ANSWER HERE`

One common function used for this is called the **sigmoid** function.

It takes any number, (or in our case, the score $z$) and outputs a number between 0 and 1:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

We will treat this output as a **confidence**.

**Note**: *We interpret this value as a confidence, but mathematically it behaves like a probability.*

In [None]:
# Complete the sigmoid function below.
def sigmoid(z):
    """
    z: a numeric value.
    Return sigmoid(z) = 1 / (1 + exp(-z)).
    """
    # YOUR CODE HERE
    # You can use math.exp() to compute the exponential of a number.
    
    raise NotImplementedError

In [None]:
# Use assertions for 4 key test cases
assert abs(sigmoid(0) - 0.5) < 1e-7, "sigmoid(0) failed"
assert abs(sigmoid(1) - 0.7310586) < 1e-7, "sigmoid(1) failed"
assert abs(sigmoid(-1) - 0.2689414) < 1e-7, "sigmoid(-1) failed"
assert abs(sigmoid(-10) - 0.0000454) < 1e-7, "sigmoid(-10) failed"
assert abs(sigmoid(10) - 0.9999546) < 1e-7, "sigmoid(10) failed"

print("All key sigmoid test cases passed!")

Q. What happens if we try to compute `sigmoid(-800)` with the sigmoid implementation above? Why?

`YOUR ANSWER HERE`


In [None]:
# sigmoid(-800)

The sigmoid function can be written in two equivalent forms:

$$
\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^{z}}{1 + e^{z}}
$$

Both formulas produce the same output for any value of $z$. The second form is especially useful for large negative $z$, as it avoids numerical overflow.

In [None]:
# Complete the sigmoid function below.
def stable_sigmoid(z):
    """
    This implementation avoids overflow issues by handling large positive and negative values of z separately.

    z: a numeric value.

    if z >= 0: use the 'standard' formula: 1/(1 + exp(-z))
    if z < 0: use the alternative formula to avoid overflow: exp(z) / (1 + exp(z))
    """
    if z >= 0:
        return 1 / (1 + math.exp(-z))
    else:
        ez = math.exp(z)
        return ez / (1 + ez)

In [None]:
stable_sigmoid(-800)

Now, let's plot the sigmoid function.

In [None]:
plt.figure(figsize=(7, 4))
sigmoid_values = [stable_sigmoid(z) for z in z_vals]
plt.plot(z_vals, sigmoid_values, label="Sigmoid(z)")
plt.axvline(x=0, color='black', linestyle='-', linewidth=2, alpha=0.7, label="z=0")
plt.xlabel("z")
plt.ylabel("sigmoid(z)")
plt.title("Sigmoid Function")
plt.grid(True)
plt.legend()
plt.show()

Now we can combine the steps and make a prediciton:

1. Compute a score: 
 $$
      z = \sum_{i=1}^n w_i x_i + b
$$
2. Turn the score into a confidence value $p$: $$p = \sigma(z)$$

We have not changed how the score is computed -  we only changed what we do *after* the score.

- To make a prediction:
    - If $p > 0.5$, predict class 1
    - If $p < 0.5$, predict class 0

This means the model outputs a **confidence** (between 0 and 1), and we use a threshold (usually 0.5) to decide the final class.

## From confidence to loss

A **loss function** tells us how bad the model’s prediction was.

- Small loss -> the model did well
- Large loss -> the model did poorly

***Note**: We will switch the classes of $y$ from $-1$ and $1$ to $0$ and $1$ for mathematical convenience. This change does not make a difference to the underlying logic or results.*

For binary classification, we want a loss function such that:

- If the true label is 1:
  - high confidence (close to 1) → small loss
  - low confidence (close to 0) → large loss

- If the true label is 0:
  - low confidence (close to 0) → small loss
  - high confidence (close to 1) → large loss

One commonly used loss function that has exactly this behavior
is called **binary cross-entropy**.

For today, you do NOT need to know where this formula comes from.
You only need to know that:

- it behaves the way we want
- it is smooth
- it works well with sigmoid


Binary cross-entropy loss for one data point is:
- For $y=1$: $L(p, 1) = -\log(p)$  
- For $y=0$: $L(p, 0) = -\log(1-p)$

In [None]:
import numpy as np

import matplotlib.pyplot as plt

p_vals = np.linspace(0.001, 0.999, 500)
bce_y1 = [-np.log(p) for p in p_vals]           # y = 1
bce_y0 = [-np.log(1 - p) for p in p_vals]       # y = 0
plt.figure(figsize=(7, 4))
plt.plot(p_vals, bce_y1, label="y = 1")
# plt.plot(p_vals, bce_y0, label="y = 0")
plt.xlabel("$p$")
plt.ylabel("Binary cross-entropy loss")
plt.title("Binary Cross-Entropy Loss vs Confidence")
plt.legend()
plt.axvline(x=0.5, color='gray', linestyle='--', linewidth=2, alpha=0.5)
plt.grid(True)
plt.show()

Combining the two cases, we get: 
$$
\mathcal{L}(p, y) = -\big(y \log(p) + (1-y)\log(1-p)\big)
$$

where:
- $p$ is the model’s confidence (from sigmoid)
- $y$ is the true label (0 or 1)

Complete the code in the next cell.

In [None]:
# Complete the binary cross-entropy loss function below.

def binary_cross_entropy(p, y):
    """
    p: model confidence [0, 1]
    y: true label (0 or 1)
    """

    # Return the binary cross-entropy loss: -(y * log(p) + (1 - y) * log(1 - p))
    # You can use math.log() to compute the logarithm of a number.

    # YOUR CODE HERE
    
    raise NotImplementedError

In [None]:
# True label = 1
print(binary_cross_entropy(0.9, 1))  # should be small
print(binary_cross_entropy(0.1, 1))  # should be large

# True label = 0
print(binary_cross_entropy(0.1, 0))  # should be small
print(binary_cross_entropy(0.9, 0))  # should be large

Q. For which values of $p$ is this loss undefined or numerically unstable? Why?

`YOUR ANSWER HERE`


In [None]:
def stable_binary_cross_entropy(p, y):
    """
    Clips p to avoid (math) issues, then computes binary cross-entropy loss.
    
    p: model confidence (between 0 and 1)
    y: true label (0 or 1)
    """
    eps = 1e-8

    if p < eps:
        p = eps
    if p > 1 - eps:
        p = 1 - eps

    return -(y * math.log(p) + (1 - y) * math.log(1 - p))

In [None]:
# should not error out
print(stable_binary_cross_entropy(0.0, 1))
print(stable_binary_cross_entropy(1.0, 0)) 

So far we have:

1. Compute a score:  
   $z = \sum_{i=1}^n w_i x_i + b$

2. Turn the score into a confidence:  
   $p = \sigma(z)$

3. Measure how good that confidence was using a loss (given the true label $y$).

This gives the *forward* computation:
  
$x \xrightarrow{\,w,b\,} z \xrightarrow{\,\sigma(.)\,} p \xrightarrow{\,y\,} L$

Each arrow means "is computed from" or "depends on".

This sequence is the **computation graph** for our model.

Next, we will ask:

- How should we change the parameters $w$ and $b$ so that the loss $L$ becomes smaller?

But, first let's visualize how changing the parameters affects the decision boundary and the loss. 


In [None]:
# Load the toy data again for the next section
X_toy = [
    [1.5, 4],
    [1, 2],
    [2, 1],
    [3, 5],
    [4, 2],
    [0, 0],
    [1.5, -0.5]
]
y_toy = [1, 1, 0, 1, 0, 0, 0]

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider

def combined_interactive_plot(w1=1.0, w2=1.0, b=0.0):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # LEFT PLOT: Sigmoid Confidence (1D View)
    z_range = np.linspace(-10, 10, 400)
    sig_vals = 1 / (1 + np.exp(-z_range))
   

    ax1.plot(z_range, sig_vals, color='green', label="Sigmoid(z)")
    ax1.axvline(x=0, color='black', linestyle='--', alpha=0.5, label="Boundary (z=0)")
    
    # Plot toy points on the sigmoid curve
    total_loss = 0
    for x, y in zip(X_toy, y_toy):
        z_score = w1 * x[0] + w2 * x[1] + b
        p = 1 / (1 + np.exp(-z_score))
        
        # Calculate loss for annotation
        p_clipped = np.clip(p, 1e-8, 1 - 1e-8)
        loss = -(y * np.log(p_clipped) + (1 - y) * np.log(1 - p_clipped))
        total_loss += loss
        
        color = 'red' if y == 1 else 'blue'
        ax1.scatter(z_score, p, color=color, s=50, edgecolors='k', zorder=5)
    
    ax1.set_xlabel("Score ($z = w_1x_1 + w_2x_2 + b$)")
    ax1.set_ylabel("Confidence ($p$)")
    ax1.set_title(f"1D View: Confidence vs Score\nTotal CE Loss: {total_loss:.2f}")
    ax1.grid(True, alpha=0.3)
    ax1.legend()

    # RIGHT PLOT: Decision Boundary (2D View)
    x1_min, x1_max = -1, 5
    x2_min, x2_max = -1, 6
    xx, yy = np.meshgrid(np.linspace(x1_min, x1_max, 100), np.linspace(x2_min, x2_max, 100))
    z_grid = w1*xx + w2*yy + b
    p_grid = 1 / (1 + np.exp(-z_grid))
    
    # Background probability contour
    contour = ax2.imshow(p_grid, origin="lower", extent=[x1_min, x1_max, x2_min, x2_max], 
                         aspect="auto", alpha=0.3, cmap='RdBu_r')
    
    # Decision boundary line (where z=0)
    if w2 != 0:
        x1_vals = np.array([x1_min, x1_max])
        x2_vals = -(w1 * x1_vals + b) / w2
        ax2.plot(x1_vals, x2_vals, "k--", label="z=0")
    elif w1 != 0:
        x0 = -b / w1
        ax2.axvline(x=x0, color="k", linestyle="--", label="z=0")
    
    # Plot the original points
    labeled0 = False
    labeled1 = False
    for x, y in zip(X_toy, y_toy):
        if y == 1:
            ax2.scatter(x[0], x[1], c='red', marker='x', s=100,
                label="Class 1" if not labeled1 else "")
            labeled1 = True
        else:
            ax2.scatter(x[0], x[1], c='blue', marker='o', s=100,
                label="Class 0" if not labeled0 else "")
            labeled0 = True

    ax2.set_xlim(x1_min, x1_max)
    ax2.set_ylim(x2_min, x2_max)
    ax2.set_xlabel("Feature $x_1$")
    ax2.set_ylabel("Feature $x_2$")
    ax2.set_title("2D View: Feature Space Boundary")
    ax2.grid(True, alpha=0.3)
    ax2.legend()
    

interact(
    combined_interactive_plot,
    w1=FloatSlider(value=1.0, min=-5, max=5, step=0.1),
    w2=FloatSlider(value=1.0, min=-5, max=5, step=0.1),
    b=FloatSlider(value=-2.0, min=-5, max=5, step=0.1)
)

## Reducing the loss

Earlier in the course, we reduced error by asking:

- If we change a parameter slightly, how does the error change?
- In which direction should we move the parameter to reduce error?
- How sensitive is the error to that parameter?

We answered these questions by treating error as a function of the parameters and using **gradients** to understand how changes in the parameters affect the error.

[SIDE QUEST](https://docs.google.com/presentation/d/1g1uaurl-GckLb47FskBPzgtsrQJPiAQUhdVML_eOI_8/edit?usp=sharing): Play the slideshow and watch the videos to get some intuition aobut gradients. I have time marked them in the slides, but please feel free to watch the full video.

Here, we use the same idea, but with a more precise quantity called a **loss**.

The goal is the same:
understand how the loss changes with respect to the parameters so that we can update $w$ and $b$ to reduce it.

Our model has three parameters:
- `w1`: weight for feature x1
- `w2`: weight for feature x2
- `b`: bias

Computation Graph: 

$x \xrightarrow{\,w,b\,} z \xrightarrow{\,\sigma(.)\,} p \xrightarrow{\,y\,} L$

In [None]:
# The forward pass to get the confidence values
def forward(X, w1, w2, b):
    """
    Inputs: 
    X: list of data points, where each point is a list of 2 features [x1, x2]
    w1, w2: weights for the two features
    b: bias term

    Output: A list of p (confidence values between 0 and 1) for each data point
    """
    
    # Initialize an empty list to store confidence values
    p_list = []

    # Iterate through each data point in X
    for i in range(len(X)):
        x1 = X[i][0]
        x2 = X[i][1]
        
        # Calculate the linear score z = w1*x1 + w2*x2 + b
        z = w1 * x1 + w2 * x2 + b
        
        # Use the score to get the sigmoid confidence
        p = stable_sigmoid(z)

        # Append the confidence value to the list
        p_list.append(p)
        
    return p_list

In [None]:
# The Loss Function
def get_avg_binary_cross_entropy(p_list, y):
    """
    
    p_list: list of predicted confidence values (between 0 and 1)
    y: list of true labels (0 or 1)

    """
    total_loss = 0.0
    for i in range(len(p_list)):
        p = p_list[i]
        y_true = y[i]
        
        # Safety: Clip p to avoid log(0) error
        loss_i = stable_binary_cross_entropy(p, y_true)
        total_loss += loss_i
        
    return total_loss / len(p_list)

In [None]:
# The gradient calculation using the finite difference method
# To see how $(w,b)$ affect $L$, we nudge one parameter at a time and look at how that affects the loss.

def get_gradients(X, y, w1, w2, b):
    eps = 0.0001
    
    # STEP 0: Baseline - Where are we now?
    # We must Run Forward -> Then Calculate Loss
    base_conf = forward(X, w1, w2, b)
    base_loss  = get_avg_binary_cross_entropy(base_conf, y)
    
    # STEP 1: Get gradient for w1
    # Nudge w1 -> Re-run Forward -> Re-calculate Loss
    w1_conf = forward(X, w1 + eps, w2, b)       # <--- SHOW THIS TO STUDENTS
    w1_loss  = get_avg_binary_cross_entropy(w1_conf, y) # <--- AND THIS
    grad_w1  = (w1_loss - base_loss) / eps
    
    # STEP 2: Get gradient for w2
    # Nudge w2 -> Re-run Forward -> Re-calculate Loss
    w2_conf = forward(X, w1, w2 + eps, b)
    w2_loss  = get_avg_binary_cross_entropy(w2_conf, y)
    grad_w2  = (w2_loss - base_loss) / eps
    
    # STEP 3: Get gradient for bias
    # Nudge b -> Re-run Forward -> Re-calculate Loss
    b_conf = forward(X, w1, w2, b + eps)
    b_loss  = get_avg_binary_cross_entropy(b_conf, y)
    grad_b  = (b_loss - base_loss) / eps
     
    return grad_w1, grad_w2, grad_b

In [None]:
# The Training Loop

# Initialize Parameters
w1 = 0.0
w2 = 0.0
b  = 0.0

# Initialize hyperparameters
learning_rate = 0.1
epochs = 500

print(f"Initial Loss: {get_avg_binary_cross_entropy(forward(X_toy, w1, w2, b), y_toy):.4f}")

for i in range(epochs):
    
    # 1. Calculate Gradients (This runs the model 4 times!)
    dw1, dw2, db = get_gradients(X_toy, y_toy, w1, w2, b)
    
    # 2. Update Weights
    w1 = w1 - learning_rate * dw1
    w2 = w2 - learning_rate * dw2
    b  = b  - learning_rate * db
    
    if i % 50 == 0:
        # Check progress
        curr_preds = forward(X_toy, w1, w2, b)
        curr_loss = get_avg_binary_cross_entropy(curr_preds, y_toy)
        print(f"Iter {i}: w1={w1:.2f}, w2={w2:.2f}, b={b:.2f} | Loss={curr_loss:.4f}")

print("\nFinal Result:")
print("Final loss:", curr_loss)

print(f"w1: {w1:.4f}, w2: {w2:.4f}, b: {b:.4f}")

We just trained our model by repeating the follwing steps:

- Running the model forward to get the confidence values $p$
- Computing the loss using $p$ and $y$
- Nudging one parameter at a time to see how the loss changes
- Updating the parameters using those estimates


This works - but it is slow, not optimal. Why? 

*Hint:*  Look closely at `get_gradients(X, y, w1, w2, b)`. How many times does the model run forward and recompute the loss during **one training step**?


`YOUR ANSWER HERE`