# Lecture 09: Logistic Regression
## Possible Subjective Exam Questions
---

## Section 1: Introduction - From Regression to Classification

### Q1. Why can't we directly use linear regression for classification tasks? Explain with an example.

**Answer:**

**Problem with Linear Regression for Classification:**

Linear regression is not suitable for classification because:

1. **Predictions can go outside [0,1]:** Linear regression can predict values like -0.5 or 1.5, which don't make sense as probabilities

2. **Not bounded:** The output range is $(-\infty, +\infty)$ but we need values between 0 and 1

3. **Sensitive to outliers:** Extreme values can shift the regression line badly

**Example:**
- Medical diagnosis: Stroke ($y=0$) vs Drug Overdose ($y=1$)
- If we use linear regression and predict $\hat{y} = 1.3$, what does this mean?
- It cannot be interpreted as probability since probability must be between 0 and 1
- Similarly, $\hat{y} = -0.2$ doesn't make sense as a probability

**Solution:** Use Logistic Regression which outputs probabilities in [0,1]

### Q2. What is binary classification? Give a real-world example with proper encoding.

**Answer:**

**Binary Classification:**
A classification problem where there are exactly two possible classes/outcomes.

**Encoding:**
- Class 0: Negative class
- Class 1: Positive class

**Real-World Example: Emergency Room Diagnosis**

| Patient Symptoms | Actual Condition | Encoding |
|-----------------|------------------|----------|
| Symptom set A | Stroke | $y = 0$ |
| Symptom set B | Drug Overdose | $y = 1$ |

**Other Examples:**
- Email: Spam (1) vs Not Spam (0)
- Medical: Disease Present (1) vs Absent (0)
- Finance: Fraud (1) vs Legitimate (0)
- Image: Cat (1) vs Dog (0)

### Q3. If we use linear regression for binary classification with threshold 0, what are the potential problems?

**Answer:**

**Approach:**
- Fit linear regression to binary response
- Classify as Class 1 if $\hat{y} > 0$, otherwise Class 0

**Problems:**

1. **Unbounded predictions:**
   - $\hat{y}$ can be any value from $-\infty$ to $+\infty$
   - Cannot interpret as probability

2. **Sensitive to outliers:**
   - Adding extreme points can shift the line dramatically
   - Changes classification for many points

3. **Poor probability estimates:**
   - If $\hat{y} = 5$, we can't say probability is 5
   - If $\hat{y} = -2$, we can't say probability is -2

4. **Non-optimal decision boundary:**
   - Linear regression minimizes squared error, not classification error
   - The objective is wrong for classification

## Section 2: The Logistic (Sigmoid) Function

### Q4. Define the Logistic (Sigmoid) function and explain its properties.

**Answer:**

**Definition:**

$$s(a) = \frac{1}{1 + e^{-a}} = \frac{e^a}{1 + e^a}$$

**Properties:**

| Property | Explanation |
|----------|-------------|
| **Output Range** | Always between 0 and 1: $s(a) \in (0, 1)$ |
| **When $a \gg 0$** | $s(a) \approx 1$ (approaches 1) |
| **When $a \ll 0$** | $s(a) \approx 0$ (approaches 0) |
| **When $a = 0$** | $s(0) = \frac{1}{1+1} = 0.5$ |
| **Monotonic** | Always increasing |
| **Symmetric** | $s(-a) = 1 - s(a)$ |
| **Differentiable** | Smooth curve, easy to compute gradient |

**Why Useful:**
- Converts any real number to a probability
- Suitable for modeling binary outcomes

### Q5. Calculate the sigmoid function value for $a = 0$, $a = 2$, and $a = -2$.

**Answer:**

**Formula:** $s(a) = \frac{1}{1 + e^{-a}}$

**Case 1: $a = 0$**
$$s(0) = \frac{1}{1 + e^{0}} = \frac{1}{1 + 1} = \frac{1}{2} = 0.5$$

**Case 2: $a = 2$**
$$s(2) = \frac{1}{1 + e^{-2}} = \frac{1}{1 + 0.135} = \frac{1}{1.135} \approx 0.88$$

**Case 3: $a = -2$**
$$s(-2) = \frac{1}{1 + e^{2}} = \frac{1}{1 + 7.389} = \frac{1}{8.389} \approx 0.12$$

**Observation:**
- $s(2) + s(-2) = 0.88 + 0.12 = 1$ (symmetry property)
- Positive input → probability > 0.5
- Negative input → probability < 0.5

### Q6. Prove that the sigmoid function satisfies: $s(-a) = 1 - s(a)$

**Answer:**

**To Prove:** $s(-a) = 1 - s(a)$

**Proof:**

Starting with the left side:
$$s(-a) = \frac{1}{1 + e^{-(-a)}} = \frac{1}{1 + e^{a}}$$

Now, let's compute $1 - s(a)$:
$$1 - s(a) = 1 - \frac{1}{1 + e^{-a}}$$

$$= \frac{(1 + e^{-a}) - 1}{1 + e^{-a}}$$

$$= \frac{e^{-a}}{1 + e^{-a}}$$

Multiply numerator and denominator by $e^a$:
$$= \frac{e^{-a} \cdot e^a}{(1 + e^{-a}) \cdot e^a} = \frac{1}{e^a + 1} = \frac{1}{1 + e^a}$$

This equals $s(-a)$. Hence proved: $s(-a) = 1 - s(a)$ ✓

### Q7. Why is the sigmoid function preferred over other functions for binary classification?

**Answer:**

**Reasons for Using Sigmoid:**

1. **Bounded Output:**
   - Always outputs values in $(0, 1)$
   - Directly interpretable as probability

2. **Smooth and Differentiable:**
   - Can compute gradients easily
   - Enables gradient descent optimization

3. **Nice Derivative:**
   - $\frac{ds}{da} = s(a)(1 - s(a))$
   - Simple formula in terms of the function itself

4. **Probabilistic Interpretation:**
   - Derived from log-odds (logit function)
   - Has statistical foundation

5. **Decision Boundary at 0.5:**
   - When $a = 0$, output is exactly 0.5
   - Natural threshold for classification

6. **Monotonically Increasing:**
   - Higher input always gives higher probability

## Section 3: Logistic Regression Model

### Q8. Write the complete logistic regression model. Explain each component.

**Answer:**

**Logistic Regression Model:**

**Step 1: Linear Combination**
$$a = \varphi^T \theta$$

**Step 2: Apply Sigmoid**
$$P(y=1|\varphi) = s(\varphi^T \theta) = \frac{1}{1 + e^{-\varphi^T \theta}}$$

**Components Explained:**

| Symbol | Meaning | Dimension |
|--------|---------|------------|
| $\varphi$ | Feature vector (input) | $d \times 1$ |
| $\theta$ | Parameter vector (weights) | $d \times 1$ |
| $\varphi^T \theta$ | Linear combination (score) | scalar |
| $s(\cdot)$ | Sigmoid function | maps to $(0,1)$ |
| $P(y=1|\varphi)$ | Probability of class 1 given input | $(0,1)$ |

**Also:**
$$P(y=0|\varphi) = 1 - P(y=1|\varphi) = \frac{e^{-\varphi^T\theta}}{1 + e^{-\varphi^T\theta}}$$

### Q9. Explain the classification rule in logistic regression. Why is threshold 0.5 used?

**Answer:**

**Classification Rule:**

$$\text{Predicted Class} = \begin{cases} 1 & \text{if } s(\varphi^T\theta) \geq 0.5 \\ 0 & \text{if } s(\varphi^T\theta) < 0.5 \end{cases}$$

**Equivalent Rule:**
$$\text{Predicted Class} = \begin{cases} 1 & \text{if } \varphi^T\theta \geq 0 \\ 0 & \text{if } \varphi^T\theta < 0 \end{cases}$$

**Why 0.5 is Used:**

1. **Natural threshold:** When $s(a) = 0.5$, both classes are equally likely

2. **At $a = 0$:** $s(0) = 0.5$ exactly

3. **Maximum uncertainty point:** The point where we're most unsure

4. **Equal error cost assumption:** Assumes misclassifying Class 0 as Class 1 has same cost as the reverse

**Note:** In practice, threshold can be adjusted based on:
- Cost of different errors
- Class imbalance
- Application requirements

### Q10. Show that the classification rule $s(\varphi^T\theta) \geq 0.5$ is equivalent to $\varphi^T\theta \geq 0$.

**Answer:**

**To Show:** $s(\varphi^T\theta) \geq 0.5 \Leftrightarrow \varphi^T\theta \geq 0$

**Proof:**

Let $a = \varphi^T\theta$

We need to show: $s(a) \geq 0.5 \Leftrightarrow a \geq 0$

Starting with $s(a) \geq 0.5$:

$$\frac{1}{1 + e^{-a}} \geq 0.5$$

$$1 \geq 0.5(1 + e^{-a})$$

$$1 \geq 0.5 + 0.5e^{-a}$$

$$0.5 \geq 0.5e^{-a}$$

$$1 \geq e^{-a}$$

Taking natural log (which preserves inequality):

$$\ln(1) \geq -a$$

$$0 \geq -a$$

$$a \geq 0$$

Hence proved: $\varphi^T\theta \geq 0$ ✓

### Q11. What is the decision boundary in logistic regression? Why is it linear?

**Answer:**

**Decision Boundary:**
The set of points where the classifier is uncertain (probability = 0.5).

**Mathematical Definition:**
$$\text{Decision Boundary: } \varphi^T\theta = 0$$

Or equivalently:
$$\theta_1 x_1 + \theta_2 x_2 + ... + \theta_d x_d = 0$$

**Why It's Linear:**

1. The equation $\varphi^T\theta = 0$ is a linear equation in the features

2. In 2D: $\theta_1 x_1 + \theta_2 x_2 + \theta_0 = 0$ is a straight line

3. In 3D: It's a plane

4. In higher dimensions: It's a hyperplane

**Geometric Interpretation:**
- Points on one side: $\varphi^T\theta > 0$ → Class 1
- Points on other side: $\varphi^T\theta < 0$ → Class 0
- Points on boundary: $\varphi^T\theta = 0$ → Uncertain

**Note:** Logistic regression is a **linear classifier** because its decision boundary is linear.

## Section 4: Cost Function (Binary Cross-Entropy)

### Q12. Why can't we use Mean Squared Error (MSE) as the cost function for logistic regression?

**Answer:**

**Problem with MSE for Logistic Regression:**

If we use MSE:
$$J(\theta) = \frac{1}{N}\sum_{i=1}^{N}(s(\varphi^T\theta) - y)^2$$

**Issues:**

1. **Non-convex:**
   - The function becomes non-convex due to sigmoid
   - Has many local minima
   - Gradient descent may get stuck

2. **Small gradients:**
   - When sigmoid output is near 0 or 1
   - Gradient becomes very small
   - Learning becomes very slow

3. **Not probabilistically motivated:**
   - Doesn't align with maximum likelihood principle
   - Cross-entropy has proper statistical foundation

**Solution:** Use Binary Cross-Entropy (Log Loss) which:
- Is convex
- Has better gradients
- Is derived from maximum likelihood

### Q13. Define the Log Loss (Binary Cross-Entropy) cost function. Explain intuitively why it works.

**Answer:**

**Log Loss Formula:**

For a single sample $(\varphi, y)$:

$$f(\theta) = \begin{cases} -\ln(\pi) & \text{if } y = 1 \\ -\ln(1-\pi) & \text{if } y = 0 \end{cases}$$

Where $\pi = P(y=1|\varphi) = s(\varphi^T\theta)$

**Unified Formula:**
$$f(\theta) = -[y \ln(\pi) + (1-y) \ln(1-\pi)]$$

**Intuitive Explanation:**

| Actual $y$ | Predicted $\pi$ | $-\ln(\pi)$ or $-\ln(1-\pi)$ | Cost |
|------------|-----------------|------------------------------|------|
| 1 | $\approx 1$ | $-\ln(1) = 0$ | Low ✓ |
| 1 | $\approx 0$ | $-\ln(0) = \infty$ | High (penalty) |
| 0 | $\approx 0$ | $-\ln(1) = 0$ | Low ✓ |
| 0 | $\approx 1$ | $-\ln(0) = \infty$ | High (penalty) |

**Key Insight:** Confident wrong predictions are heavily penalized!

### Q14. Write the complete cost function for logistic regression over N training samples.

**Answer:**

**Complete Cost Function:**

$$J(\theta) = -\sum_{i=1}^{N} \left[ y^{(i)} \ln(\pi^{(i)}) + (1 - y^{(i)}) \ln(1 - \pi^{(i)}) \right]$$

Where:
- $N$ = number of training samples
- $y^{(i)}$ = actual label for sample $i$ (0 or 1)
- $\pi^{(i)} = s(\varphi^{(i)T}\theta)$ = predicted probability for sample $i$

**Expanded Form:**
$$J(\theta) = -\sum_{i=1}^{N} \left[ y^{(i)} \ln\left(\frac{1}{1+e^{-\varphi^{(i)T}\theta}}\right) + (1 - y^{(i)}) \ln\left(\frac{e^{-\varphi^{(i)T}\theta}}{1+e^{-\varphi^{(i)T}\theta}}\right) \right]$$

**Alternative with Average:**
$$J(\theta) = -\frac{1}{N}\sum_{i=1}^{N} \left[ y^{(i)} \ln(\pi^{(i)}) + (1 - y^{(i)}) \ln(1 - \pi^{(i)}) \right]$$

### Q15. Calculate the cost for the following cases:
- Case A: $y = 1$, $\pi = 0.9$
- Case B: $y = 1$, $\pi = 0.1$
- Case C: $y = 0$, $\pi = 0.2$

**Answer:**

**Formula:** $f(\theta) = -[y \ln(\pi) + (1-y) \ln(1-\pi)]$

**Case A: $y = 1$, $\pi = 0.9$**
$$f = -[1 \cdot \ln(0.9) + 0 \cdot \ln(0.1)]$$
$$= -\ln(0.9) = -(-0.105) = 0.105$$

**Interpretation:** Low cost - correct confident prediction ✓

**Case B: $y = 1$, $\pi = 0.1$**
$$f = -[1 \cdot \ln(0.1) + 0 \cdot \ln(0.9)]$$
$$= -\ln(0.1) = -(-2.303) = 2.303$$

**Interpretation:** High cost - wrong confident prediction ✗

**Case C: $y = 0$, $\pi = 0.2$**
$$f = -[0 \cdot \ln(0.2) + 1 \cdot \ln(0.8)]$$
$$= -\ln(0.8) = -(-0.223) = 0.223$$

**Interpretation:** Low cost - correct prediction ✓

### Q16. Why is the cross-entropy cost function convex for logistic regression?

**Answer:**

**Convexity of Cross-Entropy:**

The cost function $J(\theta)$ is convex because:

1. **Negative log-likelihood:**
   - Cross-entropy is the negative log of likelihood
   - Log of sigmoid products leads to a convex function

2. **Second derivative is positive:**
   - The Hessian matrix is positive semi-definite
   - This guarantees convexity

3. **Composition of convex functions:**
   - $-\ln(x)$ is convex for $x > 0$
   - Sigmoid output is always positive
   - Sum of convex functions is convex

**Benefits of Convexity:**
1. Only one minimum exists (global minimum)
2. No local minima to get stuck in
3. Gradient descent will find the optimal solution
4. Solution is unique

**Note:** However, there is no closed-form solution like in linear regression.

## Section 5: Gradient of the Cost Function

### Q17. Derive the derivative of the sigmoid function.

**Answer:**

**To Find:** $\frac{\partial s(a)}{\partial a}$ where $s(a) = \frac{1}{1 + e^{-a}}$

**Derivation:**

Let $s(a) = (1 + e^{-a})^{-1}$

Using chain rule:
$$\frac{ds}{da} = -1 \cdot (1 + e^{-a})^{-2} \cdot \frac{d}{da}(1 + e^{-a})$$

$$= -(1 + e^{-a})^{-2} \cdot (-e^{-a})$$

$$= \frac{e^{-a}}{(1 + e^{-a})^2}$$

Now, notice that:
$$s(a) = \frac{1}{1 + e^{-a}}$$
$$1 - s(a) = \frac{e^{-a}}{1 + e^{-a}}$$

Therefore:
$$\frac{ds}{da} = \frac{1}{1 + e^{-a}} \cdot \frac{e^{-a}}{1 + e^{-a}} = s(a) \cdot (1 - s(a))$$

**Final Result:**
$$\boxed{\frac{\partial s(a)}{\partial a} = s(a)(1 - s(a))}$$

### Q18. Derive the gradient of the logistic regression cost function.

**Answer:**

**Cost Function:**
$$J(\theta) = -\sum_{i=1}^{N} \left[ y^{(i)} \ln(\pi^{(i)}) + (1 - y^{(i)}) \ln(1 - \pi^{(i)}) \right]$$

Where $\pi^{(i)} = s(\varphi^{(i)T}\theta)$

**Step 1:** For a single sample, the cost is:
$$f = -[y \ln(\pi) + (1-y) \ln(1-\pi)]$$

**Step 2:** Compute $\frac{\partial f}{\partial \theta}$ using chain rule:
$$\frac{\partial f}{\partial \theta} = \frac{\partial f}{\partial \pi} \cdot \frac{\partial \pi}{\partial a} \cdot \frac{\partial a}{\partial \theta}$$

Where $a = \varphi^T\theta$

**Step 3:** Compute each part:
- $\frac{\partial f}{\partial \pi} = -\frac{y}{\pi} + \frac{1-y}{1-\pi}$
- $\frac{\partial \pi}{\partial a} = \pi(1-\pi)$
- $\frac{\partial a}{\partial \theta} = \varphi$

**Step 4:** Combine:
$$\frac{\partial f}{\partial \theta} = \left(-\frac{y}{\pi} + \frac{1-y}{1-\pi}\right) \cdot \pi(1-\pi) \cdot \varphi$$

$$= (-y(1-\pi) + (1-y)\pi) \cdot \varphi$$

$$= (\pi - y) \cdot \varphi$$

**Final Gradient:**
$$\boxed{\nabla_\theta J(\theta) = \sum_{i=1}^{N} \varphi^{(i)} \cdot (\pi^{(i)} - y^{(i)})}$$

### Q19. Write the gradient formula in component form for parameter $\theta_j$.

**Answer:**

**Component-wise Gradient:**

$$\frac{\partial J}{\partial \theta_j} = \sum_{i=1}^{N} (\pi^{(i)} - y^{(i)}) \cdot \varphi_j^{(i)}$$

**Where:**
- $\theta_j$ = the $j$-th parameter
- $\varphi_j^{(i)}$ = the $j$-th feature of the $i$-th sample
- $\pi^{(i)}$ = predicted probability for sample $i$
- $y^{(i)}$ = actual label for sample $i$

**Interpretation:**
- $(\pi^{(i)} - y^{(i)})$ = prediction error for sample $i$
- We weight this error by the feature value $\varphi_j^{(i)}$
- Sum over all samples

**Example:** For 3 samples and 2 features:
$$\frac{\partial J}{\partial \theta_1} = (\pi^{(1)} - y^{(1)})\varphi_1^{(1)} + (\pi^{(2)} - y^{(2)})\varphi_1^{(2)} + (\pi^{(3)} - y^{(3)})\varphi_1^{(3)}$$

### Q20. Compare the gradient of logistic regression with the gradient of linear regression.

**Answer:**

**Linear Regression Gradient:**
$$\nabla_\theta J = \sum_{i=1}^{N} \varphi^{(i)} \cdot (\hat{y}^{(i)} - y^{(i)})$$

Where $\hat{y}^{(i)} = \varphi^{(i)T}\theta$

**Logistic Regression Gradient:**
$$\nabla_\theta J = \sum_{i=1}^{N} \varphi^{(i)} \cdot (\pi^{(i)} - y^{(i)})$$

Where $\pi^{(i)} = s(\varphi^{(i)T}\theta)$

**Comparison:**

| Aspect | Linear Regression | Logistic Regression |
|--------|-------------------|---------------------|
| Form | $\varphi \cdot (\hat{y} - y)$ | $\varphi \cdot (\pi - y)$ |
| Prediction | $\hat{y} = \varphi^T\theta$ | $\pi = s(\varphi^T\theta)$ |
| Range of prediction | $(-\infty, +\infty)$ | $(0, 1)$ |
| Closed-form solution | Yes (Normal Equations) | No |

**Key Insight:** The gradients look remarkably similar! The only difference is linear vs sigmoid prediction.

## Section 6: Gradient Descent Optimization

### Q21. Write the gradient descent algorithm for logistic regression.

**Answer:**

**Algorithm: Gradient Descent for Logistic Regression**

```json
Input: Training data {(φ(i), y(i))}_{i=1}^N, learning rate α, max_iterations
Output: Optimal parameters θ*

1. Initialize θ randomly or with zeros

2. Repeat until convergence or max_iterations:
   
   a. For each sample i, compute:
      π(i) = sigmoid(φ(i)^T θ)
   
   b. Compute gradient:
      ∇J = Σ φ(i) · (π(i) - y(i))
   
   c. Update parameters:
      θ_new = θ_old - α · ∇J

3. Return θ
```

**Update Rule:**
$$\theta_{new} = \theta_{old} - \alpha \cdot \nabla_\theta J(\theta)$$

**Component-wise Update:**
$$\theta_j = \theta_j - \alpha \sum_{i=1}^{N} (\pi^{(i)} - y^{(i)}) \cdot \varphi_j^{(i)}$$

### Q22. Why is there no closed-form solution for logistic regression unlike linear regression?

**Answer:**

**Linear Regression:**
- Cost: $J(\theta) = ||\Phi\theta - y||^2$
- Gradient: $\nabla J = 2\Phi^T\Phi\theta - 2\Phi^Ty$
- Setting to zero: $\Phi^T\Phi\theta = \Phi^Ty$
- **Closed-form:** $\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$

**Logistic Regression:**
- Cost: $J(\theta) = -\sum[y\ln(s(\varphi^T\theta)) + (1-y)\ln(1-s(\varphi^T\theta))]$
- Gradient: $\nabla J = \sum \varphi(s(\varphi^T\theta) - y)$

**Why No Closed-Form:**

1. **Non-linear function:** Sigmoid is inside the expression
2. **Cannot isolate θ:** Setting gradient to zero gives:
   $$\sum \varphi \cdot s(\varphi^T\theta) = \sum \varphi \cdot y$$
3. **Transcendental equation:** $\theta$ appears inside exponential
4. **No algebraic solution:** Cannot solve analytically

**Solution:** Use iterative methods like gradient descent

### Q23. Explain the effect of learning rate (α) on gradient descent. What happens with too large or too small α?

**Answer:**

**Learning Rate (α):** Controls the step size in each iteration

$$\theta_{new} = \theta_{old} - \alpha \cdot \nabla J$$

**Too Large α:**

| Problem | Description |
|---------|-------------|
| Overshooting | Jumps over the minimum |
| Oscillation | Bounces back and forth |
| Divergence | Cost keeps increasing |
| Instability | Never converges |

**Too Small α:**

| Problem | Description |
|---------|-------------|
| Slow convergence | Takes many iterations |
| Time consuming | Very slow training |
| Local minima | May get stuck (though logistic loss is convex) |
| Practical issues | May never reach minimum in reasonable time |

**Ideal α:**
- Fast but stable convergence
- Cost decreases smoothly
- Reaches minimum in reasonable time

### Q24. What is the convergence criterion for gradient descent in logistic regression?

**Answer:**

**Common Convergence Criteria:**

1. **Small gradient:**
   $$||\nabla J(\theta)|| < \epsilon$$
   Stop when gradient magnitude is very small

2. **Small change in cost:**
   $$|J(\theta_{new}) - J(\theta_{old})| < \epsilon$$
   Stop when cost barely changes

3. **Small change in parameters:**
   $$||\theta_{new} - \theta_{old}|| < \epsilon$$
   Stop when parameters barely change

4. **Maximum iterations:**
   $$\text{iterations} > \text{max\_iter}$$
   Stop after fixed number of iterations

**Typical Values:**
- $\epsilon = 10^{-6}$ or $10^{-8}$
- max_iter = 1000 or 10000

**Best Practice:** Use combination of criteria

## Section 7: Numerical Problems

### Q25. Given a single training sample with $\varphi = [1, 2]^T$, $y = 1$, and current $\theta = [0.5, 0.5]^T$, compute one gradient descent update with $\alpha = 0.1$.

**Answer:**

**Given:**
- $\varphi = [1, 2]^T$
- $y = 1$
- $\theta = [0.5, 0.5]^T$
- $\alpha = 0.1$

**Step 1: Compute linear combination**
$$a = \varphi^T\theta = 1(0.5) + 2(0.5) = 0.5 + 1.0 = 1.5$$

**Step 2: Compute sigmoid**
$$\pi = s(1.5) = \frac{1}{1 + e^{-1.5}} = \frac{1}{1 + 0.223} = \frac{1}{1.223} \approx 0.818$$

**Step 3: Compute error**
$$\pi - y = 0.818 - 1 = -0.182$$

**Step 4: Compute gradient**
$$\nabla J = \varphi \cdot (\pi - y) = [1, 2]^T \cdot (-0.182) = [-0.182, -0.364]^T$$

**Step 5: Update parameters**
$$\theta_{new} = \theta_{old} - \alpha \cdot \nabla J$$
$$= [0.5, 0.5]^T - 0.1 \cdot [-0.182, -0.364]^T$$
$$= [0.5 + 0.0182, 0.5 + 0.0364]^T$$
$$= [0.518, 0.536]^T$$

### Q26. For two samples: $(\varphi_1 = [1, 0]^T, y_1 = 0)$ and $(\varphi_2 = [0, 1]^T, y_2 = 1)$, with $\theta = [0, 0]^T$, calculate the total cost $J(\theta)$.

**Answer:**

**Given:**
- Sample 1: $\varphi_1 = [1, 0]^T$, $y_1 = 0$
- Sample 2: $\varphi_2 = [0, 1]^T$, $y_2 = 1$
- $\theta = [0, 0]^T$

**For Sample 1:**
$$a_1 = \varphi_1^T\theta = 1(0) + 0(0) = 0$$
$$\pi_1 = s(0) = 0.5$$
$$\text{Cost}_1 = -[y_1\ln(\pi_1) + (1-y_1)\ln(1-\pi_1)]$$
$$= -[0 \cdot \ln(0.5) + 1 \cdot \ln(0.5)]$$
$$= -\ln(0.5) = 0.693$$

**For Sample 2:**
$$a_2 = \varphi_2^T\theta = 0(0) + 1(0) = 0$$
$$\pi_2 = s(0) = 0.5$$
$$\text{Cost}_2 = -[1 \cdot \ln(0.5) + 0 \cdot \ln(0.5)]$$
$$= -\ln(0.5) = 0.693$$

**Total Cost:**
$$J(\theta) = \text{Cost}_1 + \text{Cost}_2 = 0.693 + 0.693 = 1.386$$

### Q27. For the decision boundary $2x_1 + 3x_2 - 1 = 0$, determine which class the point $(1, 1)$ belongs to.

**Answer:**

**Given:**
- Decision boundary: $2x_1 + 3x_2 - 1 = 0$
- Point to classify: $(x_1, x_2) = (1, 1)$

**Step 1: Identify θ and φ**
$$\theta = [2, 3, -1]^T \text{ (including bias)}$$
$$\varphi = [x_1, x_2, 1]^T = [1, 1, 1]^T$$

**Step 2: Compute linear combination**
$$\varphi^T\theta = 2(1) + 3(1) + (-1)(1) = 2 + 3 - 1 = 4$$

**Step 3: Apply classification rule**
Since $\varphi^T\theta = 4 > 0$:
- The point is on the positive side of the boundary
- **Classification: Class 1**

**Verification with probability:**
$$\pi = s(4) = \frac{1}{1 + e^{-4}} \approx 0.982$$
Since $0.982 > 0.5$, classify as Class 1 ✓

### Q28. Given that the sigmoid output is 0.73, what is the value of the input $a$?

**Answer:**

**Given:** $s(a) = 0.73$

**To Find:** $a$

**Solution:**

$$s(a) = \frac{1}{1 + e^{-a}} = 0.73$$

$$1 + e^{-a} = \frac{1}{0.73} = 1.370$$

$$e^{-a} = 1.370 - 1 = 0.370$$

Taking natural log:
$$-a = \ln(0.370) = -0.994$$

$$a = 0.994 \approx 1.0$$

**Verification:**
$$s(1.0) = \frac{1}{1 + e^{-1}} = \frac{1}{1 + 0.368} = \frac{1}{1.368} \approx 0.731$$ ✓

## Section 8: Conceptual Questions

### Q29. Compare Linear Regression and Logistic Regression.

**Answer:**

| Aspect | Linear Regression | Logistic Regression |
|--------|-------------------|---------------------|
| **Task** | Regression (continuous output) | Classification (discrete output) |
| **Output** | Any real number | Probability in $(0, 1)$ |
| **Model** | $\hat{y} = \varphi^T\theta$ | $\pi = s(\varphi^T\theta)$ |
| **Cost Function** | Mean Squared Error | Binary Cross-Entropy |
| **Gradient** | $\varphi(\hat{y} - y)$ | $\varphi(\pi - y)$ |
| **Closed-form** | Yes (Normal Equations) | No |
| **Decision Boundary** | N/A | Linear (hyperplane) |
| **Example** | Predict house price | Predict spam/not spam |

### Q30. What does it mean that logistic regression has a "linear decision boundary"?

**Answer:**

**Linear Decision Boundary:**

The decision boundary is the set of points where:
$$\varphi^T\theta = 0$$

**In Different Dimensions:**

| Dimension | Boundary Shape | Equation |
|-----------|----------------|----------|
| 2D | Line | $\theta_1 x_1 + \theta_2 x_2 + \theta_0 = 0$ |
| 3D | Plane | $\theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_0 = 0$ |
| nD | Hyperplane | $\sum_{j=1}^{n} \theta_j x_j + \theta_0 = 0$ |

**Implications:**
1. Can only separate linearly separable data
2. Cannot learn XOR-like patterns
3. Simple but limited

**Visual Example:**
- Cats on one side of the line
- Dogs on the other side
- The dashed line is the decision boundary

### Q31. Can logistic regression handle non-linearly separable data? How?

**Answer:**

**Basic Logistic Regression:**
- Cannot handle non-linearly separable data
- Decision boundary is always a hyperplane

**Solutions for Non-linear Data:**

1. **Feature Engineering:**
   - Add polynomial features: $x_1^2, x_2^2, x_1 x_2$
   - Original: $\varphi = [x_1, x_2, 1]^T$
   - Extended: $\varphi = [x_1, x_2, x_1^2, x_2^2, x_1 x_2, 1]^T$

2. **Kernel Methods:**
   - Transform data to higher dimension
   - Linear boundary in high dimension = curved boundary in original space

3. **Use Different Algorithms:**
   - Neural Networks
   - Decision Trees
   - SVM with kernels

**Example:**
For circular boundary, add $x_1^2 + x_2^2$ as a feature.

### Q32. What is the probabilistic interpretation of logistic regression?

**Answer:**

**Probabilistic Model:**

Logistic regression models the probability of class membership:

$$P(y = 1 | \varphi) = \frac{1}{1 + e^{-\varphi^T\theta}}$$

$$P(y = 0 | \varphi) = 1 - P(y = 1 | \varphi) = \frac{e^{-\varphi^T\theta}}{1 + e^{-\varphi^T\theta}}$$

**Log-Odds (Logit):**

$$\ln\left(\frac{P(y=1)}{P(y=0)}\right) = \ln\left(\frac{P(y=1)}{1 - P(y=1)}\right) = \varphi^T\theta$$

**Interpretation:**
- The linear combination $\varphi^T\theta$ models the log-odds
- Sigmoid converts log-odds to probability
- Each $\theta_j$ represents change in log-odds when $\varphi_j$ increases by 1

**Maximum Likelihood:**
- Cross-entropy loss = negative log-likelihood
- Minimizing cross-entropy = maximizing likelihood

### Q33. How can you adjust the decision threshold based on application requirements?

**Answer:**

**Default Threshold:** 0.5
- Classify as Class 1 if $P(y=1) \geq 0.5$

**When to Adjust:**

| Scenario | Threshold | Reason |
|----------|-----------|--------|
| High cost of FN | Lower (e.g., 0.3) | Don't miss positive cases |
| High cost of FP | Higher (e.g., 0.7) | Be more certain before predicting positive |
| Imbalanced data | Adjust based on class ratio | Handle class imbalance |

**Examples:**

1. **Disease Detection:** Use threshold 0.3
   - Missing a disease (FN) is very costly
   - Better to have false alarms than miss cases

2. **Spam Detection:** Use threshold 0.7
   - Blocking important email (FP) is very costly
   - Better to let some spam through

**Method:** Use ROC curve to find optimal threshold

### Q34. What happens when the data is perfectly linearly separable?

**Answer:**

**Problem with Perfect Separability:**

When data is perfectly separable:

1. **Parameters diverge to infinity:**
   - $||\theta|| \rightarrow \infty$
   - Gradient descent keeps increasing weights

2. **Probabilities become extreme:**
   - All Class 1 samples: $\pi \rightarrow 1$
   - All Class 0 samples: $\pi \rightarrow 0$

3. **Numerical issues:**
   - Overflow/underflow in computations
   - $\ln(0)$ or $\ln(1)$ problems

**Solutions:**

1. **Regularization:**
   - Add penalty term: $J(\theta) + \lambda||\theta||^2$
   - Prevents parameters from growing too large

2. **Early stopping:**
   - Stop training before divergence

3. **Maximum iterations:**
   - Limit number of gradient descent steps

## Section 9: Extension Questions

### Q35. How can logistic regression be extended to multi-class classification?

**Answer:**

**Two Main Approaches:**

**1. One-vs-All (OvA):**
- Train K binary classifiers (one for each class)
- Classifier k: Class k vs all other classes
- Prediction: Choose class with highest probability

**2. Softmax Regression (Multinomial Logistic):**

$$P(y = k | \varphi) = \frac{e^{\varphi^T\theta_k}}{\sum_{j=1}^{K} e^{\varphi^T\theta_j}}$$

- One parameter vector for each class
- Probabilities sum to 1
- Natural extension of sigmoid to multiple classes

**Comparison:**

| Aspect | One-vs-All | Softmax |
|--------|------------|--------|
| Classifiers | K separate | 1 joint |
| Training | Independent | Coupled |
| Probabilities | May not sum to 1 | Sum to 1 exactly |

### Q36. What is regularization in logistic regression and why is it needed?

**Answer:**

**Regularization:**
Adding a penalty term to the cost function to prevent overfitting.

**L2 Regularization (Ridge):**
$$J(\theta) = -\sum_{i=1}^{N}[y^{(i)}\ln(\pi^{(i)}) + (1-y^{(i)})\ln(1-\pi^{(i)})] + \lambda\sum_{j=1}^{d}\theta_j^2$$

**L1 Regularization (Lasso):**
$$J(\theta) = -\sum_{i=1}^{N}[y^{(i)}\ln(\pi^{(i)}) + (1-y^{(i)})\ln(1-\pi^{(i)})] + \lambda\sum_{j=1}^{d}|\theta_j|$$

**Why Needed:**

1. **Prevents overfitting:** Keeps parameters small
2. **Handles perfect separability:** Prevents divergence
3. **Feature selection:** L1 can make some weights zero
4. **Generalization:** Better performance on test data

**λ (Regularization strength):**
- Large λ: More regularization, simpler model
- Small λ: Less regularization, complex model

### Q37. Summarize the key points of logistic regression.

**Answer:**

**Key Points Summary:**

**1. Purpose:**
- Binary classification (can extend to multi-class)
- Estimates probability of class membership

**2. Model:**
$$P(y=1|\varphi) = \frac{1}{1 + e^{-\varphi^T\theta}}$$

**3. Cost Function:**
$$J(\theta) = -\sum_{i=1}^{N}[y^{(i)}\ln(\pi^{(i)}) + (1-y^{(i)})\ln(1-\pi^{(i)})]$$

**4. Gradient:**
$$\nabla J = \sum_{i=1}^{N}\varphi^{(i)}(\pi^{(i)} - y^{(i)})$$

**5. Optimization:**
- No closed-form solution
- Use gradient descent: $\theta_{new} = \theta_{old} - \alpha \nabla J$

**6. Decision Boundary:**
- Linear (hyperplane)
- $\varphi^T\theta = 0$

**7. Properties:**
- Convex cost function
- Unique global minimum
- Simple and interpretable

---
## Summary of Important Formulas

| Concept | Formula |
|---------|--------|
| Sigmoid Function | $s(a) = \frac{1}{1 + e^{-a}}$ |
| Sigmoid Property | $s(-a) = 1 - s(a)$ |
| Sigmoid Derivative | $\frac{ds}{da} = s(a)(1 - s(a))$ |
| Probability Model | $P(y=1|\varphi) = s(\varphi^T\theta)$ |
| Log Loss (single) | $f = -[y\ln(\pi) + (1-y)\ln(1-\pi)]$ |
| Cost Function | $J(\theta) = -\sum_{i=1}^{N}[y^{(i)}\ln(\pi^{(i)}) + (1-y^{(i)})\ln(1-\pi^{(i)})]$ |
| Gradient | $\nabla_\theta J = \sum_{i=1}^{N}\varphi^{(i)}(\pi^{(i)} - y^{(i)})$ |
| Update Rule | $\theta_{new} = \theta_{old} - \alpha \nabla J$ |
| Decision Boundary | $\varphi^T\theta = 0$ |
| Classification Rule | Class 1 if $\varphi^T\theta \geq 0$, else Class 0 |

---