### **Why Linear Regression Fails for Classification**

1. **Outliers:** The best-fit line can be heavily influenced by extreme values.
2. **Output Range:** Linear regression can produce outputs less than 0 or greater than 1, which are invalid for probabilities.

---

### **Solution: Logistic Regression**

* Apply a **squashing function** to the linear output to constrain predictions between 0 and 1.
* The **sigmoid (logistic) function** is used:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

**Key properties:**

* Outputs always between 0 and 1.

* Midpoint at 0.5 when $z = 0$.

* If $z > 0$, $\sigma(z) > 0.5$.

* **Hypothesis function with sigmoid:**

$$
h_\theta(x) = \sigma(\theta_0 + \theta_1 x_1)
$$

* For multiple features:

$$
h_\theta(x) = \sigma(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots)
$$

---

### **Cost Function**

* Linear regression cost function leads to **non-convexity** when combined with sigmoid, causing **local minima**.
* Logistic regression uses **log loss (cross-entropy)** for convexity:

$$
\text{Cost}(h_\theta(x), y) =
\begin{cases} 
- \log(h_\theta(x)) & \text{if } y = 1 \\
- \log(1 - h_\theta(x)) & \text{if } y = 0
\end{cases}
$$

* Combined form:

$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \big[ y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \big]
$$

* Ensures **convexity**, allowing gradient descent to reliably find the **global minimum**.

---

### **Gradient Descent**

* Parameter update rule:

$$
\theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}
$$

* Repeat until convergence.
* For multiple features, update each $\theta_j$ in the same way.

---

### **Summary**

1. Fit a linear model: $\theta_0 + \theta_1 x_1 + \dots$
2. Apply **sigmoid activation** to squash outputs between 0 and 1.
3. Use **log loss** to ensure a convex cost function.
4. Optimize parameters using **gradient descent**.
5. Predictions can now be interpreted as **probabilities** for classification.

