We are looking at how to compute the derivatives of the **cost function** for linear regression. The cost function is:

$$
J(m, c) = \sum_{i=1}^n \left( y_i - (mx_i + c) \right)^2
$$


### What are we trying to do?

We are trying to **minimize** the total error between our predicted line $\hat{y}_i = mx_i + c$ and the actual true values $y_i$. The total error is the **sum of squared differences** (residuals).

To do this, we want to find values of $m$ (slope) and $c$ (intercept) such that this error is the smallest. In math, we do this by computing the **derivatives**—which tell us the direction in which the function is increasing or decreasing.



### Derivative Intuition

Imagine you are walking on a hilly surface (the graph of the function). The **derivative** tells you the slope of the hill at your feet. If the slope is steeply upwards, go the other way! If it's steeply downwards, keep going—it means you're minimizing the function.


### Let's write the cost function again:

$$
J(m, c) = \sum_{i=1}^n \left( y_i - (mx_i + c) \right)^2
$$

Let’s simplify this inner expression a bit. For each data point:

$$
\text{error}_i = y_i - (mx_i + c)
$$

$$
\text{error}_i^2 = \left( y_i - (mx_i + c) \right)^2
$$

So our goal is to **adjust** $m$ and $c$ to reduce the total squared error.

---

## Derivative with respect to **m** (slope)

We're going to ask: “How does the error change if we nudge the slope $m$ a little bit?”

Let’s derive $\frac{\partial J}{\partial m}$:

$$
\frac{\partial J}{\partial m} = \sum_{i=1}^n 2 \cdot \left( y_i - (mx_i + c) \right) \cdot (-x_i)
$$

Why this form?

* $2 \cdot \text{error}$: Comes from squaring a value (recall: $d/dx[x^2] = 2x$)
* $-x_i$: Comes from chain rule; you're changing $m$, and the inside has $-mx_i$, which gives derivative $-x_i$

So, putting it all together:

$$
\frac{\partial J}{\partial m} = -2 \sum_{i=1}^n x_i \cdot \left( y_i - (mx_i + c) \right)
$$

---

## Derivative with respect to **c** (intercept)

Same logic, but now we ask: “What if we nudge the intercept $c$?”

$$
\frac{\partial J}{\partial c} = \sum_{i=1}^n 2 \cdot \left( y_i - (mx_i + c) \right) \cdot (-1)
$$

$$
\frac{\partial J}{\partial c} = -2 \sum_{i=1}^n \left( y_i - (mx_i + c) \right)
$$

---

### Final Derivatives

$$
\frac{\partial J}{\partial m} = -2 \sum_{i=1}^n x_i \cdot \left( y_i - (mx_i + c) \right)
$$

$$
\frac{\partial J}{\partial c} = -2 \sum_{i=1}^n \left( y_i - (mx_i + c) \right)
$$

---

### What do you do with these?

You use them in **gradient descent**:

* Repeatedly update:

  $$
  m := m - \alpha \cdot \frac{\partial J}{\partial m}
  $$

  $$
  c := c - \alpha \cdot \frac{\partial J}{\partial c}
  $$
* Where $\alpha$ is the **learning rate** (a small number like 0.01)




### 1. **Partial Derivatives (Multivariable Functions)**

When you have a function of **multiple variables**, like:

$$
f(x, y) = x^2 + y^2
$$

A **partial derivative** measures the rate of change of the function **with respect to one variable**, while **keeping all others constant**.

* $\frac{\partial f}{\partial x} = 2x$
* $\frac{\partial f}{\partial y} = 2y$

So partial derivatives are useful in **multivariable calculus** and optimization.

### 2. **Total Derivatives (Single Variable or Chain Rule Context)**

A **total derivative** considers the full dependency of the function on all variables — including **indirect dependencies** via the chain rule.

For example, if:

* $z = f(x, y)$, and
* $x = x(t), y = y(t)$, then

$$
\frac{dz}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}
$$

This is the **total derivative** of $z$ with respect to $t$.


| Type                | Function Type         | Meaning                                                   |
| ------------------- | --------------------- | --------------------------------------------------------- |
| Partial Derivative  | Multivariable         | Change in function wrt one variable, keeping others fixed |
| Total Derivative    | Function of functions | Overall rate of change considering all dependencies       |
| Ordinary Derivative | Single-variable       | Standard derivative (like $f'(x)$)                        |

---

In **convex optimization**, you typically deal with **partial derivatives**, and collectively they form the **gradient vector**:

$$
\nabla f(x) = \begin{bmatrix}
\frac{\partial f}{\partial x_1} \\
\frac{\partial f}{\partial x_2} \\
\vdots \\
\frac{\partial f}{\partial x_n}
\end{bmatrix}
$$

