# Problem 1 
----


#### (1a) Loss Function in Linear Regression

The most commonly used loss function for linear regression is the **Mean Squared Error (MSE)**. This loss function measures the average of the squared differences between the predicted values $ \hat{y} $ and the actual values $ y $.

The mathematical formula for the loss function $ L_{reg}(\hat{y}, y) $ is given by:

$ L_{reg}(\hat{y}, y) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 $

where:
- $ N $ is the number of data points,
- $ y_i $ is the actual value for the $i$-th data point,
- $ \hat{y}_i $ is the predicted value for the $i$-th data point.

### Derivation of the Gradient
To minimize this loss function, we need to compute the gradient of $ L_{reg} $ with respect to the weight parameters $ w_0 $ and $ w_1 $.

For a linear model, the prediction is given by:

$ \hat{y} = w_0 + w_1 x $

where:
- $ w_0 $ is the intercept,
- $ w_1 $ is the slope (or weight for the feature $ x $).

Using the chain rule, we compute the partial derivatives:

1. Gradient with respect to $ w_0 $:
$ \frac{\partial L_{reg}}{\partial w_0} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) $

2. Gradient with respect to $ w_1 $:
$ \frac{\partial L_{reg}}{\partial w_1} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) x_i $

These gradients are used to update the weights during the optimization process, typically with methods like gradient descent.


### Step-by-Step Derivation of MSE Gradient (Without Matrix Notation)

We begin by defining the Mean Squared Error (MSE) loss function for linear regression:

\[
L_{reg}(\hat{y}, y) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2
\]

where:
- \( y_i \) is the actual value,
- \( \hat{y}_i \) is the predicted value,
- \( N \) is the number of data points.

The predicted value \( \hat{y}_i \) in linear regression is given by:

\[
\hat{y}_i = w_0 + w_1 x_i
\]

where:
- \( w_0 \) is the intercept,
- \( w_1 \) is the slope (or weight),
- \( x_i \) is the input feature for the \(i\)-th data point.

Our goal is to minimize the loss function by finding the gradients with respect to the parameters \( w_0 \) and \( w_1 \).

### 1. Gradient with respect to \( w_0 \)

Start by taking the partial derivative of the MSE with respect to \( w_0 \):

\[
\frac{\partial L_{reg}}{\partial w_0} = \frac{\partial}{\partial w_0} \left( \frac{1}{N} \sum_{i=1}^{N} (y_i - (w_0 + w_1 x_i))^2 \right)
\]

We can apply the chain rule here. Let’s denote the error term \( \hat{y}_i - y_i = e_i \), where:

\[
e_i = w_0 + w_1 x_i - y_i
\]

Now, apply the chain rule:

\[
\frac{\partial}{\partial w_0} (y_i - \hat{y}_i)^2 = 2 (y_i - \hat{y}_i) \cdot \frac{\partial}{\partial w_0} (-\hat{y}_i)
\]

Since \( \frac{\partial}{\partial w_0} \hat{y}_i = \frac{\partial}{\partial w_0} (w_0 + w_1 x_i) = 1 \), this simplifies to:

\[
\frac{\partial L_{reg}}{\partial w_0} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)
\]

### 2. Gradient with respect to \( w_1 \)

Similarly, take the partial derivative of the MSE with respect to \( w_1 \):

\[
\frac{\partial L_{reg}}{\partial w_1} = \frac{\partial}{\partial w_1} \left( \frac{1}{N} \sum_{i=1}^{N} (y_i - (w_0 + w_1 x_i))^2 \right)
\]

Again, applying the chain rule:

\[
\frac{\partial}{\partial w_1} (y_i - \hat{y}_i)^2 = 2 (y_i - \hat{y}_i) \cdot \frac{\partial}{\partial w_1} (-\hat{y}_i)
\]

Since \( \frac{\partial}{\partial w_1} \hat{y}_i = \frac{\partial}{\partial w_1} (w_0 + w_1 x_i) = x_i \), we get:

\[
\frac{\partial L_{reg}}{\partial w_1} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) x_i
\]

### Final Gradients

Thus, the gradients of the MSE loss function with respect to the parameters \( w_0 \) and \( w_1 \) are:

\[
\frac{\partial L_{reg}}{\partial w_0} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)
\]

\[
\frac{\partial L_{reg}}{\partial w_1} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) x_i
\]


SyntaxError: invalid syntax (3290196431.py, line 3)