# Lecture 07-08: Linear Regression & Gradient Descent
## Possible Subjective Exam Questions
---

## Section 1: Introduction and Motivation

### Q1. Why is accurate forecasting important for renewable energy generation? Explain with an example.

**Answer:**

Accurate forecasting is important because:

**Problem with Renewable Energy:**
1. Wind and solar power are intermittent (not always available)
2. When wind stops, we need backup from traditional power plants
3. Traditional plants (like nuclear) cannot easily be turned on/off quickly
4. They cannot be ramped up or down easily

**Why Forecasting Helps:**
1. If we know when wind will stop, we can prepare backup in advance
2. Makes renewable energy more efficient and reliable
3. Saves money by better planning

**Real Example:**
Wind forecasting algorithms (using Machine Learning) at NCAR saved utility companies \$6-\$10 million per year.

### Q2. Why is it difficult to estimate energy consumption using "a priori" models? How does data-driven approach help?

**Answer:**

**Why A Priori Models are Difficult:**
1. Energy consumption depends on many factors
2. Hard to write equations that capture all relationships
3. Human behavior is unpredictable
4. Weather patterns are complex

**How Data-Driven Approach Helps:**
1. We have abundant historical data available
2. We can learn patterns from past data
3. The model discovers relationships automatically
4. No need to manually define all rules
5. Model improves as more data becomes available

### Q3. What factors affect electricity consumption? How can visualization help in understanding these patterns?

**Answer:**

**Factors Affecting Electricity Consumption:**

1. **Hour of the day:** Consumption varies throughout the day
2. **Season:** Different consumption in February vs July vs October
3. **Temperature:** Higher temperature leads to more AC usage
4. **Day of week:** Weekdays vs weekends have different patterns

**How Visualization Helps:**
1. Shows correlation between variables
2. Reveals patterns that are not obvious
3. Helps identify which features are useful for prediction
4. Example: Plotting Peak Hourly Demand (GW) vs High Temperature (F) shows a clear relationship

## Section 2: Linear Model Basics

### Q4. Write the general formula for a linear model in simple regression. Explain each component.

**Answer:**

**General Formula:**

$$\text{predicted peak demand} = \theta_1 \cdot (\text{high temperature}) + \theta_2$$

**Components Explained:**

| Component | Meaning |
|-----------|----------|
| $\theta_1$ | Slope - how much output changes when input changes by 1 unit |
| $\theta_2$ | Intercept - the output value when input is zero |
| high temperature | Input feature (independent variable) |
| predicted peak demand | Output (dependent variable) |

**Note:** Both $\theta_1$ and $\theta_2$ are real numbers ($\theta_1, \theta_2 \in \mathbb{R}$) learned from data.

### Q5. Given $\theta_1 = 0.046$ and $\theta_2 = -1.46$, calculate the predicted peak demand when the high temperature is 80°F. Show your work.

**Answer:**

**Given:**
- $\theta_1 = 0.046$
- $\theta_2 = -1.46$
- High temperature = 80°F

**Formula:**
$$\text{predicted peak demand} = \theta_1 \cdot (\text{temperature}) + \theta_2$$

**Calculation:**
$$\text{predicted peak demand} = 0.046 \times 80 + (-1.46)$$

$$= 3.68 - 1.46$$

$$= 2.22 \text{ GW}$$

**Answer:** The predicted peak demand is approximately **2.22 GW** (or 2.19 GW as given in slides).

### Q6. What is the Least Squares Method? Why is it used in linear regression?

**Answer:**

**Least Squares Method:**
A method to find the best-fitting line by minimizing the sum of squared differences between predicted and actual values.

**Why It's Used:**

1. **Minimizes error:** Finds the line that is closest to all data points

2. **Squared errors:** Using squares makes all errors positive and penalizes large errors more

3. **Mathematical convenience:** Leads to a nice analytical solution

4. **Unique solution:** Always gives one best answer

**Goal:** Find $\theta$ values that minimize:
$$\sum_{i=1}^{m} (\text{predicted}_i - \text{actual}_i)^2$$

## Section 3: Fitting a Line - Manual Calculation

### Q7. Explain the step-by-step process to fit a line to 2D data points.

**Answer:**

**Steps to Fit a Line:**

**Step 1:** Calculate the mean of X values
$$\overline{X} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

**Step 2:** Calculate the mean of Y values
$$\overline{Y} = \frac{1}{n}\sum_{i=1}^{n} y_i$$

**Step 3:** Calculate the slope (m) using:
$$m = \frac{\sum_{i=1}^{n}(x_i - \overline{X})(y_i - \overline{Y})}{\sum_{i=1}^{n}(x_i - \overline{X})^2}$$

**Step 4:** Calculate the intercept (b) using:
$$b = \overline{Y} - m \cdot \overline{X}$$

**Step 5:** Write the final equation:
$$y = mx + b$$

### Q8. Derive and explain the formula for calculating slope in linear regression.

**Answer:**

**Slope Formula:**

$$m = \frac{\sum_{i=1}^{n}(x_i - \overline{X})(y_i - \overline{Y})}{\sum_{i=1}^{n}(x_i - \overline{X})^2}$$

**Explanation of Components:**

**Numerator:** $\sum_{i=1}^{n}(x_i - \overline{X})(y_i - \overline{Y})$
- This is the covariance between X and Y
- Measures how X and Y change together
- Positive if they increase together

**Denominator:** $\sum_{i=1}^{n}(x_i - \overline{X})^2$
- This is the variance of X
- Measures how spread out X values are

**Meaning of Slope:**
- How much Y changes when X increases by 1 unit
- Positive slope: Y increases as X increases
- Negative slope: Y decreases as X increases

### Q9. Why do we subtract the mean ($\overline{X}$ and $\overline{Y}$) in the slope formula?

**Answer:**

**Reasons for Subtracting Mean:**

1. **Centering the data:** Moves the data so that the center is at origin (0,0)

2. **Removes bias:** Focuses on the relationship, not the absolute values

3. **Measures deviation:** Shows how each point differs from the average

4. **Covariance calculation:** The formula essentially calculates covariance divided by variance

5. **Ensures line passes through mean point:** The regression line always passes through $(\overline{X}, \overline{Y})$

### Q10. Given slope $m = -1.1$ and intercept $b = 14.0$, interpret what these values mean in context of a prediction problem.

**Answer:**

**Final Equation:** $y = -1.1x + 14.0$

**Interpretation of Slope ($m = -1.1$):**
- Negative slope indicates inverse relationship
- For every 1 unit increase in X, Y decreases by 1.1 units
- Example: If temperature increases by 1°F, demand might decrease by 1.1 units

**Interpretation of Intercept ($b = 14.0$):**
- When X = 0, the predicted Y = 14.0
- This is the baseline value
- May or may not have practical meaning depending on context

**Using the Model:**
- If $x = 5$: $y = -1.1(5) + 14.0 = -5.5 + 14.0 = 8.5$
- If $x = 10$: $y = -1.1(10) + 14.0 = -11 + 14.0 = 3.0$

## Section 4: Formal Problem Setting

### Q11. Define the formal notation used in linear regression. Explain input, output, parameters, and feature mapping.

**Answer:**

**Formal Notation:**

| Symbol | Meaning | Example |
|--------|---------|----------|
| $x_i \in \mathbb{R}^n$ | Input vector for sample $i$ | High temperature for day $i$ |
| $y_i \in \mathbb{R}$ | Output (target) for sample $i$ | Peak demand for day $i$ |
| $\theta \in \mathbb{R}^k$ | Model parameters to learn | Slope and intercept |
| $\phi: \mathbb{R}^n \rightarrow \mathbb{R}^k$ | Feature mapping function | Maps input to feature vector |

**Feature Mapping Example:**
- For $n=1$ (one input) and $k=2$ (two parameters)
- $\phi(x_i) = [x_i, 1]^T$
- This adds a 1 for the intercept term

**This is a Regression Task:** Output is a continuous real number.

### Q12. What is feature mapping? Why do we add a "1" to the feature vector?

**Answer:**

**Feature Mapping ($\phi$):**
A function that transforms the input into a feature vector suitable for the model.

**Example:**
$$\phi(x_i) = [x_i, 1]^T$$

**Why Add "1":**

1. **For the intercept term:** The "1" allows us to include a bias/intercept in our model

2. **Compact notation:** Prediction becomes a simple dot product:
$$\hat{y}_i = \theta^T \phi(x_i) = [\theta_1, \theta_2] \cdot [x_i, 1]^T = \theta_1 x_i + \theta_2$$

3. **Uniform treatment:** Both slope and intercept are treated as parameters

4. **Matrix operations:** Makes it easy to use matrix notation

### Q13. Write and explain the prediction formula in linear regression using vector notation.

**Answer:**

**Prediction Formula:**

$$\hat{y}_i = \sum_{j=1}^{k} \theta_j \cdot \phi_j(x_i) \equiv \theta^T \phi(x_i)$$

**Expanded Form:**
$$\hat{y}_i = \theta_1 \cdot \phi_1(x_i) + \theta_2 \cdot \phi_2(x_i) + ... + \theta_k \cdot \phi_k(x_i)$$

**Components:**
- $\hat{y}_i$ = Predicted output for sample $i$
- $\theta_j$ = The $j$-th parameter
- $\phi_j(x_i)$ = The $j$-th feature of input $x_i$
- $\theta^T \phi(x_i)$ = Dot product of parameter vector and feature vector

**Example with $k=2$:**
$$\hat{y}_i = \theta_1 \cdot x_i + \theta_2 \cdot 1 = \theta_1 x_i + \theta_2$$

## Section 5: Loss Function

### Q14. What is a loss function? Why do we use squared loss in linear regression?

**Answer:**

**Loss Function:**
A function that measures how "close" the predicted value is to the actual value.

$$l: \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}_+$$

It takes predicted and actual values and returns a non-negative number.

**Squared Loss:**
$$l(\hat{y}_i, y_i) = (\hat{y}_i - y_i)^2$$

**Why Use Squared Loss:**

1. **Always positive:** Squaring ensures loss is never negative

2. **Penalizes large errors more:** An error of 10 costs 100, not just 10

3. **Differentiable:** Easy to compute gradients for optimization

4. **Analytical solution:** Leads to closed-form solution (Normal Equations)

5. **Statistical interpretation:** Optimal under Gaussian noise assumption

### Q15. Write and explain the cost function (objective function) for linear regression.

**Answer:**

**Cost Function:**

$$J(\theta) = \sum_{i=1}^{m} (\theta^T \phi(x_i) - y_i)^2$$

**Explanation:**

| Part | Meaning |
|------|----------|
| $J(\theta)$ | Total cost as a function of parameters |
| $m$ | Total number of training samples |
| $\theta^T \phi(x_i)$ | Predicted value $\hat{y}_i$ |
| $y_i$ | Actual/true value |
| $(\theta^T \phi(x_i) - y_i)^2$ | Squared error for sample $i$ |
| $\sum_{i=1}^{m}$ | Sum over all samples |

**Objective:** We want to find $\theta$ that minimizes $J(\theta)$

**Goal:** Find $\theta^*$ such that $\hat{y}_i \approx y_i$ for all samples.

### Q16. What is the difference between loss function and cost function?

**Answer:**

| Loss Function | Cost Function |
|---------------|---------------|
| For a single sample | For all samples combined |
| $l(\hat{y}_i, y_i) = (\hat{y}_i - y_i)^2$ | $J(\theta) = \sum_{i=1}^{m} l(\hat{y}_i, y_i)$ |
| Measures individual error | Measures total/average error |
| Also called error function | Also called objective function |

**Relationship:**
$$\text{Cost Function} = \sum (\text{Loss Function for each sample})$$

Sometimes cost function includes average:
$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2$$

## Section 6: Gradient Descent

### Q17. Derive the partial derivative of the cost function with respect to $\theta_j$.

**Answer:**

**Cost Function:**
$$J(\theta) = \sum_{i=1}^{m} (\theta^T \phi(x_i) - y_i)^2$$

**Derivation:**

Using chain rule:

$$\frac{\partial J}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \sum_{i=1}^{m} (\theta^T \phi(x_i) - y_i)^2$$

$$= \sum_{i=1}^{m} 2(\theta^T \phi(x_i) - y_i) \cdot \frac{\partial}{\partial \theta_j}(\theta^T \phi(x_i) - y_i)$$

Since $\frac{\partial}{\partial \theta_j}(\theta^T \phi(x_i)) = \phi_j(x_i)$:

$$\frac{\partial J}{\partial \theta_j} = \sum_{i=1}^{m} 2(\theta^T \phi(x_i) - y_i) \cdot \phi_j(x_i)$$

**Or simply:**
$$\frac{\partial J}{\partial \theta_j} = 2 \sum_{i=1}^{m} (\hat{y}_i - y_i) \cdot \phi_j(x_i)$$

### Q18. What is the Design Matrix? How is it constructed?

**Answer:**

**Design Matrix ($\Phi$):**
A matrix that contains all the feature vectors for all samples.

**Construction:**

$$\Phi = \begin{bmatrix} \phi(x_1)^T \\ \phi(x_2)^T \\ \vdots \\ \phi(x_m)^T \end{bmatrix}$$

**Dimensions:** $\Phi \in \mathbb{R}^{m \times k}$
- $m$ = number of samples (rows)
- $k$ = number of features (columns)

**Example:**
If $\phi(x_i) = [x_i, 1]^T$ for 3 samples:

$$\Phi = \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \end{bmatrix}$$

**Target Vector:** $y \in \mathbb{R}^m = [y_1, y_2, ..., y_m]^T$

### Q19. Write the least-squares objective in matrix notation and explain each term.

**Answer:**

**Matrix Form of Cost Function:**

$$J(\theta) = ||\Phi\theta - y||_2^2$$

**Explanation of Terms:**

| Term | Meaning | Dimension |
|------|---------|------------|
| $\Phi$ | Design matrix | $m \times k$ |
| $\theta$ | Parameter vector | $k \times 1$ |
| $y$ | Target vector | $m \times 1$ |
| $\Phi\theta$ | Predictions for all samples | $m \times 1$ |
| $\Phi\theta - y$ | Error vector | $m \times 1$ |
| $||\cdot||_2^2$ | Squared L2 norm | scalar |

**Interpretation:**
This measures the total squared difference between predictions and actual values.

### Q20. What is the condition for finding the minimum of a function? How does it differ for 1D vs multi-variate case?

**Answer:**

**1-D Case (single variable):**

The minimum is found where the derivative equals zero:
$$\frac{dJ}{d\theta} = 0$$

**Multi-variate Case ($\theta \in \mathbb{R}^k$):**

The minimum is found where the gradient (all partial derivatives) equals zero:
$$\nabla_{\theta} J(\theta) = 0$$

Where the gradient is:
$$\nabla_{\theta} J(\theta) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \\ \vdots \\ \frac{\partial J}{\partial \theta_k} \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}$$

This gives us $k$ equations to solve for $k$ unknowns.

## Section 7: Normal Equations

### Q21. Expand the least-squares objective function and show the detailed matrix form.

**Answer:**

**Starting Point:**
$$J(\theta) = ||\Phi\theta - y||_2^2$$

**Expansion:**
$$J(\theta) = (\Phi\theta - y)^T (\Phi\theta - y)$$

Using $(A - B)^T = A^T - B^T$:
$$= (\theta^T\Phi^T - y^T)(\Phi\theta - y)$$

Expanding:
$$= \theta^T\Phi^T\Phi\theta - \theta^T\Phi^Ty - y^T\Phi\theta + y^Ty$$

Since $\theta^T\Phi^Ty$ and $y^T\Phi\theta$ are scalars and equal:
$$J(\theta) = \theta^T\Phi^T\Phi\theta - 2y^T\Phi\theta + y^Ty$$

**Terms:**
- $\theta^T\Phi^T\Phi\theta$: Quadratic term in $\theta$
- $-2y^T\Phi\theta$: Linear term in $\theta$
- $y^Ty$: Constant (doesn't affect optimization)

### Q22. Compute the gradient of the cost function and derive the Normal Equations.

**Answer:**

**Cost Function:**
$$J(\theta) = \theta^T\Phi^T\Phi\theta - 2y^T\Phi\theta + y^Ty$$

**Computing Gradient:**

Using matrix calculus rules:
- $\nabla_{\theta}(\theta^T A \theta) = 2A\theta$ (for symmetric $A$)
- $\nabla_{\theta}(b^T\theta) = b$

$$\nabla_{\theta} J(\theta) = 2\Phi^T\Phi\theta - 2\Phi^Ty$$

**Setting Gradient to Zero:**
$$2\Phi^T\Phi\theta - 2\Phi^Ty = 0$$

$$\Phi^T\Phi\theta = \Phi^Ty$$

**Normal Equations Solution:**
$$\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$$

**This is the analytical (closed-form) solution!**

### Q23. State and explain the Normal Equations. Why is this solution called "analytical"?

**Answer:**

**Normal Equations:**

$$\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$$

**Explanation of Each Part:**

| Term | Dimension | Meaning |
|------|-----------|----------|
| $\Phi^T$ | $k \times m$ | Transpose of design matrix |
| $\Phi^T\Phi$ | $k \times k$ | Gram matrix (symmetric) |
| $(\Phi^T\Phi)^{-1}$ | $k \times k$ | Inverse of Gram matrix |
| $\Phi^Ty$ | $k \times 1$ | Correlation between features and target |
| $\theta^*$ | $k \times 1$ | Optimal parameters |

**Why "Analytical":**
1. We can directly compute the exact solution
2. No iterative algorithm needed
3. One-shot computation
4. This is rare - most ML problems don't have closed-form solutions

### Q24. What are the conditions for the Normal Equations solution to exist?

**Answer:**

**Condition:** The matrix $(\Phi^T\Phi)$ must be invertible.

**When is it invertible:**

1. **Full column rank:** $\Phi$ must have linearly independent columns
2. **Enough samples:** Number of samples $m$ must be at least $k$ (number of features)
3. **No redundant features:** Features should not be perfectly correlated

**Problems when not invertible:**
1. Too many features compared to samples
2. Multicollinearity (features are highly correlated)
3. Duplicate features

**Solutions:**
1. Use pseudo-inverse instead
2. Add regularization (Ridge regression)
3. Remove redundant features

## Section 8: Multidimensional Inputs

### Q25. How do we extend linear regression to handle multiple input features?

**Answer:**

**Scenario:** Input $x \in \mathbb{R}^2$ (e.g., Temperature and Hour of Day)

**Feature Vector:**
$$\phi(x) = [\text{temperature}, \text{hour of day}, 1]^T \in \mathbb{R}^3$$

**Design Matrix:**
$$\Phi = \begin{bmatrix} \text{temp}_1 & \text{hour}_1 & 1 \\ \text{temp}_2 & \text{hour}_2 & 1 \\ \vdots & \vdots & \vdots \\ \text{temp}_m & \text{hour}_m & 1 \end{bmatrix} \in \mathbb{R}^{m \times 3}$$

**Solution remains the same:**
$$\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$$

**Prediction:**
$$\hat{y} = \theta_1 \cdot \text{temperature} + \theta_2 \cdot \text{hour} + \theta_3$$

### Q26. Write the general formula for linear regression with $n$ input features.

**Answer:**

**General Setting:**
- Input: $x \in \mathbb{R}^n$ with $n$ features
- Feature vector: $\phi(x) \in \mathbb{R}^{n+1}$ (adding 1 for intercept)
- Parameters: $\theta \in \mathbb{R}^{n+1}$

**Feature Mapping:**
$$\phi(x) = [x_1, x_2, ..., x_n, 1]^T$$

**Prediction Formula:**
$$\hat{y} = \theta^T\phi(x) = \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n + \theta_{n+1}$$

**Or in summation form:**
$$\hat{y} = \sum_{j=1}^{n} \theta_j x_j + \theta_{n+1}$$

**Optimal Parameters:**
$$\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$$

## Section 9: Numerical Problems

### Q27. Given the following data points, calculate the slope and intercept of the regression line.

| X | Y |
|---|---|
| 1 | 3 |
| 2 | 5 |
| 3 | 7 |

**Answer:**

**Step 1: Calculate means**
$$\overline{X} = \frac{1+2+3}{3} = \frac{6}{3} = 2$$
$$\overline{Y} = \frac{3+5+7}{3} = \frac{15}{3} = 5$$

**Step 2: Calculate slope**

Numerator:
$$\sum(x_i - \overline{X})(y_i - \overline{Y}) = (1-2)(3-5) + (2-2)(5-5) + (3-2)(7-5)$$
$$= (-1)(-2) + (0)(0) + (1)(2) = 2 + 0 + 2 = 4$$

Denominator:
$$\sum(x_i - \overline{X})^2 = (-1)^2 + (0)^2 + (1)^2 = 1 + 0 + 1 = 2$$

$$m = \frac{4}{2} = 2$$

**Step 3: Calculate intercept**
$$b = \overline{Y} - m\overline{X} = 5 - 2(2) = 5 - 4 = 1$$

**Final Equation:** $y = 2x + 1$

### Q28. For the data in Q27, construct the design matrix $\Phi$ and verify the solution using Normal Equations.

**Answer:**

**Design Matrix:**
$$\Phi = \begin{bmatrix} 1 & 1 \\ 2 & 1 \\ 3 & 1 \end{bmatrix}$$

**Target Vector:**
$$y = \begin{bmatrix} 3 \\ 5 \\ 7 \end{bmatrix}$$

**Step 1:** Calculate $\Phi^T\Phi$
$$\Phi^T\Phi = \begin{bmatrix} 1 & 2 & 3 \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 2 & 1 \\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 14 & 6 \\ 6 & 3 \end{bmatrix}$$

**Step 2:** Calculate $\Phi^Ty$
$$\Phi^Ty = \begin{bmatrix} 1 & 2 & 3 \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 3 \\ 5 \\ 7 \end{bmatrix} = \begin{bmatrix} 34 \\ 15 \end{bmatrix}$$

**Step 3:** Calculate $(\Phi^T\Phi)^{-1}$
$$\det = 14(3) - 6(6) = 42 - 36 = 6$$
$$(\Phi^T\Phi)^{-1} = \frac{1}{6}\begin{bmatrix} 3 & -6 \\ -6 & 14 \end{bmatrix} = \begin{bmatrix} 0.5 & -1 \\ -1 & 2.33 \end{bmatrix}$$

**Step 4:** Calculate $\theta^*$
$$\theta^* = \begin{bmatrix} 0.5 & -1 \\ -1 & 2.33 \end{bmatrix} \begin{bmatrix} 34 \\ 15 \end{bmatrix} = \begin{bmatrix} 2 \\ 1 \end{bmatrix}$$

**Result:** $\theta_1 = 2$ (slope), $\theta_2 = 1$ (intercept) ✓

### Q29. Calculate the cost $J(\theta)$ for the model $y = 2x + 1$ on the data from Q27.

**Answer:**

**Data and Predictions:**

| $x_i$ | $y_i$ (actual) | $\hat{y}_i = 2x_i + 1$ | Error | Squared Error |
|-------|----------------|------------------------|-------|---------------|
| 1 | 3 | $2(1) + 1 = 3$ | 0 | 0 |
| 2 | 5 | $2(2) + 1 = 5$ | 0 | 0 |
| 3 | 7 | $2(3) + 1 = 7$ | 0 | 0 |

**Cost Function:**
$$J(\theta) = \sum_{i=1}^{3} (\hat{y}_i - y_i)^2 = 0 + 0 + 0 = 0$$

**Interpretation:**
The cost is 0 because all points lie exactly on the line. This is a perfect fit (which is rare in real data).

### Q30. Given data points (1,2), (2,3), (3,5), (4,4), calculate the regression line.

**Answer:**

**Step 1: Calculate means**
$$\overline{X} = \frac{1+2+3+4}{4} = \frac{10}{4} = 2.5$$
$$\overline{Y} = \frac{2+3+5+4}{4} = \frac{14}{4} = 3.5$$

**Step 2: Calculate deviations**

| $x_i$ | $y_i$ | $x_i - \overline{X}$ | $y_i - \overline{Y}$ | Product | $(x_i-\overline{X})^2$ |
|-------|-------|----------------------|----------------------|---------|------------------------|
| 1 | 2 | -1.5 | -1.5 | 2.25 | 2.25 |
| 2 | 3 | -0.5 | -0.5 | 0.25 | 0.25 |
| 3 | 5 | 0.5 | 1.5 | 0.75 | 0.25 |
| 4 | 4 | 1.5 | 0.5 | 0.75 | 2.25 |
| **Sum** | | | | **4.0** | **5.0** |

**Step 3: Calculate slope**
$$m = \frac{4.0}{5.0} = 0.8$$

**Step 4: Calculate intercept**
$$b = 3.5 - 0.8(2.5) = 3.5 - 2.0 = 1.5$$

**Final Equation:** $y = 0.8x + 1.5$

## Section 10: Conceptual Questions

### Q31. Why is linear regression called "linear"? Is it limited to straight lines?

**Answer:**

**Why "Linear":**
Linear regression is called linear because the prediction is a **linear combination of the parameters** $\theta$.

$$\hat{y} = \theta_1 \phi_1(x) + \theta_2 \phi_2(x) + ... + \theta_k \phi_k(x)$$

The model is linear **in $\theta$**, not necessarily in $x$.

**Not Limited to Straight Lines:**

We can use non-linear feature mappings:

| Feature Mapping | Resulting Model |
|-----------------|------------------|
| $\phi(x) = [x, 1]^T$ | $\hat{y} = \theta_1 x + \theta_2$ (line) |
| $\phi(x) = [x^2, x, 1]^T$ | $\hat{y} = \theta_1 x^2 + \theta_2 x + \theta_3$ (parabola) |
| $\phi(x) = [\sin(x), \cos(x), 1]^T$ | $\hat{y} = \theta_1 \sin(x) + \theta_2 \cos(x) + \theta_3$ |

All these are still "linear regression" because they're linear in $\theta$.

### Q32. Compare the analytical solution (Normal Equations) vs iterative solution (Gradient Descent).

**Answer:**

| Aspect | Normal Equations | Gradient Descent |
|--------|------------------|------------------|
| **Formula** | $\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$ | $\theta := \theta - \alpha \nabla J$ |
| **Type** | Closed-form/Analytical | Iterative |
| **Iterations** | One-shot | Multiple iterations |
| **Hyperparameters** | None | Learning rate $\alpha$ |
| **Time Complexity** | $O(k^3)$ for inverse | $O(mk)$ per iteration |
| **Large $k$** | Slow (matrix inverse) | Fast |
| **Large $m$** | Need all data in memory | Can use mini-batches |
| **When to use** | Small to medium $k$ | Large $k$ or online learning |

### Q33. What happens if $\Phi^T\Phi$ is not invertible? What causes this?

**Answer:**

**What Happens:**
- The Normal Equations cannot be used directly
- No unique solution exists
- Infinitely many solutions may exist

**Causes:**

1. **Redundant features:** Two features are linearly dependent
   - Example: "height in cm" and "height in inches"

2. **More features than samples:** $k > m$
   - Not enough data to determine all parameters

3. **Duplicate features:** Same feature included twice

**Solutions:**

1. **Remove redundant features**
2. **Use pseudo-inverse:** $\theta = \Phi^+ y$
3. **Add regularization (Ridge):** $\theta = (\Phi^T\Phi + \lambda I)^{-1}\Phi^Ty$
4. **Use gradient descent instead**

### Q34. Explain the geometric interpretation of the least squares solution.

**Answer:**

**Geometric View:**

1. **Column space of $\Phi$:** The set of all possible predictions $\Phi\theta$

2. **Target $y$:** May not lie in the column space of $\Phi$

3. **Optimal $\hat{y} = \Phi\theta^*$:** The projection of $y$ onto the column space of $\Phi$

4. **Residual $(y - \Phi\theta^*)$:** Perpendicular to the column space

**Key Insight:**
The least squares solution finds the point in the "prediction space" that is closest to the actual target.

**Why Normal Equations Work:**
The condition $\Phi^T(\Phi\theta - y) = 0$ means the residual is orthogonal to all columns of $\Phi$.

### Q35. What is the role of the intercept term (bias) in linear regression?

**Answer:**

**Role of Intercept:**

1. **Shifts the line:** Allows the line to not pass through origin

2. **Baseline prediction:** The predicted value when all features are zero

3. **Flexibility:** Without intercept, model is too constrained

**Example:**
- Predicting house price based on size
- Without intercept: A 0 sq.ft house has price = \$0
- With intercept: A 0 sq.ft house has price = \$base\_price (land value, etc.)

**How to Include:**
Add a column of 1s to the feature matrix:
$$\phi(x) = [x_1, x_2, ..., x_n, 1]^T$$

The parameter corresponding to the 1 becomes the intercept.

### Q36. List the assumptions of linear regression.

**Answer:**

**Key Assumptions:**

1. **Linearity:** Relationship between X and Y is linear (in parameters)

2. **Independence:** Observations are independent of each other

3. **Homoscedasticity:** Variance of errors is constant across all levels of X

4. **No multicollinearity:** Features are not highly correlated with each other

5. **Normality of errors:** Errors follow a normal distribution (for inference)

6. **No autocorrelation:** Errors are not correlated with each other

**When Assumptions Violated:**
- Model may give biased or unreliable estimates
- Need to use different techniques (regularization, transformations, etc.)

## Section 11: Summary Questions

### Q37. Summarize the complete pipeline for linear regression from data to prediction.

**Answer:**

**Complete Pipeline:**

**Step 1: Data Collection**
- Gather input-output pairs: $(x_1, y_1), (x_2, y_2), ..., (x_m, y_m)$

**Step 2: Feature Engineering**
- Create feature mapping: $\phi(x_i) = [x_i, 1]^T$ (or more complex)

**Step 3: Construct Design Matrix**
- Build $\Phi \in \mathbb{R}^{m \times k}$ from all feature vectors

**Step 4: Solve Normal Equations**
$$\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$$

**Step 5: Make Predictions**
- For new input $x_{new}$:
$$\hat{y}_{new} = \theta^{*T}\phi(x_{new})$$

**Step 6: Evaluate**
- Calculate error on test data
- Check if model is good enough

### Q38. What makes linear regression special compared to other ML algorithms?

**Answer:**

**Special Properties of Linear Regression:**

1. **Analytical Solution:** One of the few ML problems with a closed-form solution
   $$\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$$

2. **Interpretability:** Coefficients directly tell us feature importance

3. **Simplicity:** Easy to understand and implement

4. **Speed:** Fast to train (just matrix operations)

5. **Foundation:** Basis for many advanced techniques

6. **No hyperparameters:** (in basic form) No tuning required

7. **Convex optimization:** Always finds global minimum

**Limitations:**
- Cannot model complex non-linear relationships directly
- Sensitive to outliers
- Assumes linear relationship

---
## Summary of Important Formulas

| Concept | Formula |
|---------|--------|
| Linear Model | $\hat{y} = \theta_1 x + \theta_2$ |
| Slope | $m = \frac{\sum(x_i - \overline{X})(y_i - \overline{Y})}{\sum(x_i - \overline{X})^2}$ |
| Intercept | $b = \overline{Y} - m\overline{X}$ |
| Prediction (vector) | $\hat{y}_i = \theta^T\phi(x_i)$ |
| Squared Loss | $l(\hat{y}_i, y_i) = (\hat{y}_i - y_i)^2$ |
| Cost Function | $J(\theta) = \sum_{i=1}^{m}(\theta^T\phi(x_i) - y_i)^2$ |
| Matrix Cost | $J(\theta) = \|\Phi\theta - y\|_2^2$ |
| Expanded Cost | $J(\theta) = \theta^T\Phi^T\Phi\theta - 2y^T\Phi\theta + y^Ty$ |
| Gradient | $\nabla_\theta J = 2\Phi^T\Phi\theta - 2\Phi^Ty$ |
| Normal Equations | $\theta^* = (\Phi^T\Phi)^{-1}\Phi^Ty$ |

---