# **Simple Linear Regression:**

## **What is Simple Linear Regression:**

Simple linear regression models the relationship between a single input variable $X$ and an output variable $Y$ using a straight line:

$$
Y = \beta_0 + \beta_1 X 
$$

- $Y$: the response (dependent) variable  
- $X$: the predictor (independent) variable  

---

### Model Parameters

- **$\beta_0$ (Intercept)**:  
  The expected value of $Y$ when $X = 0$. It determines where the regression line crosses the Y-axis.

- **$\beta_1$ (Slope)**:  
  The change in the predicted value of $Y$ for a one-unit increase in $X$. It represents the strength and direction of the linear relationship between $X$ and $Y$.

---

### Questions this Model can Answer

1. **What is the average change in $Y$ for a one-unit increase in $X$? How Strong is the Association**  
   → Interpreted through the slope coefficient $\beta_1$.

2. **What is the predicted value of $Y$ for a given value of $X$?**  
   → Use the regression equation: $\hat{Y} = \beta_0 + \beta_1 X$.

3. **Is there a statistically significant linear relationship between $X$ and $Y$?**  
   → Test whether $\beta_1 \neq 0$ using a t-test. ($H_0$)

4. **How well does $X$ explain the variation in $Y$?**  
   → Measured using the coefficient of determination, $R^2$.

5. **What is the expected value of $Y$ when $X = 0$?**  
   → Interpreted through the intercept $\beta_0$ (if meaningful in context).

6. **How much uncertainty is in our predictions?**  
   → Use confidence intervals for $\hat{Y}$ or prediction intervals for new observations.

7. **Is there a Synergy Effect with other Predictors?**

## **Estimating Coefficients**:

#### Residual Sum of Squares (RSS)

In simple linear regression, the **residual sum of squares (RSS)** measures the total squared difference between the observed values $y_i$ and the predicted values $\hat{y}_i$:

$$
\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2
$$

- A **smaller RSS** means the regression line fits the data more closely.

---

#### Relationship Between RSS and MSE

The **mean squared error (MSE)** is simply the average of RSS:

$$
\text{MSE} = \frac{\text{RSS}}{n}
$$

- RSS sums the squared errors.
- MSE scales it by the number of observations \( n \), giving the **average squared error**.

---

#### Estimating $(\beta_1)$ and $(\beta_0)$

To find the best-fitting line, we choose $(\beta_0)$ and $(\beta_1)$ that **minimize RSS**. Using calculus (taking partial derivatives of RSS and setting them to zero), we get:

$$
\beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}
$$

$$
\beta_0 = \bar{y} - \beta_1 \bar{x}
$$

- $(\bar{x})$ and $(\bar{y})$ are the sample means of $(x)$ and $(y)$
- $(\beta_1)$ is the slope: it measures how much $(Y)$ changes per unit change in $(X)$
- $(\beta_0)$ is the intercept: the predicted value of $(Y)$ when $(X = 0)$

These formulas result from **minimizing RSS using calculus**, by solving the normal equations derived from setting the gradients to zero.


## **Unbiased Vs. Biased Estimators:**

In statistics, an **estimator** is a rule or formula for estimating a population parameter (like the mean or variance) from sample data.

---

#### Unbiased Estimator

An estimator $\hat{\theta}$ is **unbiased** if its expected value equals the true parameter $\theta$:

$$
\mathbb{E}[\hat{\theta}] = \theta
$$

This means that, on average across many samples, the estimator will correctly estimate the true value.

---

#### Biased Estimator

An estimator is **biased** if its expected value does **not** equal the true parameter:

$$
\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta \ne 0
$$

It systematically overestimates or underestimates the true parameter.

---

#### Example: Sample Variance

- The **unbiased** sample variance uses $n - 1$ in the denominator:
  
  $$
  s^2 = \frac{1}{n - 1} \sum_{i=1}^{n}(x_i - \bar{x})^2
  $$

- The **biased** version uses $n$ in the denominator and underestimates the true population variance.


## **Standard Error:**

The **standard error** measures the variability (or precision) of an estimator. It tells us how much the estimate would vary if we repeated the sampling process many times.

---

#### Variance of the Sample Mean

For a sample of size $n$ from a population with variance $\sigma^2$:

$$
\operatorname{Var}(\bar{\mu}) = \frac{\sigma^2}{n}
$$

So the **standard error of the sample mean** is:

$$
\text{SE}(\bar{\mu}) = \sqrt{\frac{\sigma^2}{n}}
$$

---

#### Standard Errors in Simple Linear Regression

Let $X$ be the predictor with sample size $n$, and let $s^2$ be the residual variance estimate:

$$
s^2 = \frac{1}{n - 2} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Then:

- **Standard error of the intercept** $\beta_0$:

  $$
  \text{SE}(\hat{\beta}_0)^2 = s^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \right]
  $$

- **Standard error of the slope** $\beta_1$:

  $$
  \text{SE}(\hat{\beta}_1)^2 = \frac{s^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
  $$
  
These quantify the uncertainty in our estimates of $\beta_0$ and $\beta_1$ based on the variability in the data.

---

Smaller standard errors indicate more precise estimates.


## **Confidence Intervals:**

A **confidence interval (CI)** gives a range of plausible values for an unknown population parameter based on sample data. It is centered around a point estimate and constructed using the **standard error (SE)**.

---

#### General Formula

For a parameter estimate $\hat{\theta}$, a 95% confidence interval is:

$$
\hat{\theta} \pm z^* \cdot \text{SE}(\hat{\theta})
$$

- $\hat{\theta}$: point estimate (e.g., $\hat{\beta}_1$, $\bar{x}$)  
- $\text{SE}(\hat{\theta})$: standard error of the estimate  
- $z^*$: critical value from the standard normal distribution (e.g., $z^* \approx 1.96$ for 95% confidence)

---

#### Interpretation

A 95% confidence interval means that if we repeated the sampling process many times, about **95% of the intervals** constructed this way would contain the **true population parameter**.

---

#### Example: Confidence Interval for $\beta_1$

$$
\hat{\beta}_1 \pm 1.96 \cdot \text{SE}(\hat{\beta}_1)
$$

This provides a plausible range for the true slope in simple linear regression.

---

Note: Use a $t$-distribution (with appropriate degrees of freedom) instead of $z^*$ when the sample size is small or $\sigma$ is unknown.


## **Hypothesis Tests:**

In simple linear regression, you often test whether the slope $\beta_1$ is significantly different from zero — i.e., whether the predictor $X$ has a linear relationship with the response $Y$.

---

#### Hypotheses

- **Null hypothesis** ($H_0$): $\beta_1 = 0$ (no linear relationship)
- **Alternative hypothesis** ($H_1$): $\beta_1 \ne 0$ (there is a linear relationship)

---

#### Test Statistic: t-score

To test $H_0$, we compute a **t-statistic**:

$$
t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)}
$$

- $\hat{\beta}_1$: estimated slope from the regression  
- $\text{SE}(\hat{\beta}_1)$: standard error of the slope

This tells us how many standard errors $\hat{\beta}_1$ is away from zero.

---

#### Using the t-Statistic in Simple Linear Regression:

1. Compare the absolute value of $t$ to a **critical value** from the $t$-distribution with $n - 2$ degrees of freedom.
2. Alternatively, calculate a **p-value** and compare it to a significance level (e.g., $\alpha = 0.05$).

- If the p-value is small (typically < 0.05), we **reject $H_0$**.
- This suggests that the predictor $X$ is statistically significant.

---

A large $|t|$ value indicates strong evidence against the null hypothesis.


# **Assessing the Accuracy of the Model:**

## **Residual Standard Error:**

The **Residual Standard Error (RSE)** measures the typical size of the residuals (prediction errors) in a regression model. It estimates the standard deviation of the error term $\varepsilon$.

---

#### Formula

$$
\text{RSE} = \sqrt{\frac{\text{RSS}}{n - p - 1}} = \sqrt{\frac{1}{n - p - 1} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
$$

- $\text{RSS}$: Residual Sum of Squares  
- $n$: number of observations  
- $p$: number of predictors
- $n - 2$: degrees of freedom for simple linear regression (two parameters estimated: $\beta_0$ and $\beta_1$)

---

#### Interpretation

- RSE gives an estimate of the **typical distance** between the actual data points and the regression line.
- It is in the **same units as the response variable $Y$**.
- A **smaller RSE** indicates a better fit.

---

#### Relation to RSS

- RSS measures **total squared error**.
- RSE adjusts RSS by dividing by degrees of freedom and taking the square root, converting it to an **average per-data-point error**.

---

#### Bounds

- $\text{RSE} \ge 0$  
- RSE = 0 only if the model fits the data **perfectly** (all residuals = 0), which is rare and usually unrealistic.

RSE is useful for comparing models or assessing absolute prediction accuracy.

---


#### Relationship Between RSE and MSE

In simple linear regression, the **Residual Standard Error (RSE)** is related to the **Mean Squared Error (MSE)** as follows:

$$
\text{RSE} = \sqrt{\frac{n}{n - p - 1} \cdot \text{MSE}}
$$

- $n$: number of observations  
- $p$: number of predictions
- MSE uses denominator $n$  
- RSE uses denominator $n - 2$ to account for the degrees of freedom lost when estimating $\beta_0$ and $\beta_1$

---

#### What They Represent

- **MSE** is the **average squared residual** and is used primarily during model training to evaluate and minimize prediction error. Its how *Off* your Predictions are (Squared)
- **RSE** is the **estimated standard deviation of the residuals**, giving an interpretable sense of the typical prediction error in the same units as $Y$. Its how *Off* your Predictions are (Relative to Scaling of Y)

---

So, RSE is essentially the **square root of a bias-corrected version of MSE**. When $n$ is large, the difference between RSE and $\sqrt{\text{MSE}}$ becomes small.


## **${R^2}$ Statistic:**

The **$R^2$ statistic** measures the proportion of variability in the response variable $Y$ that is explained by the regression model.

---

#### Formula

$$
R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}
$$

- $\text{RSS}$: Residual Sum of Squares  
- $\text{TSS}$: Total Sum of Squares

---

#### What is TSS?

$$
\text{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2
$$

- TSS measures the **total variation** in the response variable $Y$.
- It quantifies how far the observed $y_i$ values are from the mean $\bar{y}$.
- TSS represents the total "error" you'd have if you used the mean $\bar{y}$ to predict every $y_i$.

---

#### Important: What Is a "Good" $R^2$?

There is **no universal threshold** for a "good" $R^2$ — it depends on the field and type of data:

- In **physics or engineering**, $R^2 > 0.9$ is common due to low-noise systems.
- In **biology, psychology, or economics**, even $R^2$ values around 0.2–0.4 may be acceptable due to high natural variability or unobserved factors.
- A low $R^2$ does **not always mean** the model is useless — it may still reveal important patterns or predictors.

Always interpret $R^2$ **in context**, alongside domain knowledge and other model diagnostics.

---

$R^2$ answers the question:  
**“How much better is my model at predicting $Y$ compared to just using the mean?”**


## **Correlation Coefficient $(r)$:**

The **correlation coefficient** $r$ measures the **strength and direction** of the linear relationship between two variables $X$ and $Y$.

---

#### Formula for Pearson Correlation

$$
r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}
$$

- $r$ ranges from **–1 to 1**
  - $r = 1$: perfect positive linear relationship  
  - $r = -1$: perfect negative linear relationship  
  - $r = 0$: no linear relationship

---

#### Relationship to $R^2$

In **simple linear regression** (only one predictor):

$$
R^2 = r^2
$$

- $R^2$ is the **square of the correlation** between $X$ and $Y$.
- This means $R^2$ captures the **proportion of variance in $Y$** that can be explained by a **linear** relationship with $X$.

---

#### Key Notes

- $r$ captures **direction and strength** of linear correlation.
- $R^2$ captures **how much variation** in $Y$ is explained — it is **always positive**.
- In multiple regression, $R^2$ is **not equal** to the square of any single $r$ — it accounts for **joint effects** of all predictors.


# **Multiple Linear Regression:**

## **Why Multiple Linear Regression?:**

**Multiple linear regression** models the relationship between several input variables $X_1, X_2, \dots, X_p$ and an output variable $Y$ using a linear equation:

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p
$$

- $Y$: the response (dependent) variable  
- $X_1, X_2, \dots, X_p$: predictor (independent) variables  

---

### Model Parameters

- **$\beta_0$ (Intercept)**:  
  The expected value of $Y$ when all predictors are zero.

- **$\beta_j$ (Coefficient for $X_j$)**:  
  The change in the predicted value of $Y$ for a one-unit increase in $X_j$, holding all other predictors constant.

---

### Questions This Model Can Answer

1. **What is the expected change in $Y$ for a one-unit increase in $X_j$, controlling for other predictors?**  
   → Interpreted through $\beta_j$.

2. **What is the predicted value of $Y$ for a given combination of predictors?**  
   → Use the regression equation: $\hat{Y} = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p$.

3. **Is there a statistically significant relationship between each predictor and $Y$?**  
   → Test whether each $\beta_j \ne 0$ using individual t-tests ($H_0: \beta_j = 0$).

4. **How well do all predictors together explain the variation in $Y$?**  
   → Measured using the coefficient of determination, $R^2$ (or adjusted $R^2$ for multiple predictors).

5. **What is the expected value of $Y$ when all $X_j = 0$?**  
   → Interpreted through $\beta_0$ (context-dependent).

6. **How much uncertainty is in our predictions?**  
   → Use confidence intervals for $\hat{Y}$ or prediction intervals for new observations.

7. **Is there a synergy or interaction between predictors?**  
   → Include and test interaction terms (e.g., $X_1 \cdot X_2$) in the model.


## **F Statistic**:

The **F-statistic** is used to test the **overall significance** of a linear regression model. It evaluates whether at least one predictor variable has a non-zero coefficient.

---

#### Purpose

While a **t-test** evaluates the significance of a **single predictor** ($H_0: \beta_j = 0$), the **F-test** answers:

> **Is the model better than a model with no predictors at all?**  
> That is, **does at least one $\beta_j \ne 0$?**

---

#### Hypotheses

- **Null hypothesis ($H_0$):** All slope coefficients are zero: $\beta_1 = \beta_2 = \dots = \beta_p = 0$
- **Alternative hypothesis ($H_1$):** At least one $\beta_j \ne 0$

---

#### Formula

$$
F = \frac{\text{Explained variance per predictor}}{\text{Unexplained variance per residual}} = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)}
$$

- $TSS$: Total Sum of Squares  
- $RSS$: Residual Sum of Squares  
- $p$: number of predictors  
- $n$: number of observations

---

#### Interpretation

- A **large F-value** indicates that the model explains a significant amount of variability in $Y$ compared to a null model (with only the intercept).
- The associated **p-value** tells you if this improvement is statistically significant.

---

#### Why Use F Instead of t or z?

- In **multiple regression**, using **individual t-tests** for each predictor can miss the **combined effect** of variables.
- The **F-test** evaluates the model **as a whole** and is especially important when:
  - Testing overall model fit
  - Performing **ANOVA**
  - Comparing nested models

---

In summary, the F-statistic tests whether your model has **any predictive power at all**.

---


#### F-Statistic for a Subset of Predictors

The **F-statistic** can be used to test whether a **subset of predictors** contributes significantly to explaining the variation in $Y$, beyond what is explained by the rest of the model.

---

#### Use Case: Comparing Two Models

We compare:
- A **reduced model** (without certain predictors)
- A **full model** (with all predictors, including those being tested)

**Goal:** Determine if the additional predictors in the full model provide a **statistically significant improvement**.

---

#### Hypotheses

- **$H_0$ (null):** The additional predictors have **no effect**; their coefficients are all zero.
- **$H_1$ (alternative):** At least one of the added predictors has a **non-zero** effect.

---

#### F-Statistic for Partial Effect

$$
F = \frac{(RSS_{\text{reduced}} - RSS_{\text{full}}) / q}{RSS_{\text{full}} / (n - p - 1)}
$$

- $RSS_{\text{reduced}}$: RSS from the smaller model  
- $RSS_{\text{full}}$: RSS from the larger model  
- $q$: number of predictors added (difference in model size)  
- $p$: total number of predictors in the full model  
- $n$: number of observations

---

#### Interpretation

- A **large F-value** suggests that the subset of predictors being tested **improves the model fit** significantly.
- The corresponding **p-value** tells you whether this improvement is **statistically significant**.

---

#### Why This Matters

- This form of the F-test reveals the **partial effect** of adding new variables.
- It helps determine whether adding variables is **justified**, or if they contribute **redundant or noise-driven information**.

---

This method is often used in **model selection**, **stepwise regression**, and **testing interaction effects**.


## **Variable Selection in Linear Regression:**

In multiple linear regression, we often have many potential predictors. **Variable selection** is the process of choosing a subset of those predictors that best explain the response variable $Y$.

---

#### Why Variable Selection Matters

- Including **too many predictors** can lead to **overfitting**, increased variance, and less interpretability.
- Including **irrelevant variables** can dilute the effect of important ones.
- Fewer, well-chosen predictors lead to **simpler, more robust, and more interpretable models**.

---

#### Why Not Try Every Possible Model?

- For $p$ predictors, there are $2^p$ possible subsets.  
  → For just 20 variables, that's over 1 million models.
- **Exhaustive search** is computationally infeasible for even moderately sized problems.

---

### Common Variable Selection Methods

#### 1. Forward Selection

- Start with no predictors.
- Add the predictor that improves model fit the most (e.g., lowest AIC/BIC, p-value, highest adjusted $R^2$).
- Repeat until adding predictors no longer improves the model significantly.

**Best for**: smaller datasets where starting from a minimal model makes sense.

---

#### 2. Backward Elimination

- Start with all predictors.
- Iteratively remove the least significant predictor (e.g., highest p-value).
- Stop when all remaining variables are statistically significant.

**Best for**: situations where $n > p$ and you suspect many irrelevant variables.

---

#### 3. Stepwise (Mixed) Selection

- Combines forward and backward steps.
- At each step, you can add or remove a variable based on a criterion (like AIC or p-value).

**Best for**: flexible balance between the two methods when you're unsure of variable importance.

---

### When to Use Each

| Method            | Use Case                                      |
|-------------------|-----------------------------------------------|
| Forward Selection | Few predictors expected to be relevant        |
| Backward Elimination | Many predictors, want to simplify            |
| Stepwise Selection| Need automation or compromise between both    |

---

### Real-World Applications of Variable Selection

1. **Healthcare**: Identify key risk factors for disease from dozens of clinical indicators.
2. **Finance**: Select relevant macroeconomic variables to predict asset returns.
3. **Marketing**: Choose customer attributes that drive purchasing behavior.
4. **Manufacturing**: Predict product quality based on process variables.
5. **Energy**: Forecast power consumption using selected environmental and usage features.

---

Variable selection improves **model performance**, **interpretability**, and **generalization** — making it essential in applied data science and decision-making.


## **Qualitative Predictors:**

Linear regression can include **categorical variables** like homeownership status by converting them into **dummy variables**. Let’s say we're modeling **average credit card balance ($Y$)** based on whether someone **owns a home**.

---

### 1. Binary (2-Level) Categorical Predictor: Homeownership

Suppose we have a variable `Homeowner` with two levels: **Yes** and **No**. We encode it as a dummy variable:

$$
X = 
\begin{cases}
1 & \text{if Homeowner = Yes} \\
0 & \text{if Homeowner = No}
\end{cases}
$$

The regression model becomes:

$$
\text{Balance} = \beta_0 + \beta_1 X + \varepsilon
$$

- **$\beta_0$**: the **average balance** for non-homeowners ($X = 0$)  
- **$\beta_1$**: the **difference** in average balance between homeowners and non-homeowners  
  → So, average balance for homeowners = $\beta_0 + \beta_1$

---

### 2. Multi-Level Example: Housing Type

Suppose instead of just "Homeowner", we have a variable `Housing` with 3 categories: **Renter**, **Owner**, **Living with Parents**. We create 2 dummy variables:

- $X_1 = 1$ if Owner, 0 otherwise  
- $X_2 = 1$ if Living with Parents, 0 otherwise  
- Renter is the **baseline**

The model becomes:

$$
\text{Balance} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon
$$

- **$\beta_0$**: average balance for **Renters**  
- **$\beta_1$**: difference in average balance between **Owners** and Renters  
- **$\beta_2$**: difference in average balance between those **Living with Parents** and Renters

---

### Why This Matters

- Including categorical variables allows us to assess **how group membership affects the response**.
- The coefficients tell us the **average differences** in credit card balance **relative to a baseline group**.
- This helps identify financial behavior patterns across groups (e.g., homeowners may carry lower balances).

---

This is how we bring qualitative traits into a quantitative model.


# **Extending the Linear Model:**

## **Removing the Additive Assumption:**

Linear regression makes two key assumptions about the relationship between predictors and the response:

---

### 1. **Linearity** Assumption

The relationship between each predictor $X_j$ and the response $Y$ is **linear**:

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \varepsilon
$$

This means each predictor contributes to $Y$ **proportionally** and independently. The effect of $X_j$ on $Y$ is constant — regardless of the values of other predictors.

---

### 2. **Additivity** Assumption

Each predictor’s effect is **added** to the total prediction:

- No interaction or synergy is assumed between predictors.
- For example, $\beta_1 X_1$ affects $Y$ the same way whether $X_2$ is low or high.

---

### Going Beyond Additivity: Modeling Synergy or Interference

In real-world data, predictors may **interact** — the effect of one depends on the level of another. This is common in:

- **Medicine**: two treatments may work better together (synergy) or cancel each other (interference)
- **Marketing**: a discount and an ad campaign may only work well **together**
- **Manufacturing**: pressure and temperature may jointly affect material strength

---

#### Capturing Interactions

To allow for non-additive effects, include **interaction terms**:

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 \cdot X_2) + \varepsilon
$$

- The term $\beta_3 (X_1 \cdot X_2)$ models the **interaction** between $X_1$ and $X_2$
- Now, the effect of $X_1$ on $Y$ **depends on the value of $X_2$**, and vice versa

---

### Summary

- **Linearity** assumes straight-line effects.
- **Additivity** assumes independent contributions.
- **Interaction terms** break additivity and allow the model to capture **synergies or interferences** between predictors.

This makes the model more flexible and often more realistic — at the cost of increased complexity.

---

#### Example: Predicting Productivity in Manufacturing

Suppose we want to model **productivity** (e.g., units produced per shift) in a factory. Two important predictors might be:

- $X_1$: **Number of workers on the line**
- $X_2$: **Number of manufacturing lines operating**

---

### Basic Linear Additive Model

We start with a simple linear regression:

$$
\text{Productivity} = \beta_0 + \beta_1 \cdot \text{Workers} + \beta_2 \cdot \text{Lines} + \varepsilon
$$

This assumes:
- Each **additional worker** increases productivity by $\beta_1$ units, regardless of how many lines are running.
- Each **additional line** increases productivity by $\beta_2$ units, regardless of how many workers there are.

**Limitation**: This ignores **synergy** — the fact that adding more workers may only help if more lines are active (and vice versa).

---

### Adding an Interaction Term (Synergy)

To capture this dependency, we introduce an interaction term:

$$
\text{Productivity} = \beta_0 + \beta_1 \cdot \text{Workers} + \beta_2 \cdot \text{Lines} + \beta_3 \cdot (\text{Workers} \cdot \text{Lines}) + \varepsilon
$$

- $\beta_3$ captures the **synergy** between_


## **Non-Linear Relatinships:**

Linear regression assumes a **linear relationship** between each predictor and the response. But in many real-world cases, the effect of a variable on the outcome is **nonlinear**.

---

### Example: Predicting MPG with Horsepower

Suppose we want to predict a car’s **fuel efficiency (MPG)** using its **horsepower**:

**Initial model:**

$$
\text{MPG} = \beta_0 + \beta_1 \cdot \text{Horsepower} + \varepsilon
$$

This model assumes:
- Each additional unit of horsepower decreases MPG by a **constant amount** ($\beta_1$).
- The relationship is a **straight line**.

---

### The Problem: The Relationship May Be Curved

In reality:
- Going from 100 → 120 HP may reduce MPG slightly.
- Going from 300 → 320 HP might reduce MPG **much more**.

This suggests a **nonlinear** relationship — MPG drops **faster** at higher horsepower levels.

---

### Modeling Nonlinearity with Polynomial Terms

We can capture this curvature by adding a **squared term**:

$$
\text{MPG} = \beta_0 + \beta_1 \cdot \text{Horsepower} + \beta_2 \cdot \text{Horsepower}^2 + \varepsilon
$$

- This is still a **linear model** in terms of parameters (so we can use linear regression techniques).
- But it allows the predicted MPG curve to **bend** — typically downward in this case.
- $\beta_2$ controls the **curvature** (e.g., negative for downward bend).

---

### Why This Helps

- More flexible models can capture realistic behavior.
- Polynomial terms are a simple way to model nonlinearity **without switching to complex models**.

This is a common technique in regression diagnostics and model refinement.


# **Potential Problems with Linear Regression:**

## **Non-Linearity of the Data:**

## **Correlation of Error Terms:**

## **Non-Constant Variance of Error Terms:**

## **Outliers:**

## **High-Leverage Points:**

## **Collinearity:**