# Week 7 Overview
This week, we review the fundamentals of simple linear regression, including how to calculate the slope and interpret the intercept. We explore the differences between error terms and residuals and discuss the significance of confounders in regression analysis. Additionally, we examine key statistical measures used to assess the accuracy of regression models. There is a lot of content and reading this week. We will focus primarily on the material through Lesson 2.3.

## Learning Objectives
At the end of this week, you will be able to: 
- Explain the difference between error terms and residuals in regression 
- Identify potential confounders in a regression model 
- Explain the common assumptions for a regression model 
- Interpret statistical measures to assess regression model accuracy 
- Compute slope in simple regression and distinguish between error terms and residuals for accurate modeling 

## Topic Overview: Regression Concepts and Error Terms 
In this topic, we cover how the slope in a simple regression situation is computed, with the estimate of the slope being the covariance of $X$ and $Y$ divided by the variance of $X$. We also discuss the distinction between error terms and residuals, the relationship between the true regression model and noise ($\varepsilon$), and the importance of ensuring that the error term is uncorrelated with $X$ for an accurate model. 

### Learning Objectives
- Explain the difference between error terms and residuals in regression 
- Identify potential confounders in a regression model 

## Understanding Error in Regression

### How We Compute the Slope in a Simple Regression Situation 

With one predictor (X), the estimate of the slope is the covariance of $X$ and $Y$: 
$$ E_{sample} ((X - \langle X \rangle )) \cdot  (X - ( \langle X \rangle )) $$  

Divided by the variance of $X$. 

$E_{sample} ((X - \langle X \rangle))^2 $

### Error Terms
If we subtract the predicted value $(\hat{Y})$ from the actual outcome $Y$, we get the **residual**. 

There is a difference between the **error** and the **residual**. In regression, the "true" relationships between $X$ and $Y$ is:

$$ Y \; =  \; \beta_0 + \beta_1 X + \varepsilon $$ 

Where:

$\varepsilon$ is some noise; it is a random variable. 

We could write $\varepsilon (X = x)$, meaning that epsilon can be conditioned on a particular value of $X$. If there is anothe variable $Z$, then $\varepsilon(Z = z)$ is also valid. 

That is, perhaps at $X = 0, \varepsilon (X = 0)$ is a random normal distribution with mean 0 and a standard deviation 1, but at $X = 1, \varepsilon (X = 1)$ is a random normal distribution with mean 0 and standard deviation 2. 

On the other hand, $\varepsilon(Z = 0)$ may be something totally different. 

We do, however, require that the mean of $\varepsilon(X = x)$ is 0 no matter the value of $X$. Otherwise, we'd say that X is correlated iwth the error term. 

if we predict $Y = \beta_0 + \beta_1 X$, then $\varepsilon$ is the error term, which is similar to the residual. However, they are not always the same. For example, if we get the $\beta_1$ coefficient wrong, like $Y = \beta_0 + \beta_1 \prime X$, then the residual is is $(\beta_1 - \beta_1 \prime) X + \varepsilon$

Since $\varepsilon$ is a random variable, and epsilon is part of $Y$, it follows that $Y$ is a random variable too - it does not uniquely depend on $X$. 

We could think of the error as "random", but presumably, any error depends on something, even if it depends on a butterfly flapping its wings. (There is an idea in chaos theory that the weather can change when a butterfly flaps its wings.) 

So, perhaps $\varepsilon (X = x)$ is a random variable, but $\varepsilon (X = x, text{butterfly} = \text{True})$ is deterministic. Since we don't know what epsilon depends on, apart from X or other variables in our model, we can't make epsilon deterministic, even if we condition on all variables in our model. 

Is it even meaningful to say that the "true" relationship is $Y = \beta_0 + \beta_1 X + \varepsilon$ ? 

What if we just define the relationship to be $ Y = \beta_0 + \beta-1 \prime X$ for some other $\beta_1 \prime$, and we say that $\varepsilon \prime = (\beta_1 - \beta_1 \prime) X + \varepsilon$ ? 

Is there any reason to say that one value of \beta_1 is the true one and the other is not? 

Yes, as mentioned above, we would like $X$ to not be correlated with the error term, whichis to say that at a given value of $X$, the expected value of $\varepsilon(X = x)$ should be zero. (it should be a constant, but if that constant isn't zero, we can just adjust $\beta_0$ to make the constant 0.)

Now, if \varepsilon(X = x) has an expectation value of 0 at all values of $X$, then it is possible that 

$( \beta_0 - \beta_0 \prime) + (\beta_1 - \beta_1 \prime)X + \varepsilon(X = x)$ also has an expectation value of 0 at all values of $X$> 

This is only possible if $X$ is constant, which seems unlikely and would make the whole situation trivial and boring. This suggest that ther is only one correct value of $\beta_0$ and of $\beta_1$ that maeks the noise uncorrelated with $X$

## 1.2 Lesson: Confounders and Error Patterns

### The effect of a confounder

If $Y = \beta_0 + \beta_1 X + \varepsilon_y$, then what happense if we are missing a confounder, $W$? 

In this case things really depend on $W$, not just $X$, so the real relationship is:

$Y = \beta_0 + \beta_1 X + \beta_2W + \varepsilon_{Y'}$

and

$\varepsilon_Y = \beta_2W + \varepsilon_{Y'}$

If $\varepsilon_{y'} (X = x)$ has a mean that is constant in $X$, then since $X$ and $W$ are causally related and correlated, it is likely that $\beta_2W$ has a mean that varies with $X$, and therefore $\varepsilon_{y'} (X = x)$ also has a mean that varies with $X$. 

Therefore, if we use the simpler equation, our regression does not really work the way we want. We wanted to have $\varepsilon_Y$ to have an expected value that is independent of $X$, but it's not the case. 

___

Here is the Markdown version of the content from both images, formatted for easy pasting into your Jupyter notebook:

⸻

### Why Confounders Need to Be Included: Example

Example: In this example, we will see why confounders need to be included in a regression (we have to control for them) while non-confounders need not be included.

If
$Y = \beta_0 + \beta_1’ X + \beta_2’ Z + \varepsilon_Y$
and $X$ and $Z$ are not correlated (in which case they likely are not causally related either), then if we regress
$Y = \beta_0 + \beta_1 X$,
we should get $\beta_1 = \beta_1’$ approximately. That is, we can just consider $\beta_2’ Z + \varepsilon_Y$ to be the noise $\varepsilon_Y$, which is still uncorrelated with $X$ and therefore does not alter the coefficients on average.

One possible reason we would need to regress on the confounder (say $Z$ is a confounder) is that $Z$ causes both $X$ and $Y$. If we fail to regress on $Z$ in that case, then consider the following example. Say $Y = X - Z$. This equation shows that $Z$ is a cause of $Y$.

| $Y$ | $X$ | $Z$ |
| :--- | :--- | :--- |
|0 | 0 | 0 |
| 1	| 1 | 0 |
| 1 | 2 | 1 |
| 2 | 3	| 1 |

Regressing only $X$ will not produce the correct coefficient of 1; it will produce a smaller coefficient (around ⅔, let’s say, because $Y$ increases by 2 while $X$ increases by 3).
Since in this example, $X$ and $Y$ are both partly caused by $Z$ ($Z$ increases $X$ and decreases $Y$), then failing to control for $Z$ will mean underestimating the effect of $X$ on $Y$.

(We are omitting the fact that the $Y$s are lower than expected for large $X$, not because the effect of $X$ is small, but because the large $X$ is caused by a large $Z$, and that large $Z$ decreases $Y$.)

⸻

On the other hand, suppose $Y = X - Z$ but $X$ and $Z$ are not correlated; perhaps $Z$ causes $Y$ but not $X$:

| $Y$ |	$X$ | $Z$ |
| :--- | :--- | :--- | 
| -1 |	0 |	1 |
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 2 | 3 | 1 |

In this case, the coefficient of $X$ will be approximately correct (equal to 1) even if $Z$ is omitted from the analysis. $Z$ is no longer a confounder since it does not affect $X$, only $Y$. This possibility was mentioned above, too.

Finally, suppose that $Z$ causes $X$, which then causes $Y$, but $Z$ doesn’t cause $Y$. In that case, $Z$ still is not a confounder, and the correct equation for $Y$ is just $Y = X$, so, again, $Z$ need not be controlled for.

⸻

Conclusion:
This shows how regression on a confounder is necessary to discern the true effect, but if $Z$ is not a confounder (it does not cause both $X$ and $Y$), then it need not be included in the regression.

Here is the Markdown version of the image content titled “Controlling for a Variable”, formatted for your Jupyter notebook:

⸻

### Controlling for a Variable

When a variable $W$ is a confounder, we need to control for $W$:

$ Y = \beta_0 + \beta_1 X + \beta_2 W + \varepsilon_Y$

If we have got all the confounders now, we have that $\varepsilon_Y$ is not correlated with $X$ and/or $W$. (The expected value of epsilon at a given value of $X$ and $W$ is zero.)

Then, for a specific value of $W = w$,

$Y = \beta_0 + \beta_1 X + \beta_2 w + \varepsilon_Y(W = w)$

Once this is done, $\varepsilon_Y(W = w)$ is not correlated with $X$, so that $\varepsilon_Y(X = x, W = w)$ has a mean that doesn’t depend on $x$ — regardless of the value of $w$. This implies that when $X$ increases by 1, $Y$ increases by (on average) $\beta_1$.

Since we’ve controlled for $W$, this is true when we hold $W$ constant and change $X$ alone — so it’s the causal relationship between $X$ and $Y$ when $W$ is constant. This is all it means to control for $W$.

If there are still other confounders not listed in the regression equation, then $X$ must still be correlated with the error term.

⸻

If we have an interaction term, like

$Y = \beta_0 + \beta_1 X + \beta_2 W + \beta_3 X \cdot W + \varepsilon$ 

then at fixed $W = w$ there will be some relationship between $Y$ and $X$, but it will be a different relationship at different values of $W$.
So, we’d have:

$Y = \beta_0 + \beta_1 X + \text{constant} + \beta_3 X \cdot X + \varepsilon$

and the slope of the relationship will be $\beta_1 + \beta_3 \cdot w$.

So, if we just toss out all data where $W$ is not equal to a specific value $w$, then there will be a coefficient $\beta_1 + \beta_3 \cdot w$.

⸻

When $X$ is correlated with the error term in a nonlinear way, the model will be misspecified, regardless of how you choose the coefficients and the corresponding error term.

For example, if:

$Y = \beta_0 + \beta_1 X + \beta_2 X^2$

Then, suppose $\varepsilon_Y = \beta_2 X^2$ — we are trying to absorb a quadratic term into the error term. This error term is correlated with $X$, which is not ideal. Can we fix the problem by redefining the error term? We could write:

$Y = \beta_0 + \beta_1’ X + \left(\beta_2 X^2 + (\beta_1 - \beta_1’) X \right)$

Because of the nonlinear relationship, no choice of $\beta_1’$ will fix the problem. The error term is correlated with $X$ in either case. We can choose the $\beta_1’$ that produces the best fit, resulting in some choice of $\varepsilon_Y$, but there is no way to make epsilon uncorrelated with $X$, regardless of how we choose $\beta_1$.

⸻

Let me know if you’d like a .md version or Jupyter cell-ready version with %%markdown.

# Topic Overview: Statistical Significance and Model Evaluation
In this topic, we explore the assumption that regression coefficients follow a normal distribution, which allows us to estimate standard errors and perform hypothesis testing. Key conditions for normality include ensuring the error term is normally distributed with constant variance and that error terms across samples are independent. 

## Learning Objectives 
- Explain the common assumptions for a regression model 
- Interpret statistical measures to assess regression model accuracy 

### 2.1 Lesson: Regression Coefficients

#### Regression coefficients follow a normal distribution.  
Suppose we take different samples from the same population. The estimated value of the coefficients will have some random distribution. We want to assume that these coefficients are normally distributed so that we can estimate standard errors. To make this work, we also have to guess the mean and standard deviation of the coefficients. 

If the coefficients are to be normally distributed, we need two conditions: 
1. The error term $\varepsilon$ is normally distributed with a fixed distribution. That is, it has a fixed mean (0) independent of $X$, and it has a fixed standard deviation ($\sigma$) that is independent of $X$. (If the variance is not fixed, it’s called heteroskedasticity. This can make it harder to compute the standard errors.) 
2. The error term of one sample is independent of another sample. That is, the two error terms are uncorrelated. When would they be correlated? Suppose that in this dataset, two samples that are close together in time are likely to have similar errors. So (if the samples are taken at particular times), then perhaps the 10am and 11am samples for the same day are likely to be both high or both low. We want this not to happen in order for estimated coefficients to be normally distributed. 

With one predictor, $X$, we have $Y = \beta_0 + \beta_1 X + \varepsilon$ and then an estimate $\widehat{\beta}_1$,  where the $\widehat{\beta}_1$ estimate has a normal distribution with a mean equal to the true value, $\beta_1$. What is its standard deviation?  

To know this, we need $\sigma$ (the standard deviation of the error term $\varepsilon$), $\text{Var} (X)$ (the variance of ﻿X﻿), and $n$ (the number of observations in the data). 

The standard deviation of $\beta_1$ is then $\sqrt{\frac{\sigma^2}{\text{Var} (X) \cdot n}}$ 

Since we do not have access to the error term epsilon, we can estimate sigma using the residual, which is the estimated $Y$ minus the actual $Y$ in the training data. 

The residual and the error term are similar so long as we estimate $\beta_1$ approximately correctly. 

With multiple predictors, $Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon_Y$, there’s a different formula for the standard error of $\widehat{\beta}_1$ and $\widehat{\beta}_2$ that depends on the covariance matrix of $X$ and $Z$, but otherwise it’s similar — $\widehat{\beta}_1$ and $\widehat{\beta}_2$ still follow normal distributions. 

To make $\beta_1$ more precise, we would have to change either $\sigma^2$, $\text{Var} (X)$, or $n$: 

1. The way to improve sigma is to reduce the variance of the error term. This would involve predicting $Y$ more accurately. We might have to add more variables to the model in order to achieve this. 
2. We are stuck with $\text{Var} (X)$ unless we are willing to change the feature $X$ that we are trying to model. 
3. We can, of course, do better if we increase the number of observations $n$. 

The standard deviation of some estimated coefficient $\widehat{\beta}_1$, which might take the form $\sqrt{\frac{\sigma^2}{\text{Var} (X) \cdot n}}$ in the one predictor case, is called the “standard error.” If you run linear regression in Python, you should be able to see the standard errors.

### 2.2 Lesson: Hypothesis Testing and Statistical Significance


#### Hypothesis Testing
We can use statistics to determine how likely it is that our estimated value of $\widehat{\beta}_1$ (or something more extreme in magnitude) would be produced on the null hypothesis that $\beta_1$ is actually equal to zero. 

Frequentist statistics gives us a **p-value** that is equal to this likelihood: 
- If the **p-value** is greater than some $\alpha$ (say, 0.05), then we assume it is possible that $\beta_1$ is actually equal to zero. 
- This means that there might be no causal relationship between $X$ and $Y$. We might still have evidence of a correlation between $X$ and $Y$, but the correlation could be caused by some confounder (say, $W$), which we have controlled for. 

The **t-statistic** is the predicted coefficient value $\widehat{\beta}_1$ divided by its standard error $\text {se} \left( \beta_1 \right)$. 
- If the t-statistic is below -1.96 or above 1.96, then it indicates statistical significance at the p = 0.05 level. (Assuming a large sample — with a small sample, the t-statistic may behave somewhat differently.) 
- This means that there is likely a nonzero relationship, but it doesn’t mean that the specific value $\widehat{\beta}_1$ is totally accurate. To know how accurate $\widehat{\beta}_1$ might be, we’d need a confidence interval on $\widehat{\beta}_1$.  

Some comments on statistical significance: 
- Just because an estimate isn’t significant doesn’t mean that the true $\beta_1$ value is really zero. It just means that if it were zero, we might see the very results we are seeing. 
    - Imagine, for instance, that a patient has a pattern of spots on his arm. 
    - The patient is worried that it is something serious, but the doctor tells them that irritation due to clothing can cause this sort of rash. 
    - This is all the frequentist test is saying: The symptom $\left(\widehat{beta}_1 \text{estimation}\right)$ could be caused by nothing serious (the true $\beta_1$ value is $0$). 
- Do not change the analysis to try to improve p-values. Thus, if we check 20 arms that truly have no serious rash (just clothing irritation), we will eventually come across one that looks serious, just by chance. (It’s a particularly bad case of clothing irritation.) So, if we go out seeking confirmation, we will eventually find it, even if it’s not real. 
- A p-value is not an effect size. That is, just because we know that a rash isn’t skin irritation doesn’t mean we know for sure what it is. It could still be a serious problem or not serious. Likewise, if our statistical test tells us that a particular medication works, we still don’t know how well it works. A medication that lowers blood pressure by 0.01 is “working” — but is that a strong enough effect to matter? If we have a study with 1 million patients, we might be able to detect such a small effect — it might have a very significant p-value yet not be very useful.

#### Statistics
The $R^2$ value shows that if we predict $Y$ by this model, then the residual has $1 - R^2$ of the variance of the observed $Y$ values.

**Adjusted $R^2$** is the same idea, but it adjusts for the number of varialbes in the model. 
- After all, just adding a totally arbitrary variable, even if it has no relationship with $Y$, could appear to help us predict $Y$ becase it has some random correlations with $Y$. No matter the values of $X$, some expression $A + BX$ is *guaranteed* to be able to exactly match at least two samples, which might then appear to partly explain $Y$. 
- Therefore, it might be more accurate to assume a weaker (adjusted) correlation than the calculated $R^2$ value. 

The **F-statistic** does a hypothesis test where the null that is *all* non-intercept coefficients are zero, but it's unlikely *that* your model is that wrong, so the F-statistic has limited value. However, if it proves significant, there is a problem.

The **RMSE**(Root Mean Square Error) compares the predicted scores from the actual scores of each sample. The degrees of freedom matter here - (this is the number of variables ($N$) minus the number of coefficients or predictors in the model $(P)$), so:

$\text{dof} \; = \; N - P$

Again, more coefficients are guaranteed to improve the prediction even if the variables have arbitrary values. So, we might choose that, when taking the average in the RMSE, we will divide by $N - P$ insated of $N$. Thus:

$$\text{RMSE (adjusted)} \; = \; \sqrt{\frac{\text{sum of squares of residuals}}{N - P}} \; \text{instead of}  \; \sqrt{\frac{\text{sum of squares of residuals}}{N}}$$ 

What does $\beta_1$ mean? It simply means that if the other variables are held constant, then a one-unit change in $X$ will correspond to (on average) a $\beta_1$ change in $Y$. 

### 2.3 Lesson: Subscripts, Polynomial Models, and Variable Transformations

#### Subscripts in Regression Equations
We sometimes use subscripts (e.g., $Y_i$ means the outcome of $(Y)$ of sample $i$). 

Thus, instead of:

$Y \; = \; \beta_0 + \beta_1 X + \varepsilon_Y$

We could write

$Y_i \; = \; \beta_0 + \beta_1 X_i + \varepsilon_{y, i}$

If the samples were taken at a sequence of times, then "$i$" refers to the time sample. So we might use $t$ insated of $i$:

$Y_t \; = \; \beta_0 + \beta_1X_{t - 1} + \varepsilon_{Y, t}$

Note that we would likely not relate $Y_t$ to $X_t$ because if we want to predict the future, we want to use the data at time $t-1$ to predict the outcome at time $t$. By the time we get access to samples $X_t$, outcome $Y_t$ has already happened; we no longer need to predict it.

We could also have $X_{it}$ meaning individual $i$ at time $t$. 

**To relate this to causal diagrams:**
- We should control for all variables that we must control for in order to close backdoors, and we should not control for anything that we must not control for due to colliders. 
- We are free to control for anything else that's related to $Y$; if we do, it will likely explain more of $Y$ and reduce standard errors. The coefficients will be more accurate.
- We can also consider adding interaction terms. In principle, any number of polynomial or interaction terms could be included; we could also have $X \times Z$, $X^2$, $X \times Z^2$, etc. In practice, we want to be careful about this to avoid overfitting; only include polynomial terms if our domain knowledge suggests that they might be relevant.

If we have a categorical variable with $N$ values, we can model this by taking $N - 1$ binary variables: 
- if the $N = 3$ of the categorical variable $X$ are $A$, $B$, and $C$, then we could model this using $N - 1 = 2$ binary variables, $X_A$ and $X_B$. 
    - $X_A$ is 1 if $X = A$; otherwise it's 0.
    - $X_B$ is 1 if $X = B$; otherwise it's 0.
    - If $X = C$, then $X_A = X_B = 0$. (It should never be the case that $X_A = X_B = 1$)
- If you want to assess the significance of $X_A$ and $X_B$, you have to then use a joint F-test for all variables ($X_A$ and $X_B$ in this case) because the significance of $X_A$ by itself is not meaningful. 

#### Polynomial Models
In a polynomial model: $Y \; = \; \beta_0 + \beta_1X + \beta_2 X^2 + \beta_3 X ^3 + \varepsilon$

**How do we interpret the above coefficients?**

If $X$ is changed by a small amount $dX$ (holding constant other variables, if there were any), then the change in $Y$ would equal the derivative of the polynomial expression, multiplied by $dX$. In this case we would have: 

$dY \; = \; \left( \beta_1 + 2\beta_2X + 3\beta_3X^3 + \varepsilon \right)dX$

The mean of $\varepsilon_Y$ term is independent of $X$, so when $X$ increases by $dX$, the mean of $\varepsilon_Y$ does not change, and the average $dY$ contributed by that term is therefore zero.

**Why would we not always add unlimited numbers of polynomial terms?** 
1. First, these terms may not help. 
2. Second, they increase overfitting.

Apart from using domain knowledge to decide whether to add polynomial terms, we can:
- Graph the data visually, and plot the residuals, ($Y_\text{actuals} - Y_\text{predicted}$). If the residuals show a nonlinear relationship, it suggests that polynomial terms are desirable.
- If you include terms beyond cubic (in other words, including ters to the fourth or fifth powers) you are probably going too far. 
- You could use a statistical significance test to see if the linear terms alone are enough. However, if you then do another significance test on the polynomial version, you are “p-hacking” by doing multiple significance tests. You are not supposed to do multiple significance tests in order to improve your fit. 
- You can use a LASSO regression. However, this also does not make it easy for us to do a significance test. 

#### Variable Transformations
You can always transform the variables — either predictors, outcome, or both — before you do the regression. That is, if you want to use $\ln(X)$ as your predictor and/or use $\ln(Y)$, no problem:

$\ln(Y) \; = \; \beta_0 + \beta_1 \ln(X) + \varepsilon_Y$

This would mean a $1\%$ increase in $X$ leads to approximately a $\beta_1 \times 1\%$. In other words:

$Y \; = \; e^{B_0} \cdot X^{\beta_1} \cdot e^{\varepsilon_Y}$

This suggests a relationship between $Y$ and $X$ with a variable exponent instead of a variable additive constant.

Using $\ln(X)$ instead of $X$ also means that outliers will not have as much impact because their values will be smaller and close to the other values: 
- If the data has values $X = 1$, $X = 3$, $X = 5$, and $X = 100$, then taking the logarithms gives $\ln(X) = 0, \ln(X) = 1.1, \ln(X) = 1.6, \ln(X) = 4.6$ The last value (the outlier is now closer to the others). 

Other transformations include:
- $\ln(X + 1)$ (useful if X can equal zero)
- $\sqrt(X)$
- $\text{sinh} = \ln (X + \sqrt{1 + X^2})$. This accepts any number, even negative ones.
- Winsorizing: Take every observation above the $X$th percentile and replace it with the $X$th percentile value. You can also do this to the bottom percent.
- Standardizing a variable by subtracting its mean and dividing by its standard deviation. This doesn’t change the statistical properties of $X$; the linear regression will find the computation no easier and no more difficult. However, if $\varepsilon$ were very small, then this could ensure that a 1-sigma change in $X$ corresponds to a 1-sigma change in $Y$. (If $\varepsilon$ is large, then $X$ might be able to change a lot without $Y$ changing very much.) 

### 2.4 Lesson: Interaction Terms, Nonlinear Regression, and Advanced Concepts

#### Interaction Terms
An **interaction term** looks like:

$$Y \; = \; \beta_0 + \beta_1 X + \beta_2 Z + \beta_3 X \cdot Z + \varepsilon$$

This is interpreted to mean that This is interpreted to mean that the effect of $X$ on $Y$ is: 

$$\frac{\partial Y}{\partial X} = \beta_1 + \beta_3 Z$$

I.e., $\beta_3$ is “how much stronger the effect of $X$ on $Y$ gets when $Z$ increases by 1 unit. 

Use caution with interaction terms to avoid overfitting. YOu should use them if domain knowledge suggests that they may be important. 

They also require a lot of data:
- If you have two groups ($Z = 0$ and $Z = 1$), then the effect of $X$ on $Y$ in the $Z = 1$ group is twice as much as the $Z = 0$ group: $Y = X + Z \cdot Z + \varepsilon$

Then, you’d still need 16 times as many observations to get enough statistical power.

#### Nonlinear Regression
One thing that might be a little confusing is that we use the word “linear regression” to apply to an equation like: 

$$Y = \beta_0 + \beta_1X + \beta_2 X^2 + \varepsilon$$

Why is this "linear" even if there is an $X^2$ term? The answer is that it's linear in the $\beta$ values. So, nonlinear regression is not linear in the beta values. 

For example: $Y = F(\beta_0 + \beta_1X)$ is nonlinear if $F$ is nonlinear. 

An important use of nonlinear regression is when we want a probability output. The expressions we have seen so far will output unlimited values - if $X$ is very large, then the output will also be very large.

To get **probabilities**, we can use **logit**:

$$F(X) = \frac{e^x}{1 + e^x}$$

This is always between 0 and 1 because the minimum when $x = - \infty$ (then it's 0), and the maximum is when $x = \infty$ (then it's 1)

Or we can use **probit**, which is the cumulative distribution function of the standard normal (the area under the curve between $-\infty$ and $x$). 
- $\text{probit}(x)$ is always between 0 and 1 becasue the area under the normal curve is 1, meaning that $\text{probit}(x)$ is always a probability.

We could also just stick with linear regression and presumably clip $Y$ values below 0 or above 1. But this is unlikely to be how the data actually are and can result in a poor estimate of $\frac{\partial Y}{\partial X}$. 

After all, for a linear model, $\frac{\partial Y}{\partial X}$ remains large right up to the point where $Y$ reaches 0 or 1; then we clip $Y$, and the effect goes to zero. But this seems unlikely to be correct. 

If $Y = F(\beta_0 + \beta_1X)$ is logit, then it happens to be the case that we can find the marginal effect on cby $X$ as:

$$\frac{\partial \text{Prob} (Y = 1)}{\partial (X)} \; = \; \beta_1 \text{Prob}(Y = 1) (1 - \text{Prob}(Y = 1))$$

Perhaps the notation is not absolutely clear; why can we take a derivative of $\text{Prob} (Y = 1)$ with respect to $X$? What’s implied is that $\text{Prob} (Y = 1)$ is a function of $X$:

$$ \text{Prob}(Y = 1 \; | \; X) = F(\beta_0 + \beta_1X)$$

We could, of course, also compute this partial derivative based on $X$ using the formula for $F (x)$.

When someone asks for “the” marginal effect on $\text{Prob} (Y = 1)$ they likely want us to compute the average marginal effect, which involves computing each individual observation’s marginal effect and then averaging them. Alternatively, we could compute the marginal effect at the mean, where we first average the observations’ $X$ values and then take the marginal effect of a hypothetical observation that has that value. 

$$\frac{\partial \text{Prob} (Y = 1)}{\partial (X)} \; = \; \beta_1 \text{Prob}(Y = 1 \; | \; X = \text{mean}) (1 - \text{Prob}(Y = 1 \; | \; X = \text{mean}))$$

That’s not as good, though — there may be no individual whose $X$ value is at the mean, so this is somewhat artificial. (Whose marginal effect?)

#### Heteroskedasticity
In the equation:

$ Y = \beta_0 + \beta_1 X + \varepsilon_Y$

We have said that $\varepsilon_Y$ is a random variable whose mean ought to be independent of $X$ unless there are confounders. 

However, $\varepsilon_Y$ can still have a standard deviation that depends on $X$:
- Suppose $X$ can take on values from 0 to 10. 
- Then, imagine that for $X < 5$, there is a high variance in epsilon, and while for $X \geq 5$, there is a low variance. 

Then, our regression should primarily use the values with $X > 5$ to compute the slope $\beta_1$ because the low variance means we are more certain about the mean value in that region. 

However, linear regression doesn’t do this — it weights all regions the same. 

This is the **heteroskedasticity problem**, and the result is that our standard errors will be wrong. In this case, one possible approach is to use a “sandwich estimator” such as Huber-White to estimate the standard error of $\beta_1$ instead of $\sqrt{\frac{\sigma^2}{\text{var} (X) \cdot n}}$

#### Errors that Fail to be Independent
Recall that errors are supposed to be uncorrelated between samples. If we have a time series where sequential values have correlated errors, then the usual approach to regression is inaccurate. 

The problem is this: 
- Suppose we see a “mountain” shape in the errors, where the residual gets really big for a long time. 
- We might be tempted to say that this is extremely improbable — how likely is it that all the residuals could be large and positive many times in a row? 
- If the errors are correlated, then, in fact, that is quite likely; it is very common for the same error to occur many times in a row. How do we solve this? 

If they are correlated across time (time-based autocorrelation from sample $N$ to sample $N + 1$), then we use heteroskedasticity-and-autocorrelation-consistent (HAC) standard errors. Example: the Newey-West estimator. 

If there is a hierarchical structure, e.g., students in the same class are correlated, then we apply classical standard errors, such as Liang-Zeger standard errors, which are also a sandwich estimator. 
- Clustering at too broad a level (say there are only two schools, and you cluster at the school level) will mean that the data won’t tell you anything. 
- The more common approach is to cluster at the level of treatment (if group G was randomly assigned all to treatment or all to control, then group G is a reasonable unit to use). 
- This works well for a large number of clusters ($ > 50$), not a small number. 

You can also use bootstrap standard errors. Pick random samples of your data (with replacement) and find the desired property (mean, $\beta_1$, or whatever it is you want) from these samples. Once you’ve collected all the $\beta_1$’s (let us say), you can just find their standard error. 

Bootstrapping requires a large sample size, and there should not be extreme values (one value that is much larger than the others). If there are extreme values, then the estimated $\beta_1$ may depend very strongly on whether the extreme value is included in the bootstrapped samples.

Even worse, it might also mean that we don’t know the true distribution of these extreme values (we have too few examples), so our bootstrap sample is not representative of the population. For example, if we have 99 data points between 0 and 1 but one data point that is at 10, then the question is whether all the large data points are going to be around 10. What if some are at 20 or 30? What if the true probability of the 10 was 0.1% and not 1%?

#### Sample Weights
Sample weighting means that different samples have different importance. In a bivariate regression, our estimate of the slope $(\beta_1)$ is $\frac{\text{Cov} (X, Y)}{\text{Var} (X)}$

With weights, when we calculate these averages, we multiply each term by the appropriate weight, e.g.: 

$\text{Cov} (X, Y)_{\text{unweighted}} \; = \; \langle (X_i - \langle X \rangle)(Y_i - \langle Y \rangle) \rangle$

$\text{Cov} (X, Y)_{\text{weighted}} \; = \; \langle w_i (X_i - \langle X \rangle)(Y_i - \langle Y \rangle) \rangle$

$\text{Cov} (X, Y)_{\text{unweighted}} \; = \; \langle (X_i - \langle X \rangle)^2 \rangle$

$\text{Cov} (X, Y)_{\text{weighted}} \; = \; \langle w_i (X_i - \langle X \rangle)^2 \rangle$

This is called “weighted least squares.” 

**Reasons to weight:** 

- **Some people are more likely to be sampled than others**. People more likely to be sampled should be down-weighted. Thus, if a population has 75 men and 25 women, then men should get a weight of 1 / 75% = 1.33 and women 1 / 25% = 4. (That’s if we assume the “real” population is 1,000 men and 1,000 women. If the real population is 750 men and 250 women, then our sample population is just right.)
- **Aggregated data**. What if you’re using classroom average scores as your data points instead of individual students? But then the larger classroom might be more important because it represents more students. Should we weight a classroom of size 30 the same as a classroom of size 2? In this case, one possible solution is **frequency weighting**, meaning that we repeat the size 30 classroom’s value 30 times and the size 2 classroom’s value 2 times. 
- On the other hand, the means of the larger classrooms are better estimated than the others. Thus, perhaps we should weight the classrooms by the classroom size $N$ since the variance goes like $\sigma^2 \over N$. (This is **inverse variance weighting**.) This is similar to the frequency weighting mentioned above in that it weights the larger classrooms more.
- When it comes to coefficient estimates, these two approaches both weight by $N$, and so they are the same. They produce the same coefficients. However, when it comes to sample size, frequency weighting literally repeats each observation $N$ times, resulting in a much larger sample size, whereas inverse variance weighting uses a single value per classroom but uses different weights $w_i$ for each classroom, as shown above. This means that frequency weighting standard errors will be smaller than inverse variance weighting standard errors.

#### Collinearity
Perfect collinearity makes regression impossible.

If the true equation is:
- $Y \; = \; \beta_0 + \beta_1X + \beta_2Z + \varepsilon_Y$, 
- but $X = Z$ always,
- then an equivalent equation is $Y = \beta_0 + (\beta_1 + 1)X + (\beta_2 - 1)Z + \varepsilon$, In other words, the coefficients are undetermined and could be arbitrarily large (millions) while the equation is still true. (You could use: $Y = \beta_0 + (\beta_1 + 10000000)X + (\beta_2 - 1000000)Z + \varepsilon(X, Z)$

However, what if collinearity is not perfect but the independent variables are just strongly correlated? 
- Then, the coefficient estimates will be sensitive and noisy with large standard errors. 
- This should be obvious: If perfect collinearity leads to infinite standard errors, then it makes sense that partial collinearity would lead to high standard errors.

One way to deal with partial collinearity is to remove variables from the analysis that reflect duplicated information. 
- This can be done via a domain knowledge understanding of the data. 
    - For example: 
        - if you are trying to predict the cost of a bag of apples based on its weight and volume, you might expect that there is substantial overlap in the two effects. 
        - Weight and volume influence the price of the bag in about the same way.
- Alternatively, you can check if there's too much collinearity via the variance inflation factor (VIF), which for variable j is $\text{VIF}_j \; = \; \frac{1}{1 - R_j^2}$, where $R_j$ is the $R^2$ found by regressing variable j on the other variables. (that is, predicting j base on other features)

- if variable j is perfectly predicted (linearly) by the other variables, then VIF = infinity. 
- A large VIF indicates collinearity. In particular, a VIF of 5 or 10 or so (or larger) indicates a problem.

#### Measurement Error
Classical measurement error means that the error in measuring $X$ or $Y$ is uncorrelated with $X$ and $Y$. In this case, measurement error on $X$ will attenuate its relationship with $Y$, while measurement error on $Y$ will not attenuate its relationship with $X$ (but will add noise to the coefficients). To see this, draw a perfect linear relationship between $X$ and $Y$, and then, add noise in $X$. Alternatively, try adding noise in $Y$. You’ll see that for a large amount of noise in $X$, the relationship of $X$ and $Y$ will attenuate (the coefficient $\beta_1$ will shrink) because if you draw a line from the smallest $X$ values on the left to the largest $X$ values on the right, it will have a shallower slope. In contrast, even a large amount of noise in $Y$ alone will not attenuate $\beta_1$ on average because there are new high and low $Y$ values on the left and new high and low $Y$ values on the right. 

In contrast, non-classical measurement error is when the error in measuring $X$ and $Y$ is correlated with the true value of $X$ or $Y$. For example, if $Y$ is binary, then the noise is inversely correlated with the true value; for $Y_\text{true} = 0$, the noise can only be 0 or positive, while for $Y_{\text{true}} = 1$, the noise can only be 0 or negative. 

In The Effect, the author mentions how to deal with error: Deming regression, total least squares, and general method of moments (GMM). However, he does not say very much about how these techniques work. 

#### Penalized Regression 
Penalized regression includes: 
- LASSO regression (add a multiple of the L1 norm to the sum of squared residuals)
- ridge regression (add a multiple of the L2 norm to the sum of squared residuals). 

LASSO regression tends to send many of the coefficients to zero, while ridge regression just makes many of them smaller without sending them to zero. 

LASSO may give you biased estimates of the coefficients; to avoid this, use LASSO to select the coefficients, but then use ordinary least squares regression to re-derive the coefficients (using only the ones that LASSO picked). 

However, this approach is not perfect: There may be a selection bias. For example, if there is truly no effect, and you run LASSO with 1,000 features, you might easily find a few features that seem to correlate just by chance. One approach would be to re-derive the coefficients on entirely new data. 

For LASSO, all variables must first be standardized — otherwise, it will be impossible to balance the magnitudes of the coefficients in a meaningful way (﻿vertical line beta subscript 1 vertical line plus vertical line beta subscript 2 vertical line﻿ is only meaningful if the two coefficients are on the same scale). It is allowed to include a lot of terms (such as many polynomial terms) and allow LASSO to pick the ones you want. There is still an overfitting problem if you had infinitely many coefficients, but it would require quite a large number of coefficients to produce this problem because, in the end, the number of coefficients produced by LASSO will always be small, so the number of arrangements increases polynomially in the number of distinct allowed coefficients rather than exponentially. 

You can use cross-validation to pick the coefficient ﻿lambda﻿ of the regularization term. 