# Week 9 Overview
This week, we will explore the concept of simulation in data science. We will learn how to use simulations to validate methods when mathematical proofs are not available and understand how to create data for simulations using directed acyclic graphs. Additionally, we will delve into power analysis and bootstrap simulations to estimate standard errors and assess model predictions.

## Learning Objectives
At the end of this week, you will be able to: 
- Design simulations to validate statistical methods when theoretical proofs are not available. 
- Analyze the distribution and variability of estimates through repeated simulations. 
- Construct data using directed acyclic graphs (DAGs) to model relationships and dependencies. 
- Evaluate the impact of different assumptions on the accuracy of statistical estimators. 
- Perform power analysis to determine the sample size required for detecting significant effects in regression models.

## Simulation
Simulation offers a powerful way to test methods and estimate uncertainty when analytical solutions are unavailable or difficult to derive. By generating synthetic data that mimics real-world relationships, simulation allows us to evaluate how well an estimator performs, examine the impact of assumptions like heteroskedasticity or omitted variables, and compare different modeling approaches. Using tools like directed acyclic graphs and assumptions about functional forms and error structures, simulations can help data analysts explore how estimates behave across thousands of repetitions — revealing potential bias, variability, or limitations in a given approach. 

### Learning Objectives 
- Design simulations to validate statistical methods when theoretical proofs are not available. 
- Analyze the distribution and variability of estimates through repeated simulations. 
- Construct data using directed acyclic graphs (DAGs) to model relationships and dependencies. 
- Evaluate the impact of different assumptions on the accuracy of statistical estimators. 
- Perform power analysis to determine the sample size required for detecting significant effects in regression models. 

## 1.1 Lesson: Simulation
Often in data science, we use math to prove that our methods work. 

For example, we can prove that in regression:

$if$

- Errors are normally distributed with a fixed mean (of zero), and
- Fixed standard deviation across all values of treatment ($X$) and covariates ($Z$), and if
- The samples are independently distributed (not a time series with gradually changing values over time)  

$then$

- Our estimates and standard errors are correct. 
- In technical language, the estimates are maximum likelihood estimators for the values they estimate, and the standard errors represent the standard deviation of the estimate if the current estimate is the true value of the coefficient.

However, what happens if we come up with a particular way of doing things where we are not sure how to prove that it works? 

Or we find a method described in a book or on the internet, but it does not say how to compute standard errors. 

In that case, **simulation** can help us to prove that we are doing things correctly. 

**Simulation** involves using computer-generated data to mimic real-world processes, allowing us to test methods and estimate uncertainty when analytical solutions are difficult or unavailable.

### Example of a Simulation
Suppose we flip 100 coins and get $N$ heads: 
- Our estimate of the likelihood of getting heads is $N / 100$. 
- The true likelihood is 0.5 (a coin flip has a 50% chance of being either heads or tails. 50/100 = 0.5). 
- We want to know how close our estimate of $N / 100$ is likely to be to the actual number of heads, 0.5. 
- To do this by simulation, we could flip the 100 coins a thousand times (so, a total of 100,000 flips) and come up with 1,000 different estimates. (51 / 100, 54 / 100, 48 / 100, …). 
- Now, we can learn exactly how far off we’re likely to be under the assumption that 50 heads is the true average value.

We could simulate for any of the following reasons: 
1. We want to know how far off our estimate is likely to be from the true effect. 
2. We want to see how our estimate is distributed. (Normally? Uniformly?) 
3. We want to see whether our estimate varies a lot depending on assumptions. For example, if the error is heteroskedastic (its variance changes as a function of $X$), does that produce a very different estimate than if the error is homoskedastic (its variance is constant in $X$)?

### Creating Data
To create data, use the **directed acyclic graph:** 
- Guess the functional form of each node based on the nodes that point to it. 
- Thus, if $Z$ points to $X$ and $Y$, while $X$ points to $Y$ (this is the graph for the simplest possible confounder, $Z$), you’d need to describe: 
    - $X$ as a function of $Z$ (perhaps $X = \beta_0 + \beta_1Z + \varepsilon_X$), and
    - $Y$ as a function of both (perhaps $Y = \alpha_0 + \alpha_1X + \alpha_2Z + \varepsilon_Y$). 
    - You could also include polynomial terms ($\alpha_3Z^2$) or interaction terms ($\alpha_4 XZ$).

You get to determine the error terms, in this case $\varepsilon_X$ and $\varepsilon_Y$: 
- You might decide, for instance, that $\varepsilon_X$ is a normal distribution with mean 0 and standard deviation 1 regardless of $X$. - Or, instead, you could create heteroskedasticity where $\varepsilon_X$ has standard deviation $1 + 0.5 Z$. 
- To calculate this heteroskedastic $\varepsilon_X$, you’d compute $Z$, then use the given standard deviation of $1 + 0.5 Z$ to compute $\varepsilon_X$. 
- After that, you could add $\beta_0 + \beta_1Z + \varepsilon_X$ to compute $X$.

You can also create clustering, where there are groups, and there is a coefficient $\beta_{group}$ for each group:

$Y = \beta_0 + \beta_{group} + \beta_1X + \varepsilon$

There might be a hierarchy of groups, in fact, like:

$Y = \beta_0 + \beta_{\text{big\_group}} + \beta_{\text{small\_group\_ within}} + \beta_1X + \varepsilon$

Here, you'd usually assume that the value of $\beta_{\text{big\_group}}$ comes from its own distribution, while $\beta_{\text{small\_group\_within}}$ comes from another distribution, which is the same for all small groups regardless of which big group they're in.

You also need to decide what estimator you care about. (How will you estimate the mean, standard deviation, causal effect, or whatever you care about? Will you estimate the standard deviation by dividing by the number of samples or the number of samples minus one?) 

Now, iterate the simulation as many times as you can, likely at least thousands to tens of thousands, depending on the context.

Finally, check whether the estimation of the mean equals the true mean (that we assumed when generating the data). Of course, it’s unlikely that the estimation is exactly equal to the true mean, so you might check that the true mean is within the 97.5% and 2.5% percentile values of the simulated estimate for the mean. (Technical note: This is not a confidence interval; this is an interval of possible estimated means surrounding a true mean, whereas a confidence interval is an interval of possible candidate true means surrounding an estimated mean.)

Alternatively, you could check how close the standard deviation of the estimated regression coefficients (e.g., $\beta$) across simlations is to the theoretical standard error, which for simple linear regression is $\sqrt{\frac{\sigma^2}{\text{Var} (X) \cdot n}}$ where $\sigma^2$ is the variance of the residuals.

### Purpose of simulation  
Some uses of simulation include:
- Inventing or experimenting with novel estimators.  
- Contrasting one estimator vs. another.  
- Checking what assumptions are necessary for an estimator to be accurate. (For example, do we need heteroskedasticity, where the error $\varepsilon ∣ X$, conditioned on $X$, has the same standard deviation regardless of the value of $X$?)

If you fail to include some back door variables in your analysis (say, in a linear regression), then $X$ will be correlated with the error term no matter what beta coefficients you choose. 

That is, if:

$X = Z + \varepsilon_X$

$Y = X + Z + \varepsilon_Y$

Then if you try to fit:

$Y = \beta_0 + \beta_1X + \varepsilon$

You can try substituting $Z = X - \beta_X$ into the $Y$ equation to get something like:

$Y = 2X - \varepsilon_X + \varepsilon_Y$

But, unfortunately $\varepsilon_X$ is correlated with $X$. 

We know this because the equation $X = Z + \varepsilon_X$ 
- If $X$ is large, it's likely because $\varepsilon_X$ happened to be large. 
- If $X$ is small, it's likely partly because $\varepsilon$ happened to be small
- Therefore, any expression for $\varepsilon$ that includes $\varepsilon_X$ will involve a correlation with $X$ ($\varepsilon_Y$ should not be correlated with either $X$ or $Z$; that's an important assumption for the regression to work in the $Y$ equation.)

Once confounders have been removed, $X$ is uncorrelated with the error term at the population level, even if it is correlated in a specific sample. In that case, even if small-sample correlations exist due to randomness, these diminish as the sample size increases.  

Simulation can tell us if it’s a major problem that we didn’t close all the back doors. We may find experimentally, for instance, that the bias is not that large, given certain assumptions.

## 1.2 Lesson: Power Analysis
What does flipping a coin five times tell us about fairness — or randomness itself? And how can we ever be confident in our estimate based on just a handful of outcomes? In this video, we take a closer look at how simulations help us explore questions of statistical power, error, and estimation, using the surprisingly rich example of five coin flips. 

As you watch, think about this: When we make decisions based on data, are we uncovering truth or just navigating uncertainty with our best guess? 

- Data–Generating Process (DGP)
    - A simulation assumes knowledge (or assumptions) of how data are generated.
	- Example setting: flipping five coins to estimate the true probability of heads.
- Five–Coin Flip Example
    - Let n = number of heads in 5 flips.
	- Possible outcomes and their probabilities under a fair coin (50:50):

| Heads (n) | Probability ($Pr$) | Estimated p ($\hat p = n/5$)|
| :--- | :--- | :--- |
| 5 |  1/32 | 1.00 |
| 4	| 5/32 | 0.80 |
| 3 | 10/32 | 0.60 |
| 2 | 10/32	| 0.40 |
| 1	| 5/32 | 0.20 |
| 0 | 1/32 | 0.00 |

Unbiasedness of $\hat{p}$:
- The expected value of \hat p under the true p=0.5 is: $E[\hat p] = \sum_{n=0}^5 \Pr(n)\times \frac{n}{5} = 0.5$
- Thus, $\hat p$ is an unbiased estimator of the true probability.
	
Estimator Variance & Standard Error
- True variance of $\hat p$ is $\frac{p(1-p)}{5}$, so $\mathrm{SD}(\hat p) = \sqrt{0.05}\approx0.2236$.
- Usual standard-error formula is $\sqrt{\hat p(1-\hat p)/5}$.
- Averaging that across all six outcomes gives a mean standard error of $\approx0.1928$, close to the true SD.

Statistical Power & Hypothesis Testing
- Power = probability of correctly rejecting a (false) null hypothesis.
- Under $H_0\!: p=0.5$, no two-sided critical region at 5% exists for 5 flips (minimum mass in two tails = $2/32=6.25\%$).
- Using a one-sided test (reject on 5 heads) gives power = $\Pr(\text{5 heads} \mid p=0.6) = 0.0778$, which is low—nearly as likely to produce a false positive as a true positive.

Increasing Sample Size (50 coins)
- Exact probabilities become unwieldy by hand; use a computer to compute $\Pr(\ge k\text{ heads}$).
- For a 60:40 true split, rejecting $H_0:p=0.5$ when $\ge32$ heads yields power $\approx0.3356$, a substantial improvement but still modest.

Key Takeaways
- Simulations require a hypothesized DGP (e.g. null and alternative).
- Without a plausible DGP, hypothesis testing or simulation-based inference has no foundation.
- Bootstrapping substitutes “the empirical distribution” for a DGP, but still relies on the assumption that the sample reflects the true process.

Now that you've seen how power analysis works for analyzing coin flips, let’s take it a step further. In this next video, we’ll explore how power analysis applies to regression, where we’re not just estimating coin flip probabilities but estimating the effect of one variable on another. You’ll see how factors like effect size, variation, and sample size come together to determine whether a regression model can reliably detect a real relationship. 

### Power Analysis For Regression
In the case of regression, we have an equation like:

$Y = \beta_0 + \beta_1X + \varepsilon_Y$

In this situation, **power analysis** balances five things:
1. The true effect size in regression. 
    - This is the coefficient $\beta_1$. 
    - Assuming the treatment $X$ is binary, the coefficient of $X$ denotes the causal effect of treating a particular sample instead of not treating it.
2. The amount of variation in the treatment. This is the variance of the treatment $X$.
3. The amount of other variation in $Y$. 
    - This is the variation of the residual after explaining $Y$ with $X$, that is the variance of $Y_{true} - Y_{predicted}$.
4. The sample size
5. The statistical precision or power.

The 5 can either mean:
1. The standard error of the estimate, how much the estimate of $\beta_1$ would vary if we did the experiment many times, or
2. The chance that the simulation correctly detects a real effect, (the statistical power).

Here's an example of balancing five things: 
- Suppose $X = \text{normal} (\text{mean} = 0, \text{stddev} = 2)$, and 
- $Y = 1 + 2X + \text{normal} (\text{mean} = 0, \text{stddev} = 1)$

Then:
1. The true effect size is $2$ because that's the coefficient of $X$. 
2. The variation in the treatment goes like $\text{normal}(0,2)$. Its variance is $2^2 = 4$. 
3. The other variation in $Y$ goes like $\text{normal} (0,1)$, assuming we get the coefficients right, its variance is $1$. 
4. Suppose the sample size is 1000. 
5. Let's consider the second definition of statistical precision. The chance that the simulation correctly detects a real effect. 

First, how big does the estimated coefficient have to be to say that it is real? 
- Perhaps its t statistic has to be bigger than 1.96. 
- We're using a 2 sided 5% test and assuming a large sample size of 1000. 

The **T-statistic** is: 

$$\frac{\text{estimate}}{\text{standard error}} \; = \; \frac{| \; \hat{\beta} \; |} {\sqrt{\frac{\text{(Var(Residual))}}{\text{Var}(X) \cdot n}}} \; = \;  \frac{| \; \hat{\beta} \; |} {\sqrt{\frac{\text{(Var(Residual))}}{4 \cdot 1000}}} $$

Assuming variance of residual is approximately equal to the variance of epsilon which is one, then this is equal to: 

$$ \quad = \quad \frac{| \; \hat{\beta} \; |} {\sqrt{\frac{1}{4000}}}$$

So: 
$$\; \hat{\beta} \; > \; 1.96 \times \sqrt{\frac{1}{4000}}$$

will result in a detection. The simulation can then find beta estimate in many cases and tell us how often it results in a detection. If we wanted a more complicated situation like what's shown, then the true effect size is 2. Assuming we use the full model including Z. 

In this case, computing the variations and the standard error is a bit more complicated due to the multiple equations. Instead of using 2, 3 and 4 to compute 5 mathematically, it might be better to simply compute the t-statistic using Python in stats models and discern whether the t value is above 1.96. We can then simulate to decide how often there is a detection.


### Reviewing Power Analysis with Simulation
Imagine that a meteorologist is trying to discern when it will rain. A true positive result means that they predict rain — and it does rain. The **true positive rate (TPR)** refers to the likelihood that they detect a true result. Suppose it rains on 50 days, and they predict only 3 of them. Then the TPR would be $3 / 50 = 6/%$. 

On the other hand, the **alpha value** (often 0.05) is meant to be the likelihood that a false result will be incorrectly detected as positive. This is the expected **false positive rate (FPR)**. 

For example: 
- Suppose it doesn’t rain on 50 days, and the meteorologist predicts rain on 2 of them. 
- This would be an FPR of 4%. The p-value is a characteristic of the sample — it is the probability that results as extreme as our sample (or more extreme) would be found if the null hypothesis (no rain) is true. 
- A “sample” here refers to the rain indicators (humidity, temperature, etc.) on a specific morning. 
- Then, while the alpha value is the fraction of these null samples that (on average) we incorrectly judge to be positive — the expected FPR. 
- If the p-value is below the expected FPR (alpha), we say that the sample is positive. 

This meteorologist gets a very good FPR of 4%. They’re unlikely to predict rain when it’s actually not going to rain. But they have a very bad TPR of 6%. The upshot is that on the 5 days (out of 100) when they predict that it rains, it only rains on 3 of them. That’s not great — it suggests that we have to know both the FPR and the TPR to understand this meteorologist. 

Both FPR and TPR are important, but FPR is often easier to calculate because the null hypothesis is a known quantity. (The null hypothesis might be that there is no effect at all.) In contrast, the TPR refers to a scenario where the null hypothesis is false; but then, what is true? To simulate the TPR, we’d need to make some assumptions about what is true instead of the null hypothesis. 

Now, let’s apply this to an estimation of an effect: 
- The TPR could tell us how likely we are to detect an effect when there really is one. 
- However, in this context, when we are talking about detecting an effect in a whole statistical analysis, we usually use the phrase “statistical power” instead of TPR. (“TPR” usually refers to the success rate per true sample in a set of samples, while “statistical power” refers to the success rate of the whole analysis.) 

To figure this out, we may have to do a simulation. 

### Balancing Five Components 
Power analysis balances five things: 
1. The **true effect size** — this is the coefficient in a regression equation for $Y$. Assuming the treatment, $X$, is binary, the coefficient of $X$ denotes the causal effect of treating a particular sample instead of not treating it. 
2. The **amount of variation in the treatment** — this is the variance of the treatment $X$. 
3. The **amount of other variation** in $Y$ — this is the variation of the residual after explaining $Y$ with $X$, i.e., the variance of $Y_{\text{true}} - Y_ {\text{predicted}}$ — which is also equal to $1 – R^2$ times the variance of $Y_{\text{true}}$. 
4. The **sample size**. 
5. The **statistical precision**.

**Statistical precision** refers to how accurately we can estimate the effect of the treatment: 
- One way to measure it is with the **standard error of the estimate**, which tells us how much our estimate of the treatment effect would vary if we repeated the study many times. 
- A **smaller standard error** means our estimate is more precise. 
- Another way to think about precision is through **statistical power:** the chance that our simulation correctly detects a real effect (for example, gives a p-value less than 0.05 when the effect is truly nonzero). 
    - If the standard error is small, the power will usually be high because it’s easier to tell that the effect is not due to random noise. 
    - In a simulation, you can compute power by checking how often the test correctly identifies a statistically significant effect across many runs. 

It is possible to compute one given the other four. The one that you compute is likely to be either (5) a minimum sample size, (1) a minimum detectable effect, or (less likely) (4) the statistical precision.

We make the best guesses we can to assume four of the items. The fifth item is simulated. It is common to have a goal of 80%–90% statistical power. 

**Note:** When detecting an interaction term, it may be necessary to have tens of thousands of samples in your simulation. 

## 1.3 Lesson: Bootstrap Review
You’ve already attempted a bootstrap simulation in the first few weeks of the course, and earlier in the program. Here is a review of bootstrap simulation. 

The idea is to resample with replacement — collect a group of samples at random, allowing yourself to pick the same sample twice. From this sample, calculate your desired estimates.  

The goal here is very different from the power analysis simulation described above. With the power analysis, we assume that our guesses represent a kind of “ground truth,” whereas with bootstrap simulation, we do not have any ground truth. Instead, our goal is largely to estimate the standard errors of our estimates. We assume that the standard error of the estimate based on resampling from the sample is the same as the true standard error based on resampling from the population. 

Another use of bootstrapping is to create training data for models, so that we can see how the model predictions change based on different training data. This essentially allows us to see the bias and variance of the model. It is also possible to make an ensemble out of the different models, which is a kind of bagging. 

It is recommended to sample hundreds or thousands of times to get the bootstrap estimates right. If there are significant outliers, the bootstrap simulation could converge slowly. For example, if positive outliers randomly fail to appear when you run the simulation, you can mis-estimate the mean or variance. Alternatively, the positive outliers might appear more often than expected. Because there are few of them, their contribution to the variance is large.

## Knowledge Check: Simulation
1. Which of the following would not be a likely purpose of a simulation:
- Correct: To test the coefficient for a single sample
- A simulation uses many samples, not a single sample. Apart from that, the others are valid purposes of simulations.
2. Which of the following describes bootstrap simulation:
- Correct: It resamples the original sample with replacement.
- A bootstrap simulation resamples the original sample with replacement — that is, each time it needs a new dataset, it picks individual data points from the original sample, allowing itself to pick the same item multiple times.
3. To perform a typical (non-bootstrap) simulation, we need to assume:
- Correct: That we understand the data generating process.
- Simulation does not require that we are matching or that we are using regression. However, we do have to understand how the data is generated in order to imitate that process with our code.