# Week 2 Overview
This week, you’ll explore two key tools for understanding causal relationships in data: fixed effects models and bootstrap simulation. You’ll begin by learning how fixed effects help control for unobserved differences across groups—like companies or soil types—so you can isolate the impact of a single variable of interest. You’ll also practice building your own example that uses fixed effects. Then, you’ll turn to simulation techniques, using bootstrapping to estimate how your results might vary if you had a different dataset. By the end of the week, you’ll have hands-on experience running fixed effects regressions, performing bootstrap simulations, and reflecting on what your results say about real-world data-generating processes (DGPs).

## Learning Objectives
At the end of this week, you will be able to: 
- Perform fixed effects  
- Invent your own example situation that uses fixed effects  
- Perform bootstrap simulation  
- Describe an example of a data-generating process


## Topic Overview: Simulation
**Bootstrapping** is a powerful simulation technique that allows you to estimate how your results might vary if you collected a different sample from the same population. By repeatedly sampling your existing dataset with replacement, you can create many simulated datasets, each slightly different from the original. You then run your analysis — such as calculating a mean or running a regression—on each of these samples. The variation across these results helps you understand the reliability of your estimates, including how much they might fluctuate due to sampling variability. This week, you’ll explore how bootstrapping works, why it’s especially useful when you can’t access more data, and how it helps evaluate the robustness of statistical findings in causal inference.

### Learning Objectives:
- Perform bootstrap simulation
- Describe an example of a data-generating process

## 1.1 Lesson: Bootstrapping
In this short video, we will discuss how bootstrapping helps you estimate the variability of regression coefficients and why resampling your data can reveal how confident you should be in the patterns you’ve found.

### Simulation
Let’s look at another example. 

Say you want to know how well your linear regression will work. You have 1,000 samples, but what if you had a different 1,000 samples? Would you get a similar effect? A very different effect? What is the variance of the effects you’d get? 

A simple way to answer this question is a **Bootstrap Simulation**. This means that you pick 1,000 samples out of the 1,000 samples with replacement. That is, you can pick the same sample twice. Now you can run whatever test you want to run but use the bootstrap sample. For example, suppose that values in our dataset are:

`values = [1,2,3,4,5]`

To do the bootstrap, we pick five of the values at random. But there are only five values, you say! What does it mean to pick “five of them?” We pick them with replacement, meaning that we can pick the same one twice. We get:

`values_bootstrap = [1,2,3,4,4]`

Now, we can estimate some statistic related to these new values. For example, the `mean`:

$$\frac{1 + 2 + 2 + 4 + 5}{5} \; \; \frac{13}{5} \; = \; 2.6$$

This is different from the original mean of 3. If we perform this many times, we'll get a variety of means. This represents the different means we could get with different sample populations. For example, suppose the true population looks like this:

`values_true = [0, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7]`

Then, the true possible distribution of means would involve taking samples from that population. 

Instead, we're only taking samples from our sampled population, or `[1,2,3,4,5]`.

However, it turns out that for large samples, the bootstrap gives a distribution that’s not so different from the population as a whole. So, we can use the bootstrap to get a handle on how things are working.

Bootstrap sampling might not work with some datasets (like the Pareto distribution mentioned previously). 

That is, with the Pareto distribution, when we sample the original population 1,000 times, we will likely get some pretty weird items in our sample — astronomically tall trees, for instance.

Usually with a big enough sample bootstrapping is a reasonably good estimate for what would happen if we sample the original population a bunch of times. In particular, it can give us the variance of any statistic. It can give us the mean, too, but the mean it gives us might be the same as the mean of that value for the sample. That is, the 2.6 value above differs from the sample of 3, but if we did it a bunch of times, we’d get 3 on average.

We can use the bootstrap sample to find the mean and variance of anything. For instance, we could run a linear regression on the bootstrap sample and find the coefficients. Then, we could find the mean and standard deviation of each coefficient. This could tell us, for instance, how likely the coefficient is to be zero (or rather, above or below zero), which is an important question for power analysis. 

**Power analysis** is a method used to estimate how likely a statistical test is to detect a true effect, if one exists. In other words, it helps you assess whether your study has enough data to confidently find meaningful results. We’ll address power analysis in a later week.

### Knowledge Check: Bootstrapping
1. Which of the following best describes a simple example of a data-generating process (DGP)?
- Correct: Assuming that each student's test score is determined by their study time plus random variation. 
- A DGP describes the underlying relationship between variables — how the data is produced in the real world. For example, if test scores are generated by study time and some random noise, that setup defines a DGP. Simply running a regression or resampling data without understanding how outcomes are generated does not capture the essence of a DGP.
2. When performing a bootstrap simulation, what is the purpose of sampling with replacement from the original dataset?
- Correct: Allow the same observation to be selected multiple times, creating new samples to estimate variability.
- The purpose of sampling with replacement in a bootstrap simulation is to create new samples by allowing observations to appear more than once. This process lets you estimate how much a statistic, like a mean or regression coefficient, might vary if you had collected a different sample. Bootstrapping does not generate a new population or correct missing data — it resamples from the data you already have.


## Topic 2: Fixed Effects
Fixed effects models are a powerful tool for controlling group-level differences that could confound your analysis. When comparing outcomes across different groups—like companies, classrooms, or regions—each group may have its own baseline level that influences the results. Fixed effects allow you to account for these unobserved, constant differences by giving each group its own intercept while estimating a common relationship between your independent and dependent variables. You’ll learn how fixed effects work through both conceptual examples and hands-on regression exercises. You’ll also practice inventing your own example to better understand when and why fixed effects are used in causal inference.

### Learning Objectives:
- Perform fixed effects
    - Invent your own example situation that uses fixed effects. 

### 2.1 Lesson Between and Within Variation
In this video, we’ll discuss how fixed effects can help isolate the true relationship between variables when comparing data across different groups and how controlling for group-level differences — like baseline productivity across companies — leads to more accurate estimates in your regression model.

#### Fixed Effects
let's say we're trying to understand how training hours impact employee productivity in a certain industry, measured in number of sales. How do we account for company level differences? 

$\text{Company level differences} \; = \; \text{Base level of productivity} \; +/- \; \text{Training Hours}$

A simple way to model this might be to run a simple linear regression for each company. For example:

`Y_company_1 = a_company_1 * x_company_1 + b_company_1`

`Y_company_2 = a_company_2 * x_company_2 + b_company_2`

and so on...

where `Y_1` is the number of sales for each company,

`X_1` is the number of training hours,

`a_1` is the slope,

`b_1` is the intercept, representing the base productivity level 

What if we believe that the effect of training hours on sales is the same for each company. (Intercepts vary, but there's only one common slope coefficient, `a`)

That is,

`a_company_1 =`

`a_company_2 =`

`a_company_3 =`

(slope `a` represents the universal effect of training hours, or for every additional hour of training, productivity increases by the same amount.)

we can now pool all the data together to estimate a more precise slope. 

Why does this help? When we believe there's a common pattern we can borrow statistical power across companies 

#### Fixed Effects
- Transform the data by subtracting each company'es mean
- Remove company-specific intercepts
- Estimate a common slope

Another example:

Suppose that farmers grow plants in three types of soil: 
- sandy 
- loamy 
- clayey 

In each soil, they can add a certain amount of water. 

The growth of the plants is related to: 
- (1) the type of soil
- (2) the amount of water added. 

Let’s take a very simple perspective where the soil adds a fixed constant to the growth, and then there is a linear relationship with the amount of water added. 

Thus, the sandy soil has a relationship:

$Y = \beta_{\text{sandy}} + \beta_1 \text{xwater}_{\text{amount}} + \epsilon_{\text{sandy}}$

The loamy soil has the relationship:

$Y = \beta_{\text{loamy}} + \beta_1 * \text{xwater}_{\text{amount}} + \epsilon_{\text{loamy}}$

And the clayey soil has the relationship:

$Y = \beta_{\text{clayley}} + \beta_1 * \text{xwater}_{\text{amount}} + \epsilon_{\text{clayley}}$

There are several implausible issues with this: 

- Does the water matter exactly as much in all 3 cases?
    - Surely, the water has less effect in the sandy soil so that more is needed to produce the same effect. 
    - In the clayley soil, the same amount of water will pool and have even more effect.

Furthermore, it seems likely that below a given threshold, the plant simply dies, and the growth is zero. And above a given threshold, the plant also dies because it drowns. So, things are much more complicated than our model! 

Still, let’s say that we are careful to water it within a given narrow range: not too much and not too little. In this range, there may be a particular relationship between the amount of water and the plant growth. In any case, you’d be right to poke holes in the example — it’s important to find problems with assumptions. But let’s run with this example for now. 

Now, one way to do this problem is to take the data from each soil type and to average the equation. Then, subtract the averaged equation from the original equation. So, we have:

$\langle Y \rangle \; = \; \beta_{\text{clayley}} + \beta_1 \;  \cdot \; \langle \text{water amount} \rangle$

Where the average removes $\epsilon$ because it's mean is zero. Then, subtracting:

$ Y - \langle Y \rangle \; = \; \beta_1 \; \cdot \; (\text{water amount} - \langle \text{water amount} \rangle )$

This lets us mix the $ Y - \langle Y \rangle$ terms from the different soil types and mix the `water_amount` - `<water_amount>` terms, because they all have the same coefficient $\beta_1$. We can then run a linear regression on all the data and find $\beta_1$.


In [1]:
import pandas as pd
import numpy as np
import sqlite3
import datetime as dt
import pyfixest as pf

ModuleNotFoundError: No module named 'pyfixest'