# Course Overview

This course will not delve into the often-discussed differences between **Bayesian inference** and **Frequentism**, the two major paradigms in statistics. The so-called "statistics wars" are largely over, and these debates are mostly confined to statisticians. Our primary interest lies in justifying our statistical procedures, whether they are Bayesian or not. This focus leads us to a field called **Causal Inference**, which is a diverse field with a variety of tools. 

Often, statistics is taught in the absence of scientific models. However, this course, we will not follow that approach. Statistical models, which are devices that process data to produce estimates, require a firm logical connection to a scientific (causal) model to produce scientific insights. Within a scientific model, changing a variable changes only some other variables. These are what we call **causes**. For any given sample, a statistical analysis can find any cause it wants in the absence of a causal model. Therefore, the reasons for a statistical analysis are not found in the data themselves. We cannot troll through the numbers and come up with a theory. Instead, we put causes into the design of a statistical model that gives us the estimate we are after. The causes of the data cannot be extracted alone; we need an additional estimate. In other words, **no causes in, no causes out**.

# What is causal inference?

1. **Correlation and Causation**: Correlation does not imply causation, and causation doesn't imply correlation too. The association between variables running in both directions is the basic problem. This is what we mean when we say "correlation does not imply causation". Correlation is a very limited measure of association. Variables can be associated but have no correlation. Associations are bidirectional and there is no causation in them; they are just statistical measures.

2. **Causal Inference**: Causal inference is about predicting the consequences of an intervention. It's not just about predicting what happens in the absence of the intervention, or predicting associations of the features, but about predicting the impact of changing one variable on other variables. This can be thought of as imputation of a missing observation.

> **Example 1**: When looking outside and there is wind, if there are trees then the trees will sway in the wind. We know that the wind causes the trees to sway. The variables `Movement of the trees` and `The presence of the wind` are statistically associated, but there is nothing about that statistical association which tells us the causal information. This has implications for hypothetical interventions. When we know the cause of the swaying of the trees, we are able to predict the outcome of a particular intervention, such as adding wind to make them sway. But climbing the tree and swaying the branches does not create wind. So, causal inference is "What if I do this?" and it is very different from pure or raw prediction.

> **Example 2**: When we know the cause, we can construct unobserved counterfactual outcomes. For instance, we might ask, "What if something else happened?" For example, "What if China landed on the moon?" Of course, that's not the kind of question that we can answer, but there are much simpler systems where we can predict outcomes. This ability to construct and analyze hypothetical scenarios is a crucial aspect of causal inference.

3. **Causal Inference, Description, and Research Design**: Causal inference is one kind of research goal but is intimately related to "Description" and "Research Design". The thing that binds all these three together is that all depend upon a ***scientific model*** to be conducted effectively. So, causal inference can only be done if we have a causal model. Descriptive studies also depend upon causal knowledge to explain that there are causes of the sample that we use to conduct the description. The research project design also depends upon some causal knowledge about the system we are designing.

![image.png](attachment:image.png)

### Causes Are Not Optional

Let’s take a moment to discuss the concept of description in research. It’s not always clear to everyone that even though descriptive studies don’t make causal claims, they still require understanding of causal modeling and inference. This is because the sample, which is often different from the population, is influenced by various factors. ***These factors need to be understood using causal logic, whether you’re designing or calculating around them***. 
> So, even if your goal is to describe the population, you still need to model the causes of the sample and why it differs from the population.

![image.png](attachment:image.png)

# DAGS (Directed Acyclic Graph) or Causal Diagrams
#### These are highly abstract causal models. The only information in a DAG is the names of variables and their causal relationships, represented by arrows.

> **Understanding DAGs:**
>> An arrow in a DAG indicates that if you change a variable at the start of the arrow, it will also induce a change in the variable at the end of the arrow, but not in the reverse. For example, if you look at `X` & `Y`, there's an arrow going from `X` to `Y`. Changing `X` would change `Y` according to this diagram. If you change `Y`, `X` would not change.
>
> **Interventions and DAGs:**
>> These arrows represent interventions. A DAG tells you the consequences of a hypothetical intervention. We can use DAGs to figure out which statistical models we need to answer particular specified questions about the variables in the graph and about particular hypothetical interventions.
>
> **Assumptions in DAGs:**
>> DAGs don't make any specific assumptions about the relationships between these variables, they just name influences. They don't assume that things are linear, and by default they assume that all variables interact. There's everything is moderation.
>
> **Usefulness of DAGs:**
>> DAGs are useful for lots of reasons. They answer very general questions about what we can decide without making additional assumptions. Eventually, we will make additional assumptions about the functional relationships between these variables, which will give us more scientific power. But the DAG has usefulness even after we do that.

![image.png](attachment:image.png)

> **Breaking Down DAGs:**
>> In most research, we're interested in the causal effect of one variable, `X`, on another, `Y`. We call `X` the treatment and `Y` the outcome. This relationship goes in one direction. There are other variables in the world that influence `X` & `Y`, and we need to draw these influences to study the relationship between `X` & `Y`. For intance, there are variables like `B`, which are not competing causes of `Y` like `X`, but can be useful to measure. There are variables like `A` which are influences of the treatment, and variables like `C` which are common causes of `X` & `Y`. `C` is a confound, a variable that we would want to control for in a statistical analysis to correctly estimate the relationship between `X` & `Y`.
>
> **Relationships Among Variables:**
>> Variables like `C`, `A`, & `B` have relationships among themselves, and these relationships can confuse our regression strategies.
>
> **Different queries, Different Models:**
>> A causal model like a DAG allows you to ask multiple questions of it. You don't only have to ask about the influence of `X` on `Y`, you can also ask about the influence of `A` on `Y` and so on. Each causal query will imply a different statistical procedure, different estimate. In many cases, you will not be able to use one statistical model to answer all of those causal queries.
>
> **Control Variables:**
>> It comes down to the issue of choosing what some fields call control variables. There are good controls, variables you want to add because adding them controls for some confounding influence and lets you correctly measure causal influence. But it's not safe to just add everything and see what happens to the coefficients, because there are also bad controls, variables that create confusion when you add them to the model that create bias and mess up your estimates.
>
> **DAGs are intuition pumps:**
>> A DAG provides a clear route for testing and refining the causal model because it's logically specified and you can deduce its implications. DAGs are intuition pumps. They get the researcher's head out of the data, out of the numbers and into the science, and then we can go back into the data and make more sense of it.


# Golems

![image.png](attachment:image.png)

### Statistical Models as Golems
The story of the Golem of Prague serves as a powerful metaphor for understanding the nature and use of statistical models. In the 16th century, Rabbi Love, according to legend, used magic to construct a clay robot, a golem, to defend the Jewish community against discrimination and blood libel. The golem, while powerful, had no wisdom or foresight and merely executed the instructions given to it, leading to unintended harm. Today, we're surrounded by 'golems' of all kinds, not just physical robots, but also software and statistical models. These modern golems, like their clay predecessor, are built for particular tasks but are blind to our intent. They execute the instructions we give them without understanding the intent behind those instructions. If they're not used wisely and in the right context, they can do severe damage.

> Many people learn to use flow charts and select statistical tests for testing a null hypothesis in their introductory stats courses. This approach, while not inherently bad, is more suited for basic quality control and experimental science, not for training research scientists. Each statistical test, like a **Spearman's rank correlation** or **T test**, is useful but also extremely narrow. This narrowness presents a limiting picture of statistics. It's important to note that this has nothing to do with old Boomer arguments about Bayes versus frequentism. There are Bayesian versions of every one of these procedures. The tradition of teaching students and researchers to use these isolated tests and teaching them that the sole goal of statistics is to reject null hypothesis is not always useful, especially in most of research science. In such cases, it's typically not possible to define a clear and sensible null hypothesis that can be rejected. Instead, we must design multiple process models and study their implications.


![image-2.png](attachment:image-2.png)


> **Rejecting null hypothesis:**
>> Neanderthals, very similar to humans, once lived in Europe and the Near East. All humans outside of Africa carry Neanderthal DNA, suggesting interbreeding between Neanderthals and modern humans. This interbreeding model proposes that when modern humans left Africa, they interacted with Neanderthals. However, an alternative hypothesis is that what appears to be Neanderthal DNA is due to ancient population substructure. This means we could share what looks like Neanderthal DNA with Neanderthals because both groups got it from now-extinct southeastern African populations. Both hypotheses are consistent with rejecting the null hypothesis of no Neanderthal DNA in modern humans outside Africa. These hypotheses need to be tested against each other using process models.

![image-3.png](attachment:image-3.png)

- **Null Hypothesis Framework**
  - The null hypothesis framework can be limiting in realistic research contexts. We need to think of scientific models and how to introduce them to data by analyzing them to design.

> **Generative Causal Models**
  > - We need generative causal models, not just Directed Acyclic Graphs (DAGs).
  > - DAGs don't have enough details to be generative.
  > - Generative means you can simulate data from the model.

**Statistical Models Justified by generative models and questions (`estimand`):**
<br/> We will write statistical models that can analyze the synthetic data to produce certain goals called estimates. Once we're sure that the model works in principle on synthetic data, we'll introduce the real data.

---

In system modeling, we're interested in the relationship between two variables, `X` and `Y`. We also have other variables, and the question arises: should we use any of them in the analysis? This question **cannot be answered without a generative or causal model**.

There are many regression models that use associations with multiple other variables. For example, we could have a model where we only examine the association between `Y` and `X`, or we could add variable `A`, or look at `A` and `B` together with `X`, and so on. The relationships among these variables can cause problems when we add control variables.

One of the most common questions in applied statistics is which covariates are controls. This **cannot be decided without, at least a DAG, or hopefully, a more detailed generative model** that specifies the shape of the relationships.

What we want to do is analyze the graph so that we can deduce, given the assumptions in this graph, which control variables are good and which are bad. In a particular example, which we won't explain today but will show in a future lecture, the correct adjustment set, what it's called, is to include the variables `B` and `C` to stratify by `B` and `C` when examining the relationship between `X` and `Y`.

![image.png](attachment:image.png)

After using the DAG and analyzing it, we have our adjustment set. However, this is not enough. We need a generative version of the causal model to design and debug our code.

> Then, we need a strategy to create an estimate. This involves coping with finite data to study something about a population that could potentially produce infinite data. We also need to properly characterize the uncertainty in the estimate we produce. The easiest approach is `Bayesian data analysis`. We use it not out of some kind of software commitment, but as a committed scientist. Bayesian data analysis allows us to take the generative assumptions in our scientific models and confront them with data with the least fuss.


### Bayes is practical, not philosophical
While **Bayesian statistics** may seem overkill for simple examples, it proves highly practical for complex, real-world analytical challenges. These include dealing with **measurement errors**, **missing data**, and **regularization**. These are not exotic problems, but routine issues in scientific research, especially at the cutting edge.

Bayesian methods have grown more popular in research in recent decades due to their practical utility, not philosophical commitments. One of the key advantages of Bayesian modeling is that Bayesian models are **generative**. They can simulate data like a causal model, allowing us to express our Bayesian statistical models in close identity with the causal models of interest.

![image.png](attachment:image.png)


## Statistics wars are over 

Bayesian statistics, once controversial, is now mainstream and no longer taboo. It's even seen as prestigious in fields like biology. The point is, we should move beyond the **Bayes-Frequentist debate**.

However, there's a lag in the university curriculum. Most places lack dedicated teaching slots for applied Bayesian data analysis. But it's becoming more common as researchers use Bayesian methods more than their teachers did. This creates a feeling of uncertainty, but the curriculum will catch up.

A lot of the research, innovation, and action is now in **machine learning**. They have their own battles to fight, so we can let the remaining combatants fight about basic frequentism. We have our own battles.

![image.png](attachment:image.png)

# Owls
Now we will talk about owls. Some of you have seen this internet joke about how to draw an owl. The joke begins by saying, well, step one you draw some circles, one for the head and one for the body. And then step 2 is you draw the rest of the owl.

### The Problem with Current Teaching Methods
*We often teach computational tasks in research, not just statistics but all kinds of programming and technical things, in a similar way. There's some guide to how to do it and they tell you how to do the initial steps, and there are a bunch of steps until you wanted to here. And it seemed to go by really fast.*

### Our Approach
> So we want to move more slowly. We want to draw out all the intermediate steps in drawing me out so that the student has some hope of finding out which part, which step they're having trouble with. And learning effectively. And this naturally means it takes more time. It takes more time both from the teacher and from the students. But it's much more successful when you want to draw the owl to get all the steps.

![image.png](attachment:image.png)

### Documenting Our Steps
*We're interested in documenting our steps of drawing the owl, so to speak. And what this means is we're going to have an explicit workflow where we:*

```markdown
1. Set up our code so that we have our generative simulation in step one,
2. Write an estimator in step 2,
3. Validate that estimator in step 3 using the simulated data,
4. Analyze real data in step 4,
5. Show you some other step fives, and things we will. We can reuse the step one code to do things like compute hypothetical interventions and other very useful tasks.



**The Problem with Current Scientific Data Analysis**

> "Scientific data analysis is a very bizarre kind of software engineering. It's like software engineering done by amateurs who haven't been taught anything about software engineering, right? This is an unfortunate state of affairs, but most data analysis in the sciences is now done with scripting. There are some people still using point-and-click methods, and that's terrible for reasons I'll get to."

**The Importance of Quality Control and Assurance**

Scripting is a kind of programming, albeit a simple one, and you should approach it as such. The software should be tested, documented, and commented appropriately. We want **quality control** and **Quality assurance**.

**Three Modes to Drawing the Bayesian OWL**

> "There are three modes to drawing the Bayesian OWL in this course. The first mode is understanding what you're doing. Breaking it down into steps and having a recipe in a workflow that you hold yourself to is extremely useful. Otherwise, you'll just have a salad of code and you'll get lost."

**The Importance of Documenting Your Work**

Documenting your work reduces error. This isn't just about understanding; it's about reducing scientific error as well and giving your colleagues some faith that your code actually works. We're professionals and we should behave professionally. We want a respectable scientific workflow that we're not afraid to describe to our colleagues.

![image.png](attachment:image.png)

```markdown
1. Define some theoretical estimate: What are we even trying to do in this study?
2. Design some scientific model, causal model: This can start out as a DAG, but it eventually needs to be generative.
3. Build some series of statistical models: These should address the specific estimates that can be justified in light of the causal models in step 2.
4. Do testing: We simulate from the generative model to validate that the estimator works.
5. Analyze the real data: There may be additional steps after this. We might decide to look back and revise the causal models. But as long as we document all that, this is the workflow that we want to draw the OWL.


