# [Causality and Experiments](https://www.inferentialthinking.com/chapters/02/causality-and-experiments) (Inferential Thinking - 2)

Is chocolate good for you? Does the death penalty have a deterrent effect? What causes breast cancer?

All of the questions above attempt to assign a cause to an effect. 

## Observation

Observation is key to good science. An **observational study** is a study in which scientists make conclusions based on data that they have observed but had no hand in generating. In data science, many such studies involve observations on:
* A group of individuals, study subjects, participants, units
    * e.g. European adults
* A factor of interest called a **treatment**
    * e.g. chocolate consumption
* An **outcome** measured on each individual
    * e.g. heart disease

## Question 1: Is there any relation between chocolate consumption and heart disease?

#### Some Data:
"Among those in the top tier of chocolate consumption, 12% developed or died of cardiovascular disease during the study, compared to 17.4% of those who didn't eat chocolate."

This points to an **association**.
The fundamental question is whether the treatment has an effect on the outcome. Any relation between the treatment and the outcome is called an **association**.

## Question 2: Does chocolate consumption lead to a reduction in heart disease?

This points to a **causality**.
If the treatment causes the outcome to occur, then the association is **causal**. This question is often harder to answer.

According to JoAnn Mansion, chief of Preventive Medicine at Brigham and Women's Hospital, Boston,
"The study doesn't prove 

## [Observation and Visualization: John Snow and the Broad Street Pump](https://www.inferentialthinking.com/chapters/02/1/observation-and-visualization-john-snow-and-the-broad-street-pump) (Inferential Thinking - 2.1)

One of the earliest example of astute observation eventually leading to establishment of causality is London back in 1850's. It was world's wealthiest city, but many of its people were desperately poor. Charles Dickens, an English writer, was writing about their plight (def: unfortunate situation, difficulty). Disease was rife (def: common occurrence) in the poorer parts of the city, with **cholera** being among the most feared. At that time, the concept of *germs* was not yet discovered, leading to the "miasma" theory.

### Miasma, Miasmatism, Miasmatists

* **Bad smells** given off by waste and rotting matter
* **Believed to be the main source of disease**
* Suggested remedies:
    * Hold sweet-smelling things to their nose

A doctor by the name John Snow had been following the cholera epidemic that hit England from time to time. Snow was skeptical of the miasma theory. He had noticed that while entire households were wiped out by cholera, the people in neighboring houses sometimes remained completely unaffected even though they were breathing the same air - or **miasmas** - as their neighbors.

Snow also noticed that the onset of the disease almost always involved vomiting and diarrhea. Thus, he believed that the infection was carried by **something people ate or drank, not by the air that they breathed**. His prime suspect was water contaminated by sewage.

As the deaths mounted, Snow recorded them diligently using a method that eventually became a standard in the study of how diseases spread: **drawing a map**. On a street map of the district, he recorded the location of each death.

Below is Snow's original map.
* Each black bar represents one death
* The black discs mark the locations of water pumps
* The map displays a striking revelation: **the deaths are roughly clustered around the Broad Street pump**

<img src = "snow_map.jpg"/>

Snow studieds his map carefully carefully and investigated the apparent anomalies:
* There were deaths in houses that were closer to Rupert Street Pump than the Broad Street pump. Even though the Rupert Street pump was closer, it was less convenient to get to because of dead ends and the layout of the streets. **The residents in those houses used the Broad Street pump instead**.
* There were no deaths in 2 blocks just east of the pump. That was the location of the Lion Brewery, where the workers drank what they brewed. If they wanted water, the brewery had its own well
* There were scattered deaths in houses several blocks away from the Broad Street pump. Those were children who drank from the Broad Street pump on their way to school.

#### Final piece of evidence that supports Snow's theory:

* 2 isolated death in the leafy and genteel(def: refined) Hampstead area, quite far from Soho.
    * The deceased were Mrs. Susannah Eley, who had once lived in Broad Street, and her niece.
    * Mrs. Eley had water from Broad Street pump delivered to her in Hampstead every day. She liked its taste.

## Randomization

An excellent way to avoid confounding:
* Assign individuals to the treatment and control groups at random
* Then administer the treatment to those who were assigned to the treatment group
Randomization keeps the 2 groups similar apart from the treatment.

If you're able to do the above method, you are running a **randomized controlled experiment**, also known as **randomized controlled trial (RCT)**. 
Sometimes, people's responses in an experiment are influenced by them knowing which group they are in. In this case, you can run a **blind experiment** in which individuals don't know 

Later it was discovered that a cesspit that was just a few feet away from the well of the Broad Street pump had been leaking into the well. The pump's water had been contaminated by sewage from houses of cholera victims.

Snow used his map to convince local authorities to remove the handle of the Broad Street pump. Snow's map is one of the earliest and most powerful uses of **data visualization**. Disease maps of various kinds are now a standard tool for tracking epidemics.

### Towards Causality

Though the map gave Snow a strong indication that the cleanliness of the water supply was the key to controlling cholera, he still had a long way from concluding a scientific argument that contaminated water was causing the spread of the disease. To make a more compelling case, he used the method of **comparison**.

Scientists use comparison to identify an association between a treatment and an outcome. The compare the outcomes of:
* A group of individuals who got the treatment (the **treatment group**) with
* The group that did not get treatment (the **control group**)

For example, researchers might compare the average murder rate in states that have the death penalty with the average murder rate in states that don't.

If the results are different, that is evidence for an association. However, more care is needed to determine causation.

## [Snow's "Grand Experiment"](https://www.inferentialthinking.com/chapters/02/2/snow-s-grand-experiment) (Inferential Thinking - 2.2)

Snow completed a more thorough analysis of cholera deaths. For some time, he had been gathering data on cholera deaths in an area of London that was served by 2 water companies: **Lambeth** and **Southwark & Vall (S&V)**. 
* Lambeth water company drew its water upriver from where sewage was discharged into the River Thames.
    * Its water was relatively clean
* S&V water company drew its water below the sewage discharge
    * Thus its water supply was contaminated.
    
The map below shows the area served by the 2 companies. Snow analyzed the region where the 2 service areas overlap.

<img src = "2_companies.jpg"/>

Snow noticed that there was no systematic difference between the people who were supplied by S&V and those supplied by Lambeth.
**“Each company supplies both rich and poor, both large houses and small; there is no difference either in the condition or occupation of the persons receiving the water of the different Companies … there is no difference whatever in the houses or the people receiving the supply of the two Water Companies, or in any of the physical conditions with which they are surrounded …”**

The only difference was in the water supply, “one group being supplied with water containing the sewage of London, and amongst it, whatever might have come from the cholera patients, the other group having water quite free from impurity.”

Confident that he would be able to arrive at a clear conclusion, Snow summarized his data in the table below:
<img src = "companies_table.jpg"/>

The numbers pointed accusingly at S&V. The death rate from cholera in the S&V houses was almost ten times the rate in the houses supplied by Lambeth.

## [Establishing Casuality](https://www.inferentialthinking.com/chapters/02/3/establishing-causality) (Inferential Thinking - 2.3)

In the language developed earlier in the section, you can think of:
* Treatment group = the people in S&V houses
* Control group = the people in Lambeth houses
A crucial element in Snow's analysis was that the people in the 2 groups were comparable to each other, apart from the treatment.

### Key to Establishing Casuality
If the treatment and control groups are **similar apart from the treatment**, then differences between the outcomes in the 2 groups can be ascribed to the treatment. In order to establish whether it was the water supply that was causing cholera, Snow had to compare two groups that were similar to each other in all but one aspect: **their water supply**. Only then would he be able to ascribe (def: attribute something to) the differences in their outcomes to the water supply. 

### Trouble
* If the treatment and the control groups have **systematic differences other than the treatment**, then it might be difficult to identify causality. 
* Such differences are often present in **observational studies**.
* When they lead researchers astray, they are called **confounding factors**

In John Snow's case, if the 2 groups had been different in some other ways as well, it would be difficult to identify that the water supply was the cause of the epidemic. For example, if the treatment group consisted of factory workers while the control group is not, then differences of the outcomes between the 2 groups could have been due to the water supply, or the factory work, or both, or to any other characteristic that made the 2 groups differ from each other.

### Confounding

In an observational study, if the treatment and control groups differ in ways other than the treatment, it will be more difficult to draw conclusions about causality.

An underlying difference, other than the treatment, between the 2 groups is called a **confounding factor**, because it might confound you when you try to reach conclusion.

#### Confounding Example: Coffee and Lung Cancer
Studies in the 1960’s showed that coffee drinkers had higher rates of lung cancer than those who did not drink coffee. Because of this, some people identified coffee as a cause of lung cancer. But coffee does not cause lung cancer. The analysis contained a confounding factor – smoking. In those days, coffee drinkers were also likely to have been smokers, and smoking does cause lung cancer. Coffee drinking was associated with lung cancer, but it did not cause the disease.

Confounding factors are common in observational studies. Good studies take great care to reduce confounding.


### Randomization

An excellent way to avoid confounding is:
* Assign individuals to the treatment and control groups at random
* Then administer the treatment to those who were assigned to the treatment group
Randomization keeps the two groups similar apart from the treatment.

If you are able to do the listed above, you are running a **randomized controlled experiment**, also known as a **randomized controlled trial (RCT)**. 
Sometimes, people’s responses in an experiment are influenced by their knowing which group they are in. In this case, you can run a **blind** experiment in which individuals do not know whether they are in the treatment group or the control group. To make this work, you give control group a placebo, which is something like the treatment but in fact has no effect.

Randomized controlled experiments have long been a gold standard in the medical field, for example in establishing whether a new drug works. They are also becoming more commonly used in other fields such as economics.

#### Example: Welfare subsidies in Mexico 
In Mexican villages in the 1990’s, children in poor families were often not enrolled in school. One of the reasons was that the older children could go to work and thus help support the family. Santiago Levy , a minister in Mexican Ministry of Finance, set out to investigate whether welfare programs could be used to increase school enrollment and improve health conditions. He conducted an RCT on a set of villages, selecting some of them at random to receive a new welfare program called PROGRESA. The program gave money to poor families if their children went to school regularly and the family used preventive health care. More money was given if the children were in secondary school than in primary school, to compensate for the children’s lost wages, and more money was given for girls attending school than for boys. The remaining villages did not get this treatment, and formed the control group. Because of the randomization, there were no confounding factors and it was possible to establish that PROGRESA increased school enrollment. For boys, the enrollment increased from 73% in the control group to 77% in the PROGRESA group. For girls, the increase was even greater, from 67% in the control group to almost 75% in the PROGRESA group. Due to the success of this experiment, the Mexican government supported the program under the new name OPORTUNIDADES, as an investment in a healthy and well educated population.

In some situations it's impossible to do randomized controlled experiment, even when the aim is to investigate causality. 

For example, suppose you want to study the effects of alcohol consumption during pregnancy, and you randomly assign some pregnant women to your “alcohol” group. You should not expect cooperation from them if you present them with a drink. In such situations you are more likely to conduct an observational study, not an experiment. Be alert for confounding factors.

## [Endnote](https://www.inferentialthinking.com/chapters/02/5/endnote) (Inferential Thinking - 2.5)

John Snow conducted an observational study, not a randomized experiment. But he called his study a “grand experiment” because, as he wrote, “No fewer than three hundred thousand people … were divided into two groups without their choice, and in most cases, without their knowledge …”

Studies such as Snow’s are sometimes called **natural experiments**. However, true randomization does not simply mean that the treatment and control groups are selected without their choice.

The method of randomization can be as simple as tossing a coin. It may also be quite a bit more complex. But every method of randomization consists of a sequence of carefully defined steps that allow chances to be specified mathematically. This allows us to:
* Account–mathematically–for the possibility that randomization produces treatment and control groups that are quite different from each other.
* Make precise mathematical statements about differences between the treatment and control groups.
    * This in turn helps us make justifiable conclusions about whether the treatment has any effect.

In this course, you will learn how to conduct and analyze your own randomized experiments. That will involve more than what has been presented in this section. For now, focus on the main idea: 
* Try to establish causality
* Run a randomized controlled experiment if possible. 
* If you are conducting an observational study, you might be able to establish association but not causation.
* Be extremely careful about confounding factors before making conclusions about causality based on an observational study.