# Risk Ratios for Journalists

### Note: This workbook has the answers inline! Please see risk-ratio-workbook.ipynb to work it out yourself
Welcome! This notebook will take you through the basics of understanding and working with risk ratios. A risk ratio describes a *change* in risk, which is useful across a wide variety of analyses including vaccine effectiveness, pay-to-play meetings with politicians, employment discrimination, TSA security screening, and medical test results. Although the formula is simple, there are  connections to deep topics like false positive rates, causal inferrence, and the important notion of conditional probability.

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Example
A risk ratio, also called a relative risk, is the ratio of two probabilities. Each of these probabilities represents something happening to one of two groups.

We'll base our example on a type of drug that reduces heart attacks called statins, which have [similar numbers](https://researchaddict.com/statins-reduce-the-risk-of-death-heart-attacks-and-even-dementia/) to those in this example. Suppose that in a randomized trial we have 60 people in a control group, 3 of whom go on to have a heart attack, and 40 who take the drug, 1 of whom has a heart attack. We can draw this situation like this.

![risk-ratio-statins.png](attachment:risk-ratio-statins.png)

We can represent all of this information in a 2x2 table, like this. 

In [69]:
statins = pd.DataFrame({'no heart attack':[39,57],'heart attack':[1,3]}, index=['medication','no medication'])
statins

Unnamed: 0,no heart attack,heart attack
medication,39,1
no medication,57,3


We might naturally be interested in how much more or less likely someone taking the medication is to get a heart attack. We can't just count the number of heart attacks in each group, because the groups might be different sizes (as they are here). Instead we want to compare the probability of heart attack within each group.

In [71]:
p_attack_medication = statins['heart attack']['medication'] / (statins['heart attack']['medication'] + statins['no heart attack']['medication'])
p_attack_medication

0.025

In [70]:
p_attack_no_medication = statins['heart attack']['no medication'] / (statins['heart attack']['no medication'] + statins['no heart attack']['no medication'])
p_attack_no_medication

0.05

All we are doing here is calculating the percentage of medicine takers and non-medicine takers who had a heart attack, which is why we divide by the total number of takers/non-takers in the denominators of the formulas above. 

From this, we can see that without medication there is about a 5% chance of having a heart attack, and with medication there is a 2.5% chance. We might want to have a single number that summarizes this *change in risk* for a variety of reasons, such as to compare the effect of different treatments or risk factors. For example, is taking this drug more or less effective than never exercising? 

The simplest way to do this would be to subtract the two probabilities:

In [72]:
p_attack_medication - p_attack_no_medication

-0.025

This is a difference of probabilities, a number known as a **risk difference** or **absolute risk reduction**. It's negative here, indicating that the risk went down. 

We might also want to ask how many times bigger one probability is than the other, which implies a division:

In [74]:
p_attack_medication / p_attack_no_medication

0.5

This, at last, is the **risk ratio** or **relative risk**. It's less than one, meaning the risk was reduced.

## Definition

Given two groups and two outcomes, we can place these four numbers in a table like this:

| Group | Positive | Negative |
| - | - | - |
| Treated | a | b |
| Untreated | c | d |

This is called a **cross table** or **contingency table**. The two outcomes arus usually called **positive** and **negative**, even though the "positive" outcome might actually be bad, like a heart attack. The two groups are called various names, such as the **untreated** and **treated**  for studies of drugs or other interventions, or sometimes **unexposed** and **exposed** when studying the effects of some risk factor.

Then the **risk ratio** is defined as `(a/(a+b)) / (c/(c+d))`. This is alsocalled the **relative risk**.

We've already also seen the *risk difference* which is `(a/(a+b)) - (c/(c+d))`. There is another quantity called the *odds ratio* which is calculated as ratio of odds instead of a ratio of probabilities, that is `(a/b) / (c/d)`. In fact there are a [number of related measures](https://en.wikipedia.org/wiki/Relative_risk_reduction) one can calculate from this table, each of which can be useful in certain cases. Yet all comparisons of risk eventually boil down to calculations on this table.

We will focus mostly on the *risk ratio*, because it is a simple and versatile ratio of probabilities.


# Exercise 1: COVID vaccine risk ratio

Here's a [paper](https://www.nejm.org/doi/full/10.1056/nejmoa2035389) which reports on the phase 3 clinical trials of the Moderna vaccine. What is the risk ratio that describes the effectiveness of this vaccine? You will need to read the paper to find the four values a,b,c,d as above.

**a) What is the risk ratio of taking the vaccine?**

In [7]:
a = # number of people who DID get the vaccine and DID get COVID
b = # number of people who DID get the vaccine and DID NOT get COVID
c = # number of people who DID NOT get the vaccine and DID get COVID
d = # number of people who DID NOT get the vaccine and DID NOT get COVID

In [1]:
# Then calculate the risk ratio itself
# Your answer here

**b) What does it mean that this number is smaller than 1?**



Your answer here

**c) This paper says that the vaccine "showed 94.1% efficacy at preventing Covid-19 illness." Where does this number come from?**

Your answer here

# Communicating Relative Risk
How should we write about this? Relative risk is often written as *times as likely*, so we could say "people who take this medicine are 0.5 times as likely to have a heart attack." 

In this case we could also go with a nice clean "half as likely," but *times* is the general wording. Consider this sentence reporting a risk ratio from a 2015 [ProPublica story](https://www.propublica.org/article/deadly-force-in-black-and-white): "Young black males in recent years were at a far greater risk of being shot dead by police than their white counterparts – 21 times greater." 

You could also report the absolute risk reduction by saying "those who took the medication were 2.5% less likely to have a heart attack." This gives a different picture of what has happened to the risk. It has decreased by only a small amount, but then again it *can't* decrease by more than the baseline of 5%.

Typically, a risk ratio is reported as *times as likely*, which implies a multiplication -- we are multiplying the risk of untreated group by the relative risk to find the risk of the treated group. Conversely, risk difference typically is reported as *less likely* or *more likely than* because it implies addition -- we add the risk difference to the untereated group to get the risk for the treated group.

You may be tempted to write "those who took the medication were 50% less likely to have a heart attack." This has a nice ring to it, and technically that 50% is a number called the **[relative risk reduction](https://en.wikipedia.org/wiki/Relative_risk_reduction)** which is just 1-risk ratio (if the risk ratio was 80% then the relative risk reduction would be 20%). However, this is confusing because "less likely" is usually used to report absolute risk reduction. Please don't do this.

Both risk ratios and risk differences are ways of summarizing a *change* in risk. They're very useful for comparing different interventions. However, neither of these numbers alone really tells the whole story.


## A simple and accurate way to report risk changes

In the heart attack example the risk ratio was a large, meaning far from one. People taking the medication were 0.5 times as likely to have a heart attack, or a 50% relative risk reduction. But the absolute risk reduction was small, just a few percent. This is because the absolute risk was small in both groups, but these two numbers can be a lot closer for common outcomes. If your probability of getting a parking ticket decreases from 70% to 30% in a different neighborhood, then the relative risk reduction is 1-30/70 = 57%, but the absolute risk reduction is 40%.

Beware of those who report [whichever number looks better](https://centerforhealthjournalism.org/2017/01/11/reporting-findings-absolute-vs-relative-risk)! A pharmacuetical company may pick the large number to report effectiveness, and the smaller number to report side effects. Also, some writers might unintentionally mislead by using "less likely" to report a relative risk reduction rather than an absolute risk reduction. 

Fortunately, there's a simple alternative that will help both you and your readers. First, convert both the untreated and treated group to percentages, so readers don't have to work out the denominator in their heads. For our statin example that could be visualized like this:

![risk-ratio-normalized.png](attachment:risk-ratio-normalized.png)

Then simply report both numbers: "2.5% of those who took the medication had a heart attack, compared to 5% of those who did not." While you would still want to use risk ratios to compare, say, the effectiveness of two different drugs, for a single drug reporting the before and after percentages is simple, comprehensive, and easy to visualize.

**Careful with causality!** It is very tempting to write "taking the medicine reduced the risk of heart attacks from 5% to 2.5%" which means that the drug *caused* the reduction. This *might* be true, if the risk ratio was computed as part of a carefully controlled experiment, as when reporting on a scientific study. But in general, *risk ratios are statements of correlation, not causation*. We'll talk about this more below.



# The Vaccine Adverse Event Reporting System (VAERS)

One great advantage of getting comfortable with risk ratios is being able to recognize when something should really be a risk ratio, but isn't. The logic of the 2x2 table explains why the Vaccine Adverse Event Reporting System (VAERS) database cannot be used to calculate the number deaths caused by the COVID vaccine. To be clear, there are many problems with trying to use this data to count deaths. For one thing, it's an open database and only collects reports of "adverse reactions"*after* vaccinations -- no attempt is made to determine causality at this stage. As this [article](https://www.nebraskamed.com/COVID/does-vaers-list-deaths-caused-by-covid-19-vaccines) notes,

>VAERS is like the Wikipedia of data reporting. Anyone can report anything. Many reports are helpful. Some reports are nonsense – to prove the point, one anesthesiologist successfully submitted a VAERS report several years ago that the flu vaccine had [turned him into The Incredible Hulk](https://www.politifact.com/factchecks/2017/may/11/bill-zedler/bill-zedler-insists-program-doesnt-collect-wide-ra/). More recently, a [false report](https://www.usatoday.com/story/news/factcheck/2021/05/09/fact-check-no-evidence-2-year-old-died-covid-vaccine/4971367001/) of a 2-year-old dying from a COVID-19 vaccine was removed from VAERS because the CDC says it was "completely made up."

But even if every report was perfectly accurate, if you understand risk ratios you'll see immediately that it's not possible to determine if vaccines increase the risk of death from this sort of data. Politifact [explains this well](https://www.politifact.com/article/2021/may/03/vaers-governments-vaccine-safety-database-critical/):

> Offit explained that four sets of data are needed to measure whether a vaccine has caused or contributed to an adverse event: vaccinated people who experienced that problem; vaccinated people who didn’t have it; unvaccinated people who had the problem; and unvaccinated people who didn’t.

In other words, VAERS tracks only one of the four numbers in the contingency table, so right away we know that we cannot say from this data alone whether getting the vaccine increases or decreases your risk for "adverse events."


# Exercise 2: Clinton Foundation Meetings

In summer 2016 the [AP reported](https://apnews.com/article/82df550e1ec646098b434f7d5771f625) that "at least 85 of 154 people from private interests who met or had phone conversations scheduled with Clinton while she led the State Department donated to her family charity or pledged commitments to its international programs." Is this evidence that donating to the Clinton Foundation would get you a meeting with the Secretary of State?

In risk ratio terms, quid-pro-quo would mean that the "risk" (probability) of getting a meeting for people who donated is higher than for those who did not donate, i.e. the risk ratio is greater than one. 

***Using the numbers in the story above, try to calculate the risk ratio***

In [None]:
a = # number of people who DID donate and DID get a meeting
b = # number of people who DID donate and DID NOT get a meeting
c = # number of people who DID NOT donate and DID get a meeting
d = # number of people who DID NOT donate and DID NOT get a meeting

Your answer here

# Exercise 3: Vaccine Effectivenes in Provincetown

You may remeber that the CDC released a [paper](https://www.cdc.gov/mmwr/volumes/70/wr/mm7031e2.htm) in August 2021 warning that COVID infection was still possible among the vaccinated, based on a study of a COVID outbreak in Provincetown, MA. Much of the coverage focussed on the statistic that 74% of the people who contracted COVID were vaccinated. Unfortunately, this number may give a misleading impression of vaccine effectiveness. Instead, I want you to use the data in this paper to compute the risk ratio and better understand what was happening here.

**Calculate risk ratio of getting COVID if you are vaccinated.** Every number you need is in the paper, but you will have to do a little bit of algebra to work it out. Hint: the risk ratio can be re-expressed like this
```
rr = (cases_among_vaxxed / total_vaxxed) / (cases_among_unvaxxed / total_unvaxxed)
   = (cases_among_vaxxed / cases_among_unvaxxed ) * ( total_unvaxxed / total_vaxxed )
```

In [1]:
# Your answer here

## Relationship to Causality
Why is this number greater than one? We know from studies of the vaccine that the relative risk of getting covid should be something like 5%. But that implicitly assumes that vaccinated an unvaccinated people act the same way. Here, it's very likely they don't. This outbreak centered around July 4 and Provincetown hosts a popular gay party scene, so perhaps out-of-town party-goers thought the vaccine made them immune and radically changed their behavior to be far riskier than the still unvaccinated.

The deeper problem is that we often want risk ratios to tell us about causality. After all, this is why we compute risk ratios for medicines, corruption, discrimination, and so on. But causality is complex; risk ratios can only tell us about causality if we can be sure that the untreated and treated groups are identical in every way that could matter, such as during a randomized experiment. When analyzing observational data, this is usually not the case. **Risk ratios are correlations, not causes.**

# Exercise 4: Hiring Discrimination

Risk ratios are also a basic building block of discrimination analyses. Consider this recent [Miami Herald story](https://www.miamiherald.com/sports/nfl/article258302943.html) about the hiring of Black head coaches in the NFL:

> The chances of landing an NFL head coaching position were three times better for white candidates compared to their non-white counterparts — even after including the most recent hires and controlling for age, number of opportunities, previous coaching position and years of experience in the league.

How do we get to this conclusion? Let's look at the data, which can be downloaded from the Herald [here](https://docs.google.com/spreadsheets/d/1lVPgIu7OKg40trVMnVlzg5EvnAByxXJCWQqGL1qDBis/edit#gid=0). For this exercise we'll use a slightly reformmated version of the data, with one row per candidate per year (candidates often apply for multiple jobs in the same year).

In [87]:
coaches = pd.read_csv('FINAL_coaches_by_year.csv')
coaches.head()

Unnamed: 0,Unit_of_Analysis,Coach_ID,Name,Age,Hired,Year,Number_of_Interviews_That_Year,Previous_Job,Previous_Job_Coded,NFL_Playing_Experience,NFL_Coaching_Experience,Total_NFL_Experience,Black,White,Minority,OC,DC,HC
0,Aaron Glenn 2021,1,Aaron Glenn,48,0,2021,1,Other NFL Job,5,15,8,23,1,0,1,0,0,0
1,Aaron Glenn 2022,1,Aaron Glenn,49,0,2022,2,Defensive Coordinator,4,15,9,24,1,0,1,0,1,0
2,Adam Gase 2015,2,Adam Gase,36,0,2015,5,Offensive Coordinator,3,0,12,12,0,1,0,1,0,0
3,Adam Gase 2016,2,Adam Gase,37,1,2016,4,Offensive Coordinator,3,0,13,13,0,1,0,1,0,0
4,Adam Gase 2019,2,Adam Gase,40,1,2019,2,Head Coach Previous Season,1,0,16,16,0,1,0,0,0,1


Let's strip this down to just two variables of interest: whether they were hired and whether they are Black.

In [88]:
coaches=coaches[['Hired','Black']]
coaches.head()

Unnamed: 0,Hired,Black
0,0,1
1,0,1
2,0,0
3,1,0
4,1,0


We can count the number of Black and non-Black coaches hired like this

In [105]:
hired_black = sum((coaches['Hired']==1) & (coaches['Black']==1))
hired_black

8

In [103]:
hired_not_black = sum((coaches['Hired']==1) & (coaches['Black']==0))
hired_not_black

48

But of course this doesn't tell the full story because it doesn't account for how many applied from each race. 

**Compute the relative risk of being hired if you are Black.**

In [2]:
# Your answer here

This number is close to the "three times" in the article, but again, *risk ratios are correlations not causes*. To know if race is the *cause* of the disparity we have to rule out other factors that could legitimately account for the difference, such as the previous coaching experience of the candidates (which we threw out above). The Herald worked with a statistician to build a model to do this. One technique that could work is [logistic regression](https://investigate.ai/regression/logistic-regression-quickstart/) which will eventually produce an odds ratio (similar but not identical to a risk ratio, as discussed [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5300861/)). This is called **controlling** or **adjusting** for experience, which is a basic technique of [causal inference](https://mixtape.scunning.com/).

In this case, controlling for experience barely changes the outcome:
![coaches-chart.png](attachment:coaches-chart.png)

# Conditional probability and the Base Rate Fallacy

This is an introduction to the idea of conditional probability, which has a close connection to risk ratios. Understanding conditional probability will help you avoid a set of common mistakes known as the *base rate fallacy*. Plus, conditional probability is one of the basic building blocks of statistics.

This section is all told through pictures, so [please see the slides](https://docs.google.com/presentation/d/1FicSPCksCe9kVqXjeBYmCnZ5KYhg2ymsPapCZlawVo8/edit?usp=sharing)

As an example of the sorts of problems we are trying to solve, suppose I tell you:

- If a woman has cancer, her mammogram is positive 75% of the time

If a woman has a positive mammogram, how likely is she to have cancer? The answer is not 75%. But to see why, and work out how to calculate the correct answer, we need the machinery of conditional probability. If we work through the math, we will see that we need two other quantities as well, the **base rates** of cancer and positive tests:

- 10% of all tests come up posiitive
- 14 of 1000 women under 50 have breast cancer

The [base rate fallacy](https://en.wikipedia.org/wiki/Base_rate_fallacy) is the failure to consider how common something is overall. Most commonly, it is a confusion between `P(cause|effect)`, such as cancer given a positive test, and `P(effect|cause)` such as a positive test given cancer.


# Exercise 5: COVID hospitalizations and Bayes Theorem

According to [hospitalization statistics](https://www.ctvnews.ca/health/coronavirus/making-sense-of-the-numbers-greater-proportion-of-unvaccinated-are-being-hospitalized-1.5770226) in Quebec, Canada:

> In Quebec, for example, health officials reported 160 new COVID-19-related hospitalizations on Feb. 5 among those aged five and older. This figure includes patients in hospital wards and intensive care units (ICUs). Of these hospitalizations, 118 were among those who were vaccinated with either two or three doses of the COVID-19 vaccine. 

Naively, one might calculate the probability of getting hospitalized with COVID after vaccination like this

In [77]:
118/160

0.7375

**a) What conditional probability does this calculate? Your answer should be of the form `p(X|Y)`**

Your answer here

**b) What conditional probability do we want instead, if we want to know the effectiveness of vaccines in preventing hopsitalization?**

Your answer here

**c) How can we calculate the probability in b) starting from the probability in a)? Use Bayes' Theorem**

Your answer here