# Chapter 19 - Evaluating Effect Sizes

## 19.1 Statistical vs. practical significance

In the previous chapters we discussed how we can use data to test hypotheses. This approach gives us p-values, which we can use to decide if we think it is likely or not that the true effect or model in a population is 0. But that sort of decision is ultimately a binary answer: we either reject or fail to reject the null hypothesis. We neglect a lot of the information a model gives us if p-values are the only thing we pay attention to. 

Consider the following situation. You are testing the efficacy of a new drug for treating depression, and find that the effect of the drug has a p-value of <0.01. Taking the drug significantly improves depression! In a simple world, that would be all you need to send the drug into production and start selling it to people. However, we don't live in a simple world. It takes money and time to produce medicine, which are resources taken away from producing other things. Your bosses (and the taxpayers funding Medicare) want that money and time to be worth it. 

Of course it's worth it you say, because the effect is signfificant! But what if you learned that the actual *amount* the drug improved depression by was 1 point on a 10 point scale? Is that worth it? What if it instead improved depression by only 0.1 points? What if it improved depression by only a little bit, and also came with side effects like acne and stomach upset? Would you make the same drug production decision across all of these scenarios? 

We know from our practice with p-values that even tiny effects can be significant if we have enough data. If your only goal is to publish a paper saying that an effect probably exists, then a p-value is enough for that purpose. But if you care about applying your new knowledge in the real world, a p-value doesn't tell us everything. We don't know if the **effect size** is large enough to matter. So while null hypothesis testing tells us whether an effect likely exists in some way (statistical significance), effect sizes tell us how meaningful that effect is (called **practical significance**). Good statistical analyses should report both p-values and effect sizes in order to give the full picture about a statistical model.  

## 19.2 Unstandardized effect sizes

How do you measure an effect size? Throughout this course, we've already been doing it! An effect size is any number expressing the magnitude of an effect. When fitting models, we have interpreted the b coefficients estimated by those models in the context of what difference in predictions we would make for different values of input variables. For a b coefficient of 5, we'd change our prediction of an outcome variable by 5 units for every one unit increase in the predictor variable. For a b coefficient of 0.1, we'd change our outcome prediction only a little bit. These numbers are all effect sizes. They express the size of the effect a predictor variable has on the predictions of an outcome variable.

Specifically, the b coefficients are **unstandardized effect sizes.** An unstandardized effect size is the magnitude of an effect expressed in the units of the variables. 

Unstandardized effect sizes are useful for making decisions about particular situations. Drug dose, cholesterol, etc. all having meaningful units by which their quantity is expressed. An unstandardized effect size representing a number in these units tells you how much of a drug you have to give to get an expected outcome, and amount of cholesterol in the blood is well-mapped to heart risks. We interact with the world through the units of quantities, and we best understand effects in these same units.

However, we have less understanding about domains or measurements that we're not already familiar with. E.g., if a model suggested that drinking alcohol before bed is associated with a decrease in sleep length of 2 hours, we intuitively know what that means. We all sleep, and we've all felt the effects of 8 hours versus 6 hours of sleep. But if that model instead suggested that drinking alcohol before bed is associated with a 2 point decrease in cognitive fluency, what does that mean? What concrete things define cognitive fluency? What does fluency = 5 feel like compared to fluency = 7? What range is it on? 

Unstandardized coefficients are useful for understanding the practical significance of effects in units that we know about, but are less useful for abstract measurements or domains we are unfamiliar with. In that case, we can standardize the effect size. 

## 19.3 Standardized effect sizes

**Standardized effect sizes** remove the real-world units from variables. We've done a version of this already when learning about z-scores. A standardized variable is one that has been converted to z-scores. Z-scores, to review, express the value of a variable in terms of how many standard deviations it is away from the variable mean. By doing this, we don't need to have domain expertise in the variable to intrinsically understand what certain values mean. Could you say what a temperature of 350F means relative to other typical temperatures for baking? What about a [Gas mark](https://en.wikipedia.org/wiki/Gas_mark) of 5? You're likely more familiar with the former than the latter. But if we convert each to z-scores and find that both have a z-score of -1, we can easily undertand either of those values relative to their distribution.  

A standardized effect size operates similarly. You don't need to know about a variable's units to understand the magnitude of a standardized effect size. Doing this has three benefits for understanding statistical models:

1) Standardized effect sizes help you evaluate how big or small an effect is when the units of measurement aren’t intuitive. We don't have to know the typical range or mean of a variable already in order to interpret a standardized effect size - its value has all the information wrapped into it. 

2) Standardized effect sizes can help you compare results across studies. Many variables are measured on different scales in different studies. This isn’t likely to happen with a variable like temperature, but there are multiple anxiety scales to choose from, each of which is on a different scale. Including standardized effect size statistics can help readers understand trends or differences across studies. It can also help us compare effect sizes of variables in the same model that are measured in different units (although remember we need to use hypothesis testing to figure out if a difference in estimated effect is statistically significant).

3) Standardized effect sizes let us plan our research more easily. We'll cover this process more later in this chapter, but the short of it is that there are tools we can use to figure out big of a study we should run if we want to be able to find a particular effect size as statistically significant. These tools use standardized effect sizes.

Next we will cover how to calculate some of the most common standardized effect statistics. 

### Standardized model coefficient


- r
- R2
- cohen's d
- cohen's f2

## 19.4 Visualizing uncertainty

## 19.5 Statistical power

Effect size is also related to how likely we are to find a statistically significant effect at all. Called statistical power

## 19.6 Positive predictive value

## 19.7 Power planning