[Back to Table of Contents](https://www.shannonmburns.com/Psyc158/intro.html)

[Previous: Chapter 21 - Bayesian Statistics](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-21.ipynb)

# Chapter 22 - Lying with Statistics

We are very nearly done with this course on statistical methods in psychology research. So far you have learned how to read, mutate, summarize, and visualize data. You've also learned several ways to model data, making predictions about a variable based on information in other variables. Finally, you've learned about both Frequentist and Bayesian approaches to evaluating those models and making inferences about the state of the world based on data. 

All of this empowers you to seek insights from data. However, you also have the power to do dangerous things with data. You've learned about how easy bias and misinterpretation are. Given the objective veneer that statistics bring to our arguments, it behooves us to weild these statistics responsibly. In this final chapter, we will learn about the consequences of using statistics *badly*, and how to look out for these practices in our own and others' research. 

## 22.1 The replication crisis

Most people think that science is a reliable way to answer questions about the world. When our physician prescribes a treatment, we trust that it has been shown to be effective through research. We have similar faith that the airplanes that we fly in aren’t going to fall from the sky. 

However, there has been an increasing concern that science may not always work as well as we think. In 2004, the renowned psychologist Daryl Bem published a book chapter called “Writing the Empirical Journal Article” in order to give advice to budding scientists on how to publish their research. Bem provided suggestions such as:

> **Which article should you write?** There are two possible articles you can write: (1) the article you planned to write when you designed your study or (2) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (2).

In another section, he continues: 

> **Analyzing data** Examine them from every angle. Analyze the [variable levels] separately. Make up new composite indices. If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something — anything — interesting.

At first blush, this may seem like good advice. Data are just numbers that represent the world, so we should do whatever we can to reveal the message they communicate, right? 

Unfortunately, as we have seen in the prior chapters, there are many ways numbers can lead us astray. Without being careful, the inferences we make about data can be false positives, imprecise, biased, etc. There are many ways to analyze data, and those data analyses can lead to different conclusions. If we have a particular conclusion we *want* to reveal through data (either about the size of an effect or just a significant finding in general), we can often "massage" the data in a particular way in order to reveal that conclusion. British economist Ronald Coase captured this reality succinctly, when he wrote: "If you torture the data long enough, it will confess to anything."

The sort of data manipulation in search of a significant result that Bem described are now known as **Questionable Research Practices (QRPs)**. But in 2004, not many people thought of them that way. They came to the center stage of psychology in 2011 when the same Daryl Bem [published a paper](https://psycnet.apa.org/doiLanding?doi=10.1037%2Fa0021524) claiming to have discovered evidence of psychic abilities in his participants. Specifically, he claimed that his participants demonstrated a significant effect of *precognition*, meaning they were able to predict information in the future at levels significantly above chance. 

This was a shocking claim, and attracted a lot of attention. Upon closer inspection, other researchers [pointed out](http://www.talyarkoni.org/blog/2011/01/10/the-psychology-of-parapsychology-or-why-good-researchers-publishing-good-articles-in-good-journals-can-still-get-it-totally-wrong/) that Bem had engaged in the following QRPs: 

- The paper published the results of 9 studies, but the sample sizes varied across studies
- Different studies appear to have been lumped together or split apart
- The studies allow many different hypotheses, and it’s not clear which were planned in advance
- Bem used one-tailed tests even when it’s not clear that there was a directional prediction (so &alpha; was really 0.1)
- Most of the p-values are very close to 0.05
- It’s not clear how many other studies were run but not reported

If the data is re-analyzed with a pre-registered plan, the results didn't replicate. There was no robust evidence of precognition. 

But how could such a hugely respected scientist get it so wrong? Surely this must be a fluke? A team of researchers (together called the Open Science Collaboration) were perturbed by this turn of events, and set out to find how common these QRPs were across the field. This systematic investigation led to a paper called ["Estimating the reproducibility of psychological science"](https://www.science.org/doi/10.1126/science.aac4716).

It would not be an exaggeration to say that this paper was Earth-shattering for the field of psychology. The large team of researchers chose 100 well-cited psychology studies and attempted to reproduce the results originally reported in the papers. Whereas 97% of the original papers had reported statistically significant findings, only 37% of these effects were statistically significant in the replications. It seemed like the majority of psychological knowledge was built on nothing but hot air. And although psychology got much of the attention (and criticism), in the ensuing years it also became clear that other fields of research suffered from the same crisis of replicability, such as [cancer biology](https://elifesciences.org/articles/04333), [chemistry](https://www.nature.com/articles/548485a), [economics](https://www.nber.org/system/files/working_papers/w22989/w22989.pdf), and [other social sciences](https://www.nature.com/articles/s41562-018-0399-z). 

Clearly, the traditional way of doing research was failing us. Statistics based on QRPs and the incentives of hunting for anything that was significant lead to a majority of research being unreplicable, which stalls progress and wastes billions of dollars in research funding. It was time for a better way of doing research. 

In the decade since, the field has been evolving. Different statistical standards are being debated, and problematic practices are diminishing. There is still a ways to go to figure out what the best practices are, but in the mean time we have learned which practices are definitely harmful. In this chapter, we will discuss 10 of these "statistical sins" so that you know to avoid them in your own research, and to be sensitive to their use in research that you review. 

## 22.2 P-hacking

The p-value is a core component of null hypothesis significance testing in the Frequentist framework. A p-value is defined as the probability to obtain a result at least as extreme as the observed one if the null hypothesis is true (i.e., if there is no effect). If the p-value is smaller than the &alpha; threshold, then the test result is labeled “significant” and the null hypothesis is rejected. Researchers who are interested in showing an effect in their data (e.g., that a new medicine improved the health of patients) are therefore eager to obtain small p-values that allow them to reject the null hypothesis and claim the existence of an effect.

The first major statistical sin we will discuss is the kind of slicing and dicing of a dataset that Bem recommended in the pursuit of a significant p-value. Collectively, these practices are known as **p-hacking** - hacking into your dataset to find whatever configuration of variables will result in a significant p-value. 

You can do a great many things while p-hacking: choosing slightly different predictor/outcome variables, transforming variables, controlling for other variables, removing certain outliers, splitting the data into subgroups, etc. No matter the method, the overarching process of p-hacking is that you first check the significance results of a model. If it is not significant, you try a different version of the model, then check the significance again. You keep repeating this process with slightly different versions of the model until you finally arrive at the coveted p < 0.05. 

The problem with this is what we first learned about in chapter 17 - Type I error. In the Frequentist approach, if an effect is truly equal to 0 in the population, setting an &alpha; level to 0.05 means that there is a 5% chance our analysis in a sample will return a false positive - find a significant effect, when there shouldn't be one. But if we test many models over and over, the chance that *any* of them is a false positive starts to accumulate more tha 5%. Even if there is no effect in the population, the probability is very high that at least one hypothesis test will (erroneously) show a significant result, if a sufficiently large number of tests are conducted. Researchers then report this significant result, and claim to have found an effect.

Below are some simulations that show the effects of p-hacking on the Type I error rate. For example, let's say we measured the same outcome variable in 5 slightly different ways. We could p-hack by fitting a model to each of these 5 different outcomes, and reporting whichever model turned out best. If the true effect was 0, across 1000 samples like this, we'd want only 5% of them to erroneously give us a significant result. But by reporting the best out of 5, this inflates our Type I error rate to ~23%. 

<img src="images/ch22-phack1.png" width="800">

1) Garden of forking paths (different IV/DV variables, variable transformations, controlling, preferrable outlier exclusion, composite score redefinition, different model types, NA imputation, subgroup analysis) – basically deciding what model is good based on running a bunch of models and reporting the best one
2) Related, no correction for multiple comparisons
3) HARKing
4) Optional stopping/continuing
5) Violations of linear model assumptions 
6) Difference in conditions without comparing them directly
7) Small samples
8) Over-interpreting non-significant results
9) Correlation vs. causation
10) Misleading visualization