# P-hacking and confidence intervals

Now that we have tools for simulating random events, let's investigate a couple ways this can be used and misused.  The first is called "p-hacking", where a scientific study is set up in a way that makes a "statistically significant" result more likely.

Let's look at one example, inspired by this [XKCD comic](https://xkcd.com/882/).

Write code to do the following:
1. Generate two sets of normally-distributed random data with the same mean and variance.  The first is the *control* and the second is the *test* set.
2. Use a T-test (`scipy.stats.ttest_ind()`) to calculate the probability that any variation between these two data sets could be explained by chance.
3. If $p \ge 0.05$, then print out a result like "we found no link between purple jelly beans and acne (p = 0.791)".  If $p < 0.05$, print out a result like "green jelly beans linked acne! Only 5% chance of coincidence!"

Once you've got this working, write a `for` loop to run this experiment for all of the jelly bean colors.


In [None]:
import numpy as np
from scipy import stats

jellybean_colors = ["purple", "brown", "pink", "blue", "teal", "salmon", "red", "turquoise", "magenta", "yellow", "grey", "tan", "cyan", "green", "mauve", "beige", "lilac", "black", "peach", "orange"]

# Your code here...

*Challenge*:
* Make some scatter plots showing the data, and annotate them with the corresponding means
* What happens if you test some other things along with acne? (e.g., 

P-hacking can take many forms, nearly all of which are well-intentioned:
* Collecting some data, finding that $p \ge 0.05$, and then continuing to collect 
* Making various choices about data filtering
* Recording lots of data and only using part of it

[Why Most Published Research Findings are False (PLOS Medicine)](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124)

## Confidence intervals

Let's start by generating a "random walk" --- imagine walking on a number line, taking steps back and forth at random and recording our path.


In [None]:
import matplotlib.pyplot as plt

rng = np.random.default_rng()

# Create a "random walk" starting at 0
x = [0]

for i in range(499):
    # Take the previous location, and add a random number to it
    x.append(x[-1] + rng.normal())

plt.plot(range(500), x)
plt.show()

Write code to create many random walks (say 100) and plot them all on the same graph.

Can you calculate the 50% confidence interval from the data?  The 95% confidence interval?

In [None]:
# Create 1000 random walks
# Your code here...