You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the hypothesis testing modules, the line that creates the simulation is generally in the same chunk as the line that plots the null distribution. For instance:
```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
ggplot(sims, aes(x=diffprop)) +
geom_histogram() +
geom_vline(xintercept = obs_diff, color = "blue") +
geom_vline(xintercept = -obs_diff, color = "blue")
```
This can cause problems as students experiment to find a good binwidth -- the simulations keep getting regenerated every time the code chunk is run, and therefore the distribution or the calculated p-value may change slightly.
My suggestion is to train students to generate their simulations in a separate code block and then start plotting the null distribution:
This way the simulations will at least be stable through multiple redrawings of the ggplot. It will also improve runtime and decrease memory load on the server.
Other possible solutions (that I don't really like):
have students "restart R and run all chunks" every time they want to execute any chunk, but this seems like overkill.
have students reset the seed every time they generate a simulation, but this would be onerous in the modules where rflip and shuffle are introduced and run multiple times.
Or we could say, just reset the seed before generating a big dataframe of simulations, but then I'd be worried about weird errors resulting from running the module in a different order.
The text was updated successfully, but these errors were encountered:
Related to the secondary point about setting seeds, I've decided to set a seed in every code chunk in which a random process appears. It is a bit onerous, but it's the best thing to ensure reproducibility in a notebook that allows for code chunks to be run in any order and/or multiple times.
In the hypothesis testing modules, the line that creates the simulation is generally in the same chunk as the line that plots the null distribution. For instance:
This can cause problems as students experiment to find a good binwidth -- the simulations keep getting regenerated every time the code chunk is run, and therefore the distribution or the calculated p-value may change slightly.
My suggestion is to train students to generate their simulations in a separate code block and then start plotting the null distribution:
This way the simulations will at least be stable through multiple redrawings of the ggplot. It will also improve runtime and decrease memory load on the server.
Other possible solutions (that I don't really like):
rflip
andshuffle
are introduced and run multiple times.The text was updated successfully, but these errors were encountered: