[Suggestion] Generate simulations in a separate code chunk #9

rhinopotamus · 2018-10-03T15:39:53Z

In the hypothesis testing modules, the line that creates the simulation is generally in the same chunk as the line that plots the null distribution. For instance:

```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
ggplot(sims, aes(x=diffprop)) + 
    geom_histogram() +
    geom_vline(xintercept = obs_diff, color = "blue") +
    geom_vline(xintercept = -obs_diff, color = "blue")
```

This can cause problems as students experiment to find a good binwidth -- the simulations keep getting regenerated every time the code chunk is run, and therefore the distribution or the calculated p-value may change slightly.

My suggestion is to train students to generate their simulations in a separate code block and then start plotting the null distribution:

```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
```

```{r}
ggplot(sims, aes(x=diffprop)) + 
    geom_histogram(binwidth= 0.01) +
    geom_vline(xintercept = obs_diff, color = "blue") +
    geom_vline(xintercept = -obs_diff, color = "blue")
```

This way the simulations will at least be stable through multiple redrawings of the ggplot. It will also improve runtime and decrease memory load on the server.

Other possible solutions (that I don't really like):

have students "restart R and run all chunks" every time they want to execute any chunk, but this seems like overkill.
have students reset the seed every time they generate a simulation, but this would be onerous in the modules where rflip and shuffle are introduced and run multiple times.
- Or we could say, just reset the seed before generating a big dataframe of simulations, but then I'd be worried about weird errors resulting from running the module in a different order.

The text was updated successfully, but these errors were encountered:

VectorPosse · 2018-10-03T15:47:21Z

Yes, this makes a lot of sense.

VectorPosse · 2019-01-11T03:25:55Z

Related to the secondary point about setting seeds, I've decided to set a seed in every code chunk in which a random process appears. It is a bit onerous, but it's the best thing to ensure reproducibility in a notebook that allows for code chunks to be run in any order and/or multiple times.

VectorPosse added the suggestion label Nov 9, 2018

VectorPosse closed this as completed Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Generate simulations in a separate code chunk #9

[Suggestion] Generate simulations in a separate code chunk #9

rhinopotamus commented Oct 3, 2018

VectorPosse commented Oct 3, 2018 via email •

edited

Loading

VectorPosse commented Jan 11, 2019

[Suggestion] Generate simulations in a separate code chunk #9

[Suggestion] Generate simulations in a separate code chunk #9

Comments

rhinopotamus commented Oct 3, 2018

VectorPosse commented Oct 3, 2018 via email • edited Loading

VectorPosse commented Jan 11, 2019

VectorPosse commented Oct 3, 2018 via email •

edited

Loading