Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Generate simulations in a separate code chunk #9

Closed
rhinopotamus opened this issue Oct 3, 2018 · 2 comments
Closed

[Suggestion] Generate simulations in a separate code chunk #9

rhinopotamus opened this issue Oct 3, 2018 · 2 comments

Comments

@rhinopotamus
Copy link
Contributor

In the hypothesis testing modules, the line that creates the simulation is generally in the same chunk as the line that plots the null distribution. For instance:

```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
ggplot(sims, aes(x=diffprop)) + 
    geom_histogram() +
    geom_vline(xintercept = obs_diff, color = "blue") +
    geom_vline(xintercept = -obs_diff, color = "blue")
```

This can cause problems as students experiment to find a good binwidth -- the simulations keep getting regenerated every time the code chunk is run, and therefore the distribution or the calculated p-value may change slightly.

My suggestion is to train students to generate their simulations in a separate code block and then start plotting the null distribution:

```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
```

```{r}
ggplot(sims, aes(x=diffprop)) + 
    geom_histogram(binwidth= 0.01) +
    geom_vline(xintercept = obs_diff, color = "blue") +
    geom_vline(xintercept = -obs_diff, color = "blue")
```

This way the simulations will at least be stable through multiple redrawings of the ggplot. It will also improve runtime and decrease memory load on the server.

Other possible solutions (that I don't really like):

  • have students "restart R and run all chunks" every time they want to execute any chunk, but this seems like overkill.
  • have students reset the seed every time they generate a simulation, but this would be onerous in the modules where rflip and shuffle are introduced and run multiple times.
    • Or we could say, just reset the seed before generating a big dataframe of simulations, but then I'd be worried about weird errors resulting from running the module in a different order.
@VectorPosse
Copy link
Owner

VectorPosse commented Oct 3, 2018 via email

@VectorPosse
Copy link
Owner

Related to the secondary point about setting seeds, I've decided to set a seed in every code chunk in which a random process appears. It is a bit onerous, but it's the best thing to ensure reproducibility in a notebook that allows for code chunks to be run in any order and/or multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants