Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculating statistic from *random samples* drawn from a population #44

Closed
hardin47 opened this issue Jul 23, 2017 · 4 comments
Closed

Comments

@hardin47
Copy link

@andrewpbray @mine-cetinkaya-rundel @beanumber @ismayc @nicksolomon

Another thing I do in my course is to create a sampling distribution of slopes taken from a huge population. I do this in chapter 1 before I've done any inference. The idea is just to visualize how the slopes change from sample to sample outside the context of making specific hypotheses.

Again, would appreciate your vote on which of these options seems best / most consistent with what we are doing.

First:

explanatory <- rnorm(1000, 0, 5)
response <- 40 + explanatory*2 + rnorm(1000,0,10)
popdata <- data.frame(explanatory,response)

(A)
Uses rep_sample_n to get many samples. Two plots here: one is superimposed lines on a scatterplot, one is a histogram.

manysamples <- popdata %>%
rep_sample_n(size=50, reps=100)

ggplot(manysamples, aes(x=explanatory, y=response, group=replicate)) + 
geom_point() + 
geom_smooth(method="lm", se=FALSE) 

manylms <- manysamples %>% 
group_by(replicate) %>% 
do(tidy(lm(response ~ explanatory, data=.))) %>%
filter(term=="explanatory")

ggplot(manylms, aes(x=estimate)) + geom_histogram()

(B)
This code doesn't exist yet... we could write an additional option to generate in infer... maybe use the argument sample?

manylms <- popdata %>%
  specify(response ~ explanatory) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "sample", size=50) %>%
  calculate(stat = "slope") 

# I don't know how to use the above to make the slope plot in ggplot (which I believe to be very powerful).  
# I'd need both the slope and the intercept to draw the lines... so maybe this is a moot point??

ggplot(manylms, aes(x=estimate)) + geom_histogram()
@andrewpbray
Copy link
Collaborator

Seems like this might be a good motivation for the bootstrap interval on the slope - is that how you're using it?

Am I right in thinking that by the current code, it's possible that the same observation is used in multiple samples? How will you be explaining that? I think we could build out generate() so that it has all the components to simulate from any generative model reps number of times, outputting a grouped data frame for the subsequent calculate() step. I'm not sure we want to get there yet though cause that seems like a pretty niche use case.

Are you planning on showing them the rnorm() code? An alternative would be to take a data set where you have all of the data (like housing in Ames), and draw samples from that, as if you are sending out separate teams of surveyors.

In any event, I'd lean towards option A.

@mine-cetinkaya-rundel
Copy link
Collaborator

Agreed, I'd lean towards (A) as well.

@hardin47
Copy link
Author

Right, I think @beanumber and I are talking about the same thing. The idea is to motivate the bigger concept of samples from a population (not specifically permutation tests or bootstrapping, but just the simpler idea that different samples give different statistics).

I do create the big population using rnorm, but no, I don't tell/show them that. I just say we have a huge hypothetical population.

I think for my situation, rep_sample_n seems like the best solution for now.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants