calculating statistic from random samples drawn from a population #44

hardin47 · 2017-07-23T22:00:17Z

@andrewpbray @mine-cetinkaya-rundel @beanumber @ismayc @nicksolomon

Another thing I do in my course is to create a sampling distribution of slopes taken from a huge population. I do this in chapter 1 before I've done any inference. The idea is just to visualize how the slopes change from sample to sample outside the context of making specific hypotheses.

Again, would appreciate your vote on which of these options seems best / most consistent with what we are doing.

First:

explanatory <- rnorm(1000, 0, 5)
response <- 40 + explanatory*2 + rnorm(1000,0,10)
popdata <- data.frame(explanatory,response)

(A)
Uses rep_sample_n to get many samples. Two plots here: one is superimposed lines on a scatterplot, one is a histogram.

manysamples <- popdata %>%
rep_sample_n(size=50, reps=100)

ggplot(manysamples, aes(x=explanatory, y=response, group=replicate)) + 
geom_point() + 
geom_smooth(method="lm", se=FALSE) 

manylms <- manysamples %>% 
group_by(replicate) %>% 
do(tidy(lm(response ~ explanatory, data=.))) %>%
filter(term=="explanatory")

ggplot(manylms, aes(x=estimate)) + geom_histogram()

(B)
This code doesn't exist yet... we could write an additional option to generate in infer... maybe use the argument sample?

manylms <- popdata %>%
  specify(response ~ explanatory) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "sample", size=50) %>%
  calculate(stat = "slope") 

# I don't know how to use the above to make the slope plot in ggplot (which I believe to be very powerful).  
# I'd need both the slope and the intercept to draw the lines... so maybe this is a moot point??

ggplot(manylms, aes(x=estimate)) + geom_histogram()

The text was updated successfully, but these errors were encountered:

andrewpbray · 2017-07-24T19:42:10Z

Seems like this might be a good motivation for the bootstrap interval on the slope - is that how you're using it?

Am I right in thinking that by the current code, it's possible that the same observation is used in multiple samples? How will you be explaining that? I think we could build out generate() so that it has all the components to simulate from any generative model reps number of times, outputting a grouped data frame for the subsequent calculate() step. I'm not sure we want to get there yet though cause that seems like a pretty niche use case.

Are you planning on showing them the rnorm() code? An alternative would be to take a data set where you have all of the data (like housing in Ames), and draw samples from that, as if you are sending out separate teams of surveyors.

In any event, I'd lean towards option A.

mine-cetinkaya-rundel · 2017-07-25T00:21:26Z

Agreed, I'd lean towards (A) as well.

hardin47 · 2017-07-26T21:17:31Z

Right, I think @beanumber and I are talking about the same thing. The idea is to motivate the bigger concept of samples from a population (not specifically permutation tests or bootstrapping, but just the simpler idea that different samples give different statistics).

I do create the big population using rnorm, but no, I don't tell/show them that. I just say we have a huge hypothetical population.

I think for my situation, rep_sample_n seems like the best solution for now.

github-actions · 2021-03-10T00:10:28Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

beanumber mentioned this issue Jul 25, 2017

Error in sample.int(length(x), size, replace, prob) : NA in probability vector #45

Closed

hardin47 closed this as completed Jul 26, 2017

rudeboybert mentioned this issue Jan 11, 2018

Export rep_sample_n() #82

Merged

github-actions bot locked and limited conversation to collaborators Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calculating statistic from random samples drawn from a population #44

calculating statistic from random samples drawn from a population #44

hardin47 commented Jul 23, 2017

andrewpbray commented Jul 24, 2017

mine-cetinkaya-rundel commented Jul 25, 2017

hardin47 commented Jul 26, 2017

github-actions bot commented Mar 10, 2021

calculating statistic from *random samples* drawn from a population #44

calculating statistic from *random samples* drawn from a population #44

Comments

hardin47 commented Jul 23, 2017

andrewpbray commented Jul 24, 2017

mine-cetinkaya-rundel commented Jul 25, 2017

hardin47 commented Jul 26, 2017

github-actions bot commented Mar 10, 2021

calculating statistic from random samples drawn from a population #44

calculating statistic from random samples drawn from a population #44