-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do NAs get resampled? #52
Comments
The When you say "original sample" it right now is just using the number of rows in the data frame resulting from library(nycflights13)
suppressPackageStartupMessages(library(dplyr))
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>%
sample_n(size = 500) %>%
mutate(half_year = case_when(
between(month, 1, 6) ~ "h1",
between(month, 7, 12) ~ "h2"
)) %>%
mutate(day_hour = case_when(
between(hour, 1, 12) ~ "morning",
between(hour, 13, 24) ~ "not morning"
)) %>%
select(arr_delay, dep_delay, half_year,
day_hour, origin, carrier)
# Determine number of missing arrival delay values
sum(is.na(fli_small$arr_delay))
#> [1] 15
# Bootstrap uses similar code to oilabs::rep_sample_n()
boots <- fli_small %>%
specify(response = arr_delay) %>%
generate(reps = 100, type = "bootstrap")
boots %>%
group_by(replicate) %>%
summarize(num_na = sum(is.na(arr_delay)))
#> # A tibble: 100 x 2
#> replicate num_na
#> <int> <int>
#> 1 1 17
#> 2 2 26
#> 3 3 16
#> 4 4 14
#> 5 5 19
#> 6 6 12
#> 7 7 10
#> 8 8 14
#> 9 9 8
#> 10 10 12
#> # ... with 90 more rows If users take care of the |
I completely agree that users should be taking care of their |
I’m inclined to just give an error so that we don’t have to deal with this in multiple scenarios. Maybe even do this at the |
@ismayc that sounds good to me! this might make the |
Sounds good! I’ll tag the commit here when I have this implemented. |
I agree that this is a problematic area, since one probably wants to condition on the observed sample size. Throwing a warning at the least or potentially an error would make sense to me.
On Oct 29, 2017, at 9:27 AM, Mine Cetinkaya-Rundel ***@***.***> wrote:
I completely agree that users should be taking care of their NAs and not us, as there may be different decisions that need to be made. So an error or warning is warranted for sure! That being said, I think the bootstrap sample size should be the number of complete cases in that column that is being resampled (or the two columns if specify has two variables) as opposed to the nrow of the data frame, what do you think? Though I guess if we give an error if there are NAs in the column to be resampled, we don't have to make this decision in infer.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Nicholas Horton
Professor of Statistics
Department of Mathematics and Statistics, Amherst College
PO Box 5000, AC #2239
Amherst, MA 01002-5000
|
Just saw the still pondering label on this. Pondering on implementation or whether the change should be made? If the latter, the answer is yes. Let me know if I can help to implement this change. |
@mine-cetinkaya-rundel Just pondering on implementation. If you have ideas, please do go for it! |
@mine-cetinkaya-rundel I don't believe we implemented this yet, did we? |
No I didn't. I just got a chance to start looking at some of the to dos here so I can work on in the next couple days, but feel free to go ahead if you have ideas now. |
Looks like there are a few other |
Agreed. I will take a look at summarizing the issues into one common issue this afternoon. |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
The
calculate
function takes thena.rm
argument for the statistic it's calculating. But does this mean at the generate stage, say if we're doing a bootstrap interval,NA
s from the original sample get resampled? If so, I believe this is not a good idea. Because chances are when we look at the sample size for the original sample those NAs don't factor into it. And a bootstrap sample should be the same size as the original sample. If indeed theNA
s from the original sample don't make it into the bootstrap samples, then isn't thena.rm
argument in calculate unnecessary?The text was updated successfully, but these errors were encountered: