Do NAs get resampled? #52

mine-cetinkaya-rundel · 2017-10-28T17:41:04Z

The calculate function takes the na.rm argument for the statistic it's calculating. But does this mean at the generate stage, say if we're doing a bootstrap interval, NAs from the original sample get resampled? If so, I believe this is not a good idea. Because chances are when we look at the sample size for the original sample those NAs don't factor into it. And a bootstrap sample should be the same size as the original sample. If indeed the NAs from the original sample don't make it into the bootstrap samples, then isn't the na.rm argument in calculate unnecessary?

The text was updated successfully, but these errors were encountered:

ismayc · 2017-10-29T12:09:01Z

The NAs from the original sample are resampled in generate(). I think it's better for the students/users to handle NAs in their original data first instead of this package handling it for them instead.

When you say "original sample" it right now is just using the number of rows in the data frame resulting from specify() which may include columns with complete data and also columns like arr_delay below that are not complete. The na.rm argument in calculate() is necessary to remove the NAs that have been brought forward in a generate() like that below.

library(nycflights13)
suppressPackageStartupMessages(library(dplyr))
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>% 
  sample_n(size = 500) %>% 
  mutate(half_year = case_when(
    between(month, 1, 6) ~ "h1",
    between(month, 7, 12) ~ "h2"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, half_year, 
         day_hour, origin, carrier)

# Determine number of missing arrival delay values
sum(is.na(fli_small$arr_delay))
#> [1] 15

# Bootstrap uses similar code to oilabs::rep_sample_n()
boots <- fli_small %>% 
  specify(response = arr_delay) %>% 
  generate(reps = 100, type = "bootstrap")
boots %>% 
  group_by(replicate) %>% 
  summarize(num_na = sum(is.na(arr_delay)))
#> # A tibble: 100 x 2
#>    replicate num_na
#>        <int>  <int>
#>  1         1     17
#>  2         2     26
#>  3         3     16
#>  4         4     14
#>  5         5     19
#>  6         6     12
#>  7         7     10
#>  8         8     14
#>  9         9      8
#> 10        10     12
#> # ... with 90 more rows

If users take care of the NA's at the data creation stage before entering into the infer pipeline this shouldn't be a problem. We should probably at the very least add a warning message that NAs are potentially resampled though, right? What other suggestions do you have for dealing with this? Maybe error out if NAs are present in the column being used asking the user to go back to handle them instead before getting into the infer pipeline?

mine-cetinkaya-rundel · 2017-10-29T13:27:54Z

I completely agree that users should be taking care of their NAs and not us, as there may be different decisions that need to be made. So an error or warning is warranted for sure! That being said, I think the bootstrap sample size should be the number of complete cases in that column that is being resampled (or the two columns if specify has two variables) as opposed to the nrow of the data frame, what do you think? Though I guess if we give an error if there are NAs in the column to be resampled, we don't have to make this decision in infer.

ismayc · 2017-10-29T13:37:43Z

I’m inclined to just give an error so that we don’t have to deal with this in multiple scenarios. Maybe even do this at the specify() stage so that we can avoid other problems going forward?

mine-cetinkaya-rundel · 2017-10-29T14:44:18Z

@ismayc that sounds good to me! this might make the na.rm option in calculate obsolete, but i don't see a harm in leaving the ... in there. especially if we'll have true function recognition in there down the line and ... could be doing so much more than just NA handling.

ismayc · 2017-10-29T14:52:25Z

Sounds good! I’ll tag the commit here when I have this implemented.

nicholasjhorton · 2017-11-01T00:15:39Z

I agree that this is a problematic area, since one probably wants to condition on the observed sample size. Throwing a warning at the least or potentially an error would make sense to me.

On Oct 29, 2017, at 9:27 AM, Mine Cetinkaya-Rundel ***@***.***> wrote: I completely agree that users should be taking care of their NAs and not us, as there may be different decisions that need to be made. So an error or warning is warranted for sure! That being said, I think the bootstrap sample size should be the number of complete cases in that column that is being resampled (or the two columns if specify has two variables) as opposed to the nrow of the data frame, what do you think? Though I guess if we give an error if there are NAs in the column to be resampled, we don't have to make this decision in infer. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Nicholas Horton Professor of Statistics Department of Mathematics and Statistics, Amherst College PO Box 5000, AC #2239 Amherst, MA 01002-5000

mine-cetinkaya-rundel · 2018-01-14T20:03:45Z

Just saw the still pondering label on this. Pondering on implementation or whether the change should be made? If the latter, the answer is yes. Let me know if I can help to implement this change.

ismayc · 2018-01-14T20:05:28Z

@mine-cetinkaya-rundel Just pondering on implementation. If you have ideas, please do go for it!

ismayc · 2018-03-07T19:13:25Z

@mine-cetinkaya-rundel I don't believe we implemented this yet, did we?

mine-cetinkaya-rundel · 2018-03-07T19:16:48Z

No I didn't. I just got a chance to start looking at some of the to dos here so I can work on in the next couple days, but feel free to go ahead if you have ideas now.

mine-cetinkaya-rundel · 2018-03-07T19:18:05Z

Looks like there are a few other NA related decisions, would be good to make consistent decisions about them?

ismayc · 2018-03-07T19:20:40Z

Agreed. I will take a look at summarizing the issues into one common issue this afternoon.

github-actions · 2021-03-10T00:10:46Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

ismayc self-assigned this Jan 14, 2018

ismayc added the still pondering label Jan 14, 2018

ismayc assigned mine-cetinkaya-rundel and unassigned ismayc Jan 14, 2018

ismayc added enhancement and removed still pondering labels Jan 14, 2018

mine-cetinkaya-rundel mentioned this issue Mar 8, 2018

Handling NAs (Meta Issue) #114

Closed

ismayc closed this as completed May 13, 2018

github-actions bot locked and limited conversation to collaborators Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do NAs get resampled? #52

Do NAs get resampled? #52

mine-cetinkaya-rundel commented Oct 28, 2017

ismayc commented Oct 29, 2017 •

edited

Loading

mine-cetinkaya-rundel commented Oct 29, 2017

ismayc commented Oct 29, 2017

mine-cetinkaya-rundel commented Oct 29, 2017

ismayc commented Oct 29, 2017

nicholasjhorton commented Nov 1, 2017 via email

mine-cetinkaya-rundel commented Jan 14, 2018

ismayc commented Jan 14, 2018

ismayc commented Mar 7, 2018

mine-cetinkaya-rundel commented Mar 7, 2018

mine-cetinkaya-rundel commented Mar 7, 2018

ismayc commented Mar 7, 2018

github-actions bot commented Mar 10, 2021

Do NAs get resampled? #52

Do NAs get resampled? #52

Comments

mine-cetinkaya-rundel commented Oct 28, 2017

ismayc commented Oct 29, 2017 • edited Loading

mine-cetinkaya-rundel commented Oct 29, 2017

ismayc commented Oct 29, 2017

mine-cetinkaya-rundel commented Oct 29, 2017

ismayc commented Oct 29, 2017

nicholasjhorton commented Nov 1, 2017 via email

mine-cetinkaya-rundel commented Jan 14, 2018

ismayc commented Jan 14, 2018

ismayc commented Mar 7, 2018

mine-cetinkaya-rundel commented Mar 7, 2018

mine-cetinkaya-rundel commented Mar 7, 2018

ismayc commented Mar 7, 2018

github-actions bot commented Mar 10, 2021

ismayc commented Oct 29, 2017 •

edited

Loading