Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do NAs get resampled? #52

Closed
mine-cetinkaya-rundel opened this issue Oct 28, 2017 · 13 comments
Closed

Do NAs get resampled? #52

mine-cetinkaya-rundel opened this issue Oct 28, 2017 · 13 comments
Assignees

Comments

@mine-cetinkaya-rundel
Copy link
Collaborator

The calculate function takes the na.rm argument for the statistic it's calculating. But does this mean at the generate stage, say if we're doing a bootstrap interval, NAs from the original sample get resampled? If so, I believe this is not a good idea. Because chances are when we look at the sample size for the original sample those NAs don't factor into it. And a bootstrap sample should be the same size as the original sample. If indeed the NAs from the original sample don't make it into the bootstrap samples, then isn't the na.rm argument in calculate unnecessary?

@ismayc
Copy link
Collaborator

ismayc commented Oct 29, 2017

The NAs from the original sample are resampled in generate(). I think it's better for the students/users to handle NAs in their original data first instead of this package handling it for them instead.

When you say "original sample" it right now is just using the number of rows in the data frame resulting from specify() which may include columns with complete data and also columns like arr_delay below that are not complete. The na.rm argument in calculate() is necessary to remove the NAs that have been brought forward in a generate() like that below.

library(nycflights13)
suppressPackageStartupMessages(library(dplyr))
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>% 
  sample_n(size = 500) %>% 
  mutate(half_year = case_when(
    between(month, 1, 6) ~ "h1",
    between(month, 7, 12) ~ "h2"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, half_year, 
         day_hour, origin, carrier)

# Determine number of missing arrival delay values
sum(is.na(fli_small$arr_delay))
#> [1] 15

# Bootstrap uses similar code to oilabs::rep_sample_n()
boots <- fli_small %>% 
  specify(response = arr_delay) %>% 
  generate(reps = 100, type = "bootstrap")
boots %>% 
  group_by(replicate) %>% 
  summarize(num_na = sum(is.na(arr_delay)))
#> # A tibble: 100 x 2
#>    replicate num_na
#>        <int>  <int>
#>  1         1     17
#>  2         2     26
#>  3         3     16
#>  4         4     14
#>  5         5     19
#>  6         6     12
#>  7         7     10
#>  8         8     14
#>  9         9      8
#> 10        10     12
#> # ... with 90 more rows

If users take care of the NA's at the data creation stage before entering into the infer pipeline this shouldn't be a problem. We should probably at the very least add a warning message that NAs are potentially resampled though, right? What other suggestions do you have for dealing with this? Maybe error out if NAs are present in the column being used asking the user to go back to handle them instead before getting into the infer pipeline?

@mine-cetinkaya-rundel
Copy link
Collaborator Author

I completely agree that users should be taking care of their NAs and not us, as there may be different decisions that need to be made. So an error or warning is warranted for sure! That being said, I think the bootstrap sample size should be the number of complete cases in that column that is being resampled (or the two columns if specify has two variables) as opposed to the nrow of the data frame, what do you think? Though I guess if we give an error if there are NAs in the column to be resampled, we don't have to make this decision in infer.

@ismayc
Copy link
Collaborator

ismayc commented Oct 29, 2017

I’m inclined to just give an error so that we don’t have to deal with this in multiple scenarios. Maybe even do this at the specify() stage so that we can avoid other problems going forward?

@mine-cetinkaya-rundel
Copy link
Collaborator Author

@ismayc that sounds good to me! this might make the na.rm option in calculate obsolete, but i don't see a harm in leaving the ... in there. especially if we'll have true function recognition in there down the line and ... could be doing so much more than just NA handling.

@ismayc
Copy link
Collaborator

ismayc commented Oct 29, 2017

Sounds good! I’ll tag the commit here when I have this implemented.

@nicholasjhorton
Copy link

nicholasjhorton commented Nov 1, 2017 via email

@ismayc ismayc self-assigned this Jan 14, 2018
@mine-cetinkaya-rundel
Copy link
Collaborator Author

Just saw the still pondering label on this. Pondering on implementation or whether the change should be made? If the latter, the answer is yes. Let me know if I can help to implement this change.

@ismayc
Copy link
Collaborator

ismayc commented Jan 14, 2018

@mine-cetinkaya-rundel Just pondering on implementation. If you have ideas, please do go for it!

@ismayc
Copy link
Collaborator

ismayc commented Mar 7, 2018

@mine-cetinkaya-rundel I don't believe we implemented this yet, did we?

@mine-cetinkaya-rundel
Copy link
Collaborator Author

No I didn't. I just got a chance to start looking at some of the to dos here so I can work on in the next couple days, but feel free to go ahead if you have ideas now.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

Looks like there are a few other NA related decisions, would be good to make consistent decisions about them?

@ismayc
Copy link
Collaborator

ismayc commented Mar 7, 2018

Agreed. I will take a look at summarizing the issues into one common issue this afternoon.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants