Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Success should always be defined #57

Closed
mine-cetinkaya-rundel opened this issue Dec 26, 2017 · 10 comments
Closed

Success should always be defined #57

mine-cetinkaya-rundel opened this issue Dec 26, 2017 · 10 comments

Comments

@mine-cetinkaya-rundel
Copy link
Collaborator

In working on a solution for #53 I realized that if success isn't defined by the user we make an assumption and use the first level as success. I don't think this is a good approach because it makes the function "just work" without the student properly formulating the inference chain. I propose removing https://github.com/andrewpbray/infer/blob/f034a8629dd42c83d4fbb33dd1b80324262dbb12/R/calculate.R#L60 and https://github.com/andrewpbray/infer/blob/f034a8629dd42c83d4fbb33dd1b80324262dbb12/R/calculate.R#L61 and instead throwing an error saying success must be defined when the response variable is categorical. I can put that in the same PR as the order solution.

Also, as far as I can tell currently the "diff in props" example isn't working since success functionality isn't built into the difference in proportions case (only the single proportion).

@andrewpbray
Copy link
Collaborator

Yeah, changing around those default makes sense to me.

One thing that's frustrating is that in the one sample case, we have a different and maybe more compact way of specifying this sort of thing higher up in the chain:

hypothesize(null = "point", p = ("boat" = .3))

I guess we're addressing another issue here, but another option for the one sample case where the parameter is fully specified in hypothesize() would be to use something like calculate(stat = "point est").

@mine-cetinkaya-rundel
Copy link
Collaborator Author

I was just about to comment on this @andrewpbray. That it looks odd to specify in both places -- in hypothesize and in calculate. But I don't think we want to specify in hypothesize fully because then it looks much more different to do a confidence interval. So I think we do need to do with (what appears to be) the redundant route. It would look something like this for the one prop case:

mtcars %>%
  specify(response = am) %>% # alt: am ~ NULL (or am ~ 1)
  hypothesize(null = "point", p = c("1" = 0.25)) %>% 
  generate(reps = 100, type = "simulate") %>% 
  calculate(stat = "prop", success = 1)

Two comments here:

  1. success is not a string. I think it makes more sense for it to match the type of the variable.
  2. Now it looks weird that success and p look different (one is string the other is numerical).

We could say we need to be consistent and make both strings because I can't imagine p = c(1 = 0.25)) would read well (and I'm not even sure if it would work). But I do somewhat feel strongly about type of success matching the input type. Thoughts?

@andrewpbray
Copy link
Collaborator

Yep, I agree with your assessment. p = c(1 = 0.25) breaks my brain, so I say we hold our noses and go with strings in both places.

Such are the compromises when trying to make onetruetest =/

@mine-cetinkaya-rundel
Copy link
Collaborator Author

mine-cetinkaya-rundel commented Dec 26, 2017

Maybe the right thing to do is to "specify" the success in the specify? I don't know if this is too big a change, but when I'm thinking about formulating a problem I'd think about what the success is at the beginning, not at the end. Like if I want to do a confidence interval for the proportion of Democrats, I would think "I'm working with the party variable, and I'm interested in the Democrat level", not do the whole thing and then at the end think "what, so which level am I interested in?"

@andrewpbray
Copy link
Collaborator

Hunh, interesting idea. I had thought of specify() has a narrowly disguised select(), but this does seem like a very natural way to solve this problem. So let's see, we could have this:

# one sample
mtcars %>%
  specify(response = am, success = 1) %>%
  hypothesize(null = "point", p = 0.25) %>% 
  generate(reps = 100, type = "simulate") %>% 
  calculate(stat = "prop")

# two sample
mtcars %>%
  specify(am ~ vs, success = 1) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "diff in props")

# GoF
mtcars %>%
  specify(cyl ~ NULL) %>%
  hypothesize(null = "point", p = c("4" = .5, "6" = .25, "8" = .25)) %>%
  generate(reps = 100, type = "simulate") %>%
  calculate(stat = "Chisq")

That way we'd only need to use the named vector thing for the GoF case.

I'm for it. @ismayc ?

@ismayc
Copy link
Collaborator

ismayc commented Dec 26, 2017

I like adding the success argument to specify().

@andrewpbray
Copy link
Collaborator

Hopefully this is addressed in the current PR.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

@andrewpbray I didn't realize you were working on it. I was independently too. I'm going to wait for this PR to be merged to rework the order functionality then.

@andrewpbray
Copy link
Collaborator

Closed by fd5bbf7.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants