Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate vocabulary #233

Closed
mine-cetinkaya-rundel opened this issue Apr 8, 2019 · 42 comments
Closed

generate vocabulary #233

mine-cetinkaya-rundel opened this issue Apr 8, 2019 · 42 comments

Comments

@mine-cetinkaya-rundel
Copy link
Collaborator

Currently the package does three types of generate(): bootstrap, permute, or simulate. Every time I teach this I run info difficulties around this vocabulary. bootstrap is not problematic but permute and simulate are hard to distinguish for students.

One other issue is that much of Statistics Education literature refers to these methods as Simulation Based Inference, so having simulate as a type of simulation seems odd to me.

A third consideration is that a good portion of the consumers of this package are students in into stat / data science courses where probability is not a requirement. So the difference between permutation / combination may not be clear to them, nor in the learning goals for the course they're taking. Hence, introducing that term makes clearly teaching this material difficult in my experience.

I would like to propose a potentially radical change:

  • generate() becomes simulate()
  • "simulate" becomes "draw" (or "flip", but I think "draw" is more general to both numerical and categorical data vs. "flip" only makes sense for categorical data to me)
  • "permute" becomes "shuffle"

Discuss 😄

Obviously we would do this in a non-breaking way via aliasing as opposed to renaming and breaking old code. But I can see the vignettes and teaching materials reading A LOT smoother with these changes.

cc @rudeboybert @mcconvil (when I discussed this with @andrewpbray a while back -- sorry for the delay!! -- he recommended tagging you two as you might have thoughts and feelings on this issue

@echasnovski
Copy link
Collaborator

This seems like continuation of #54.

I am not a native English speaker, but to my ear "simulate" is more associated with "imitate" and "pretend". So using it as one of main verbs is a little bit confusing for me. Also for this reason I don't really like it as one of generate() options either.

"draw" looks to be a suitable alternative.

I have nothing against "shuffle" but I like "permute" better.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

Indeed it is a continuation of #54, which I think is further evidence that the current vocabulary is confusing -- I've had a chance to teach it 3 more times after that with not much improvement on the student/teaching experience.

Simulation is a concept students are exposed to in high school, at least in the US and I believe in the UK. And the term used is simulation. And a search for "simulation based inference" on Google Scholar yields many relevant results (and more results than "randomization based inference"). That being said, I think the more important issue has been "simulate" being a type of simulation/generation. So as long as we have an alternative for that and we can say things like

  • generate() "bootstrap" samples
  • generate() random "draw"s
  • generate() random sample via "shuffle"ing

I am ok with keeping the word generate(). I do feel strongly about removing (or at this point providing an alternative for) "simulate" as a type because one could consider generation of bootstrap samples and permutations also a type of simulation. It would be good for those three types of generation to be mutually exclusive. Currently, with the use of a generic term like simulate, I don't think they are.

The reason for the suggestion for aliasing generate() and simulate() is as follows: It would be nice to be able to say the words "When doing theoretical inference, this is how you set it up. When doing simulation based inference, you add a simulate() step, the rest of the pipeline stays the same." I could replace both occurrences of the word "simulate" with "generate", but there is no such thing in the literature (that I am aware of at least) called "generation based inference". So it's not that the word generate is wrong, but that the word simulate would fit so nicely into a sentence like this, which is a very concise way of highlighting the strength of the infer package.

As for "permute" vs. "shuffle" the reasoning is about students taking an intro stat course (no math/probability prereqs) not being familiar with what permutation means. Also, usually we do tactile simulation to demonstrate how the process works, and often during that process we use the term "shuffle" (shuffling playing cards, index cards, etc.). It would be great to have parity between the physical demonstration and the code.

@ismayc
Copy link
Collaborator

ismayc commented May 1, 2019

This is a fantastic discussion and I hope it's OK for me to poke my head back in here.

I remember using the mosaic package in the classroom. Students didn't quite get that "shuffle" corresponds to both shuffling the cards AND also randomly assigning the shuffled deck to the other variable. They weren't necessarily breaking the ties the two variables in the observations assuming the null is true. So when they did a tactile simulation after we had reinforced their understanding using R with mosaic, they'd say "I shuffled" but I'd ask "OK, but have you also completed the permutation?". They often couldn't quite get over that hump because of how "shuffle" is used in English. It's a nuance that might not be worth it, but could also confuse things some.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

I'd love to get a sense of where others are on this issue so that we might have time to implement changes before the semester starts. I feel pretty strongly that the language could use a revisit, but it seems worthwhile to agree on a pathway for change (if any) here before anyone starts (re)development around it.

@rudeboybert
Copy link
Contributor

Quick takes

  • I think "permute" or "shuffle" is a toss up: as @mine-cetinkaya-rundel says more people understand "shuffle" colloquially and as @ismayc says the language is closer to "permutation test." Either way, I've found as long as you illustrate the act first, either by hand with cards or with a simple 5 row example, students remember either.
  • I'm fine with renaming type = "simulate" to something else. But to be frank I had no idea this option existed and the ?generate help file wasn't much help. From Student/teaching experience with generate type #54, I figured out it was for binomial sampling, but I haven't ever done this in class.
  • On that note, when this discussion settles feel free to tag me in issues to beef up specific help files. Since I use infer but wasn't knee-deep in its development, I might be well placed to give an outsider's perspective.

Meaning of word simulation

Allow me to further muddy the waters of what "simulate" means. My sense is when stat ed people hear the term "simulation-based inference", they immediately think "Don't do CLT, rather do bootstrapping and permutation tests instead" i.e. "inference via resampling from a single sample, either with or without replacement".

However, say you are constructing a sampling distribution by sampling from a population, using pennies or M&M's say. This is also a form of simulation in the colloquial sense of the word, as you are "imitating/mimicking the act" of sampling several times.

All this to say simulation can be used for both "sampling from a population" and not just "resampling from a sample."

Alternative verb/function name for generate

This a tricky one. When I first started using infer, my mnemonic pathway has always been start with @andrewpbray's wonderful diagram and go from there. The verb I had the most trouble remembering at first was generate():
ht

One way to frame this discussion is to find the best_verb() name that most intuitively extends the concept of computing the observed point estimate/test statistic to the concept of constructing the appropriate resampling distribution: by repeating this computation under the assumed model:

# Compute point estimate:
specify() %>%                 calculate()
# Construct bootstrap distribution of point estimate:
specify() %>% best_verb() %>% calculate()

# Calculate observed test statistic:
specify() %>% hypothesize() %>%                 calculate()
# Construct null distribution of test statistic:
specify() %>% hypothesize() %>% best_verb() %>% calculate()

So yes, IMO simulate() > generate(). This is what Allan Downey uses in his There is only one test blogpost, so we'd at least be consistent with him.

However, another verb that I prefer is resample() as this emphasizes that we are resampling from a single sample as one would do in practice (and not repeatedly sample from a population in the theoretical exercise of constructing the sampling distribution). Alas, this sounds like it would conflict with @mine-cetinkaya-rundel's use of binomial sampling via generate(type = "simulate").

@beanumber
Copy link
Contributor

At the risk of throwing a monkey wrench, I still like the idea of generate(type = "theoretical").

The point is that using an approximation-based approach is still generating a null distribution. It's just that that null distribution is defined by a closed-form function, rather than generated from data. But we could realize that null distribution as a collection of outputs from that smooth function. So generate(type = "theoretical") would return data frame with (x,f(x)) pairs. You could still plot a density curve to this. I think that skipping the generate() phase in a CLT framework glosses over the fact that we have made a choice to construct the null distribution using theory. In my mind this makes the "there is only one test" interpretation. It also means that students can explain to their lab advisers that, "look, it works the same way, we just generated the null distribution in a different way."

I also think @mine-cetinkaya-rundel 's ideas for shorthand are good. So

  • draw() = generate(type = "simulate"): resample from a categorical variable
  • draw() = generate(type = "bootstrap"): resample from a numerical variable
  • shuffle() = generate(type = "permute"): permute the relationship between two variables. I actually think permute() would be just as good and maybe more specific.

and then my proposal:

  • draw() = generate(type = "theoretical"): "sample" from a known probability distribution over an evenly spaced set of points.

One downside to shuffle() and resample() is that the mosaic package already uses those functions for much more intuitive operations on vectors (they also have deal() instead of draw()). So what if we prefix our functions with gen_ or something like that?

Using only draw() and shuffle() would have the advantage of couching everything in the register of cards, which is pretty intuitive for most people. But are these really the only two operations??

@rudeboybert
Copy link
Contributor

Ah, gotcha! generate(type = "simulate") is to resample a categorical variable based on empirical probabilities from the data frame, rather than sample based on probabilities p. Thanks @beanumber!

As for generate(type = "theoretical"), is it worth trying to bridge the gap between simulation-based and theoretical inferential methods at this step, when you can already do this via visualize(method = "both")?

library(tidyverse)
mtcars %>% 
  mutate(am = as.factor(am)) %>% 
  specify(formula = mpg ~ am) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000) %>% 
  calculate(stat = "t", order = c("0", "1")) %>% 
  visualize(method = "both")

Also, please disregard my early suggestion of using resample(). It just occurred to me that there might be confusion over the fact that permutation tests are actually also a class of "resampling methods." For example, I instinctively think of resampling only as applying to bootstrap methods.

If we want to clearly distinguish resampling with vs without replacement in the pipeline, then perhaps draw() and shuffle() are indeed the best bets for aliases to generate().

@andrewpbray
Copy link
Collaborator

Naming

The phrase Simulation Based Inference is a good reason to take care in how we use the word simulate, and my first thought was to agree whole hog with @mine-cetinkaya-rundel initial post. However, subsequent comments have made me realize that there are several complications that we'll want to be sure we've thought through.

One function or many?

Ignoring for a moment the call to change some of these function names, there seems to be two modes for structuring this step of the infer pipeline. Currently, we have one function, generate(), that does different things based on the type argument. Alternatively, we could have multiple functions (say, simulate(), permute() and bootstrap()) that would replace generate() and have reps as their only argument. We could also do as @beanumber suggests and change them to gen_simulate(), gen_permute(), and gen_bootstrap()

I don't have a strong argument for why one method would be better than the other for elucidating for users what's going on under the hood. My initial favoring of the current implementation was due to wanting to emphasize the conceptual commonality of simulation, permutation, and bootstrapping as ways to generate more data. It's possible that obscures the actual mechanism of generation, though. Fwiw, the current tidyverse style would likely favor many functions.

SBI vs theory

I had forgotten about that suggestion that @beanumber had made regarding what a theoretical pipeline. The theoretical distributions have certainly been a neglected part of infer.

Right now, here is a comparison of SBI vs theory.

library(tidyverse)
library(infer)

# SBI
mtcars %>% 
  mutate(am = as.factor(am)) %>% 
  specify(formula = mpg ~ am) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000) %>% 
  calculate(stat = "t", order = c("0", "1")) %>% 
  visualize()

# Theory
mtcars %>% 
  mutate(am = as.factor(am)) %>% 
  specify(formula = mpg ~ am) %>% 
  hypothesize(null = "independence") %>% 
  #generate(reps = 1000) %>% 
  #calculate(stat = "t", order = c("0", "1")) %>% 
  visualize(method = "theoretical")

The rationale here was that, going up to the diagram, once you've specified the null hypothesis and are comfortable with your assumptions, you can just jump directly to the null distribution.

Looking at it now, there is at least one pretty clear weakness. The information in specify() and hypothesis() is insufficient to describe which sampling distribution we want. I mean, we can infer it plenty well on the back end because there tends to be only one statistic for each setting that has a known/commonly used asymptotic distribution. But the code that the user writes is opaque.

If we were to return to Ben's suggestion, it would be:

# Alt Theory
mtcars %>% 
  mutate(am = as.factor(am)) %>% 
  specify(formula = mpg ~ am) %>% 
  hypothesize(null = "independence") %>% 
  generate(type = "theoretical") %>% 
  calculate(stat = "t", order = c("0", "1")) %>% 
  visualize()

I see the appeal of this, but I'm unclear what the output would be at the generate() and calculate() steps. @beanumber , are you picturing those (x, f(x)) pairs to be the output of generate() and then skip the calculate() step? An alternative would be for calculate() to return draws from the null distribution, drawn on a regular grid from the cdf (I think that's right?). Then there's the sense that this is still kinda a computational approach, but using a closed-form distribution.

Thoughts? I don't see any silver bullet here. I don't think we'll be able to come up with a syntax to cover all formulations in a fully cogent way. I like @rudeboybert 's advice to make decisions based primarily on the computational approach, the real strength of the package, and figure that folks can just do method = "both" if they want the comparison.

@hardin47
Copy link

To weigh in (better late than never?) ... I really prefer generate() because for me the power of the infer structure is the connection between SBI and mathematical inference (i.e., the unified structure of hypothesis testing). That is, I always go back to @andrewpbray 's image, and I imagine doing the middle step with rectangles when I do theory. I think that imagining the rectangles is really helpful for students, so I'd like to keep the associated word / idea as generic as possible. To me, generate() works because it fits in with sampling, permuting, bootstrapping, and theoretical inference.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

I agree that generate() also works with theoretical inference, but this is the step (the line of code) we skip when we do theoretical inference with infer.

@hardin47 do you have thoughts on the other suggestions from the original post on top, i.e. the arguments for generate(). Even if we stuck with generate(), I think the current names for the arguments are either not specific enough (simulate) or potentially not in the curriculum for an intro stat course that doesn't focus on probability (permute).

@hardin47
Copy link

I actually like the idea of using generate() in the theoretical commands. With the idea that the model is being generated through a theoretical construct (e.g., CLT).

and I agree that the rest of the words aren't ideal. my inclination is to stick with permute / simulate because those words have meaning beyond the course. but i can see the advantage of shuffle / draw which are much more intuitive ideas. i don't have super strong feelings about the words.

my bigger point is that i like one word (here: generate()) to identify the creation of a sampling distribution (in whatever way that distribution is created).

@mine-cetinkaya-rundel
Copy link
Collaborator Author

@hardin47 I agree with this argument if we were using the generate() step in the theoretical setting as well, but we don't. I think this is an interesting suggestion, though implementation-wise I don't know what would happen at that stage.

As for the argument names -- I agree that permute is specific enough (though, the way I tend to teach it, we don't define it formally outside of this context so it's not as familiar a word to the students as it is to me). I would disagree that simulate is specific enough, and I think a lot of my difficulty around this stems from the usage of the word simulate here. When I'm teaching, I keep saying things like "we'll use simulation to do this inference" to mean something much broader than the argument name simulate. I never say "we'll use generation to do this inference" even though the name of the step that is different for doing non-theoretical inference in this framework is generate(). The original idea behind this issue was to match the function names to what we actually say (in English) in class. If generate() stays AND if the argument simulate also stays I think I have to be careful to not say broad things like "we'll use simulation to do this inference" because it gets confusing. If others don't run into this confusion, I'd love pedagogical tips on it!

@hardin47
Copy link

hardin47 commented Jan 9, 2020

but we do say "now we need to generate a sampling distribution.... how should we do that? maybe we could simulate it using the data at hand as a population model." or at least I say that.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

So then is the story?

Now we need to generate a sampling distribution... how should we do that?

  • should we simulate it
  • should we permute it
  • or should we bootstrap it?

I like the first sentence, but I get stuck in the options because the options don't seem like apples to apples to me. I feel like simulate is an umbrella term, and permute and bootstrap are different types of simulation. And rides at amusement parks are another type of simulation. This is the example I usually give -- we use simulation to represent reality, and there are different ways of simulating depending on the reality you want to represent. Then when we hit the argument names in the generate() step it gets confusing so either I need to change my story or the argument names. I do feel strongly that simulate is a different beast than permute and bootstrap (a bigger one).

Also, many students have used the term simulate before in other settings as well, for example, in AP stats.

@hardin47
Copy link

hardin47 commented Jan 9, 2020

what is the word you use to describe the function rnorm? That's the word I'm looking for when I say simulate above. Would love it if there was another word. I just don't know of a different or better word, so I don't know that making one up or trying to fight this battle is worth it. I mean, are we going to try to redefine the word "normal" and all the zillion different things it can mean?

I'd also like to add to your story:

Now we need to generate a sampling distribution... how should we do that?

  • should we simulate it
  • should we permute it
  • should we bootstrap it
  • or should we use theoretical mathematics to approximate it?

@beanumber
Copy link
Contributor

beanumber commented Jan 9, 2020

Isn't the theoretical mathematical distribution as simulation as well?

It's not reality, it's an idealized representation of reality, right?

@neilhatfield
Copy link

Apologies for dropping into this conversation at this point in time, but I just recently found the infer package.

I'm rather fond of the verb generate for the sampling distribution. When I work with students we discuss three major ways to do this: we could replicate the study over and over, we could simulate, or we could take a shortcut. Thus, I agree with @mine-cetinkaya-rundel that simulate is an umbrella term. Under that umbrella, we discuss permutation (we don't get technical, but build from the students' current meanings), bootstrapping, and Monte Carlo simulations. Shortcuts refer to parametric and non-parametric theory based methods.

I'm going to offer a revision to the story:

We need to generate a sampling distribution and we don't have the resources for replication. How should we do that?

  • We could permute the data and then re-calculate the test statistic (how many times? All, only some?)
  • We could bootstrap new samples and then re-calculate the test statistic (how many times?)
  • We could monte_carlo new samples and then re-calculate the test statistic (which distribution? what parameters? how many times?)
  • We could take_shortcut and use theory to approximate a sampling distribution (which shortcut?)

The questions in parentheses point to what additional arguments might need to be specified. I will note that I'm not overly committed to monte_carlo but I'm trying to stay away from both simulate and "random". (I also couldn't think of a verb.) I also opted for take_shortcut to have a verb than an adjective like theoretical; I suppose use_theory might work.

@davidhodge931
Copy link

davidhodge931 commented Apr 19, 2021

Hi All,

I've just found this package, and think it is fantastic!

Some thoughts on this thread:

  1. I find the current output of the theoretical null distribution of a single stat very confusing. I would also prefer code to be more explicit in how the null distribution is being created when it is via theoretical methods.
library(infer)

null_distn_theoretical <- gss %>%
  specify(finrela ~ sex) %>%
  hypothesize(null = "independence") %>% 
  calculate(stat = "Chisq")

null_distn_theoretical

image

If possible, I think an output of the null distribution for theoretical approximation of a dataset of quantiles and densities would be better.

An additional function to use instead of calculate for theoretical distributions might allow the code to be more explicit that we have used a theoretical method to get a null distribution. E.g. theoretical().

library(infer)

null_theoretical <- gss %>%
  specify(finrela ~ sex) %>%
  hypothesize(null = "independence") %>% 
  theoretical(stat = "Chisq") # or other function name to express this

# return something like this
tibble(quantile = 0:100, density = dchisq(0:100, df = 1)) 

image

  1. I agree type = "simulation" is really confusing, given others are a type of simulation too (and the title of the graph is 'Simulation-based Null distribution' for the non-theoretical approaches). Not sure what the best name is, but would be great to change it.

@simonpcouch
Copy link
Collaborator

Will be making some moves related to this issue in the coming weeks, so wanted to nudge this conversation a bit and see if we can stumble on vocabulary that works well for folks! Some thoughts from reading above comments and chatting with @mine-cetinkaya-rundel and @topepo this last week…

I think a strength of the {infer} framework is the emphasis on the statistical intuition arising from the juxtaposition of

  1. the observed statistic
  2. a distribution reflecting the null hypothesis

For example, in the randomization-based side of the package

# calculate observed mean
obs_stat <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "mean")

# generate a null distribution of means
null_dist <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

# juxtapose them visually
visualize(null_dist) + shade_p_value(obs_stat, direction = "both")

# juxtapose them to calculate a p-value
get_p_value(x = null_dist, obs_stat = obs_stat, direction = "both")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.038

As the package stands, there is no thing in {infer} that is a theoretical null distribution. e.g. there is no “theoretical analogue” to null_dist, say, for a T distribution. You could change stat = "mean" to stat = "t", but that distribution of values is not a T distribution, but a sample of values distributed according to the T distribution. Implicitly, those distributions do exist inside of hypothesize()d infer objects as the params and theory_type attributes, but they aren’t user-facing, and just get passed to visualize and other auxillary functions as part of larger data objects (either observed statistics or simulation-based null distributions).

I think whatever solution we come up with to this grammatical problem should result in a thing that a theorized statistic can be juxtaposed with. One suggested approach was a new "theoretical" generate() type that would be used as such, I think:

# calculate observed t
obs_stat <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")

# generate a null t distribution
null_dist <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(type = "theoretical") %>%
  calculate(stat = "t")

# juxtapose them visually
visualize(null_dist) + shade_p_value(obs_stat, direction = "both")

# juxtapose them to calculate a p-value
get_p_value(x = null_dist, obs_stat = obs_stat, direction = "both")

The lines used to generate null_dist, though:

  1. Extend the meaning of generate in the package–there’s no (re)sampling being done.
  2. Make use of calculate only to specify the statistic of interest–there’s no actual calculate()ing being carried out in the “collapsing” sense elicited in other uses of the function.
  3. Hides the existence/calculation of the distribution parameters (here, degrees of freedom) from the student.

One approach that Mine and I have tossed around (and that I think also borrows from @davidhodge931's suggestion) would interface like this:

# calculate observed t
obs_stat <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")

# generate a null T distribution
null_dist <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  theorize(distribution = "T", df = nrow(gss) - 1)

# juxtapose them visually
visualize(null_dist) + shade_p_value(obs_stat, direction = "both")

# juxtapose them to calculate a p-value
get_p_value(x = null_dist, obs_stat = obs_stat, direction = "both")

…where null_dist is some new {infer} class that implements smooth distributions. This would actually just be a wrapper around pt, or p*, generally, with a short print method: "A T distribution with 499 degrees of freedom.". Besides the distribution argument, theorize() could take in distributional parameters (like df above), optionally passed à la generate()'s type argument--we know what the value should be, but would allow the user to supply it to check understanding.

This approach has a few benefits:

  1. Existing verbs in the package will continue to have the same meaning
  2. theorize() provides the student a place to specify distributional information to check understanding
  3. null_dist is a thing that
    • is explicitly a different type of thing that arises from generate()ing resamples and
    • exists independently of the observed statistic or empirical null distribution

Some drawbacks:

  1. Max brought up here that some learners may have difficulty distinguishing between hypothesize() and theorize()
  2. specify() and hypothesize() pass along more information than is really needed to check that the supplied distribution and its parameters are appropriate

A spec of how this could look once implemented, and how it compares to existing approaches, here.

Would love to hear folks’ thoughts here. :-)

Oh, and I'm on board for a switch of the simulate argument terminology to "draw" or something else!

@ismayc
Copy link
Collaborator

ismayc commented Jun 2, 2021

I agree that theorize() and hypothesize() seem too similar. Maybe we could use set_theory() (though I know that also opens another can of worms in that "set theory" is a whole different field.) My thinking is that we have get_pvalue() and in some ways theorize() (or whatever name is decided on) goes beyond what the package was originally indented to do with randomization and quickly visualize().

I will admit that naming is incredibly hard so I am open to suggestions here on what to call things.

@ismayc
Copy link
Collaborator

ismayc commented Jun 2, 2021

I'm also fine with changing the "simulate" option too!

@hardin47
Copy link

hardin47 commented Jun 2, 2021

there is a lot going on in this thread, i hope i don't make things worse...

  1. simulate() is really hard to think about / reconsider. I think we've been round and round on that one. I have nothing to offer.

  2. what about generate(type = "math_model") or generate(type = "math_theory")? @mine-cetinkaya-rundel and I have been using "mathematical model" vs "computational method" to juxtapose all of the methods in our textbook.

  3. I do not like set_theory() for the reasons implied.

  4. I like the theoretical option inside the generate. I understand @simonpcouch 's point about no "thing", but I sort of do think there is a "thing." That is, it is a thing that tells you how unusual the observed statistic is under some sort of structure (the null hypothesis). So in one case the thing is the set of statistics (e.g., after randomizing) and in the other case it's a smooth mathematical model. I'm okay comparing those two items.

  5. Yes, for sure, naming things (well) is really really really hard.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

Here is my summary of the discussion so far:

1. What (if anything) should we do about the vagueness of type = "simulate"?

I think (and I hope I'm not summarising this with a bias towards my preference) there seems not to be a huge objection to offering an alternative to type = "simulate". I think we should go ahead and offer type = "draw", not deprecate type = "simulate" in order to not break existing code but replace all existing references in documentation with type = "draw" and add a note about the change in generate() documentation.

2. What should an infer pipeline for theoretical inference look like?

There's a worry that the words theorize and hypothesize are too similar, which seems risky. It also seems like the notion of using generate() to define the mathematical model doesn't seem too jarring to people. So how about the following? I've marked the lines that are different between the theoretical and simulation-based approaches.

## theoretical

# calculate observed t
obs_stat <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%  # <<
  calculate(stat = "t")

# generate a null T distribution
null_dist <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(type = "theoretical", distribution = "T", df = nrow(gss) - 1) # <<

# juxtapose them visually
visualize(null_dist) + shade_p_value(obs_stat, direction = "both")

# juxtapose them to calculate a p-value
get_p_value(x = null_dist, obs_stat = obs_stat, direction = "both")
## simulation-based (for reference)

# calculate observed mean
obs_stat <- gss %>%
  specify(response = hours) %>%
  calculate(stat = "mean")

# generate a null distribution of means
null_dist <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(reps = 1000, type = "bootstrap") %>% # <<
  calculate(stat = "mean") # <<

# juxtapose them visually
visualize(null_dist) + shade_p_value(obs_stat, direction = "both")

# juxtapose them to calculate a p-value
get_p_value(x = null_dist, obs_stat = obs_stat, direction = "both")

If we were to go with both of these changes:

  • generate() would stay
  • generate() would gain two new types: "draw" and "theoretical"
  • generate() would gain additional parameters to be used when type = "theoretical"

I don't love that other types are verbs (permute, bootstrap, draw) and theoretical is a noun, but I think we can live with that as theorize is not a great option and I'm not sure what else is. Other than that, I think this summarises the opinions expressed here but chime in if you think I've overlooked something and/or if you don't like the suggestion for the theoretical pipeline.

@davidhodge931
Copy link

I like using a function different from generate() for the theoretical distribution.

I think if type = "theoretical" is put within generate(), then the meaning of this function gets a little weird, in that it is either generating data or is generating a mathematical distribution.

I think a separate function would work better for building intuition that you can have a null distribution either through generating data and calculating statistics, or via a theoretical distribution.

I think theoretical() is slightly better than theorise(), because it is less likely to be confused with hypothesising. If exact approaches in the future were supported, then you could then use an equivalent exact() function.

@beanumber
Copy link
Contributor

This is such a great discussion!

Like @hardin47 and @mine-cetinkaya-rundel I like it better with theoretical being a sub-type of generate(). To me this is encodes the big insight of the "there is only one test" philosophy.

If you do choose to go with a separate verb, how about assume() instead of theorize()? In addition to being similar to hypothesize(), you are also not the person doing the theorizing. The theory has already been written, you are just the one choosing to apply it. You feel comfortable doing that because of various assumptions. [But maybe this opens up a whole 'nother can of worms...]

@mine-cetinkaya-rundel
Copy link
Collaborator Author

mine-cetinkaya-rundel commented Jun 2, 2021

For the record, I actually don't think theoretical should be a sub-type of generate() but I was trying to warm up to that idea. My reasoning is that we don't actually generate theoretical sampling distribution. As @beanumber said, we assume it. So, I worry that using generate() in that context might give the wrong impression to learners about something that is already quite difficult for them to wrap their head around (the Central Limit Theorem) while trying to drive home the point about "one test".

I agree that assume() is closer to the truth than generate() in the theoretical context but I'm also wary of a can of worms being opened, particularly, we don't want the takeaway message to be "we don't have to assume anything when doing simulation-based inference (i.e. there are no conditions) but we do when doing theoretical inference". It's true that we don't have to make distributional assumptions for simulation-based inference but we do need often need to assume independence of observations based on what we know about the sampling/randomization scheme.

So, I think the two options we have are:

  1. "theoretical" should be a sub-type of generate()

OR

  1. generate() should not make an appearance in the theoretical pipeline and we should find a better verb for how the user expresses what distribution inference will use

I think we have converged, at least, to there are no other viable options on the table (which I'm fine with).

@simonpcouch
Copy link
Collaborator

Very much in agreement with Mine's comment above and really appreciating the conversation generally.

If "theoretical" is a type of generate(), I'm on board for an interface like Mine's suggestion two comments above. I think 2. should be the goal, though.

In tossing around other options for that verb, I think assume might put us in a good spot. The arguments would just be the distribution and its parameters, which might avoid that challenge for teaching that Mine mentioned. Others in the thread would be better positioned to comment on the pedagogical perspective, though.

@andrewpbray
Copy link
Collaborator

@simonpcouch , thanks for bringing this discussion back around. When you first laid out a pipeline to make a theoretical object that includes generate() and calculate(), my brain errored out for the reasons you identified. The pipeline laid out by @mine-cetinkaya-rundel, without calculate(), is less jarring but still muddies to me the mechanism behind generate(). I like to talk about (certain) null hypotheses as data generating machines but that in certain situations we can take a mathematical shortcut around that generation to learn the shape of the null distribution.

So I guess that puts me in the 2. camp but I also have some reservations about theorize() and assume(). In addition to the reasons laid out by folks above, it's really in the hypothesize() step that the analyst is putting down a theory for the way the world works. The step we're trying to capture is when the analyst decides to map their specific hypothesis and data set to a particular result from mathematics. But.... that doesn't leave me with any better ideas for a verb I'm afraid.

I agree with @mine-cetinkaya-rundel about the can of worms if assumptions are only associated with mathematical approximations and not computational ones. I'm wondering, though, if this might be the least bad downside of any of our options. It could also be an opportunity to discuss in the documentation what we assume about the process when we do permutation and bootstrapping.

@hardin47
Copy link

hardin47 commented Jun 2, 2021

not to add to the confusion (when it looks like possibly there is convergence?), but i want to point out two different issues here:

  1. like others have said, the point here is to have students see the parallel structure of all the tests. and to that end, i think generate() is something that can be done with a mathematical model. for example, you might see a HW problem in an algebra book that says "Generate the plot of f(x) = x^2." of course i can see that generating a functional relationship is different than generating a histogram of randomized statistics, but from the perspective of students who are looking to understand how likely it is to see observed statistics under a null model, it doesn't seem fundamentally different. so i'm still in camp 1.

  2. the other issue is that the words are really hard to get right. words matter. a lot. and i really understand why the word generate feels uncomfortable to describe the theoretical model. but it is the step that produces the thing which can be compared to the observed statistic. i wonder if there is a word other than generate that might encapsulate both possible outcomes:
    (a) the histogram of null statistics,
    (b) a null sampling distribution given by a function

As seen in the parallel conversation we are having (about the word simulate, I don't love the word draw either, but i'm not going to complain because i don't have better suggestions), maybe we won't be able to come up with a word that does everything in every situation. if that is the case, then i'm fine going forward with @mine-cetinkaya-rundel 's option 2. i continue to prefer option 1, but i'm not going to die on this hill.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

@hardin47 I'll address the bit about "draw" since I've already said a bunch on the generate() issue. My proposal for that discussion is:

  1. Review and merge supersede type = “simulate” in favor of type = “draw” #390 which brings in "draw" as a preferred alternative to "simulate"
  2. Then open a new issue (to disentangle from this thread) where we can, for a bit longer, brainstorm better words in place of "draw"
  3. If we happen upon a better word, we can easily replace "draw" with that. If not, we'll close that issue and call it a day.

We have at least reached an agreement that "simulate" is not ideal so offering an alternative seems desirable. And disentangling the threads will be the cherry on top.

@andrewpbray
Copy link
Collaborator

@mine-cetinkaya-rundel your proposal for "draw" makes sense to me!

@hardin47 One of the reasons, I think, why option 1. is tough for me is because I've never thought of generate() as making a histogram of statistics under the null. I've thought about it as generating data under the null. In fact, I first just pipe generated data into faceted ggplots to show the sorts of plots that you'd see under the null (a la Di Cook's work). Then I introduce calculate() as a way to turn the informal structure that we see in those plots into single numbers that can be more precisely compared.

That's one of my favorite functionalities of {infer}: the ability to quickly generate data through permutation or bootstrapping. But I'm also not going to die on this hill :). That functionality will persist even if option 1. is implemented.

@davidhodge931
Copy link

assume() makes sense to me. You can either get a null distribution by generating data and calculating statistics, or by assuming a theoretical distribution

@beanumber
Copy link
Contributor

I think I'm still with @hardin47 in camp 1. But I'm not going to die on that hill either.

Could we do both and just have assume() be a wrapper?

I do worry about assume() implying that there are no assumptions under the resampling techniques. This is already a danger and this syntax would exacerbate it.

What about situate()? Or contextualize()?

@simonpcouch
Copy link
Collaborator

I really, really like those two verbs read in the sense of situating/contextualizing the observed statistic in the null distribution, though that phrasing would imply that the inputs to situate()/contextualize() were the observed statistic and/or null distribution rather than the observed data (i.e. the outputs of specify() %>% hypothesize()). Let me know if I'm misreading the idiom here. assume may not be our silver bullet, and I'm game for continuing to brainstorm alternatives. I don't feel especially confident about either of these, but distribute() and parameterize() are two more that have come to mind.

I'd prefer not to do both as we will continue maintaining the current techniques to generate theoretical distributions for a good while (if not indefinitely) as well.

@hardin47
Copy link

hardin47 commented Jun 3, 2021

i really like distribute() and i'm not a fan of assume(). no strong feelings about the others.

@mine-cetinkaya-rundel
Copy link
Collaborator Author

I hesitate to introduce vocabulary that is otherwise not used in this context. I've never said "situate" in place of "assume" in the sentence "we can assume the sampling distribution of the sample statistic is nearly normal".

And I'll take a step back from my earlier worry re: assume(). If this step was called condition() and we risked misinterpretation that no conditions are present in the simulation-based approach, perhaps that would be worse. But it is true that there are no distributional assumptions in the simulation-based framework, hence no line that has words like assume(distribution = "t", df = "20").

I also continue to feel uneasy about "we can generate the sampling distribution of the sample statistic to be nearly normal". The word "generate" (defined as "produce or create" or "produce (a set or sequence of items) by performing specified mathematical or logical operations on an initial set") says we are creating something, which is true when we resample and create a randomization or a bootstrap distribution but not true when we use existing theory and assume a defined distribution (with a given parameter).

So my proposal here @simonpcouch would be to implement assume() so we can see how it looks, what the output looks like in each of the steps, and see if during that process we come up with a better alternative word.

@hardin47
Copy link

hardin47 commented Jun 3, 2021

i'm totally fine with you all ignoring me... but i like the action-ness of the verbs, and i like the generate feels like we are producing something. what about any of the following:

produce(distribution = "t", df = 20)
supply(distribution = "t", df = 20)
compose(distribution = "t", df = 20)
establish(distribution = "t", df = 20)
implement(distribution = "t", df = 20)

@hardin47
Copy link

hardin47 commented Jun 3, 2021

set(distribution = "t", df = 20) ????

@beanumber
Copy link
Contributor

construct(distribution = "t", df = 20)
build(distribution = "t", df = 20)
invoke(distribution = "t", df = 20)

@davidhodge931
Copy link

If people are worried that assume might suggest no assumptions for computational approach, would assume_theory help?

assume_theory(distribution = "t", df = 20)

@mine-cetinkaya-rundel
Copy link
Collaborator Author

mine-cetinkaya-rundel commented Jun 22, 2021

A new type for generate() (type = "draw") that can be used in place of "simulate" is now implemented (in #390) and we seem to have come to somewhat of a consensus that theoretical inference shouldn't make use of generate() but use a different verb. The discussion on what and how of the implementation of a new verb is now on #399 and so I'll close this issue that is specifically on generate() vocabulary. Please chime in on #399 if you have thoughts on the framework for theoretical inference with infer!

@github-actions
Copy link

github-actions bot commented Jul 7, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants