Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Residuals and expected cell counts for chi-square tests #462

Closed
VectorPosse opened this issue Nov 14, 2022 · 7 comments
Closed

Residuals and expected cell counts for chi-square tests #462

VectorPosse opened this issue Nov 14, 2022 · 7 comments

Comments

@VectorPosse
Copy link

Feature

Any thoughts about providing residuals or expected cell counts for chi-square tests? I can appreciate that post hoc testing is likely outside the scope of the infer package, but as it stands now, I can't see any way to pipe infer output to any other tool that would calculate them. If I want to teach my students about post-hoc testing, I have to basically ignore everything I just got finished teaching them with infer tools and go back and run chisq.test from scratch.

@simonpcouch
Copy link
Collaborator

Thanks for the issue!

I'm going to write w.r.t. Chi-square tests of independence for the purposes of brevity, but the analogous machinery applies for goodness of fit tests.

I can't see any way to pipe infer output to any other tool that would calculate them.

The current state of infer’s support for calculating randomization-based expected cell counts and residuals would rely on generateing samples under a null hypothesis of independence with the usual infer machinery and then tableing away.

# use infer to generate a distribution of values under the null
library(infer)

reps <- 1e4

null_dist <- gss %>%
  specify(finrela ~ sex) %>%
  hypothesize(null = "independence") %>% 
  generate(reps = reps, type = "permute")

null_dist
#> Response: finrela (factor)
#> Explanatory: sex (factor)
#> Null Hypothesis: independence
#> # A tibble: 5,000,000 × 3
#> # Groups:   replicate [10,000]
#>    finrela           sex    replicate
#>    <fct>             <fct>      <int>
#>  1 average           male           1
#>  2 DK                female         1
#>  3 far below average male           1
#>  4 average           male           1
#>  5 average           male           1
#>  6 average           female         1
#>  7 average           female         1
#>  8 below average     female         1
#>  9 above average     female         1
#> 10 far below average female         1
#> # … with 4,999,990 more rows

# use `table` to create the observed
observed_counts <- table(gss$finrela, gss$sex)

observed_counts
#>                    
#>                     male female
#>   far below average    8     17
#>   below average       55     60
#>   average            127    112
#>   above average       65     41
#>   far above average    6      4
#>   DK                   2      3

# use the same function to form the null distribution
expected_counts <- table(null_dist$finrela, null_dist$sex) / reps

expected_counts
#>                    
#>                         male   female
#>   far below average  13.1783  11.8217
#>   below average      60.4669  54.5331
#>   average           125.7173 113.2827
#>   above average      55.7701  50.2299
#>   far above average   5.2532   4.7468
#>   DK                  2.6142   2.3858

# take residuals
residuals <- observed_counts - expected_counts
  
residuals
#>                    
#>                        male  female
#>   far below average -5.1783  5.1783
#>   below average     -5.4669  5.4669
#>   average            1.2827 -1.2827
#>   above average      9.2299 -9.2299
#>   far above average  0.7468 -0.7468
#>   DK                -0.6142  0.6142

# calculate chi^2
sum(residuals^2 / expected_counts)
#> [1] 9.122616

This doesn't feel too verbose to me, especially if facility with residuals and expected cell counts were a pedagogical goal. If infer is a known tool, as well, then the code to generate null_dist is more straightforward than other randomization-based routes to the same output.

For juxaposition, the same values with chisq.test():

table(gss$finrela, gss$sex)
#>                    
#>                     male female
#>   far below average    8     17
#>   below average       55     60
#>   average            127    112
#>   above average       65     41
#>   far above average    6      4
#>   DK                   2      3
chisq.test(gss$finrela, gss$sex)$expected
#> Warning in chisq.test(gss$finrela, gss$sex): Chi-squared approximation may be
#> incorrect
#>                    gss$sex
#> gss$finrela            male  female
#>   far below average  13.150  11.850
#>   below average      60.490  54.510
#>   average           125.714 113.286
#>   above average      55.756  50.244
#>   far above average   5.260   4.740
#>   DK                  2.630   2.370
chisq.test(gss$finrela, gss$sex)$residuals
#> Warning in chisq.test(gss$finrela, gss$sex): Chi-squared approximation may be
#> incorrect
#>                    gss$sex
#> gss$finrela               male     female
#>   far below average -1.4201831  1.4960567
#>   below average     -0.7058795  0.7435912
#>   average            0.1146962 -0.1208239
#>   above average      1.2379814 -1.3041208
#>   far above average  0.3226553 -0.3398933
#>   DK                -0.3884746  0.4092290
chisq.test(gss$finrela, gss$sex)$statistic
#> Warning in chisq.test(gss$finrela, gss$sex): Chi-squared approximation may be
#> incorrect
#> X-squared 
#>  9.105397

Note that the $residuals slot of chisq.test() output contains Pearson residuals, so we could take residuals / sqrt(expected_counts) to match that output if we wanted to match that output.

For another approach, the book "Introduction to Modern Statistics" has a chapter with applications in infer, including a learnr tutorial working through a Chi-square test of independence with infer. That lesson leans heavily on motivating the test statistic visually before shifting into infer code, but doesn't surface the code that pushes around any tables. They write:

Computing these expected counts while respecting the marginal distributions is a bit tedious, so we’ll be relying upon [base] R for this calculation.

...which is perhaps a symptom of lack of infer support.

Created on 2022-11-17 with reprex v2.0.2

@rhinopotamus
Copy link

I was curious about this too, and I went a-digging. Line 419 or so of calculate.R seems to be the operative part, and it looks like it just calls stats::chisq.test, which under the hood does produce a table of expected counts. However, on line 428, we're just pulling "statistic" out of the output of stats::chisq.test and, it looks like, throwing the rest of the output away.

Maybe if there was a way to keep the rest of the output of stats::chisq.test around, then a function called get_expected_counts() or get_residuals() could exist to just report the $expected or $residuals slots.

(I am also going to have this same question about anova -- I'd love to be able to do post-hoc stuff with the infer pipeline.)

ooh -- simonpcouch wrote the comment above while I was writing this one -- but I think this comment is maybe still relevant.

@VectorPosse
Copy link
Author

I'll also point out that there is tidyverse precedent for such a request. For example, tidymodels (parsnip, specifically) uses the extract_fit_engine() function to access the underlying object that can then be passed to other functions.

Another important use case would be anyone wanting to use broom to tidy results. AFAIK, broom does not know what to do with output from infer, so some of this is about also playing nice with other packages within the tidyverse.

Thanks @simonpcouch for your consideration. Your sample code gives me some options to think about, but I would encourage y'all to continue the discussion to see if being able to extract the base R model is something you're willing to support, even if @rhinopotamus's suggestion about getting specific residuals is too specific (i.e. not generalizable to all infer workflows).

@simonpcouch
Copy link
Collaborator

However, on line 428, we're just pulling "statistic" out of the output of stats::chisq.test and, it looks like, throwing the rest of the output away.

That's correct.🙂 There have been some issues/PRs over the years on this repo about whether to use the stats backend or not in this case, and the discussion on those issues might be illuminating. #108, #248, etc.

Maybe if there was a way to keep the rest of the output of stats::chisq.test around, then a function called get_expected_counts() or get_residuals() could exist to just report the $expected or $residuals slots.

This is certainly possible! For internal consistency, and w.r.t. scope creep, I'd be hesitant to consider this addition.

For example, tidymodels (parsnip, specifically) uses the extract_fit_engine() function to access the underlying object that can then be passed to other functions.

parsnip, as a principle, doesn't implement its own computational engines, and thus relies on engines to fit models. The analogy with chisq.test is quite loose here, as most other statistics in infer are implemented "in-house," and an engine isn't well-defined in those cases.

Another important use case would be anyone wanting to use broom to tidy results. AFAIK, broom does not know what to do with output from infer, so some of this is about also playing nice with other packages within the tidyverse.

infer outputs tidy data frames, by construction, so there's no work for broom to do here.


Will leave this issue open in case folks want to chime in. :)

@VectorPosse
Copy link
Author

VectorPosse commented Nov 18, 2022 via email

@rhinopotamus
Copy link

@simonpcouch I appreciate the concern about scope creep!

Here's a potential argument for including interfaces for extracting residuals etc., which you should please feel free to adopt or ignore: vignette("infer") is a really lovely articulation of the philosophical approach this package takes. The vignette includes this comment: "infer also offers several utilities to extract the meaning out of summary statistics and distributions". I think residuals could neatly fall into this category of work: imo, with a chi-square test for independence, the meaning of the test is in the residuals.

So, as an example, I'm going to follow along with what's in vignette("chi_squared"): is there some kind of association between people's financial status and whether they have a college education?
gss %>% specify(college ~ finrela) %>% hypothesize(null = "independence") %>% calculate(stat="chisq") reports that the chi-squared statistic is 30.7, which is large.
Accordingly, the pipeline gss %>% specify(college ~ finrela) %>% assume(distribution = "chisq") %>% get_p_value(obs_stat = 30.7, direction = "right") produces a p-value of 0.0000107.
So we're going to reject the null hypothesis, and conclude that there is an association between financial status and college education.

However, importantly, we don't know what that association is!
It's reasonable to hypothesize that people with a college degree feel more financially secure, but unless we have some way to quantify the deviations from expected, we'd only be guessing.

(This example feels a little obvious, but if we get into something more subtle like the smoking dataset from the openintro library, we can use the chi-squared test to see that there is an association between marital status and smoking status, but infer pipelines give us no way to see who actually smokes more.)

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants