Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for bug report 17770 #139

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

paocorrales
Copy link

@paocorrales paocorrales commented Aug 31, 2023

The initial report was addressed by Martin Maechler. The following is related to comment 3.

I went through the code for xtabs to understand the behavior noted by Thomas Soeiro and I believe there is no bug in the code but the documentation could include some clarification to cover this.

When the example is executed:

x <- data.frame(A = c("Y", "Y", "Z", "Z"),
                B = c(NA, TRUE, FALSE, TRUE),
                C = c(TRUE, TRUE, NA, FALSE))

xtabs(formula = cbind(B, C) ~ A,
      data = x,
      na.action = na.omit)

what enters to the model.frame() function inside xtabs() is

stats::model.frame(formula = cbind(B, C) ~ A, data = x, na.action = na.omit)

with data being

     A    B     C
1    Y   NA  TRUE
2    Y TRUE  TRUE
3    Z FALSE    NA
4    Z TRUE FALSE

na.omit will remove all the lines containing an NA, i.e. all combinations of A-B and A-C in a row, resulting in the output shown in comment 3:

> na.omit(data)
  A    B     C
2 Y TRUE  TRUE
4 Z TRUE FALSE

To avoid this, a user should not use cbind(B, C). Instead something like this:

long_df <- tidyr::pivot_longer(x, cols = B:C)

xtabs(value ~ A + name, data = long_df)

By doing this, the table that goes into model.frame() and gets na.omited is

data
# A tibble: 8 × 3
  A     name  value
  <chr> <chr> <lgl>
1 Y     B     NA   
2 Y     C     TRUE 
3 Y     B     TRUE 
4 Y     C     TRUE 
5 Z     B     FALSE
6 Z     C     NA   
7 Z     B     TRUE 
8 Z     C     FALSE

and then

  A     name  value
  <chr> <chr> <lgl>
1 Y     C     TRUE 
2 Y     B     TRUE 
3 Y     C     TRUE 
4 Z     B     FALSE
5 Z     B     TRUE 
6 Z     C     FALSE

And only the combination of A-B or A-C that has NA is filtered.

In summary, na.action is called over data (the original data frame) instead of the result of model.frame(). This argument also has an impact on how NAs are treated inside sum(), if na.action = na.pass, then na.rm = FALSE inside sum(), otherways will be TRUE.

I propose a patch to include a sentence in the details section to make this behavior more clear.

Argument section

  \item{na.action}{a \code{\link{function}} which indicates what should happen when
    \code{data} contain \code{\link{NA}}s.  If unspecified, and
    \code{addNA} is true, this is set to \code{\link{na.pass}}.  \code{na.action} also has an impact on how NAs are treated inside `sum()`.  If `na.action = na.pass` and \code{formula} has a left hand side (with counts), \code{\link{sum}(*), if it set to \code{NULL} it will use \code{getOption("na.action", default = na.omit)}, otherwise it will use \code{\link{sum}(*, na.rm = TRUE).}

Description

Also note that `na.action `is called over `data ` and this may result in the loss of counts as complete rows are omitted if there is an \code{NA} present in any collum.

src/library/stats/man/xtabs.Rd Outdated Show resolved Hide resolved
src/library/stats/man/xtabs.Rd Outdated Show resolved Hide resolved
src/library/stats/man/xtabs.Rd Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant