Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr can't summarize this variable #2919

Closed
jabranham opened this issue Jun 27, 2017 · 26 comments
Closed

dplyr can't summarize this variable #2919

jabranham opened this issue Jun 27, 2017 · 26 comments
Labels

Comments

@jabranham
Copy link

@jabranham jabranham commented Jun 27, 2017

I'm working with a data.frame and dplyr returns NA for all summaries for this variable.
Here's the data (from the General Social Survey). Sorry for the zip file, github won't let me upload the file directly.

test2.zip

and the R code. Note that you can change the summarize statement to anything (e.g. summarize(m=mean(russia, na.rm=TRUE)) and it'll still return NA:

library(dplyr)
test2 <- readRDS("test2.rds")

## Returns NA
test2 %>%
  summarize_at("russia", funs(m = mean(., na.rm = TRUE)))

##Returns 5.5ish
mean(test2$russia, na.rm = TRUE)

The data aren't crazy (and not all the values for "russia" are missing):

> str(test2)
'data.frame':	62466 obs. of  2 variables:
 $ year  : num  1972 1972 1972 1972 1972 ...
 $ russia: num  NA NA NA NA NA NA NA NA NA NA ...
> summary(test2$russia)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    4.00    5.00    5.52    9.00    9.00   46935 

Am I missing something really simple here?

@SivanMehta
Copy link

@SivanMehta SivanMehta commented Jul 7, 2017

While this may not directly answer your question, you can simply omit the NA variables with a filter:

test2 %>% 
  filter(!is.na(russia)) %>% # takes out all potentially problematic entries
  summarise(m = mean(russia))

@jabranham
Copy link
Author

@jabranham jabranham commented Jul 7, 2017

Thanks. Yes, that works. But it still doesn't explain why dplyr can't summarize that variable.

@hadley
Copy link
Member

@hadley hadley commented Jul 13, 2017

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@hadley hadley added the reprex label Jul 13, 2017
@jabranham
Copy link
Author

@jabranham jabranham commented Jul 13, 2017

Here's a full example using reprex, including downloading from that zip file I posted with the rio package.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(rio)
test2 <- import("https://github.com/tidyverse/dplyr/files/1105096/test2.zip")

## Returns NA
test2 %>%
  summarize_at("russia", funs(m = mean(., na.rm = TRUE)))
#>    m
#> 1 NA

##Returns 5.5ish
mean(test2$russia, na.rm = TRUE)
#> [1] 5.516387

@jabranham
Copy link
Author

@jabranham jabranham commented Sep 14, 2017

Just FYI - I updated dplyr (to 0.7.3, latest CRAN release) and this issue is still there.

@batpigandme
Copy link
Member

@batpigandme batpigandme commented Dec 28, 2017

Just adding this reprex; it seems that it works if you filter out the nas beforehand, but not as na.rm = TRUE.

library(dplyr)
library(rio)
test2 <- import("https://github.com/tidyverse/dplyr/files/1105096/test2.zip")

test2 %>%
  summarise_at("russia", mean, na.rm = TRUE)
#>   russia
#> 1     NA

test2 %>%
  filter(!is.na(russia)) %>%
  summarise_at("russia", mean, na.rm = TRUE)
#>     russia
#> 1 5.516387

test3 <- test2 %>%
  filter(!is.na(russia))

test3 %>%
  summarise_at("russia", mean)
#>     russia
#> 1 5.516387

test2 %>%
  filter(!is.na(russia)) %>%
  summarize_at("russia", funs(m = mean(., na.rm = TRUE)))
#>          m
#> 1 5.516387

Created on 2017-12-28 by the reprex package (v0.1.1.9000).

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 19, 2018

Thanks for the reprex. This appears to work now with the CRAN versions of dplyr and rlang, though I don't know why. Can you confirm?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(rio)
test2 <- import("https://github.com/tidyverse/dplyr/files/1105096/test2.zip")

test2 %>%
  summarise_at("russia", mean, na.rm = TRUE)
#>     russia
#> 1 5.516387

Created on 2018-01-19 by the reprex package (v0.1.1.9000).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_US                       
#>  collate  en_US.UTF-8                 
#>  tz       Europe/Busingen             
#>  date     2018-01-19
#> Packages -----------------------------------------------------------------
#>  package    * version     date       source                          
#>  assertthat   0.2.0       2017-04-11 CRAN (R 3.4.1)                  
#>  backports    1.1.2       2017-12-13 cran (@1.1.2)                   
#>  base       * 3.4.3       2017-12-01 local                           
#>  bindr        0.1         2017-06-15 local                           
#>  bindrcpp     0.2         2017-06-18 local (krlmlr/bindrcpp@dfce02c) 
#>  cellranger   1.1.0       2016-07-27 CRAN (R 3.4.0)                  
#>  compiler     3.4.3       2017-12-01 local                           
#>  curl         3.1         2017-12-12 CRAN (R 3.4.3)                  
#>  data.table   1.10.4-3    2017-10-27 CRAN (R 3.4.3)                  
#>  datasets   * 3.4.3       2017-12-01 local                           
#>  devtools     1.13.4      2017-11-09 CRAN (R 3.4.2)                  
#>  digest       0.6.13      2017-12-14 CRAN (R 3.4.3)                  
#>  dplyr      * 0.7.4       2017-09-28 CRAN (R 3.4.3)                  
#>  evaluate     0.10.1      2017-06-24 CRAN (R 3.4.1)                  
#>  forcats      0.2.0.9000  2017-09-27 local                           
#>  foreign      0.8-69      2017-06-21 CRAN (R 3.4.1)                  
#>  glue         1.2.0.9000  2017-11-22 Github (tidyverse/glue@752458e) 
#>  graphics   * 3.4.3       2017-12-01 local                           
#>  grDevices  * 3.4.3       2017-12-01 local                           
#>  haven        1.1.0       2017-07-09 CRAN (R 3.4.1)                  
#>  htmltools    0.3.6       2017-04-28 CRAN (R 3.4.1)                  
#>  knitr        1.18        2017-12-27 CRAN (R 3.4.3)                  
#>  magrittr     1.5         2014-11-22 CRAN (R 3.4.3)                  
#>  memoise      1.1.0       2017-08-07 Github (hadley/memoise@d63ae9c) 
#>  methods    * 3.4.3       2017-12-01 local                           
#>  openxlsx     4.0.29      2017-11-21 local                           
#>  pillar       1.0.99.9001 2018-01-14 local (r-lib/pillar@9d96835)    
#>  pkgconfig    2.0.1       2017-03-21 CRAN (R 3.4.1)                  
#>  R6           2.2.2       2017-06-17 CRAN (R 3.4.1)                  
#>  Rcpp         0.12.14.5   2018-01-11 local                           
#>  readxl       1.0.0       2017-04-18 CRAN (R 3.4.3)                  
#>  rio        * 0.5.5       2017-06-18 CRAN (R 3.4.3)                  
#>  rlang        0.1.6       2017-12-21 CRAN (R 3.4.3)                  
#>  rmarkdown    1.8         2017-11-17 CRAN (R 3.4.3)                  
#>  rprojroot    1.3-2       2018-01-03 local (krlmlr/rprojroot@851d293)
#>  stats      * 3.4.3       2017-12-01 local                           
#>  stringi      1.1.6       2017-11-17 CRAN (R 3.4.3)                  
#>  stringr      1.2.0       2017-02-18 CRAN (R 3.4.1)                  
#>  tibble       1.4.1.9000  2018-01-15 local                           
#>  tools        3.4.3       2017-12-01 local                           
#>  utils      * 3.4.3       2017-12-01 local                           
#>  withr        2.1.1.9000  2017-12-30 Github (r-lib/withr@df18523)    
#>  yaml         2.1.16      2017-12-12 CRAN (R 3.4.3)                  
#>  zip          1.0.0       2017-04-25 CRAN (R 3.4.2)

@jabranham
Copy link
Author

@jabranham jabranham commented Jan 19, 2018

I still experience this problem. sessioninfo:

Session info -------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Chicago             
 date     2018-01-19                  

Packages -----------------------------------------------------------------------------
 package    * version  date       source        
 assertthat   0.2.0    2017-04-11 CRAN (R 3.4.2)
 base       * 3.4.3    2017-11-30 local         
 bindr        0.1      2016-11-13 CRAN (R 3.4.2)
 bindrcpp     0.2      2017-06-17 CRAN (R 3.4.2)
 cellranger   1.1.0    2016-07-27 CRAN (R 3.4.2)
 compiler     3.4.3    2017-11-30 local         
 curl         3.1      2017-12-12 CRAN (R 3.4.3)
 data.table   1.10.4-3 2017-10-27 CRAN (R 3.4.2)
 datasets   * 3.4.3    2017-11-30 local         
 devtools     1.13.4   2017-11-09 CRAN (R 3.4.3)
 digest       0.6.14   2018-01-14 CRAN (R 3.4.3)
 dplyr      * 0.7.4    2017-09-28 CRAN (R 3.4.2)
 forcats      0.2.0    2017-01-23 CRAN (R 3.4.2)
 foreign      0.8-69   2017-06-22 CRAN (R 3.4.3)
 glue         1.2.0    2017-10-29 CRAN (R 3.4.2)
 graphics   * 3.4.3    2017-11-30 local         
 grDevices  * 3.4.3    2017-11-30 local         
 haven        1.1.1    2018-01-18 CRAN (R 3.4.3)
 magrittr     1.5      2014-11-22 CRAN (R 3.4.2)
 memoise      1.1.0    2017-04-21 CRAN (R 3.4.3)
 methods    * 3.4.3    2017-11-30 local         
 openxlsx     4.0.17   2017-03-23 CRAN (R 3.4.2)
 parallel     3.4.3    2017-11-30 local         
 pillar       1.1.0    2018-01-14 CRAN (R 3.4.3)
 pkgconfig    2.0.1    2017-03-21 CRAN (R 3.4.2)
 R6           2.2.2    2017-06-17 CRAN (R 3.4.2)
 Rcpp         0.12.14  2017-11-23 CRAN (R 3.4.3)
 readxl       1.0.0    2017-04-18 CRAN (R 3.4.2)
 rio        * 0.5.5    2017-06-18 CRAN (R 3.4.2)
 rlang        0.1.6    2017-12-21 CRAN (R 3.4.3)
 stats      * 3.4.3    2017-11-30 local         
 tibble       1.4.1    2017-12-25 CRAN (R 3.4.3)
 tools        3.4.3    2017-11-30 local         
 utils      * 3.4.3    2017-11-30 local         
 withr        2.1.1    2017-12-19 CRAN (R 3.4.3)

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 19, 2018

Just to double-check, can you please run the following code and paste the results from the clipboard:

reprex::reprex(si = TRUE, {
  library(dplyr)
  library(rio)
  test2 <- import("https://github.com/tidyverse/dplyr/files/1105096/test2.zip")
  
  test2 %>%
    summarize_at("russia", mean, na.rm = TRUE)
})

@jabranham
Copy link
Author

@jabranham jabranham commented Jan 19, 2018

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 19, 2018

Thanks. I'm completely at a loss here. Need to compare package versions to check if they are different.

@jabranham
Copy link
Author

@jabranham jabranham commented Jan 19, 2018

Here is that comparison:

diff -u /tmp/his.txt /tmp/mine.txt
--- /tmp/his.txt	2018-01-19 10:50:38.575592738 -0600
+++ /tmp/mine.txt	2018-01-19 10:50:35.832291330 -0600
@@ -4,9 +4,9 @@
 #>  version  R version 3.4.3 (2017-11-30)
 #>  system   x86_64, linux-gnu
 #>  ui       X11
-#>  language en_US
+#>  language (EN)
 #>  collate  en_US.UTF-8
-#>  tz       Europe/Busingen
+#>  tz       America/Chicago
 #>  date     2018-01-19
 #> Packages -----------------------------------------------------------------
 #>  package    * version
@@ -21,25 +21,25 @@
 #>  data.table   1.10.4-3
 #>  datasets   * 3.4.3
 #>  devtools     1.13.4
-#>  digest       0.6.13
+#>  digest       0.6.14
 #>  dplyr      * 0.7.4
 #>  evaluate     0.10.1
-#>  forcats      0.2.0.9000
+#>  forcats      0.2.0
 #>  foreign      0.8-69
-#>  glue         1.2.0.9000
+#>  glue         1.2.0
 #>  graphics   * 3.4.3
 #>  grDevices  * 3.4.3
-#>  haven        1.1.0
+#>  haven        1.1.1
 #>  htmltools    0.3.6
 #>  knitr        1.18
 #>  magrittr     1.5
 #>  memoise      1.1.0
 #>  methods    * 3.4.3
-#>  openxlsx     4.0.29
-#>  pillar       1.0.99.9001
+#>  openxlsx     4.0.17
+#>  pillar       1.1.0
 #>  pkgconfig    2.0.1
 #>  R6           2.2.2
-#>  Rcpp         0.12.14.5
+#>  Rcpp         0.12.14
 #>  readxl       1.0.0
 #>  rio        * 0.5.5
 #>  rlang        0.1.6
@@ -48,9 +48,8 @@
 #>  stats      * 3.4.3
 #>  stringi      1.1.6
 #>  stringr      1.2.0
-#>  tibble       1.4.1.9000
+#>  tibble       1.4.1
 #>  tools        3.4.3
 #>  utils      * 3.4.3
-#>  withr        2.1.1.9000
+#>  withr        2.1.1
 #>  yaml         2.1.16
-#>  zip          1.0.0

Diff finished.  Fri Jan 19 10:50:43 2018

@jabranham
Copy link
Author

@jabranham jabranham commented Jan 19, 2018

I just re-ran this with the latest development version of all the packages where we had differnt versions with the exception of Rcpp (couldn't find 0.12.14.5) and digest and haven (where my package version was more recent than yours), and openxlsx (it's not an excel file) with the same results.

I was also able to reproduce this on a macOS computer with up-to-date packages from CRAN.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2018

Very strange indeed. This is what I just tried, with r-lib/withr#66, in a vanilla session:

withr::with_temp_libpaths(action = "replace", {
  install.packages(c("dplyr", "rio"))
  library(dplyr)
  library(rio)
  test2 <- import("https://github.com/tidyverse/dplyr/files/1105096/test2.zip")

  test2 %>%
    summarize_at("russia", mean, na.rm = TRUE)
})

I got:

    russia
1 5.516387

@jabranham
Copy link
Author

@jabranham jabranham commented Jan 25, 2018

OK, if I do that I also get 5.516.... Progress, I guess.

But I still get NA in a "normal" R session, even after making sure all R packages are up-to-date and deleting my Rprofile file

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2018

Maybe a hidden install-time dependency?

Can you post a snapshot of your library, and your OS and version? I wonder if I can replicate the problem on my machine.

The only other option I see would be undefined behavior.

@jabranham
Copy link
Author

@jabranham jabranham commented Jan 25, 2018

Can you post a snapshot of your library, and your OS and version?

How do you suggest I do that? I tried using packrat but I can't reproduce the problem in packrat. Which would make me think that this is some weird issue on this computer, except I can reproduce this on several computers with different OSes.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2018

Just zip your ~/R directory?

@jabranham
Copy link
Author

@jabranham jabranham commented Jan 25, 2018

D'oh!

Warning, large download (~56M):

https://www.dropbox.com/s/oh2awr5fiby3759/r.tar.gz?dl=0

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS: /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2018

Mounted your library into a Docker container built with the following Rockerfile:

FROM base/archlinux

RUN pacman -Syy

RUN pacman --noconfirm -S r gcc-fortran

MOUNT ./library:/library

RUN R_LIBS_USER=/library R -q -e 'library(dplyr); library(rio); test2 <- import("https://github.com/tidyverse/dplyr/files/1105096/test2.zip"); test2 %>% summarise_at("russia", mean, na.rm = TRUE)'

Got NA as result. Bingo!

So:

  • your package library, and that created by a clean install on OS X, exhibits the problematic behavior
  • a clean package library installed on Ubuntu works as expected

I'm going to reinstall all packages already present in your library, one by one, and check which package fixes the problem. Will adapt the Dockerfile to take a copy from your precious library first.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2018

So, reinstalling dplyr appears to resolve the problem.

Can you please post the output of dplyr::dr_dplyr() on your system?

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2018

And also on the OS X system, if you can?

@jabranham
Copy link
Author

@jabranham jabranham commented Jan 25, 2018

dplyr::dr_dplyr() tells me:

Warning message:
Installed Rcpp (0.12.15) different from Rcpp used to build dplyr (0.12.14).
Please reinstall dplyr to avoid random crashes or undefined behavior.

And indeed reinstalling dplyr fixes the issue!

I don't have access to the mac right now, but I'll take a look later.

Thanks for looking through this super particular bug!

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2018

Thank you for furnishing me with the input I asked for!

Leaving it open for now, will file an issue with Rcpp that points here.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2018

This problem disappears if dplyr is compiled against Rcpp >= 0.12.15. Just installing Rcpp 0.12.15 is not enough. In the following Dockerfile, the first two runs of test.R fail; the second run is after installing Rcpp 0.12.15 with dplyr compiled against Rcpp 0.12.14:

FROM base/archlinux

RUN pacman -Syy

RUN pacman --noconfirm -S r gcc-fortran

RUN pacman --noconfirm -S base-devel

RUN echo 'suppressPackageStartupMessages(library(dplyr)); library(rio); test2 <- import("https://github.com/tidyverse/dplyr/files/1105096/test2.zip"); test2 %>% summarise_at("russia", mean, na.rm = TRUE)' > test.R


RUN echo "MAKEFLAGS=-j 8" >> ~/.Renviron

RUN echo 'options(repos = "https://cloud.r-project.org")' > ~/.Rprofile

RUN R -q -e 'install.packages(c("Rcpp", "rio"))'

RUN R -q -e 'install.packages("dplyr")'

# baseline

RUN R -q -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/Rcpp/Rcpp_0.12.14.tar.gz", repos = NULL)'

RUN nice R -q -e 'install.packages("dplyr")'

RUN R -q -f test.R

RUN nice R -q -e 'install.packages("Rcpp")'

RUN R -q -f test.R

RUN nice R -q -e 'install.packages("dplyr")'

RUN R -q -f test.R

The problem here is fixed with RcppCore/Rcpp#790, which is included in Rcpp 0.12.15.

@krlmlr krlmlr closed this in 9298431 Jan 25, 2018
@lock
Copy link

@lock lock bot commented Jul 24, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jul 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants