New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when tabulating data incorrectly guessed as type "logical" #385

Closed
nealrichardson opened this Issue Sep 10, 2017 · 5 comments

Comments

Projects
None yet
2 participants
@nealrichardson

nealrichardson commented Sep 10, 2017

After reading the data in this file, I get a segmentation fault whenever I try to aggregate most columns in the resulting tbl_df:

> library(readxl)
> df <- read_xlsx("City Survey MASTER Data 1996-2017.xlsx", sheet=2)
> table(df$parkvis)

 *** caught segfault ***
address 0x211ebd268, cause 'memory not mapped'

Traceback:
 1: unique.default(x, nmax = nmax)
 2: unique(x, nmax = nmax)
 3: factor(a, exclude = exclude)
 4: table(df$parkvis)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 1
R is aborting now ...
Segmentation fault: 11

The problem is probably related to the fact that, because there are so many missing values at the top of the dataset, this column (and a majority of the others in the dataset) were guessed as type "logical":

> class(df$parkvis)
[1] "logical"
> head(df$parkvis)
[1] NA NA NA NA NA NA
> tail(df$parkvis)
[1] TRUE TRUE TRUE TRUE TRUE TRUE

It turns out that if you specify the col_types correctly as "numeric", the data are perfectly readable without segfaulting:

> df <- read_xlsx("City Survey MASTER Data 1996-2017.xlsx", sheet=2, col_types="numeric")
> table(df$parkvis, useNA="always")

    1     2     3     4     5     6  <NA> 
 2027  2863  4931  6400  9325   828 11598 
> sessionInfo()
R Under development (unstable) (2017-09-07 r73219)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readxl_1.0.0

loaded via a namespace (and not attached):
[1] compiler_3.5.0   tools_3.5.0      tibble_1.3.4     Rcpp_0.12.12    
[5] cellranger_1.1.0 rlang_0.1.2.9000
@jennybc

This comment has been minimized.

Member

jennybc commented Sep 10, 2017

Can you provide "City Survey MASTER Data 1996-2017.xlsx"?

@nealrichardson

This comment has been minimized.

nealrichardson commented Sep 10, 2017

@jennybc

This comment has been minimized.

Member

jennybc commented Sep 10, 2017

Oh sorry, I didn't notice the blue text 😬

@jennybc

This comment has been minimized.

Member

jennybc commented Sep 11, 2017

I can reproduce it. It very weird, because the data frame appears normal in most respects. Thanks for this specimen.

Two things I noticed in a quick examination:

  • dplyr::count() seems able to tabulate df$parkvis without segfaulting.
  • Instead of typing all the variables, another workaround is to increase guess_max to something high enough to consult all the data.
library(readxl)
library(tidyverse)
df <- read_xlsx("~/Downloads/City Survey MASTER Data 1996-2017.xlsx", sheet = 2)
df %>% 
  count(parkvis)
#> # A tibble: 7 x 2
#>   parkvis     n
#>     <lgl> <int>
#> 1    TRUE  2027
#> 2    TRUE  2863
#> 3    TRUE  4931
#> 4    TRUE  6400
#> 5    TRUE  9325
#> 6    TRUE   828
#> 7      NA 11598

df2 <- read_xlsx("~/Downloads/City Survey MASTER Data 1996-2017.xlsx", sheet = 2, guess_max = 40000)
table(df2$parkvis)
#> 
#>    1    2    3    4    5    6 
#> 2027 2863 4931 6400 9325  828
@nealrichardson

This comment has been minimized.

nealrichardson commented Sep 11, 2017

Yeah, I can get to the right parsing of the data a few ways, and the version that segfaults isn't useful, so this doesn't block the analysis I was trying to do. But I figured you'd want to know about a segfault :)

Odd that dplyr::count actually tabulates the data correctly: even though all of the values show as TRUE, somewhere in there the distinct values are there.

base::table and stats::xtabs both trigger the segfault when they call unique (unique.default, to be precise). Indeed, calling unique(df$parkvis) directly also segfaults.

nacnudus added a commit to nacnudus/readxl that referenced this issue Oct 19, 2017

nacnudus added a commit to nacnudus/readxl that referenced this issue Oct 19, 2017

Transform int to 0/1 before coercion to logical
Fixes tidyverse#385

Weirdly, the test passes even without this fix when run with `testthat::test()`.  To get the test to fail, checkout the previous commit, install, and do

```r
library(readxl)
library(testthat)
source("./tests/testthat/helper.R")
source("./tests/testthat/test-col-types.R")
```

from within the project directory.

nacnudus added a commit to nacnudus/readxl that referenced this issue Oct 19, 2017

Transform int to 0/1 before coercion to logical
Fixes tidyverse#385

Weirdly, the test passes even without this fix when run with `testthat::test()`.  To get the test to fail, checkout the previous commit, install, and do

```r
library(readxl)
library(testthat)
source("./tests/testthat/helper.R")
source("./tests/testthat/test-col-types.R")
```

from within the project directory.  It will segfault.

@jennybc jennybc closed this in #398 Oct 20, 2017

jennybc added a commit that referenced this issue Oct 20, 2017

Transform int to 0/1 before coercion to logical (#398)
Fixes #385

Weirdly, the test passes even without this fix when run with `testthat::test()`.  To get the test to fail, checkout the previous commit, install, and do

```r
library(readxl)
library(testthat)
source("./tests/testthat/helper.R")
source("./tests/testthat/test-col-types.R")
```

from within the project directory.  It will segfault.

nacnudus added a commit to nacnudus/readxl that referenced this issue Oct 20, 2017

jennybc added a commit that referenced this issue Oct 21, 2017

Add bugfix of #385 to NEWS (#399)
* Transform int to 0/1 before coercion to logical

Fixes #385

Weirdly, the test passes even without this fix when run with `testthat::test()`.  To get the test to fail, checkout the previous commit, install, and do

```r
library(readxl)
library(testthat)
source("./tests/testthat/helper.R")
source("./tests/testthat/test-col-types.R")
```

from within the project directory.  It will segfault.

* Add bugfix of #385 to NEWS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment