Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading file with .sas7bcat specified (still) results in crash #116

Closed
gergness opened this issue Oct 9, 2015 · 14 comments
Closed

Loading file with .sas7bcat specified (still) results in crash #116

gergness opened this issue Oct 9, 2015 · 14 comments

Comments

@gergness
Copy link
Contributor

gergness commented Oct 9, 2015

Thanks for the great package! I see that @evanmiller did some updates for sas7bcat files, and thought I'd send one of mine which I've never gotten to work (even before this most recent update). I can't share the full dataset, but even this minimal one with 2 observations and 3 variables causes a crash.

# First works okay, but the second causes a crash
data <- read_sas("https://dl.dropboxusercontent.com/u/2019891/haven_bug/subset.sas7bdat")

data2 <- read_sas("https://dl.dropboxusercontent.com/u/2019891/haven_bug/subset.sas7bdat", 
                  "https://dl.dropboxusercontent.com/u/2019891/haven_bug/formats.sas7bcat")

Here's my session info in case that is part of the problem again. Please let me know if there's anything else I can do to help.

Session info ---------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.2 (2015-08-14)
 system   x86_64, mingw32             
 ui       RStudio (0.99.484)          
 language (EN)                        
 collate  English_United States.1252  
 tz       America/Chicago             
 date     2015-10-09                  

Packages -------------------------------------------------------------------------------------------------
 package    * version    date       source                       
 devtools     1.9.1      2015-09-11 CRAN (R 3.2.2)               
 digest       0.6.8      2014-12-31 CRAN (R 3.2.0)               
 haven      * 0.2.0.9000 2015-10-09 Github (hadley/haven@2923140)
 memoise      0.2.1      2014-04-22 CRAN (R 3.2.0)               
 Rcpp         0.12.1     2015-09-10 CRAN (R 3.2.2)               
 rstudioapi   0.3.1      2015-04-07 CRAN (R 3.2.0)
@evanmiller
Copy link
Collaborator

What is the crash message?

@gergness
Copy link
Contributor Author

gergness commented Oct 9, 2015

"R encountered a fatal error. The session was terminated."

I'm unaware of how to get more information when R crashes this way, as far as I know nothing is retained.

@gergness
Copy link
Contributor Author

gergness commented Oct 9, 2015

(Would be happy to be proven wrong if someone here knows more)

@evanmiller
Copy link
Collaborator

Thanks -- I am able to reproduce the crash on my machine, so I'll take it from here.

@gergness
Copy link
Contributor Author

gergness commented Oct 9, 2015

Great, thanks!

@evanmiller
Copy link
Collaborator

Just some notes for myself:

Version string: 8.0202M0WIN_ASRV

The page size is 32256 bytes, which (curiously) is 512 bytes short of 2^15 bytes.

The crash is occurring on the second block of the first "real" catalog page (i=3). The first block gives its size as 2040 (=(1+7)*255) but the second block appears to begin at 2015. Should add checks to prevent a crash, but the core issue is why there's a 25-byte discrepancy between the purported and actual block size.

Actually many of the blocks appear to be 2015 bytes in length, despite 1) each purporting to be 2040 bytes and 2) using only a fraction of their block size.

The second-to-last block in the page purports to be 2040 bytes but appears to be 2009 bytes.

The last block in the page purports to be 510 bytes but appears to be truncated at 22 bytes. (Also possible this is unscrubbed memory?)

Worth noting that 2015 * 16 = 32240 so the missing 25 bytes in each block might be a technique to pack an extra block into the page. Since the blocks start at byte 22, that might also explain why the second-to-last block is coming up 6 bytes short (32256 - 32240 = 16 = 22 - 6).

@evanmiller
Copy link
Collaborator

I think I am going to need more example files in order to debug this. This file has short pages (512 smaller than usual) and short blocks (25 bytes less than expected) -- if I can find other catalog files with similar properties (possibly, that crash in the same way), I'll try to find a pattern.

@evanmiller
Copy link
Collaborator

Just so you know where I am -- I think I've crashed down the crash and have a fix ready. However, the file you provided seems to have surfaced a few other bugs in the software library. In particular it's choking on strings such as:

MT: ADULT (>= 20)
HI: OAHU,OTH RACE

Are these value labels, or another kind of data stored in the catalog file?

@evanmiller
Copy link
Collaborator

Ok, I've pushed some patches which ought to fix this issue. Copy this into the dev version of haven:

readstat_sas.c

(Or see the instruction in the README for using the latest ReadStat code.)

Please close the issue if the fix works for you. Thanks!

@gergness
Copy link
Contributor Author

Thanks! The crash was fixed, but a lot of the labels are scrambled. I totally understand if you want to wait to see if other format libraries are structured this way, but here's what I've figured out about the problem.

The two variables I've included in the subset have the same formatting, which I believe SAS calls "SMOKE5F". They should be labelled as:

.B  =   DK/BLANK
.S  =   SKIP
1   =   41 CIGARETTES OR MORE
2   =   21 TO 40 CIGARETTES
3   =   11 TO 20 CIGARETTES
4   =   6    10 CIGARETTES
5   =   1 TO  5 CIGARETTES
6   =   LESS THAN 1 CIGARETTE
7   =   NONE (0 CIGARETTES)

But haven's label shows

Labels:
          value                 label is_na
 -3.639647e-310 41 CIGARETTES OR MORE FALSE
 -4.563140e-310   21 TO 40 CIGARETTES FALSE
   1.000000e+00   11 TO 20 CIGARETTES FALSE
   2.000000e+00    6    10 CIGARETTES FALSE
   3.000000e+00    1 TO  5 CIGARETTES FALSE
   4.000000e+00 LESS THAN 1 CIGARETTE FALSE
   5.000000e+00   NONE (0 CIGARETTES) FALSE
   6.000000e+00              DK/BLANK FALSE
   7.000000e+00                  SKIP FALSE

In the full dataset,I think the pattern that determines whether the labels are scrambled is if it has "multiple missings" such as .B and .S above. For example $STATEF, STRAT1IA, QX_TYPE, and LANGUAGE all have the correct labels.

@gergness
Copy link
Contributor Author

For the strings you mentioned above, do any start with "IA"? (I'm in Iowa, and it looks like those are state specific notes)

@evanmiller
Copy link
Collaborator

Thanks for the information about the scrambled labels -- it appears to be related to the issue brought up over here: WizardMac/ReadStat#38 (comment)

The fix I provided there won't work with multiple missings. I'll need to dig around the file you provided some more.

And yes -- some of the strings start with IA, and many string appear to start with state postal codes. How does SAS interpret these notes, any idea?

@gergness
Copy link
Contributor Author

Are these the IA ones?

535 =   IA: NON-HISPANIC BLACK
536 =   IA: NON-HISPANIC WHITE/OTHER/UNKNOWN
537 =   IA: HISPANIC

This is from the label STRATCF, which combines all participating state's stratums into a single variable. Haven does not have a label for the variable that was supposed to be encoded by this label.

The only difference I can think of is that this one will be much longer than other variable labels.

@evanmiller
Copy link
Collaborator

Interesting, thanks -- it's probably encoded as a long label somehow. I didn't know catalog files could contain variable labels. I've opened a separate issue against ReadStat about this: WizardMac/ReadStat#39

Since the crash is fixed, please close this issue and let's continue the discussion about jumbled value labels here: WizardMac/ReadStat#40

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants