Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPSS value labels longer than 255 bytes cause crash #262

Closed
rogerjdeangelis opened this issue Jan 13, 2017 · 29 comments
Closed

SPSS value labels longer than 255 bytes cause crash #262

rogerjdeangelis opened this issue Jan 13, 2017 · 29 comments
Labels
bug an unexpected problem or unintended behavior readstat

Comments

@rogerjdeangelis
Copy link

Haven cannot create a SPSS file with a string longer than 255 bytes but Python can.
Vanilla SAS cannot, but it splits the string into 200 byte chunks over multiple varaibles;

Suggestion on improving the write_sav and write_sas in Haven.

  • this runs from R term and from SAS if the string length is less than about 255 bytes
    but crashes with longer strings;

%utl_submit_r64('
source("C:/Program Files/R/R-3.3.2/etc/Rprofile.site", echo=T);
library(haven);
str<-as.data.frame(paste(replicate(100, "roger"), collapse = ""));
colnames(str)<-"String";
str;
write_sav(str,"d:/sav/str.sav");
fro<-read_sav("d:/sav/str.sav");
fro;
');

hangs
visual studio 5 exception
unhandled win 32 exception

  • I use python to create SPSS files, seems to handle labels and long strings better?
    Maybe you can look at the code?

I am a statistician and drop down to WPS, Stattransfer, R, Perl and Python from SAS.
Using the functionality

  • This works with long strings;
    CREATE SPSS dataset using Python;

  • create a SPSS file where var1 is 3000 bytes;

  • Python seems better that R for SPSS;

  • seems to work better when you overspecity string length meta data;

  • PYTHON;
    %utl_submit_py64('
    import savReaderWriter as sav;
    savFileName = "d:/rio/mtcars.sav";
    newstring = "a"*300;
    print(newstring);
    records = [[newstring, 1, 1], [newstring, 2, 1]];
    varNames = ["var1", "v2", "v3"];
    varTypes = {"var1": 500, "v2": 0, "v3": 0};
    with sav.SavWriter(savFileName, varNames, varTypes) as writer:;
    . for record in records:;
    . writer.writerow(record);
    ');

  • using the free express version of WPS (proc r) I can
    create a SAS dtatset using the Python output.

  • Haven can input the long strings;

  • Stattransfer can handle the long string (in and out);

  • create a SAS dataset from sav file;
    %utl_submit_wps64('
    options set=R_HOME "C:/Program Files/R/R-3.3.2";
    libname wrk "%sysfunc(pathname(work))";
    proc r;
    submit;
    source("C:/Program Files/R/R-3.3.2/etc/Rprofile.site", echo=T);
    library(haven);
    strsas<-read_sav("d:/rio/mtcars.sav");
    strsas;
    endsubmit;
    import r=strsas data=wrk.strsas;
    run;quit;
    ');

/* SAS dataset strsas */
Variables in Creation Order

Variable Type Len

1 VAR1 Char 300
2 V2 Num 8
3 V3 Num 8

@hadley
Copy link
Member

hadley commented Jan 24, 2017

Can you please create a minimal reproducible example, and format your issue nicely so it's easier for me to read?

@rubenarslan
Copy link
Contributor

I think I ran into the same problem. Here's a minimal reproducible example:

write_sav(data.frame(long = paste(rep("a", times= 252), collapse = "")), path = "test.sav") # works
write_sav(data.frame(long = paste(rep("a", times= 253), collapse = "")), path = "test.sav") # >Error # 1405 in column 8.  

@hadley
Copy link
Member

hadley commented Jan 25, 2017

@rubenarslan perfect - thanks!

@evanmiller
Copy link
Collaborator

Possibly fixed by: WizardMac/ReadStat@e4b4c1b

@hadley
Copy link
Member

hadley commented Jan 25, 2017

@evanmiller with the latest readstat, both examples now crash:

    frame #7: 0x0000000108b941f3 haven.so`sav_begin_data + 147 at readstat_sav_write.c:418 [opt]
    frame #8: 0x0000000108b94160 haven.so`sav_begin_data(writer_ctx=<unavailable>) + 3216 at readstat_sav_write.c:932 [opt]
    frame #9: 0x0000000108b80ef5 haven.so`readstat_begin_row [inlined] readstat_begin_writing_data(writer=<unavailable>) + 75 at readstat_writer.c:110 [opt]
    frame #10: 0x0000000108b80eaa haven.so`readstat_begin_row(writer=<unavailable>) + 90 at readstat_writer.c:471 [opt]
    frame #11: 0x0000000108baa785 haven.so`Writer::write(this=0x00007fff5fbfceb0) + 1269 at DfWriter.cpp:110 [opt]
    frame #12: 0x0000000108baa213 haven.so`write_sav_(data=<unavailable>, path=<unavailable>) + 67 at DfWriter.cpp:320 [opt]

@hadley hadley added bug an unexpected problem or unintended behavior readstat labels Jan 25, 2017
@hadley hadley changed the title Have write_spss does not handle strings longer than 255 bytes but Python does? Support SPSS strings longer than 255 bytes Jan 25, 2017
@evanmiller
Copy link
Collaborator

@hadley That is strange since the 252 and 253-byte string are now under test coverage... do other <252 and >255 length strings crash too?

@hadley
Copy link
Member

hadley commented Jan 25, 2017

@evanmiller yeah, I've only run a few test cases but length 100 is ok, but 200 is not. I can try and narrow it down more if that would help

@hadley hadley changed the title Support SPSS strings longer than 255 bytes SPSS value labels longer than 255 bytes cause crash Jan 25, 2017
@hadley
Copy link
Member

hadley commented Jan 25, 2017

@evanmiller btw the problem isn't actually strings, but is actually value labels.

@hadley
Copy link
Member

hadley commented Jan 25, 2017

Slightly simpler reprex:

n <- 100
df <- data.frame(long = paste(rep("a", n), collapse = ""))
write_sav(df, path = tempfile())

@evanmiller
Copy link
Collaborator

@hadley Thanks, that makes sense, will investigate.

@evanmiller
Copy link
Collaborator

Welp: https://github.com/WizardMac/ReadStat/blob/master/src/spss/readstat_sav_write.c#L22

The file format limits value labels to 120 chars, so ReadStat attempts to truncate to 120. Likely a buffer overrun or something for 120+ bytes.

@evanmiller
Copy link
Collaborator

If these are value labels, what is the underlying data type and storage size? (The SAV writer has separate code paths for string values longer than 8 bytes vs 8 bytes or shorter.)

@evanmiller
Copy link
Collaborator

May or may not fix things: WizardMac/ReadStat@7e2965d

@hadley hadley closed this as completed in d8b1f71 Jan 25, 2017
@hadley
Copy link
Member

hadley commented Jan 25, 2017

Looks good - thanks @evanmiller !

@evanmiller
Copy link
Collaborator

FYI the news note is misleading as value labels are still truncated to 120 chars.

@hadley
Copy link
Member

hadley commented Jan 25, 2017

Ooops, not only that but I guess I forgot to test it with a long enough label :/

@hadley hadley reopened this Jan 25, 2017
@hadley
Copy link
Member

hadley commented Jan 25, 2017

I've added an R-side check for now. If you figure out the root problem, I can change the error to a warning.

@rubenarslan
Copy link
Contributor

@hadley I think the root problem is SPSS, I also cannot make value labels longer than 120 in the software itself. But long strings work fine now, thanks a lot!

@rubenarslan
Copy link
Contributor

So strings longer than 255 still don't work with haven installed just now from Github. Sorry, I think I misunderstood some of the above. Is this now a problem with value labels? But strings don't have value labels in SPSS right? Value labels have an actual hard limit, strings longer than 255 just get a special treatment under the hood?

n <- 256
df <- data.frame(long = paste(rep("a", n), collapse = ""))
write_sav(df, path = tempfile())

@hadley
Copy link
Member

hadley commented Jan 26, 2017

@rubenarslan you did not create a string there. data.frame() automatically turns character vectors into factors by default.

@rubenarslan
Copy link
Contributor

rubenarslan commented Jan 26, 2017

Sorry, I have stringsAsFactors = FALSE in my .Rprofile, forgot that this would be relevant.
The error occurs with strings (and the variable shows up as string in SPSS, when it's short enough).

n <- 256
df <- data.frame(long = paste(rep("a", n), collapse = ""), stringsAsFactors = FALSE)
write_sav(df, path = "test.sav")

@evanmiller
Copy link
Collaborator

@rubenarslan Are you using the latest code? Your example works fine for me.

> devtools::install_github("tidyverse/haven")
> library(haven)
> n <- 256
> df <- data.frame(long = paste(rep("a", n), collapse = ""), stringsAsFactors = FALSE)
> write_sav(df, path = "test.sav")
> df1 <- read_sav("test.sav")
> df1
# A tibble: 1 × 1
                                                                                                                                                             long
                                                                                                                                                            <chr>
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> 

@rubenarslan
Copy link
Contributor

I am. SPSS refuses to open the file, haven has no problem.

Error. Command name: GET FILE
Invalid SPSS Statistics data file (DATA1204)
Execution of this command stops.

Error # 1405 in column 8. Text:
Error when attempting to get a data file.
DATASET NAME DataSet1 WINDOW=FRONT.

@rubenarslan
Copy link
Contributor

I assume you don't have SPSS to test? Maybe this helps, I made three test files.
In all, I wrote a long string consisting of "a" with one "b" at the end.
One is 255 characters, written in haven, it opens in SPSS.
One is 256 characters, written in haven, it doesn't open in SPSS.
One is the first file, with the string width increased from 255 to 256 in SPSS (added one a). This one opens fine too.

test_files.zip

@evanmiller
Copy link
Collaborator

@rubenarslan Please open a new issue if the issue is SPSS compatibility rather than a crash.

@rubenarslan
Copy link
Contributor

Sorry, hadley changed the title of this issue from "support" to "crash" after I started commenting.

I assumed we were talking about the same thing (I mentioned the SPSS error only in the off-screen comment of my original reply), but didn't read the title of the original issue carefully enough.

@wibrt
Copy link

wibrt commented May 17, 2018

i can confirm this issue still seems to persist,
in this case a column called 'end_ofquestions' causes a problem;
with the same error message in spss (none in R/haven though)

@rubenarslan
Copy link
Contributor

@wibrt that is now issue #266 which is still open.

@lock
Copy link

lock bot commented Nov 13, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior readstat
Projects
None yet
Development

No branches or pull requests

5 participants