New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataframe2sasdata fails with newline character in data #251
Comments
@abbottkr I have to say, thank you for such a thorough bug report! That is excellent! I've been playing around with this a bit and yes, embedded newlines just seem to wreak havoc trying to get them into a data set. I wasn't able to get them via control-enter in excel, but I just put them in via python and saw the problems. I tried having that column be quoted and setting keep_outer_quotes=True on df2sd() but that didn't help at all. I tried exporting the data frame as csv and reading it w/ SAS and that didn't work right either. The data step I generate for the reverse of this, allows for specifying TERMSTR= where I can change from LF to some other byte, but the data step I generate for this doesn't support that; it uses the newline to means end of input row. Here's my musings:
I expect that I can come up with a way to get this to work, but I'll have to investigate that some more. At least I can easily reproduce this, so that's good. To speak to your Additional context statements (in positional order):
I may not have this solved today, but I will investigate this further and see what I can do to address it. Thanks again for entering this, and for such a thorough write up. I do appreciate that! |
So, seeing that the escaped (\\n instead of \n) appeared to work, I tweaked 2 lines of code to replace \n with \\n when transferring the data to SAS and that get this to work:
Now, this is a quick tweak, so I'm not about to push this out until I investigate further and identify if this is really a good way to do this, or if this is still only part of a larger issue that need to be addressed as well. Bullet item 1: -I have not tested any other escape sequences. But, for a Friday afternoon, I'm happy to have a potential easy fix for this. I'll continue on this and have something next week. Thanks! |
No, can't believe I fell for that. This isn't as it looks. That's just a trick of the display formatting. escaping didn't actually make it work, it changed it from a line feed (x'10') to 2 characters: a backslashes and the letter 'n', duh. Can't trust what's displayed on the screen when looking at \n.
My bad (long week), |
I've been exploring variations on data step input (https://go.documentation.sas.com/?docsetId=lrcon&docsetTarget=n1w749t788cgi2n1txpuccsuqtro.htm&docsetVersion=9.4&locale=en) to see if there was a different way I could code this which would allow for embedded LF in a column. But, I haven't been successful. In every case, the newline triggers the next row. Tom |
Sorry I haven't been responding to these ... I've been in the backcountry for the last week. I'm not familiar enough with SAS to suggest anything regarding your attempts but there was one other potential clue that I had meant to mention (which is related to your comment about trying to play around with enclosing quotes. if you copy cells from char_var to notepad, most are not enclosed in double quotes. The cell with the newline that was taken directly into sas is, however, enclosed in double quotes which presumably is why it works. the same cell going through saspy (the cell that is truncated) is not enclosed in double quotes. I doubt this helps given that you've already played around with this. Thanks again for putting effort on this. |
@tomweber-sas the best option to use is TERMSTR= which could be set to CRLF for example and in turn allow embedded LF in the column. |
@abbottkr no problem, that sounds like fun. I'll look at this some more as I have some time too. @FriedEgg TERMSTR isn't available using the infile and datalines like it is in the reverse (sd2df) where I have file and put. |
@tomweber-sas, I guess I should've looked at the code... I had assumed it was reading the data over socket. There are still data step solutions, made easier if the values are quoted. |
Hey @FriedEgg no problem. Yeah, for this method on all of my access methods, I just stream the data across stdin via datalines, so I don't need to stage the data in a file (IOM and HTTP) then read it w/ the data step (where I could use termstr), or for STDIO, use a socket (which also can use terstr). But I can find no way to make a newline not mean next row for the data step via stdin. So, I was about to look at adding a method to write to a file then read that w/ termstr, when I finally figured it out :) I had mentioned before that I could convert these to blanks, so they loaded cleanly at least. Well, I can convert them to some other char and then convert them back in the data step which won't trip up the data step parser and that will work just fine. I basically have this in the reverse, sd2df, where I use hex 1, 2 by default for column and row separators so I can get newlines across. I use hex 3 in sd2df already as a column sep, so embedded commas or blanks, can be in columns; I'm using delimited variable length columns for df2sd, instead of bloating it to fixed length padded columns. So, I've made a change to make CR and LF get converted to hex 1,2 and then in the data step convert them back, if you specify embedded_newlines=True on df2sd. If not, False is the default (same as now) behavior. @abbottkr can you give this a try and verify it works for your case, which I couldn't exactly reproduce? This code is in a branch called newlines. Can you pull that branch and try it out? Here's the case I was using. You can see below with and without the flag:
|
Hey @tomweber-sas. Unfortunately, I don't think I can test it as I'm working in an environment where I have to put in requests to be able to install packages and I don't think they allow anything that isn't a formal package on pypi. That said, I think the only differences between our cases was whether or not the df was read in from excel so this looks promising. Is there any other way I can help test it? |
Yeah, no problem, I understand. I went ahead and built a new release with this. I'm pretty confident that this will address your case as well as 'it works for me' :) I like to have it verified for the original case, especially when I can't exactly reproduce it, but this is pretty straight forward, so I believe it's good. Version 3.1.3, which is now out on PyPI has this, so you can install that to try this out. I also added overrides for the surrogate character I use for LF and CR to get them to SAS, but you shouldn't need those. You just need to add embedded_newlines=True as a parm to df2ds() and it ought to just work. I almost left off embedded_newlines and just did it by default as it will generally fail without doing this, but as this isn't a usual case didn't want to impose the extra (though fairly minimal) processing for all other cases. You're the first to bring up this issue, so it seemed to be the unusual case. Easy to add a boolean parm when needed. Let me know what you find when you get a chance. Thanks! |
Hey @tomweber-sas, sorry for the slow response. There is even a delay getting published PyPi packages. Unfortunately, I'm getting the same flawed output. For some reason, the version our repository manager uploaded was 3.1.4, not the 3.1.3 I requested. Does that make a difference? For reference, the script I was using (just focusing on the inputs that generate flawed outputs) was import pandas as pd For reference, the full output to the console was: Using SAS Config named: autogen_winlocal 3 The SAS System 14:46 Tuesday, August 27, 2019 21 NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA. NOTE: Updated analytical products:
NOTE: Additional host information: X64_7PRO WIN 6.1.7601 Service Pack 1 Workstation NOTE: SAS Initialization used (Total process time): 1 ;';";/; 7 ;';";/; 21 ;';";/; 27 ;';";*/; LIBREFSSTART 40 45 data OUT.via_sas_test_prob; 6 The SAS System 14:46 Tuesday, August 27, 2019 NOTE: Invalid data for 'numeric_var'n in line 57 1-14. NOTE: The SAS System stopped processing this step because of errors. 63 64 ;';";/; 70 ;';";*/; TABLE_EXISTS= 1 75 SAS Connection terminated. Subprocess id was 9468 |
Hey, no problem, I know how it goes. No, 3.1.4 is fine - I just published that an hour or two ago :)
and see that that get's the data over correctly? Thanks! |
ach, sorry I missed that addition. Yup, that works perfectly. Well done. do I close the issue or do you? Thanks for your work on this. |
Great! yeah, no problem. I'll go ahead and close this, since you're good with it and it's working. If you have anything else, just open another ticket! Glad to get this fixed! Thanks! |
Describe the bug
when using dataframe2sasdata, there is a failure when there are cells with newline characters. The output sas7bdat file only contains records up to the newline character (i.e., the cell with the newline character appears but only the part of the cell before the newline character).
To Reproduce
I will show some workflows that start with excel because that is where I discovered the issue (I do not have excel in my workflow, or newlines in my data, by choice). Note that the issue does not seem to be an excel issue, this is just a way to create a reproducible sequence.
1.a. problematic_data.xlsx looks like (note there are newline sequences after 'are', 'putting', & 'characters' created usinf ctrl+enter in excel):
1.b. no_prob_data.xlsx looks like:
libname test "path\to\output";
PROC IMPORT OUT = test.direct_data
DATAFILE = "path\to\problematic_data.xlsx"
DBMS = XLSX REPLACE;
GETNAMES=YES;
RUN;
import pandas as pd
import saspy
path_no_prob = r'path/to/no_prob_data.xlsx'
no_prob_data = pd.read_excel(path_no_prob,sheet_name='Sheet1')
path_problematic = r'path/to/problematic_data.xlsx'
problematic_data = pd.read_excel(path_problematic,sheet_name='Sheet1')
sas = saspy.SASsession()
sas.saslib('OUT','"path\to\output"')
sas.dataframe2sasdata(df=no_prob_data,table='via_sas_test_no_prob',libref='OUT')
sas.dataframe2sasdata(df=problematic_data,table='via_sas_test_prob',libref='OUT')
print(sas.saslog())
sas._endsas()
Expected behavior
sas can directly import data with newline sequences (from To Reproduce 2.):
pandas can import data without newline sequences ...:
.... as well as data with newline sequences (both from to Reproduce 3.):
sas7bdat from saspy without newline sequences is fine ...:
... but sas7bdat from saspy with newline sequences is truncated, both in terms of rows and in terms of what is is the cell that had the newline sequences (both from to Reproduce 3.):
full log
3 The SAS System 08:39 Friday, August 2, 2019
19
20 libname OUT "H:/Projects and scripts/saspy bug" ;
NOTE: Libref OUT was successfully assigned as follows:
Engine: V9
Physical Name: H:\Projects and scripts\saspy bug
21
22
1 The SAS System 08:39 Friday, August 2, 2019
NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M3)
Licensed to STATISTICS CANADA - PC/SERVER, Site 70186445.
NOTE: This session is executing on the X64_7PRO platform.
NOTE: Updated analytical products:
NOTE: Additional host information:
X64_7PRO WIN 6.1.7601 Service Pack 1 Workstation
NOTE: SAS Initialization used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
1 ;';";/;
2 options svgtitle='svgtitle'; options validvarname=any pagesize=max linesize=max nosyntaxcheck; ods graphics on;
3
4 ;';";/;
5 %put E3969440A681A2408885998500000001;
E3969440A681A2408885998500000001
6
2 The SAS System 08:39 Friday, August 2, 2019
7 ;';";/;
8
9 %put WORKPATH=%sysfunc(pathname(work));
WORKPATH=C:\Users\abbokev\AppData\Local\Temp\SAS Temporary Files_TD6312_C232160_\Prc2
10 %put ENCODING=&SYSENCODING;
ENCODING=wlatin1
11 %put SYSVLONG=&SYSVLONG4;
SYSVLONG=9.04.01M3P06242015
12 %put SYSJOBID=&SYSJOBID;
SYSJOBID=6312
13 %put SYSSCP=&SYSSCP;
SYSSCP=WIN
14
15
16 ;';";/;
17 %put E3969440A681A2408885998500000002;
E3969440A681A2408885998500000002
18
3 The SAS System 08:39 Friday, August 2, 2019
19 ;';";/;
20 libname OUT "H:/Projects and scripts/saspy bug" ;
NOTE: Libref OUT was successfully assigned as follows:
Engine: V9
Physical Name: H:\Projects and scripts\saspy bug
21
22 ;';";/;
23 %put E3969440A681A2408885998500000003;
E3969440A681A2408885998500000003
24
4 The SAS System 08:39 Friday, August 2, 2019
25 ;';";*/;
26
27 data null; retain libref; retain cobs 1;
28 set sashelp.vlibnam end=last;
29 if cobs EQ 1 then
30 put "LIBREFSSTART";
31 cobs = 2;
32 if libref NE libname then
33 put "LIBREF=" libname;
34 libref = libname;
35 if last then
36 put "LIBREFSEND";
37 run;
LIBREFSSTART
LIBREF=WORK
LIBREF=OUT
LIBREF=SASHELP
LIBREF=MAPS
LIBREF=MAPSSAS
LIBREF=MAPSGFK
LIBREF=SASUSER
LIBREFSEND
NOTE: There were 24 observations read from the data set SASHELP.VLIBNAM.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.07 seconds
38
39
40 ;';";*/;
41 %put E3969440A681A2408885998500000004;
E3969440A681A2408885998500000004
42
5 The SAS System 08:39 Friday, August 2, 2019
43 data OUT.via_sas_test_no_prob;
44 length 'numeric_var'n 8 'char_var'n $65;
45 infile datalines delimiter='03'x DSD STOPOVER;
46 input @;
47 if infile = '' then delete;
48 input 'numeric_var'n 'char_var'n ;
49 datalines4;
6 The SAS System 08:39 Friday, August 2, 2019
NOTE: DATA statement used (Total process time):
real time 0.10 seconds
cpu time 0.03 seconds
NOTE: The data set OUT.VIA_SAS_TEST_NO_PROB has 5 observations and 2 variables.
56 ;;;;
57
7 The SAS System 08:39 Friday, August 2, 2019
58 ;';";/;
59 run;
60
61 ;';";/;
62 %put E3969440A681A2408885998500000005;
E3969440A681A2408885998500000005
63
8 The SAS System 08:39 Friday, August 2, 2019
64 ;';";*/;
65 data null; e = exist('OUT.via_sas_test_no_prob');
66 v = exist('OUT.via_sas_test_no_prob', 'VIEW');
67 if e or v then e = 1;
68 te='TABLE_EXISTS='; put te e;run;
TABLE_EXISTS= 1
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
69
70
71 ;';";*/;
72 %put E3969440A681A2408885998500000006;
E3969440A681A2408885998500000006
73
9 The SAS System 08:39 Friday, August 2, 2019
74 ;';";*/;
75
76 data null; retain libref; retain cobs 1;
77 set sashelp.vlibnam end=last;
78 if cobs EQ 1 then
79 put "LIBREFSSTART";
80 cobs = 2;
81 if libref NE libname then
82 put "LIBREF=" libname;
83 libref = libname;
84 if last then
85 put "LIBREFSEND";
86 run;
LIBREFSSTART
LIBREF=WORK
LIBREF=OUT
LIBREF=SASHELP
LIBREF=MAPS
LIBREF=MAPSSAS
LIBREF=MAPSGFK
LIBREF=SASUSER
LIBREFSEND
NOTE: There were 24 observations read from the data set SASHELP.VLIBNAM.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
87
88
89 ;';";*/;
90 %put E3969440A681A2408885998500000007;
E3969440A681A2408885998500000007
91
10 The SAS System 08:39 Friday, August 2, 2019
92 data OUT.via_sas_test_prob;
93 length 'numeric_var'n 8 'char_var'n $68;
94 infile datalines delimiter='03'x DSD STOPOVER;
95 input @;
96 if infile = '' then delete;
97 input 'numeric_var'n 'char_var'n ;
98 datalines4;
11 The SAS System 08:39 Friday, August 2, 2019
NOTE: Invalid data for 'numeric_var'n in line 103 1-14.
ERROR: INPUT statement exceeded record length. INFILE CARDS OPTION STOPOVER specified.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
103 people putting
'numeric_var'n=. 'char_var'n= ERROR=1 INFILE=people putting N=5
NOTE: DATA statement used (Total process time):
real time 0.18 seconds
cpu time 0.04 seconds
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set OUT.VIA_SAS_TEST_PROB may be incomplete. When this step was stopped there were 3 observations and 2 variables.
108 ;;;;
109
12 The SAS System 08:39 Friday, August 2, 2019
110 ;';";/;
111 run;
112
113 ;';";/;
114 %put E3969440A681A2408885998500000008;
E3969440A681A2408885998500000008
115
13 The SAS System 08:39 Friday, August 2, 2019
116 ;';";*/;
117 data null; e = exist('OUT.via_sas_test_prob');
118 v = exist('OUT.via_sas_test_prob', 'VIEW');
119 if e or v then e = 1;
120 te='TABLE_EXISTS='; put te e;run;
TABLE_EXISTS= 1
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
121
122
123 ;';";*/;
124 %put E3969440A681A2408885998500000009;
E3969440A681A2408885998500000009
125
SAS Connection terminated. Subprocess id was 1376
Desktop (please complete the following information):
Additional context
-I have not tested any other escape sequences.
-It may be significant that the newline sequence is in the last column of the dataset.
-I have had cases where no data was transferred to a sas table (rather than just some of the data being transferred). I cannot reproduce this reliably but it seems to be related to the fact that the code above will overwrite an existing sas dataset when there are no newline sequences but not when there are (but it will create a new sas dataset).
-The examples above relate to cases where the newline sequence is introduced in excel (by using ctrl+enter. I get the same behaviour by explicitly entering \n. I don't get the behaviour if I use double backslash.
-I am not submitting this because I need it fixed right now (I have dealt with the issue in my current workflow by sanitizing my inputs, which I should have been doing anyways), but because it ssems like the long-term goal of saspy is that it should be able to reproduce what can be done using sas, which is not the case here.
-The log seems to suggest what throws the error is the fact that the 2nd line of the problematic cell is being treated as the first column of the following row, which is supposed to be numeric. The exact outcome might depend on the layout of the file as a whole.
The text was updated successfully, but these errors were encountered: