Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataframe2sasdata fails with newline character in data #251

Closed
abbottkr opened this issue Aug 2, 2019 · 15 comments
Closed

dataframe2sasdata fails with newline character in data #251

abbottkr opened this issue Aug 2, 2019 · 15 comments

Comments

@abbottkr
Copy link

abbottkr commented Aug 2, 2019

Describe the bug
when using dataframe2sasdata, there is a failure when there are cells with newline characters. The output sas7bdat file only contains records up to the newline character (i.e., the cell with the newline character appears but only the part of the cell before the newline character).

To Reproduce
I will show some workflows that start with excel because that is where I discovered the issue (I do not have excel in my workflow, or newlines in my data, by choice). Note that the issue does not seem to be an excel issue, this is just a way to create a reproducible sequence.

  1. Start with two excel files as below.
    1.a. problematic_data.xlsx looks like (note there are newline sequences after 'are', 'putting', & 'characters' created usinf ctrl+enter in excel):
    problematic data

1.b. no_prob_data.xlsx looks like:
no_prob data

  1. confirm that sas can import the problematic data by running (the output is fine as shown in Expected Behaviour 1):

libname test "path\to\output";
PROC IMPORT OUT = test.direct_data
DATAFILE = "path\to\problematic_data.xlsx"
DBMS = XLSX REPLACE;
GETNAMES=YES;
RUN;

  1. run the following python script to draw problematic and non_problematic data into a pandas df and then send to sas7bdat files via saspy (dfs are shown in Expected Behaviour 2 and 3 to demonstrate that the data makes it to pandas, sas7bdat files are shown in Expected Behaviour 4 and 5 to show that the output without newline sequence is fine but the data with newlines is not, the log is given in Expected Behaviour 6):

import pandas as pd
import saspy
path_no_prob = r'path/to/no_prob_data.xlsx'
no_prob_data = pd.read_excel(path_no_prob,sheet_name='Sheet1')
path_problematic = r'path/to/problematic_data.xlsx'
problematic_data = pd.read_excel(path_problematic,sheet_name='Sheet1')
sas = saspy.SASsession()
sas.saslib('OUT','"path\to\output"')
sas.dataframe2sasdata(df=no_prob_data,table='via_sas_test_no_prob',libref='OUT')
sas.dataframe2sasdata(df=problematic_data,table='via_sas_test_prob',libref='OUT')
print(sas.saslog())
sas._endsas()

Expected behavior

  1. sas can directly import data with newline sequences (from To Reproduce 2.):
    EB1

  2. pandas can import data without newline sequences ...:
    EB2

  3. .... as well as data with newline sequences (both from to Reproduce 3.):
    EB3

  4. sas7bdat from saspy without newline sequences is fine ...:
    EB4

  5. ... but sas7bdat from saspy with newline sequences is truncated, both in terms of rows and in terms of what is is the cell that had the newline sequences (both from to Reproduce 3.):
    EB5

  6. full log
    3 The SAS System 08:39 Friday, August 2, 2019

19
20 libname OUT "H:/Projects and scripts/saspy bug" ;
NOTE: Libref OUT was successfully assigned as follows:
Engine: V9
Physical Name: H:\Projects and scripts\saspy bug
21
22
1 The SAS System 08:39 Friday, August 2, 2019

NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M3)
Licensed to STATISTICS CANADA - PC/SERVER, Site 70186445.
NOTE: This session is executing on the X64_7PRO platform.

NOTE: Updated analytical products:

  SAS/STAT 14.1
  SAS/ETS 14.1
  SAS/OR 14.1
  SAS/IML 14.1
  SAS/QC 14.1

NOTE: Additional host information:

X64_7PRO WIN 6.1.7601 Service Pack 1 Workstation

NOTE: SAS Initialization used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds

1 ;';";/;
2 options svgtitle='svgtitle'; options validvarname=any pagesize=max linesize=max nosyntaxcheck; ods graphics on;
3
4 ;
';";/;
5 %put E3969440A681A2408885998500000001;
E3969440A681A2408885998500000001
6
2 The SAS System 08:39 Friday, August 2, 2019

7 ;';";/;
8
9 %put WORKPATH=%sysfunc(pathname(work));
WORKPATH=C:\Users\abbokev\AppData\Local\Temp\SAS Temporary Files_TD6312_C232160_\Prc2
10 %put ENCODING=&SYSENCODING;
ENCODING=wlatin1
11 %put SYSVLONG=&SYSVLONG4;
SYSVLONG=9.04.01M3P06242015
12 %put SYSJOBID=&SYSJOBID;
SYSJOBID=6312
13 %put SYSSCP=&SYSSCP;
SYSSCP=WIN
14
15
16 ;
';";/;
17 %put E3969440A681A2408885998500000002;
E3969440A681A2408885998500000002
18
3 The SAS System 08:39 Friday, August 2, 2019

19 ;';";/;
20 libname OUT "H:/Projects and scripts/saspy bug" ;
NOTE: Libref OUT was successfully assigned as follows:
Engine: V9
Physical Name: H:\Projects and scripts\saspy bug
21
22 ;
';";/;
23 %put E3969440A681A2408885998500000003;
E3969440A681A2408885998500000003
24
4 The SAS System 08:39 Friday, August 2, 2019

25 ;';";*/;
26
27 data null; retain libref; retain cobs 1;
28 set sashelp.vlibnam end=last;
29 if cobs EQ 1 then
30 put "LIBREFSSTART";
31 cobs = 2;
32 if libref NE libname then
33 put "LIBREF=" libname;
34 libref = libname;
35 if last then
36 put "LIBREFSEND";
37 run;

LIBREFSSTART
LIBREF=WORK
LIBREF=OUT
LIBREF=SASHELP
LIBREF=MAPS
LIBREF=MAPSSAS
LIBREF=MAPSGFK
LIBREF=SASUSER
LIBREFSEND
NOTE: There were 24 observations read from the data set SASHELP.VLIBNAM.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.07 seconds

38
39
40 ;';";*/;
41 %put E3969440A681A2408885998500000004;
E3969440A681A2408885998500000004
42
5 The SAS System 08:39 Friday, August 2, 2019

43 data OUT.via_sas_test_no_prob;
44 length 'numeric_var'n 8 'char_var'n $65;
45 infile datalines delimiter='03'x DSD STOPOVER;
46 input @;
47 if infile = '' then delete;
48 input 'numeric_var'n 'char_var'n ;
49 datalines4;

6 The SAS System 08:39 Friday, August 2, 2019

NOTE: DATA statement used (Total process time):
real time 0.10 seconds
cpu time 0.03 seconds

NOTE: The data set OUT.VIA_SAS_TEST_NO_PROB has 5 observations and 2 variables.

56 ;;;;
57
7 The SAS System 08:39 Friday, August 2, 2019

58 ;';";/;
59 run;
60
61 ;
';";/;
62 %put E3969440A681A2408885998500000005;
E3969440A681A2408885998500000005
63
8 The SAS System 08:39 Friday, August 2, 2019

64 ;';";*/;
65 data null; e = exist('OUT.via_sas_test_no_prob');
66 v = exist('OUT.via_sas_test_no_prob', 'VIEW');
67 if e or v then e = 1;
68 te='TABLE_EXISTS='; put te e;run;

TABLE_EXISTS= 1
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds

69
70
71 ;';";*/;
72 %put E3969440A681A2408885998500000006;
E3969440A681A2408885998500000006
73
9 The SAS System 08:39 Friday, August 2, 2019

74 ;';";*/;
75
76 data null; retain libref; retain cobs 1;
77 set sashelp.vlibnam end=last;
78 if cobs EQ 1 then
79 put "LIBREFSSTART";
80 cobs = 2;
81 if libref NE libname then
82 put "LIBREF=" libname;
83 libref = libname;
84 if last then
85 put "LIBREFSEND";
86 run;

LIBREFSSTART
LIBREF=WORK
LIBREF=OUT
LIBREF=SASHELP
LIBREF=MAPS
LIBREF=MAPSSAS
LIBREF=MAPSGFK
LIBREF=SASUSER
LIBREFSEND
NOTE: There were 24 observations read from the data set SASHELP.VLIBNAM.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds

87
88
89 ;';";*/;
90 %put E3969440A681A2408885998500000007;
E3969440A681A2408885998500000007
91
10 The SAS System 08:39 Friday, August 2, 2019

92 data OUT.via_sas_test_prob;
93 length 'numeric_var'n 8 'char_var'n $68;
94 infile datalines delimiter='03'x DSD STOPOVER;
95 input @;
96 if infile = '' then delete;
97 input 'numeric_var'n 'char_var'n ;
98 datalines4;

11 The SAS System 08:39 Friday, August 2, 2019

NOTE: Invalid data for 'numeric_var'n in line 103 1-14.
ERROR: INPUT statement exceeded record length. INFILE CARDS OPTION STOPOVER specified.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
103 people putting
'numeric_var'n=. 'char_var'n= ERROR=1 INFILE=people putting N=5
NOTE: DATA statement used (Total process time):
real time 0.18 seconds
cpu time 0.04 seconds

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set OUT.VIA_SAS_TEST_PROB may be incomplete. When this step was stopped there were 3 observations and 2 variables.
108 ;;;;

109
12 The SAS System 08:39 Friday, August 2, 2019

110 ;';";/;
111 run;
112
113 ;
';";/;
114 %put E3969440A681A2408885998500000008;
E3969440A681A2408885998500000008
115
13 The SAS System 08:39 Friday, August 2, 2019

116 ;';";*/;
117 data null; e = exist('OUT.via_sas_test_prob');
118 v = exist('OUT.via_sas_test_prob', 'VIEW');
119 if e or v then e = 1;
120 te='TABLE_EXISTS='; put te e;run;

TABLE_EXISTS= 1
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds

121
122
123 ;';";*/;
124 %put E3969440A681A2408885998500000009;
E3969440A681A2408885998500000009
125

SAS Connection terminated. Subprocess id was 1376

Desktop (please complete the following information):

  • OS: Windows 7
  • SAS Version: 9.4
  • SASPy Version: 2.4.3

Additional context
-I have not tested any other escape sequences.
-It may be significant that the newline sequence is in the last column of the dataset.
-I have had cases where no data was transferred to a sas table (rather than just some of the data being transferred). I cannot reproduce this reliably but it seems to be related to the fact that the code above will overwrite an existing sas dataset when there are no newline sequences but not when there are (but it will create a new sas dataset).
-The examples above relate to cases where the newline sequence is introduced in excel (by using ctrl+enter. I get the same behaviour by explicitly entering \n. I don't get the behaviour if I use double backslash.
-I am not submitting this because I need it fixed right now (I have dealt with the issue in my current workflow by sanitizing my inputs, which I should have been doing anyways), but because it ssems like the long-term goal of saspy is that it should be able to reproduce what can be done using sas, which is not the case here.
-The log seems to suggest what throws the error is the fact that the 2nd line of the problematic cell is being treated as the first column of the following row, which is supposed to be numeric. The exact outcome might depend on the layout of the file as a whole.

@tomweber-sas
Copy link
Contributor

@abbottkr I have to say, thank you for such a thorough bug report! That is excellent!

I've been playing around with this a bit and yes, embedded newlines just seem to wreak havoc trying to get them into a data set. I wasn't able to get them via control-enter in excel, but I just put them in via python and saw the problems.

I tried having that column be quoted and setting keep_outer_quotes=True on df2sd() but that didn't help at all. I tried exporting the data frame as csv and reading it w/ SAS and that didn't work right either. The data step I generate for the reverse of this, allows for specifying TERMSTR= where I can change from LF to some other byte, but the data step I generate for this doesn't support that; it uses the newline to means end of input row.

Here's my musings:

>>> x = [['boo'], ['voo'], ['goo\nfoo\nzoo'], ['who']]
>>> df = pd.DataFrame.from_records(x)
>>> df
               0
0            boo
1            voo
2  goo\nfoo\nzoo
3            who
>>>
>>>
>>>
>>> sd = sas.df2sd(df)
>>> sd.head()
     0
0  boo
1  voo
2  goo
3  foo
4  zoo
>>> df
               0
0            boo
1            voo
2  goo\nfoo\nzoo
3            who
>>> z = [['boo'], ['voo'], ["'goo\nfoo\nzoo'"], ['who']]
>>>
>>> df2 = pd.DataFrame.from_records(z)
>>> sd2 = sas.df2sd(df2, keep_outer_quotes=False)
>>> sd2.head(10)
      0
0   boo
1   voo
2  'goo
3   foo
4  zoo'
5   who
>>> sd2 = sas.df2sd(df2, keep_outer_quotes=True)
>>> sd2.head(10)
      0
0   boo
1   voo
2  'goo
3   foo
4  zoo'
5   who

>>> df2
                 0
0              boo
1              voo
2  'goo\nfoo\nzoo'
3              who
>>> df
               0
0            boo
1            voo
2  goo\nfoo\nzoo
3            who
>>> df.to_csv('df1')
>>> df2.to_csv('df2')
>>>
>>> sas1 = sas.read_csv('df1')
>>> sas1.head(10)
   VAR1     0
0     0   boo
1     1   voo
2     2  "goo
3   foo   NaN
4  zoo"   NaN
5     3   who
>>> sas2 = sas.read_csv('df2')
>>> sas2.head(10)
    VAR1      0
0      0    boo
1      1    voo
2      2  "'goo
3    foo    NaN
4  zoo'"    NaN
5      3    who
>>> fd = open('df1','r')
>>> fd.read()
',0\n0,boo\n1,voo\n2,"goo\nfoo\nzoo"\n3,who\n'
>>> fd.close()
>>> fd = open('df2','r')
>>> fd.read()
',0\n0,boo\n1,voo\n2,"\'goo\nfoo\nzoo\'"\n3,who\n'
>>> fd.close()
>>> df2
                 0
0              boo
1              voo
2  'goo\nfoo\nzoo'
3              who
>>> df
               0
0            boo
1            voo
2  goo\nfoo\nzoo
3            who
>>> pd.read_csv('df1')
   Unnamed: 0              0
0           0            boo
1           1            voo
2           2  goo\nfoo\nzoo
3           3            who
>>> pd.read_csv('df2')
   Unnamed: 0                0
0           0              boo
1           1              voo
2           2  'goo\nfoo\nzoo'
3           3              who

I expect that I can come up with a way to get this to work, but I'll have to investigate that some more. At least I can easily reproduce this, so that's good.

To speak to your Additional context statements (in positional order):

  • not sure, but LF is certainly the worst case, I expect
  • I believe the LF anywhere causes a problem; probably presents different depending upon where and how many columns
  • yeah, I suppose that's due to the data step getting an error, instead of happening to succeed with bad data
  • hey, I see that work in my case too, hmmm:
>>> y = [['boo'], ['voo'], ["goo\\nfoo\\nzoo"], ['who']]
>>> dfy = pd.DataFrame.from_records(y)
>>> dfy
               0
0            boo
1            voo
2  goo\nfoo\nzoo
3            who
>>> sasy = sas.df2sd(dfy)
>>> sasy.head(10)
               0
0            boo
1            voo
2  goo\nfoo\nzoo
3            who
>>>
  • I appreciate that you are cleansing you data and this isn't a show stopper for you. Thanks. I will look into this further though, as I take these things as a personal challenge - it's just my way :) And, regarding the goal of saspy, yes, it is really the same of that of SAS itself, which is to help customers succeed in what they are trying to accomplish, which isn't just using saspy or SAS for the sake of using it. You use it as a tool to do something bigger.
  • yes, like the third one, above, this will vary depending upon the layout of the data ..., as these newlines are throwing off the parsing, so it depends upon where and when that happens.

I may not have this solved today, but I will investigate this further and see what I can do to address it.

Thanks again for entering this, and for such a thorough write up. I do appreciate that!
Tom

@tomweber-sas
Copy link
Contributor

So, seeing that the escaped (\\n instead of \n) appeared to work, I tweaked 2 lines of code to replace \n with \\n when transferring the data to SAS and that get this to work:

tom64-5> python3.5
Python 3.5.6 (default, Nov 16 2018, 15:50:39)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import saspy
>>> sas = saspy.SASsession(cfgname='sdssas')
SAS Connection established. Subprocess id is 30631

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to iso8859_15 based upon the SAS session encoding value of latin9.
>>> import pandas as pd
>>> x = [['boo'], ['voo'], ['goo\nfoo\nzoo'], ['who']]
>>> df = pd.DataFrame.from_records(x)
>>> df
               0
0            boo
1            voo
2  goo\nfoo\nzoo
3            who
>>> sd = sas.df2sd(df)
>>> sd.head(10)
               0
0            boo
1            voo
2  goo\nfoo\nzoo
3            who
>>>

Now, this is a quick tweak, so I'm not about to push this out until I investigate further and identify if this is really a good way to do this, or if this is still only part of a larger issue that need to be addressed as well. Bullet item 1: -I have not tested any other escape sequences.

But, for a Friday afternoon, I'm happy to have a potential easy fix for this. I'll continue on this and have something next week.

Thanks!
Tom

@tomweber-sas
Copy link
Contributor

No, can't believe I fell for that. This isn't as it looks. That's just a trick of the display formatting. escaping didn't actually make it work, it changed it from a line feed (x'10') to 2 characters: a backslashes and the letter 'n', duh. Can't trust what's displayed on the screen when looking at \n.
So, I have to guess that's what you actually got too.
I'll dig deeper next week.

>>> sd.to_df().at[2,'0']
'goo\\nfoo\\nzoo'
>>> sd.to_df().at[2,'0'].encode().hex()
'676f6f5c6e666f6f5c6e7a6f6f'
>>>
>>> b'\\n'.hex()
'5c6e'
>>> b'\n'.hex()
'0a'

My bad (long week),
Tom

@tomweber-sas
Copy link
Contributor

I've been exploring variations on data step input (https://go.documentation.sas.com/?docsetId=lrcon&docsetTarget=n1w749t788cgi2n1txpuccsuqtro.htm&docsetVersion=9.4&locale=en) to see if there was a different way I could code this which would allow for embedded LF in a column. But, I haven't been successful. In every case, the newline triggers the next row.
I could add an option to convert embedded LF to Blanks, so that you wouldn't have to do this ahead of time, and the data set would be valid. But I'm not seeing a way to keep them embedded in the column.

Tom

@abbottkr
Copy link
Author

Sorry I haven't been responding to these ... I've been in the backcountry for the last week. I'm not familiar enough with SAS to suggest anything regarding your attempts but there was one other potential clue that I had meant to mention (which is related to your comment about trying to play around with enclosing quotes.

if you copy cells from char_var to notepad, most are not enclosed in double quotes. The cell with the newline that was taken directly into sas is, however, enclosed in double quotes which presumably is why it works. the same cell going through saspy (the cell that is truncated) is not enclosed in double quotes.

I doubt this helps given that you've already played around with this.

Thanks again for putting effort on this.

@FriedEgg
Copy link
Contributor

@tomweber-sas the best option to use is TERMSTR= which could be set to CRLF for example and in turn allow embedded LF in the column.

@tomweber-sas
Copy link
Contributor

@abbottkr no problem, that sounds like fun. I'll look at this some more as I have some time too. @FriedEgg TERMSTR isn't available using the infile and datalines like it is in the reverse (sd2df) where I have file and put.
I'll check with some folks around here that are better data step programmers than I am too!
Tom

@FriedEgg
Copy link
Contributor

@tomweber-sas, I guess I should've looked at the code... I had assumed it was reading the data over socket. There are still data step solutions, made easier if the values are quoted.

@tomweber-sas
Copy link
Contributor

Hey @FriedEgg no problem. Yeah, for this method on all of my access methods, I just stream the data across stdin via datalines, so I don't need to stage the data in a file (IOM and HTTP) then read it w/ the data step (where I could use termstr), or for STDIO, use a socket (which also can use terstr). But I can find no way to make a newline not mean next row for the data step via stdin.

So, I was about to look at adding a method to write to a file then read that w/ termstr, when I finally figured it out :)

I had mentioned before that I could convert these to blanks, so they loaded cleanly at least. Well, I can convert them to some other char and then convert them back in the data step which won't trip up the data step parser and that will work just fine. I basically have this in the reverse, sd2df, where I use hex 1, 2 by default for column and row separators so I can get newlines across. I use hex 3 in sd2df already as a column sep, so embedded commas or blanks, can be in columns; I'm using delimited variable length columns for df2sd, instead of bloating it to fixed length padded columns.

So, I've made a change to make CR and LF get converted to hex 1,2 and then in the data step convert them back, if you specify embedded_newlines=True on df2sd. If not, False is the default (same as now) behavior.

@abbottkr can you give this a try and verify it works for your case, which I couldn't exactly reproduce?

This code is in a branch called newlines. Can you pull that branch and try it out?

Here's the case I was using. You can see below with and without the flag:
z = sas.df2sd(df)
z = sas.df2sd(df, embedded_newlines=True)

tom64-5> python3.5
Python 3.5.6 (default, Nov 16 2018, 15:50:39)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>> x = [['boo', 'voo', 'goo\nfoo\rzoo', 'who\r\ntoo'],
...      ['boo', 'voo', 'goo\nfoo\rzoo', 'who\r\ntoo'],
...      ['boo', 'voo', 'goo\nfoo\rzoo', 'who\r\ntoo'],
...      ['boo', 'voo', 'goo\nfoo\rzoo', 'who\r\ntoo'],
...     ]
>>>
>>> import pandas as pd
>>> df = pd.DataFrame.from_records(x, columns=['a','b','c','d'])
>>> df
     a    b              c           d
0  boo  voo  goo\nfoo\rzoo  who\r\ntoo
1  boo  voo  goo\nfoo\rzoo  who\r\ntoo
2  boo  voo  goo\nfoo\rzoo  who\r\ntoo
3  boo  voo  goo\nfoo\rzoo  who\r\ntoo
>>>
>>> import saspy

>>> sas = saspy.SASsession()
Please enter the name of the SAS Config you wish to run. Available Configs are: ['default', 'SASgrid', 'http', 'httptest', 'ssh', 'sshtun', 'sshrtun', 'httpfred', 'grid', 'tdi', 'tdilat', 'iomj', 'iomc', 'iomjwin', 'winiomj', 'winiomjwin', 'winlocal', 'gridiom', 'wingridiom', 'zos', 'zos2', 'winzos', 'winzos2', 'sshtest', 'sshloc', 'sdssas', 'saskr', 'vb010', 'vb015', 'pune', 'notpune', 'iomkr', 'httpviya', 'kevin', 'issue176', 'httpjason', 'itviya', 'carrie1', 'iomcom', 'iomcom_loc'] iomj
SAS Connection established. Subprocess id is 11077

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to utf_8 based upon the SAS session encoding value of utf-8.
>>> z = sas.df2sd(df)
>>> z.head()
Empty DataFrame
Columns: [a, b, c, d]
Index: []
>>>
>>> s = sas.sasdata('_df', results='text')
>>> s.head()

>>>
>>>
>>> z = sas.df2sd(df, embedded_newlines=True)
>>> z.head()
     a    b              c           d
0  boo  voo  goo\nfoo\rzoo  who\r\ntoo
1  boo  voo  goo\nfoo\rzoo  who\r\ntoo
2  boo  voo  goo\nfoo\rzoo  who\r\ntoo
3  boo  voo  goo\nfoo\rzoo  who\r\ntoo
>>>
>>> s = sas.sasdata('_df', results='text')
>>> s.head()

                                                           The SAS System                          13:53 Friday, August 16, 2019   1

                                            Obs     a      b          c            d

                                             1     boo    voo    goo
zoo    who
too
                                             2     boo    voo    goo
zoo    who
too
                                             3     boo    voo    goo
zoo    who
too
                                             4     boo    voo    goo
zoo    who
too

>>>

@abbottkr
Copy link
Author

Hey @tomweber-sas. Unfortunately, I don't think I can test it as I'm working in an environment where I have to put in requests to be able to install packages and I don't think they allow anything that isn't a formal package on pypi. That said, I think the only differences between our cases was whether or not the df was read in from excel so this looks promising. Is there any other way I can help test it?

@tomweber-sas
Copy link
Contributor

Yeah, no problem, I understand. I went ahead and built a new release with this. I'm pretty confident that this will address your case as well as 'it works for me' :) I like to have it verified for the original case, especially when I can't exactly reproduce it, but this is pretty straight forward, so I believe it's good.

Version 3.1.3, which is now out on PyPI has this, so you can install that to try this out.

I also added overrides for the surrogate character I use for LF and CR to get them to SAS, but you shouldn't need those. You just need to add embedded_newlines=True as a parm to df2ds() and it ought to just work.

I almost left off embedded_newlines and just did it by default as it will generally fail without doing this, but as this isn't a usual case didn't want to impose the extra (though fairly minimal) processing for all other cases. You're the first to bring up this issue, so it seemed to be the unusual case. Easy to add a boolean parm when needed.

Let me know what you find when you get a chance.

Thanks!
Tom

@abbottkr
Copy link
Author

Hey @tomweber-sas, sorry for the slow response. There is even a delay getting published PyPi packages. Unfortunately, I'm getting the same flawed output. For some reason, the version our repository manager uploaded was 3.1.4, not the 3.1.3 I requested. Does that make a difference?

For reference, the script I was using (just focusing on the inputs that generate flawed outputs) was

import pandas as pd
import saspy
path_problematic = r'H:/Projects and scripts/saspy bug/problematic_data.xlsx'
problematic_data = pd.read_excel(path_problematic,sheet_name='Sheet1')
sas = saspy.SASsession()
sas.saslib('OUT','"H:/Projects and scripts/saspy bug"')
sas.dataframe2sasdata(df=problematic_data,table='via_sas_test_prob',libref='OUT')
print(sas.saslog())
sas._endsas()

For reference, the full output to the console was:

Using SAS Config named: autogen_winlocal
SAS Connection established. Subprocess id is 9468

3 The SAS System 14:46 Tuesday, August 27, 2019

21
22 libname OUT "H:/Projects and scripts/saspy bug" ;
NOTE: Libref OUT was successfully assigned as follows:
Engine: V9
Physical Name: H:\Projects and scripts\saspy bug
23
24
1 The SAS System 14:46 Tuesday, August 27, 2019

NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M3)
Licensed to STATISTICS CANADA - PC/SERVER, Site 70186445.
NOTE: This session is executing on the X64_7PRO platform.

NOTE: Updated analytical products:

  SAS/STAT 14.1
  SAS/ETS 14.1
  SAS/OR 14.1
  SAS/IML 14.1
  SAS/QC 14.1

NOTE: Additional host information:

X64_7PRO WIN 6.1.7601 Service Pack 1 Workstation

NOTE: SAS Initialization used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds

1 ;';";/;
2 options svgtitle='svgtitle'; options validvarname=any pagesize=max nosyntaxcheck; ods graphics on;
3
4 ;
';";/;
5 %put E3969440A681A2408885998500000001;
E3969440A681A2408885998500000001
6
2 The SAS System 14:46 Tuesday, August 27, 2019

7 ;';";/;
8 options nosource;
WORKPATH=C:\Users\abbokev\AppData\Local\Temp\SAS Temporary Files_TD8196_C232160_\Prc2
ENDWORKPATH=
ENCODING=wlatin1
SYSVLONG=9.04.01M3P06242015
SYSJOBID=8196
SYSSCP=WIN
16
17
18 ;
';";/;
19 %put E3969440A681A2408885998500000002;
E3969440A681A2408885998500000002
20
3 The SAS System 14:46 Tuesday, August 27, 2019

21 ;';";/;
22 libname OUT "H:/Projects and scripts/saspy bug" ;
NOTE: Libref OUT was successfully assigned as follows:
Engine: V9
Physical Name: H:\Projects and scripts\saspy bug
23
24 ;
';";/;
25 %put E3969440A681A2408885998500000003;
E3969440A681A2408885998500000003
26
4 The SAS System 14:46 Tuesday, August 27, 2019

27 ;';";*/;
28
29 data null; retain libref; retain cobs 1;
30 set sashelp.vlibnam end=last;
31 if cobs EQ 1 then
32 put "LIBREFSSTART";
33 cobs = 2;
34 if libref NE libname then
35 put "LIBREF=" libname;
36 libref = libname;
37 if last then
38 put "LIBREFSEND";
39 run;

LIBREFSSTART
LIBREF=WORK
LIBREF=OUT
LIBREF=SASHELP
LIBREF=MAPS
LIBREF=MAPSSAS
LIBREF=MAPSGFK
LIBREF=SASUSER
LIBREFSEND
NOTE: There were 24 observations read from the data set SASHELP.VLIBNAM.
NOTE: DATA statement used (Total process time):
real time 0.09 seconds
cpu time 0.09 seconds

40
41
42 ;';";*/;
43 %put E3969440A681A2408885998500000004;
E3969440A681A2408885998500000004
44
5 The SAS System 14:46 Tuesday, August 27, 2019

45 data OUT.via_sas_test_prob;
46 length 'numeric_var'n 8 'char_var'n $68;
47 infile datalines delimiter='03'x DSD STOPOVER;
48 input @;
49 if infile = '' then delete;
50 input 'numeric_var'n 'char_var'n ;
51 ;
52 datalines4;

6 The SAS System 14:46 Tuesday, August 27, 2019

NOTE: Invalid data for 'numeric_var'n in line 57 1-14.
ERROR: INPUT statement exceeded record length. INFILE CARDS OPTION STOPOVER specified.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
57 people putting
'numeric_var'n=. 'char_var'n= ERROR=1 INFILE=people putting N=5
NOTE: DATA statement used (Total process time):
real time 0.15 seconds
cpu time 0.06 seconds

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set OUT.VIA_SAS_TEST_PROB may be incomplete. When this step was stopped there were 3 observations and 2
variables.
62 ;;;;

63
7 The SAS System 14:46 Tuesday, August 27, 2019

64 ;';";/;
65 run;
66
67 ;
';";/;
68 %put E3969440A681A2408885998500000005;
E3969440A681A2408885998500000005
69
8 The SAS System 14:46 Tuesday, August 27, 2019

70 ;';";*/;
71 data null; e = exist('OUT.via_sas_test_prob');
72 v = exist('OUT.via_sas_test_prob', 'VIEW');
73 if e or v then e = 1;
74 te='TABLE_EXISTS='; put te e;run;

TABLE_EXISTS= 1
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

75
76
77 ;';";*/;
78 %put E3969440A681A2408885998500000006;
E3969440A681A2408885998500000006
79

SAS Connection terminated. Subprocess id was 9468

@tomweber-sas
Copy link
Contributor

Hey, no problem, I know how it goes. No, 3.1.4 is fine - I just published that an hour or two ago :)
So, that's good. You just need to add the option to say to generate the extra code to get the newlines transferred over. Can you submit this:

sas.dataframe2sasdata(df=problematic_data,table='via_sas_test_prob',libref='OUT', embedded_newlines=True)

and see that that get's the data over correctly?

Thanks!
Tom

@abbottkr
Copy link
Author

ach, sorry I missed that addition. Yup, that works perfectly. Well done. do I close the issue or do you?

Thanks for your work on this.

@tomweber-sas
Copy link
Contributor

Great! yeah, no problem. I'll go ahead and close this, since you're good with it and it's working. If you have anything else, just open another ticket! Glad to get this fixed!

Thanks!
Tom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants