Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting SAS data to CSV? #128

Closed
moshekaplan opened this issue Apr 17, 2018 · 49 comments
Closed

Exporting SAS data to CSV? #128

moshekaplan opened this issue Apr 17, 2018 · 49 comments

Comments

@moshekaplan
Copy link
Contributor

moshekaplan commented Apr 17, 2018

Is there a recommended way to export a SAS data to a CSV file on the system running Python? The only relevant method appeared to be SASdata.to_csv, which writes it to the SAS system's disk.

@jasonphillips
Copy link
Contributor

Unless I'm misreading your question, I believe you'd simply want to convert the SAS data set into a (pandas) DataFrame and write to csv that way.

Eg:

frame = SASdata.to_df() # data is now local to Python
frame.to_csv(...) # with your path, options 

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 17, 2018

I attempted to run the following commands:

session=saspy.SASsession(..) # IOM connection
data = session.sasdata(table_name, lib_name) 
frame = data.to_df()

When it tries to execute frame = data.to_df() it raises an error:
AssertionError: 37 columns passed, passed data had 40 columns

@tomweber-sas
Copy link
Contributor

@moshekaplan what version of saspy do you have? You can submit your SASsession object and that will provide the info. And, I'd like to see what the data looks like. Can you submit the following:

session
data.contents()

Thanks!
Tom

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 17, 2018

(Pdb) session

Access Method         = IOM
SAS Config name       = MY_CONFIG
WORK Path             = S:\SANITIZED_PATH\Prc2\
SAS Version           = 9.04.01M2P07232014
SASPy Version         = 2.2.4
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = WLATIN1
Python Encoding value = latin1

(Pdb) data.contents()

{'Sortedby':                             Member         Label1  \
0  SASUSER.MOSHE_TEMP_TABLE_TEST_2       Sortedby
1  SASUSER.MOSHE_TEMP_TABLE_TEST_2      Validated
2  SASUSER.MOSHE_TEMP_TABLE_TEST_2  Character Set

                           cValue1  nValue1
0  Event_Date Event_Time Event_Seq      NaN
1                              YES      NaN
2                             ANSI      NaN  , 'Attributes':                             Member               Label1  \
0  SASUSER.MOSHE_TEMP_TABLE_TEST_2        Data Set Name
1  SASUSER.MOSHE_TEMP_TABLE_TEST_2          Member Type
2  SASUSER.MOSHE_TEMP_TABLE_TEST_2               Engine
3  SASUSER.MOSHE_TEMP_TABLE_TEST_2              Created
4  SASUSER.MOSHE_TEMP_TABLE_TEST_2        Last Modified
5  SASUSER.MOSHE_TEMP_TABLE_TEST_2           Protection
6  SASUSER.MOSHE_TEMP_TABLE_TEST_2        Data Set Type
7  SASUSER.MOSHE_TEMP_TABLE_TEST_2                Label
8  SASUSER.MOSHE_TEMP_TABLE_TEST_2  Data Representation
9  SASUSER.MOSHE_TEMP_TABLE_TEST_2             Encoding

                           cValue1       nValue1                 Label2  \
0  SASUSER.MOSHE_TEMP_TABLE_TEST_2           NaN           Observations
1                             DATA           NaN              Variables
2                               V9           NaN                Indexes
3              04/17/2018 15:41:57  1.839599e+09     Observation Length
4              04/17/2018 15:41:57  1.839599e+09   Deleted Observations
5                                            NaN             Compressed
6                                            NaN            Reuse Space
7                                            NaN  Point to Observations
8                       WINDOWS_64           NaN                 Sorted
9       wlatin1  Western (Windows)           NaN

  cValue2  nValue2
0    8328   8328.0
1      37     37.0
2       0      0.0
3    3522   3522.0
4       0      0.0
5    CHAR      NaN
6      NO      NaN
7     YES      NaN
8     YES      NaN
9              0.0  , 'Enginehost':                             Member                      Label1  \
0  SASUSER.MOSHE_TEMP_TABLE_TEST_2          Data Set Page Size
1  SASUSER.MOSHE_TEMP_TABLE_TEST_2    Number of Data Set Pages
2  SASUSER.MOSHE_TEMP_TABLE_TEST_2  Number of Data Set Repairs
3  SASUSER.MOSHE_TEMP_TABLE_TEST_2            ExtendObsCounter
4  SASUSER.MOSHE_TEMP_TABLE_TEST_2                    Filename
5  SASUSER.MOSHE_TEMP_TABLE_TEST_2             Release Created
6  SASUSER.MOSHE_TEMP_TABLE_TEST_2                Host Created

                                             cValue1  nValue1
0                                              16384  16384.0
1                                                257    257.0
2                                                  0      0.0
3                                                YES      NaN
4  S:\SANITIZED_PATH\moshe_temp_table_te...      NaN
5                                           9.0401M2      NaN
6                                         X64_ES08R2      NaN  , 'Variables':                              Member  Num            Variable  Type   Len  \
0   SASUSER.MOSHE_TEMP_TABLE_TEST_2   19        SANITIZED_00   Num     6
1   SASUSER.MOSHE_TEMP_TABLE_TEST_2   21        SANITIZED_21  Char    75
2   SASUSER.MOSHE_TEMP_TABLE_TEST_2   22        SANITIZED_22  Char    75
3   SASUSER.MOSHE_TEMP_TABLE_TEST_2    7        SANITIZED_07  Char   100
4   SASUSER.MOSHE_TEMP_TABLE_TEST_2    1        SANITIZED_01  Char    10
5   SASUSER.MOSHE_TEMP_TABLE_TEST_2    3        SANITIZED_03   Num     8
6   SASUSER.MOSHE_TEMP_TABLE_TEST_2    2        SANITIZED_02  Char     8
7   SASUSER.MOSHE_TEMP_TABLE_TEST_2    6        SANITIZED_06  Char    39
8   SASUSER.MOSHE_TEMP_TABLE_TEST_2    9        SANITIZED_09  Char    10
9   SASUSER.MOSHE_TEMP_TABLE_TEST_2    5        SANITIZED_05  Char    39
10  SASUSER.MOSHE_TEMP_TABLE_TEST_2   10        SANITIZED_10  Char    39
11  SASUSER.MOSHE_TEMP_TABLE_TEST_2   15        SANITIZED_15  Char   100
12  SASUSER.MOSHE_TEMP_TABLE_TEST_2   37        SANITIZED_37  Char  1000
13  SASUSER.MOSHE_TEMP_TABLE_TEST_2   20        SANITIZED_20  Char     8
14  SASUSER.MOSHE_TEMP_TABLE_TEST_2   17        SANITIZED_17  Char     5
15  SASUSER.MOSHE_TEMP_TABLE_TEST_2   16        SANITIZED_16  Char    75
16  SASUSER.MOSHE_TEMP_TABLE_TEST_2   18        SANITIZED_18 Char    75
17  SASUSER.MOSHE_TEMP_TABLE_TEST_2   25        SANITIZED_25 Char  1000
18  SASUSER.MOSHE_TEMP_TABLE_TEST_2    8        SANITIZED_08 Char    50
19  SASUSER.MOSHE_TEMP_TABLE_TEST_2   24        SANITIZED_24 Char     6
20  SASUSER.MOSHE_TEMP_TABLE_TEST_2   23        SANITIZED_23 Char   500
21  SASUSER.MOSHE_TEMP_TABLE_TEST_2    4        SANITIZED_04 Char    20
22  SASUSER.MOSHE_TEMP_TABLE_TEST_2   14        SANITIZED_14 Char    20
23  SASUSER.MOSHE_TEMP_TABLE_TEST_2   12        SANITIZED_12  Num     8
24  SASUSER.MOSHE_TEMP_TABLE_TEST_2   32        SANITIZED_32 Char    39
25  SASUSER.MOSHE_TEMP_TABLE_TEST_2   33        SANITIZED_33  Num     8
26  SASUSER.MOSHE_TEMP_TABLE_TEST_2   34        SANITIZED_34  Num     4
27  SASUSER.MOSHE_TEMP_TABLE_TEST_2   35        SANITIZED_35  Num     6
28  SASUSER.MOSHE_TEMP_TABLE_TEST_2   36        SANITIZED_36  Num     6
29  SASUSER.MOSHE_TEMP_TABLE_TEST_2   27        SANITIZED_27 Char    39
30  SASUSER.MOSHE_TEMP_TABLE_TEST_2   28        SANITIZED_28  Num     8
31  SASUSER.MOSHE_TEMP_TABLE_TEST_2   29        SANITIZED_29  Num     4
32  SASUSER.MOSHE_TEMP_TABLE_TEST_2   30        SANITIZED_30  Num     6
33  SASUSER.MOSHE_TEMP_TABLE_TEST_2   31        SANITIZED_31  Num     6
34  SASUSER.MOSHE_TEMP_TABLE_TEST_2   13        SANITIZED_13 Char    25
35  SASUSER.MOSHE_TEMP_TABLE_TEST_2   11        SANITIZED_11 Char    20
36  SASUSER.MOSHE_TEMP_TABLE_TEST_2   26        SANITIZED_26 Char    75

     Pos       Format               Label
0     40     COMMA10.        SANITIZED_00
1    713     $CHAR75.        SANITIZED_01
2    788     $CHAR75.        SANITIZED_02
3    186    $CHAR100.        SANITIZED_03
4     70     $CHAR10.        SANITIZED_04
5      0         Z15.        SANITIZED_05
6     80      $CHAR8.        SANITIZED_06
7    147     $CHAR39.        SANITIZED_07
8    336     $CHAR10.        SANITIZED_08
9    108     $CHAR39.        SANITIZED_09
10   346     $CHAR39.        SANITIZED_10
11   450    $CHAR100.        SANITIZED_11
12  2522   $CHAR1000.        SANITIZED_12
13   705      $CHAR8.        SANITIZED_13
14   625      $CHAR5.        SANITIZED_14
15   550     $CHAR75.        SANITIZED_15
16   630     $CHAR75.        SANITIZED_16
17  1369   $CHAR1000.        SANITIZED_17
18   286     $CHAR50.        SANITIZED_18
19  1363      $CHAR6.        SANITIZED_19
20   863    $CHAR500.        SANITIZED_20
21    88     $CHAR20.        SANITIZED_21
22   430     $CHAR20.        SANITIZED_22
23     8  DATETIME24.        SANITIZED_23
24  2483     $CHAR39.        SANITIZED_24
25    24       HEX12.        SANITIZED_25
26    36        HEX4.        SANITIZED_26
27    58        HEX8.        SANITIZED_27
28    64        HEX8.        SANITIZED_28
29  2444     $CHAR39.        SANITIZED_29
30    16       HEX12.        SANITIZED_30
31    32        HEX4.        SANITIZED_31
32    46        HEX8.        SANITIZED_32
33    52        HEX8.        SANITIZED_33
34   405     $CHAR25.        SANITIZED_34
35   385     $CHAR20.        SANITIZED_35
36  2369     $CHAR75.        SANITIZED_36  }

@tomweber-sas
Copy link
Contributor

Thanks!, I wonder if this could be due to the encodings not matching up. It looks like SAS is running with Windows Latin 1, which is a different code page than Latin1.
In your 'bear' configuration definition, can you can you change (or add in) the encoding key and set it to use windows latin1:

            'encoding'  : 'cp1252',

Let's see if that fixes it. I don't recognize the error that you were getting, so it's possible it's a transcoding issue that's corrupting the data and then having an odd downstream effect causing that error.

Thanks!
Tom

@moshekaplan
Copy link
Contributor Author

Same issue manifests when using 'encoding' : 'cp1252'

@tomweber-sas
Copy link
Contributor

Hmm, do you mind trying this, just to see if it's specific to this data or something more pervasive.

cars = session.sasdata('cars', 'sashelp')
df = cars.to_df()
df.head()
cars.head()

Thanks,
Tom

@moshekaplan
Copy link
Contributor Author

Seems to run without issue:

(Pdb) cars = session.sasdata('cars', 'sashelp')
(Pdb) df = cars.to_df()
(Pdb) df.head()

    Make           Model   Type Origin DriveTrain   MSRP  Invoice  EngineSize  \
0  Acura             MDX    SUV   Asia        All  36945    33337         3.5
1  Acura  RSX Type S 2dr  Sedan   Asia      Front  23820    21761         2.0
2  Acura         TSX 4dr  Sedan   Asia      Front  26990    24647         2.4
3  Acura          TL 4dr  Sedan   Asia      Front  33195    30299         3.2
4  Acura      3.5 RL 4dr  Sedan   Asia      Front  43755    39014         3.5

   Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0        6.0         265        17           23    4451        106     189
1        4.0         200        24           31    2778        101     172
2        4.0         200        22           29    3230        105     183
3        6.0         270        20           28    3575        108     186
4        6.0         225        18           24    3880        115     197

(Pdb) cars.head()

    Make           Model   Type Origin DriveTrain   MSRP  Invoice  EngineSize  \
0  Acura             MDX    SUV   Asia        All  36945    33337         3.5
1  Acura  RSX Type S 2dr  Sedan   Asia      Front  23820    21761         2.0
2  Acura         TSX 4dr  Sedan   Asia      Front  26990    24647         2.4
3  Acura          TL 4dr  Sedan   Asia      Front  33195    30299         3.2
4  Acura      3.5 RL 4dr  Sedan   Asia      Front  43755    39014         3.5

   Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0          6         265        17           23    4451        106     189
1          4         200        24           31    2778        101     172
2          4         200        22           29    3230        105     183
3          6         270        20           28    3575        108     186
4          6         225        18           24    3880        115     197

@tomweber-sas
Copy link
Contributor

Ok, so that's good. It would appear to be specific to that data.
Looks like it too you some time to cleanse the output of the proc contents, so I expect you're not wanting to have this data here for all to see.
If it's ok for me to see a little more about this data and what the problem might be, we can continue with direct email. I can also do a web-ex if that's ok with you to try to see what is happening.

I'm curious about how this data was created. If you have any problems accessing it w/in SAS:

data = session.sasdata(table_name, lib_name, results='HTML')
data.head()

The html output will remove the pandas creation step and see (you don't have to paste the output here) if the data looks right or if there's any issues with SAS processing it.

Also, 2 other things to check. If you submit

print(session.saslog())

We can see the log to see if there are any errors. Submit that after the to_df() fails.

The other thing worth trying, is to try the data.to_df_CSV() method and see if anything different happens.

Tom

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 17, 2018

Seems to attempt to print data without issue (although not actually printable):

(Pdb) data = session.sasdata(table_name, lib_name, results='HTML')
(Pdb) data.head()
<IPython.core.display.HTML object>

I also tried data.to_df_CSV(), but that too failed.

I'll see about emailing you the saslog() directly. I'll likely have to get management approval for that.

@tomweber-sas
Copy link
Contributor

That isn't right, it looks like you're in the debugger -> the (Pdb) part.
Is this running in a notebook? Or are you running this in a shell? Sorry if so; HTML won't render in that mode. You can just assign it to a variable then write that variable to a file and then open it in a browser if you're in line mode.

x = data.head()
fd = open('filesystempath\data.html', 'wb')
fd.write(x)
fd.close()

Then you should be able to click it and it should open in your browser (windows) or open it with whatever viewer in linux.

@moshekaplan
Copy link
Contributor Author

data.head() returns None, so it's not possible to write the HTML to the disk.

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 17, 2018

Ah, it's because of

if not self.sas.batch:

I set it to batch and was able to retrieve the HTML content.

If I can make a recommendation: Functions should always return the value, and batch mode should only control whether or not the output is printed.

@tomweber-sas
Copy link
Contributor

Good job, yes, batch is what that's for!
So, how are you running this? What is pdb; I know that as the python debugger. Are you running line mode?

@moshekaplan
Copy link
Contributor Author

My plan is to use saspy within a larger script - not interactively. I'm using pdb (python debugger) because I wanted to execute some of the existing code and then enter a debugging session where I could try various commands.

What is "line mode"?

@tomweber-sas
Copy link
Contributor

That's all good. In fact that's what 'batch' mode if for, so you can run as a script and still get graphs and plots and such, as html and write them to files, just like you did.
By line mode I just mean running python in a shell, from a command line, as opposed to in a jupyter notebook.
Just wanted to be sure I understood the environment.
Did you see any errors in the log after running the to_df() method?

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 17, 2018

By line mode I just mean running python in a shell, from a command line, as opposed to in a jupyter notebook.

Yes, I'm running in line mode.

Did you see any errors in the log after running the to_df() method?

I did not see any errors in the log, but to_df() still did not work. VARNUMS= 37 and there were 37 variables.

@tomweber-sas
Copy link
Contributor

Ok, kinda back to the beginning. I assume that Assert error is coming from dataframe constructor after bringing the data over. As there are 37 variables, and everyone seems to agree on that, but seemingly I'm passing 40 variable records to the data frame constructor (assuming that what this is), then I suspect something off parsing the data I'm streaming over.
I generate a data step to stream the data over to python. I specify column and row seperators so I can split up the variables on the python side. The defaults for these are hex 01 and hex 02. Those are pretty safe for most SAS data as it SAS only has doubles and character data, and the data come over as character; the doubles are formatted out in text, so there shouldn't be any binary 01/02 in there.
So, I wonder if your data has any columns that have binary data in their character columns, such that I'm ending up splitting it into multiple columns of data - says there are 40, not 37.

W/out having a better idea of the data, I can't really tell. But these values are options you can change:

   def sasdata2dataframe(self, table: str, libref: str ='', dsopts: dict ={}, rowsep: str = '\x01', colsep: str = '\x02', **kwargs) -> '<Pandas Data Frame object>':

colsep and rowsep. They can only be one byte. If you can tell if you might have some char variables that have either binary 0x01 or 0x02 in them. you could try changing these to something not in them.

I will dig in to this deeper tomorrow. Sorry for the trouble, but we'll get this fixed and working for you.

Thanks,
Tom

@moshekaplan
Copy link
Contributor Author

You got it - that was the exact issue - one of the fields had binary data in it. Here's how I came up with valid seperator:

data = open('TEXT_OUTPUT_MOSHE_1.txt', 'rb').read()
s = set(data)
b = set([chr(i) for i in range(256)])
x = []
for i in range(256):
  try:
    x.append(chr(i).encode('cp1252'))
  except:
    pass
x = set(x)
print ((b-s) & x)

I determined that '[' and ']' were both not in my data and safe to use as separators. But it would be better to design and use a technique that didn't rely on 'magic' bytes.

@tomweber-sas
Copy link
Contributor

Ok, good deal. Well, the _CSV method doesn't rely on this, as it exports the data set to a csv file then streams it over, no parsing needed (not by me anyway). So, I'm curious why that failed. It can't really be the (exact) same problem, though it could still have something to do with the data. I'm curious as to what exactly is the failure with sd2cf_CSV.
If you're up for it, there a branch in the repo right now called sd2df_CSV which has a change to allow you to specify a local file to write the csv data to for importing to the dataframe. W/out that, I just use a temporary file which is cleaned up. With the ability to specify the file yourself, it doesn't get cleaned up. You could try that, which would keep the CSV file there on your client to see what it looks like. And I'd be curious to what the error or failure actually is for that case.
Thanks!
Tom

@moshekaplan
Copy link
Contributor Author

I'm game to switch branches. Would it be possible to give the exact sequence of commands?

For example:

session=saspy.SASsession(..) # IOM connection
data = session.sasdata(table_name, lib_name)
...

@tomweber-sas
Copy link
Contributor

Yes, there's a tempfile= now on the _CVS methods. I'm in the process of adding tempkeep= also so you can control whether the file gets cleaned up after it's used or not. So you should be able to just do the following:

session=saspy.SASsession(..) # IOM connection
data = session.sasdata(table_name, lib_name)

df = data.to_df_CSV(tempfile='os path to the file you want to use', tempkeep=True)

That should allow you to persist the CSV file on your client side. Also, this code should show the proc export in the log too, so if there's an issue there, we'll be able to see it: print(session.saslog())

Thanks!
Tom

@moshekaplan
Copy link
Contributor Author

Strange. It appears that the job ran without issue, but I didn't actually get any data:

Code:

session=saspy.SASsession(..) # IOM connection
data = session.sasdata(table_name, lib_name)
df = data.to_df_CSV(tempfile='another_attempt.csv', tempkeep=True

Output:

*** pandas.errors.EmptyDataError: No columns to parse from file

Log contents:

NOTE: The file _TOMODS1 is:
      Filename=S:\SANITIZED\Prc2\tomods1,
      RECFM=V,LRECL=131068,File Size (bytes)=0,
      Last Modified=19Apr2018:17:21:51,
      Create Time=19Apr2018:17:21:51

NOTE: 8329 records were written to the file _TOMODS1.
      The minimum record length was 396.
      The maximum record length was 978.
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
      real time           0.51 seconds
      cpu time            0.51 seconds
      
NOTE: There were 8328 observations read from the data set MY_LIB.MY_TABLE.
NOTE: There were 8328 observations read from the data set WORK.SASDATA2DATAFRAME.
NOTE: DATA statement used (Total process time):
      real time           0.53 seconds
      cpu time            0.53 seconds

8328 records created in _TOMODS1 from SASDATA2DATAFRAME.

@tomweber-sas
Copy link
Contributor

Hmm, so what's the contents of 'another_attempt.csv'?

@moshekaplan
Copy link
Contributor Author

Just a \n. No other contents.

@tomweber-sas
Copy link
Contributor

I'm going to have to mock up some data w/ 0x01, 0x02 and see what happens with that. It seems like it's failing to transfer across IOM, given what you show: SAS exported it out but nothing ended up on the client. Was there a traceback, or just the pandas empyt data error?
I'll try to track this down on my side. Anything else unusual about your data other than having binary data in char columns? So I can try to simulate what you're seeing.

Thanks,
Tom

@tomweber-sas
Copy link
Contributor

@moshekaplan I've been able to reproduce some of this. I am getting the failure you observed trying to create a dataframe when transferring binary data containing 0x01 and 0x02. I've also see it work when I change my row/col separator characters. I don't see a problem however when using the _CSV method. It transfers the data and creates the dataframe. I've also used the tempfile/keep to see the contents.

I've merged in to master the sd2df_CSV track, and it's possible there were some changed mad in there after you switch to trying it. I don't know of anything specifically, but can you switch to master, maybe a fresh pull?, and see if you are seeing different results than I am.

I'll attach an HTML (you'll have to remove the extra .txt off it; can't attach .html), and a screen shot of the csv file. If you are still getting no data in your local csv, even though the remote export worked, then I'll have to guess there another data specific thing that I don't have beyond just the 01/02 row/col separator .
issue128_binary.htm.txt
image

Thanks!
Tom

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 20, 2018

Was there a traceback, or just the pandas empyt data error?

just the pandas empyt data error

I'll try to track this down on my side. Anything else unusual about your data other than having binary data in char columns? So I can try to simulate what you're seeing.

Not that I'm aware of.

I'll test again with the new master. Thanks again for all the time and effort you're putting into resolving this issue.

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 20, 2018

OK, so a little further digging: It definitely seems to be an issue caused by one of the data values. If I create a temporary table with a subset of the columns, I can retrieve the data without issue.

I'm going to see if I can narrow down the exact table and its values, so I can create and share a minimal test case.

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 20, 2018

OK, I'm not sure if a single-column table is handled differently, but when I used the code below, it didn't even print out the error message once, indicating that there were no exceptions (although some CSVs were blank):

    columns = "COLUMN_1, COLUMN_12, ..., COLUMN_37".split(", ")
    
    for c in columns:
        job = "proc sql; CREATE TABLE MYLIB.%s_TABLE_%s AS SELECT %s from MYLIB.%s ; quit;" % (table_name, c, c, table_name)
        print (job)
        sql_results = session.submit(job, results="Pandas")
        open('%s.log' % c, 'wb').write(sql_results['LOG'].encode('utf-8'))
        open('%s.txt' % c, 'wb').write(sql_results['LST'].encode('utf-8'))    
        
        data = session.sasdata( "%s_TABLE_%s" % (table_name, c), "MYTABLE", results='Text')
        try:
            data.to_df_CSV(tempfile="mycsv_%s_%s.csv" % (table_name, c), tempkeep=True)
        except:
            print("Column %s failed!" % c)

As an aside, my entire business need was to select only a fraction of the columns, and so I've now accomplished that using the code below:

fname = "mycsv.csv"
job = "proc sql; CREATE TABLE MYLIB.%s_2 AS SELECT ... from MYLIB.%s_1;" % (table_name, table_name)
session.submit(job, results="Text")
data = session.sasdata( "%s_2" % table_name, "MYLIB", results='Text')
data.to_df_CSV(tempfile=fname, tempkeep=True)

However, if you'd like to debug this further, I'm happy to help.

@tomweber-sas
Copy link
Contributor

Well, I don't like to not understand a problem. It does appear to be data specific with your data set. The constraints we're running under for this is as follows:

  1. the file on the SAS side that the csv data is written to is utf-8. So it's expected that SAS transcodes the data from your SAS Session encoding (WLATIN1), to UTF writting it out.
  2. the java IOM client tells IOM that file is binary, so IOM doesn't try to do any transcoding.
  3. reading that binary stream into java, it's read into, effectively, a byte array and then sent to saspy. So no transcoding there either.
  4. saspy expects utf-8 data, which it should be getting.

What I believe you've seen is that the csv file on the SAS side is written out: log showing proc export.
You've gotten an empty csv file on the saspy side; using tempfile/keep=
There's no traceback or any kind of failure? Is this all correct?

I've been looking into this more, regarding the 4 things listed above. I'm seeing that 2 and 4 aren't consistent depending upon whether it's local or remote (particularly with _CSV). For local, the proc export is written to disk and not streamed over IOM, as it is for remote. The csv file isn't the same, regarding the encoding. In both cases the filename stmt states encoding=utf8, but I'm not getting a utf8 byte stream like I'm expecting when reading over IOM, but it is when writing straight to disk. I think it might be that SAS reads the file back in to give to IOM to send over. In that case, SAS would be transcoding it back from utf8 on disk, to session encoding (wlatin1) which IOM then streams across as binary. That would account for the difference I'm seeing.

I've made some changes in the access method to account for this. This still wouldn't account for what you're seeing (if what I stated above about what you're seeing is right).

Since everything is currently up to date as master, you could try this again from master and see if you still see the same thing.

Thanks,
Tom

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 26, 2018

Using commit 00b0880 :

sas = saspy.SASsession(cfgname=mycfg, omruser=config['user'], omrpw=config['pass'])
if sas is None:
    sys.exit(1)
data = sas.sasdata(tablename, libname)
df = data.to_df_CSV(tempfile='another_attempt.csv', tempkeep=True)

Generates the following output:

Traceback (most recent call last):
  File "test_saspy", line 334, in <module>
    main()
  File "test_saspy", line 308, in main
    run_saspy_test()
  File "test_saspy", line 183, in run_saspy_test
    df = data.to_df_CSV(tempfile='another_attempt.csv', tempkeep=True)
  File "saspy_github/saspy/sasbase.py", line 1911, in to_df_CSV
    return self.to_df(method='CSV', tempfile=tempfile, tempkeep=tempkeep, **kwargs)
  File "saspy_github/saspy/sasbase.py", line 1899, in to_df
    return self.sas.sasdata2dataframe(self.table, self.libref, self.dsopts, method, **kwargs)
  File "saspy_github/saspy/sasbase.py", line 755, in sasdata2dataframe
    return self._io.sasdata2dataframe(table, libref, dsopts, method=method, **kwargs)
  File "saspy_github/saspy/sasioiom.py", line 1178, in sasdata2dataframe
    return self.sasdata2dataframeCSV(table, libref, dsopts, **kwargs)
  File "saspy_github/saspy/sasioiom.py", line 1550, in sasdata2dataframeCSV
    df = pd.read_csv(tmpcsv, index_col=False, engine='c', dtype=dts, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 709, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 449, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 818, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 1049, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 1695, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 565, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

another_attempt.csv is empty, aside from a newline character.
I also confirmed that the table contains data.

@tomweber-sas
Copy link
Contributor

and the local .csv file is empty? And there's nothing in the saslog showing a problem?

@moshekaplan
Copy link
Contributor Author

The local CSV is empty. Here's the output from the saslog():

114        ;*';*";*/;
115        data sasdata2dataframe / view=sasdata2dataframe; set SANITIZED_LIB.SANITIZED_TABLE;
116        format 'SANITIZED best32. ;
117         run;

NOTE: DATA STEP view saved on file WORK.SASDATA2DATAFRAME.
NOTE: A stored DATA STEP view cannot run under a different operating system.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      

118        
119        
120        ;*';*";*/;
121        %put E3969440A681A2408885998500000011;
E3969440A681A2408885998500000011
122        
123        proc export data=sasdata2dataframe outfile=_tomods1 dbms=csv replace;
123      !                                                                       run
124        ;

NOTE: Unable to open parameter catalog: SANITIZED_LIB.PARMS.PARMS.SLIST in update mode. Temporary parameter values will be saved to 
WORK.PARMS.PARMS.SLIST.
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
NOTE: There were 18 observations read from the data set SANITIZED_LIB.SANITIZED_TABLE.
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      
NOTE: There were 18 observations read from the data set SANITIZED_LIB.SANITIZED_TABLE.
125         /**********************************************************************
126         *   PRODUCT:   SAS
127         *   VERSION:   9.4
128         *   CREATOR:   External File Interface
129         *   DATE:      26APR18
130         *   DESC:      Generated SAS Datastep Code
131         *   TEMPLATE SOURCE:  (None Specified.)
132         ***********************************************************************/
133            data _null_;
134            %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
135            %let _EFIREC_ = 0;     /* clear export record count macro variable */
136            file _TOMODS1 delimiter=',' DSD DROPOVER ;
137            if _n_ = 1 then        /* write column names or labels */
138             do;
139               put
140                  "SANITIZED_VAR_01"...
213               ;
214             end;
215           set  SASDATA2DATAFRAME   end=EFIEOD;
216               format SANITIZED_VAR_01 $char10. ;
217               ...
252               format SANITIZED_VAR_37 $char1000. ;
253             do;
254               EFIOUT + 1;
255               put SANITIZED_VAR_01 $ @;
                  ...
291               put SANITIZED_VAR_37 $ ;
292               ;
293             end;
294            if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
295            if EFIEOD then call symputx('_EFIREC_',EFIOUT);
296            run;

NOTE: The file _TOMODS1 is:
      Filename=S:\SANITIZED_PATH\Prc2\tomods1,
      RECFM=V,LRECL=131068,File Size (bytes)=0,
      Last Modified=26Apr2018:19:55:23,
      Create Time=26Apr2018:19:55:18

NOTE: 8329 records were written to the file _TOMODS1.
      The minimum record length was 396.
      The maximum record length was 978.
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
      real time           0.49 seconds
      cpu time            0.51 seconds
      
NOTE: There were 8328 observations read from the data set SANITIZED_LIB.SANITIZED_TABLE.
NOTE: There were 8328 observations read from the data set WORK.SASDATA2DATAFRAME.
NOTE: DATA statement used (Total process time):
      real time           0.53 seconds
      cpu time            0.54 seconds
      

8328 records created in _TOMODS1 from SASDATA2DATAFRAME.
  
  
NOTE: "_TOMODS1" file was successfully created.
NOTE: PROCEDURE EXPORT used (Total process time):
      real time           0.73 seconds
      cpu time            0.76 seconds
      

297        
12                                                         The SAS System                             19:55 Thursday, April 26, 2018

298        
299        %put E3969440A681A2408885998500000008;
E3969440A681A2408885998500000008
300        

@tomweber-sas
Copy link
Contributor

That looks just like my log too, when I run this. I just can't account for what you're getting. I can't really see a path that would do this. I'm gonna have to let this stir in the back of my head a bit.

@tomweber-sas
Copy link
Contributor

Well, I can think of one reason this might be happening. If there's an issue transcoding the csv file on the SAS server side, when IOM is trying to read it to send it to me, I can imagine that we might not see anything in the log, and that I might not get an error along the way. Just not get the data. So, I have 2 things we can try.
The first it to just put a print statement in the code where I'm reading this and writing it to the local csv file. Just to see what's coming across. We would expect nothing, but let's see.
If you have the current code from master (the line numbers will match, else you can still see where to put it), you can just add 'print(data)' to sasioiom.py at line 1515:

                    try:
                       data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
                    except (BlockingIOError):
                       data = b''
                    print(data)   # line 1515, let's see what we are or are not getting
                    if len(data) > 0:
                       datar += data
                       data   = datar.rpartition('\n')
                       datap  = data[0]+data[1]
                       datar  = data[2] 

The other thing, to see if my speculation might be right, is to do the proc export, and then try to read the file back in explicitly so we will see if there is any kind of error in the saslog. Run these in separate cells, if in a notebook.

filename = sas.workpath+"test.csv"
ll = sas.submit("filename x '"+filename+"' encoding='utf-8';proc export data=sashelp.cars outfile=x dbms=csv replace;run;")
print(ll['LOG'])

ll = sas.submit("data _null_; infile x; input @; put _infile_;run;")
print(ll['LOG'])

That's all I've got currently. And, just double checking, you are running the current master? Or an earlier version?

Thanks,
Tom

@moshekaplan
Copy link
Contributor Author

Should be the newest version of master:

$ git log -1
commit 00b0880f0b90967d49895653aab946e6b5b6a98f (HEAD -> master, origin/master, origin/HEAD)
Author: Tom Weber <Tom.Weber@sas.com>
Date:   Thu Apr 26 13:30:21 2018 -0400

    set appname again on reconnect

I made changes similar to what you described:

try:
    data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
 except (BlockingIOError) as e:
    data = b''
    print (e)
    print (data)   # line 1515, let's see what we are or are not getting

My output was the following repeated many times:

[Errno 11] Resource temporarily unavailable
b''

I then attempted to debug it manually:

> saspy/sasioiom.py(1514)sasdata2dataframeCSV()
-> data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
(Pdb) import select
(Pdb) data = self.stdout[0].recv(4096)
(Pdb) len(data)
33
(Pdb) p data
b'\nE3969440A681A2408885998500000008'
(Pdb) c
[Errno 11] Resource temporarily unavailable
b''
> saspy/sasioiom.py(1513)sasdata2dataframeCSV()
-> pdb.set_trace()
(Pdb) data = self.stdout[0].recv(4096)
*** BlockingIOError: [Errno 11] Resource temporarily unavailable
(Pdb) select.select([self.stdout[0]], [] ,[self.stdout[0]])


Written out:
I broke into the Python debugger, right before the call to 
```python3
data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')

As you can see below, the first time it calls self.stdout[0].recv(), it's returning 33 bytes of data: b'\nE3969440A681A2408885998500000008'.

The second call raises errno 11 (EAGAIN or EWOULDBLOCK), as is described in recv's manpage:

       value  -1  is  returned and the external variable errno is set to EAGAIN or EWOULDBLOCK.  The receive calls normally return any data available, up to the requested
       amount, rather than waiting for receipt of the full amount requested.

To test if it wasn't waiting long enough for data to come back, I called select.select() to wait until there's data available. The call to select.select() waited without receiving data for more than ten minutes. I then killed it. So it seems that aside from those first 33 bytes, there is no output in self.stdout().

@tomweber-sas
Copy link
Contributor

You've verified that we are getting nothing back from the SAS server for that CSV file.
The b'\nE3969440A681A2408885998500000008', which was the only thing you got, is the 'termination' string I send to saspy from java to identify the transmission is complete.
You are getting none of the CSV file sent back over.

So, it may very well be something like I'm speculating. Can you run the proc export and the data step and see if there's any errors in the saslog after the data step? This is progress :)

thanks!
Tom

@moshekaplan
Copy link
Contributor Author

With the exception of NOTE: Unable to open parameter catalog: SASUSER.PARMS.PARMS.SLIST in update mode. Temporary parameter values will be saved to WORK.PARMS.PARMS.SLIST., it seemed to have no issues and printed out the listing of cars:

5                                                          The SAS System                               17:46 Friday, April 27, 2018

31         ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
31       ! ods graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1
32         
33         filename x 'S:\SANITIZED\test.csv' encoding='utf-8';proc export data=sashelp.cars outfile=x
33       ! dbms=csv replace;run;

NOTE: Unable to open parameter catalog: SASUSER.PARMS.PARMS.SLIST in update mode. Temporary parameter values will be saved to 
WORK.PARMS.PARMS.SLIST.
34          /**********************************************************************
35          *   PRODUCT:   SAS
36          *   VERSION:   9.4
37          *   CREATOR:   External File Interface
38          *   DATE:      27APR18
39          *   DESC:      Generated SAS Datastep Code
40          *   TEMPLATE SOURCE:  (None Specified.)
41          ***********************************************************************/
42             data _null_;
43             %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
44             %let _EFIREC_ = 0;     /* clear export record count macro variable */
45             file X delimiter=',' DSD DROPOVER ;
46             if _n_ = 1 then        /* write column names or labels */
47              do;
48                put
49                   "Make"
50                ','
51                   "Model"
52                ','
53                   "Type"
54                ','
55                   "Origin"
56                ','
57                   "DriveTrain"
58                ','
59                   "MSRP"
60                ','
61                   "Invoice"
62                ','
63                   "EngineSize"
64                ','
65                   "Cylinders"
66                ','
67                   "Horsepower"
68                ','
69                   "MPG_City"
70                ','
71                   "MPG_Highway"
72                ','
73                   "Weight"
74                ','
75                   "Wheelbase"
76                ','
77                   "Length"
78                ;
79              end;
80            set  SASHELP.CARS   end=EFIEOD;
81                format Make $13. ;
82                format Model $40. ;
83                format Type $8. ;
84                format Origin $6. ;
85                format DriveTrain $5. ;
86                format MSRP dollar8. ;
87                format Invoice dollar8. ;
88                format EngineSize best12. ;
89                format Cylinders best12. ;
90                format Horsepower best12. ;
91                format MPG_City best12. ;
92                format MPG_Highway best12. ;
93                format Weight best12. ;
94                format Wheelbase best12. ;
95                format Length best12. ;
96              do;
97                EFIOUT + 1;
98                put Make $ @;
99                put Model $ @;
100               put Type $ @;
101               put Origin $ @;
102               put DriveTrain $ @;
103               put MSRP @;
104               put Invoice @;
105               put EngineSize @;
106               put Cylinders @;
107               put Horsepower @;
108               put MPG_City @;
109               put MPG_Highway @;
110               put Weight @;
111               put Wheelbase @;
112               put Length ;
113               ;
114             end;
115            if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
116            if EFIEOD then call symputx('_EFIREC_',EFIOUT);
117            run;

NOTE: The file X is:
      Filename=S:\SANITIZED_PATH\test.csv,
      RECFM=V,LRECL=131068,File Size (bytes)=0,
      Last Modified=27Apr2018:17:46:51,
      Create Time=27Apr2018:17:46:51

NOTE: 429 records were written to the file X.
      The minimum record length was 68.
      The maximum record length was 123.
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds
      

428 records created in X from SASHELP.CARS.
  
  
NOTE: "X" file was successfully created.
NOTE: PROCEDURE EXPORT used (Total process time):
      real time           0.26 seconds
      cpu time            0.23 seconds
      

118        
119        ods html5 (id=saspy_internal) close;ods listing;
120        
********************************************************************************
6                                                          The SAS System                               17:46 Friday, April 27, 2018

123        ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
123      ! ods graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1
124        
125        data _null_; infile x; input @; put _infile_;run;

NOTE: The infile X is:
      Filename=S:\SANITIZED_PATH\test.csv,
      RECFM=V,LRECL=131068,File Size (bytes)=38145,
      Last Modified=27Apr2018:17:46:51,
      Create Time=27Apr2018:17:46:51

Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
Acura,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6,265,17,23,4451,106,189
...
Volvo,XC70,Wagon,Europe,All,"$35,145","$33,112",2.5,5,208,20,27,3823,109,186
NOTE: 429 records were read from the infile X.
      The minimum record length was 68.
      The maximum record length was 123.
NOTE: DATA statement used (Total process time):
      real time           0.09 seconds
      cpu time            0.11 seconds
      

126        
127        ods html5 (id=saspy_internal) close;ods listing;
128        

@tomweber-sas
Copy link
Contributor

I'm sorry I meant your data set that doesn't work. Not cars. I just did that for the example.

@moshekaplan
Copy link
Contributor Author

moshekaplan commented Apr 30, 2018

Got it.
I ran the following:

filename = sas.workpath+"test2.csv"
ll = sas.submit("filename myfname '"+filename+"' encoding='utf-8';proc export data=MYLIB.MYTABLE outfile=myfname dbms=csv replace;run;")
print(ll['LOG'])

print ("*"*80)
ll = sas.submit("data _null_; infile myfname; input @; put _infile_;run;")
print(ll['LOG'])

The log output included the data from all rows and columns.

Additionally, I then ran the following to generate a list of uncommon bytes:

import string
data =  set(open("job4_data.txt").read())
print (sorted(data - set(string.ascii_letters + string.digits)))

My output was the following:

['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0c', 
'\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', ' ', '!', 
'"', '#', '$', '%', '&', "'", '(', ')', '+', ',', '-', '.', '/', ':', ';', '=', '?', 
'@', '_', '`', '\x80', '\x81', '\x82', '\x84', '\x86', '\x90', '\x92', '\x93', 
'\x94', '\x98', '\x99', '\x9a', '\x9c', '\x9d', '\x9e', '\xa0', '\xa1', '\xa2', 
'\xa6', '\xac', '\xb0', '\xbd', '\xbf', '\xc2', '\xc6', '\xcb', '\xe2', '\xef']

@tomweber-sas
Copy link
Contributor

Well, since you got all the data written to the log, with no errors or issues, I don't know how the data is not making it back to the python side. When you try the to_df_CSV() method and it get's no data in the local file, and then gets the pandas error, is your sas session still active and functioning? The Java IOM process hasn't terminated? You can still run other things?
I have a data set I created with every hex value from 0x00 to 0xff and I can transfer it w/ the _CSV() method without problem. I'm just not seeing how this is getting nothing over to the local file.

@moshekaplan
Copy link
Contributor Author

I ran the following code:

data = sas.sasdata(tablename, libname)
df = data.to_df_CSV(tempfile='mytempcsv.csv', tempkeep=True)
print(session.saslog())

As before, the CSV file was empty and it raised a pandas.errors.EmptyDataError: No columns to parse from file.

@tomweber-sas
Copy link
Contributor

could you still run other methods after that? Was your session still working?

@moshekaplan
Copy link
Contributor Author

Yes, the session was still working and I was able to retrieve other data without issue.

Is it possible that the byte/encoding issue only manifests when returning a table with multiple columns and we've both been only testing the nonprintable characters on single-column tables?

@tomweber-sas
Copy link
Contributor

Thought I had replied to this, but I don't see it.
I've tried to get this to fail for me. I've created a multi column table containing all 256 wlatin1 codepoints; semi randomly scrambled in the various columns. I still can't get this to fail. I am trying remote IOM to a workspace server running wlatin1 (cp1252). With the _CSV methods I can keep the temporary local file, and I see it's got the data in utf-8 encoding, which is then imported successfully into a dataframe.
I've tried it with local IOM too, and get the same results.
nls1.html.txt

Any chance you can see if the workspace server logs show any issues on the IOM server side?
I can't see why there would be no errors or anything, yet the data won't show up on the client side.

Maybe you can try this example and see if it works or has issues for you?

Tom

BTW, just remove the .txt off the file name. Can't upload .html to this.

@tomweber-sas
Copy link
Contributor

@moshekaplan I've found some edge cases where I was getting bad transcoding with both to_df and to_df_CSV. I've got fixes in a new branch called nls2. Are you able to try out this branch with your failing cases and see if they work correctly now?

Thanks!
Tom

@moshekaplan
Copy link
Contributor Author

Sorry for the delayed response. I tested the newest version of master (which includes your three nls2 commits) and it appears to have solved the issue.

Thank you!
Moshe

@tomweber-sas
Copy link
Contributor

@moshekaplan That's great new, thanks for validating that for me!
Yes, I was reading the stream of data, in chncks and concatenating them, but I was converting each chunk to character instead of concatenating as binary then converting to char after. So, it was only causing a problem when a multibyte character was split between chunks I was getting. So it was 'data specific', although completely a logic error on my part. I am glad this is what you were hitting and that this fixes it!

Thanks again for all the help looking into this!
Thanks,
Tom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants