New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exporting SAS data to CSV? #128
Comments
Unless I'm misreading your question, I believe you'd simply want to convert the SAS data set into a (pandas) DataFrame and write to csv that way. Eg: frame = SASdata.to_df() # data is now local to Python
frame.to_csv(...) # with your path, options |
I attempted to run the following commands:
When it tries to execute |
@moshekaplan what version of saspy do you have? You can submit your SASsession object and that will provide the info. And, I'd like to see what the data looks like. Can you submit the following:
Thanks! |
(Pdb) session
(Pdb) data.contents()
|
Thanks!, I wonder if this could be due to the encodings not matching up. It looks like SAS is running with Windows Latin 1, which is a different code page than Latin1.
Let's see if that fixes it. I don't recognize the error that you were getting, so it's possible it's a transcoding issue that's corrupting the data and then having an odd downstream effect causing that error. Thanks! |
Same issue manifests when using |
Hmm, do you mind trying this, just to see if it's specific to this data or something more pervasive.
Thanks, |
Seems to run without issue: (Pdb) cars = session.sasdata('cars', 'sashelp')
(Pdb) cars.head()
|
Ok, so that's good. It would appear to be specific to that data. I'm curious about how this data was created. If you have any problems accessing it w/in SAS:
The html output will remove the pandas creation step and see (you don't have to paste the output here) if the data looks right or if there's any issues with SAS processing it. Also, 2 other things to check. If you submit
We can see the log to see if there are any errors. Submit that after the to_df() fails. The other thing worth trying, is to try the data.to_df_CSV() method and see if anything different happens. Tom |
Seems to attempt to print data without issue (although not actually printable):
I also tried I'll see about emailing you the |
That isn't right, it looks like you're in the debugger -> the (Pdb) part.
Then you should be able to click it and it should open in your browser (windows) or open it with whatever viewer in linux. |
|
Ah, it's because of Line 1130 in d28a017
I set it to batch and was able to retrieve the HTML content. If I can make a recommendation: Functions should always return the value, and batch mode should only control whether or not the output is printed. |
Good job, yes, batch is what that's for! |
My plan is to use saspy within a larger script - not interactively. I'm using pdb (python debugger) because I wanted to execute some of the existing code and then enter a debugging session where I could try various commands. What is "line mode"? |
That's all good. In fact that's what 'batch' mode if for, so you can run as a script and still get graphs and plots and such, as html and write them to files, just like you did. |
Yes, I'm running in line mode.
I did not see any errors in the log, but to_df() still did not work. |
Ok, kinda back to the beginning. I assume that Assert error is coming from dataframe constructor after bringing the data over. As there are 37 variables, and everyone seems to agree on that, but seemingly I'm passing 40 variable records to the data frame constructor (assuming that what this is), then I suspect something off parsing the data I'm streaming over. W/out having a better idea of the data, I can't really tell. But these values are options you can change:
colsep and rowsep. They can only be one byte. If you can tell if you might have some char variables that have either binary 0x01 or 0x02 in them. you could try changing these to something not in them. I will dig in to this deeper tomorrow. Sorry for the trouble, but we'll get this fixed and working for you. Thanks, |
You got it - that was the exact issue - one of the fields had binary data in it. Here's how I came up with valid seperator:
I determined that '[' and ']' were both not in my data and safe to use as separators. But it would be better to design and use a technique that didn't rely on 'magic' bytes. |
Ok, good deal. Well, the _CSV method doesn't rely on this, as it exports the data set to a csv file then streams it over, no parsing needed (not by me anyway). So, I'm curious why that failed. It can't really be the (exact) same problem, though it could still have something to do with the data. I'm curious as to what exactly is the failure with sd2cf_CSV. |
I'm game to switch branches. Would it be possible to give the exact sequence of commands? For example:
|
Yes, there's a tempfile= now on the _CVS methods. I'm in the process of adding tempkeep= also so you can control whether the file gets cleaned up after it's used or not. So you should be able to just do the following:
That should allow you to persist the CSV file on your client side. Also, this code should show the proc export in the log too, so if there's an issue there, we'll be able to see it: print(session.saslog()) Thanks! |
Strange. It appears that the job ran without issue, but I didn't actually get any data: Code:
Output:
Log contents:
|
Hmm, so what's the contents of 'another_attempt.csv'? |
Just a |
I'm going to have to mock up some data w/ 0x01, 0x02 and see what happens with that. It seems like it's failing to transfer across IOM, given what you show: SAS exported it out but nothing ended up on the client. Was there a traceback, or just the pandas empyt data error? Thanks, |
@moshekaplan I've been able to reproduce some of this. I am getting the failure you observed trying to create a dataframe when transferring binary data containing 0x01 and 0x02. I've also see it work when I change my row/col separator characters. I don't see a problem however when using the _CSV method. It transfers the data and creates the dataframe. I've also used the tempfile/keep to see the contents. I've merged in to master the sd2df_CSV track, and it's possible there were some changed mad in there after you switch to trying it. I don't know of anything specifically, but can you switch to master, maybe a fresh pull?, and see if you are seeing different results than I am. I'll attach an HTML (you'll have to remove the extra .txt off it; can't attach .html), and a screen shot of the csv file. If you are still getting no data in your local csv, even though the remote export worked, then I'll have to guess there another data specific thing that I don't have beyond just the 01/02 row/col separator . Thanks! |
just the pandas empyt data error
Not that I'm aware of. I'll test again with the new master. Thanks again for all the time and effort you're putting into resolving this issue. |
OK, so a little further digging: It definitely seems to be an issue caused by one of the data values. If I create a temporary table with a subset of the columns, I can retrieve the data without issue. I'm going to see if I can narrow down the exact table and its values, so I can create and share a minimal test case. |
OK, I'm not sure if a single-column table is handled differently, but when I used the code below, it didn't even print out the error message once, indicating that there were no exceptions (although some CSVs were blank): columns = "COLUMN_1, COLUMN_12, ..., COLUMN_37".split(", ")
for c in columns:
job = "proc sql; CREATE TABLE MYLIB.%s_TABLE_%s AS SELECT %s from MYLIB.%s ; quit;" % (table_name, c, c, table_name)
print (job)
sql_results = session.submit(job, results="Pandas")
open('%s.log' % c, 'wb').write(sql_results['LOG'].encode('utf-8'))
open('%s.txt' % c, 'wb').write(sql_results['LST'].encode('utf-8'))
data = session.sasdata( "%s_TABLE_%s" % (table_name, c), "MYTABLE", results='Text')
try:
data.to_df_CSV(tempfile="mycsv_%s_%s.csv" % (table_name, c), tempkeep=True)
except:
print("Column %s failed!" % c) As an aside, my entire business need was to select only a fraction of the columns, and so I've now accomplished that using the code below: fname = "mycsv.csv"
job = "proc sql; CREATE TABLE MYLIB.%s_2 AS SELECT ... from MYLIB.%s_1;" % (table_name, table_name)
session.submit(job, results="Text")
data = session.sasdata( "%s_2" % table_name, "MYLIB", results='Text')
data.to_df_CSV(tempfile=fname, tempkeep=True) However, if you'd like to debug this further, I'm happy to help. |
Well, I don't like to not understand a problem. It does appear to be data specific with your data set. The constraints we're running under for this is as follows:
What I believe you've seen is that the csv file on the SAS side is written out: log showing proc export. I've been looking into this more, regarding the 4 things listed above. I'm seeing that 2 and 4 aren't consistent depending upon whether it's local or remote (particularly with _CSV). For local, the proc export is written to disk and not streamed over IOM, as it is for remote. The csv file isn't the same, regarding the encoding. In both cases the filename stmt states encoding=utf8, but I'm not getting a utf8 byte stream like I'm expecting when reading over IOM, but it is when writing straight to disk. I think it might be that SAS reads the file back in to give to IOM to send over. In that case, SAS would be transcoding it back from utf8 on disk, to session encoding (wlatin1) which IOM then streams across as binary. That would account for the difference I'm seeing. I've made some changes in the access method to account for this. This still wouldn't account for what you're seeing (if what I stated above about what you're seeing is right). Since everything is currently up to date as master, you could try this again from master and see if you still see the same thing. Thanks, |
Using commit 00b0880 : sas = saspy.SASsession(cfgname=mycfg, omruser=config['user'], omrpw=config['pass'])
if sas is None:
sys.exit(1)
data = sas.sasdata(tablename, libname)
df = data.to_df_CSV(tempfile='another_attempt.csv', tempkeep=True) Generates the following output: Traceback (most recent call last):
File "test_saspy", line 334, in <module>
main()
File "test_saspy", line 308, in main
run_saspy_test()
File "test_saspy", line 183, in run_saspy_test
df = data.to_df_CSV(tempfile='another_attempt.csv', tempkeep=True)
File "saspy_github/saspy/sasbase.py", line 1911, in to_df_CSV
return self.to_df(method='CSV', tempfile=tempfile, tempkeep=tempkeep, **kwargs)
File "saspy_github/saspy/sasbase.py", line 1899, in to_df
return self.sas.sasdata2dataframe(self.table, self.libref, self.dsopts, method, **kwargs)
File "saspy_github/saspy/sasbase.py", line 755, in sasdata2dataframe
return self._io.sasdata2dataframe(table, libref, dsopts, method=method, **kwargs)
File "saspy_github/saspy/sasioiom.py", line 1178, in sasdata2dataframe
return self.sasdata2dataframeCSV(table, libref, dsopts, **kwargs)
File "saspy_github/saspy/sasioiom.py", line 1550, in sasdata2dataframeCSV
df = pd.read_csv(tmpcsv, index_col=False, engine='c', dtype=dts, **kwargs)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 709, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 449, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 818, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 1049, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 1695, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 565, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
|
and the local .csv file is empty? And there's nothing in the saslog showing a problem? |
The local CSV is empty. Here's the output from the
|
That looks just like my log too, when I run this. I just can't account for what you're getting. I can't really see a path that would do this. I'm gonna have to let this stir in the back of my head a bit. |
Well, I can think of one reason this might be happening. If there's an issue transcoding the csv file on the SAS server side, when IOM is trying to read it to send it to me, I can imagine that we might not see anything in the log, and that I might not get an error along the way. Just not get the data. So, I have 2 things we can try.
The other thing, to see if my speculation might be right, is to do the proc export, and then try to read the file back in explicitly so we will see if there is any kind of error in the saslog. Run these in separate cells, if in a notebook.
That's all I've got currently. And, just double checking, you are running the current master? Or an earlier version? Thanks, |
Should be the newest version of master:
I made changes similar to what you described: try:
data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
except (BlockingIOError) as e:
data = b''
print (e)
print (data) # line 1515, let's see what we are or are not getting My output was the following repeated many times:
I then attempted to debug it manually: > saspy/sasioiom.py(1514)sasdata2dataframeCSV()
-> data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
(Pdb) import select
(Pdb) data = self.stdout[0].recv(4096)
(Pdb) len(data)
33
(Pdb) p data
b'\nE3969440A681A2408885998500000008'
(Pdb) c
[Errno 11] Resource temporarily unavailable
b''
> saspy/sasioiom.py(1513)sasdata2dataframeCSV()
-> pdb.set_trace()
(Pdb) data = self.stdout[0].recv(4096)
*** BlockingIOError: [Errno 11] Resource temporarily unavailable
(Pdb) select.select([self.stdout[0]], [] ,[self.stdout[0]])
Written out:
I broke into the Python debugger, right before the call to
```python3
data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace') As you can see below, the first time it calls The second call raises errno 11 (
To test if it wasn't waiting long enough for data to come back, I called |
You've verified that we are getting nothing back from the SAS server for that CSV file. So, it may very well be something like I'm speculating. Can you run the proc export and the data step and see if there's any errors in the saslog after the data step? This is progress :) thanks! |
With the exception of
|
I'm sorry I meant your data set that doesn't work. Not cars. I just did that for the example. |
Got it.
The log output included the data from all rows and columns. Additionally, I then ran the following to generate a list of uncommon bytes: import string
data = set(open("job4_data.txt").read())
print (sorted(data - set(string.ascii_letters + string.digits))) My output was the following:
|
Well, since you got all the data written to the log, with no errors or issues, I don't know how the data is not making it back to the python side. When you try the to_df_CSV() method and it get's no data in the local file, and then gets the pandas error, is your sas session still active and functioning? The Java IOM process hasn't terminated? You can still run other things? |
I ran the following code: data = sas.sasdata(tablename, libname)
df = data.to_df_CSV(tempfile='mytempcsv.csv', tempkeep=True)
print(session.saslog()) As before, the CSV file was empty and it raised a |
could you still run other methods after that? Was your session still working? |
Yes, the session was still working and I was able to retrieve other data without issue. Is it possible that the byte/encoding issue only manifests when returning a table with multiple columns and we've both been only testing the nonprintable characters on single-column tables? |
Thought I had replied to this, but I don't see it. Any chance you can see if the workspace server logs show any issues on the IOM server side? Maybe you can try this example and see if it works or has issues for you? Tom BTW, just remove the .txt off the file name. Can't upload .html to this. |
@moshekaplan I've found some edge cases where I was getting bad transcoding with both to_df and to_df_CSV. I've got fixes in a new branch called nls2. Are you able to try out this branch with your failing cases and see if they work correctly now? Thanks! |
Sorry for the delayed response. I tested the newest version of master (which includes your three nls2 commits) and it appears to have solved the issue. Thank you! |
@moshekaplan That's great new, thanks for validating that for me! Thanks again for all the help looking into this! |
Is there a recommended way to export a SAS data to a CSV file on the system running Python? The only relevant method appeared to be
SASdata.to_csv
, which writes it to the SAS system's disk.The text was updated successfully, but these errors were encountered: