Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

ENH: read_sas, to_sas #4052

Open
benjello opened this Issue · 72 comments
@benjello

It would be really convenient to be able to at least import SAS tables into pandas dataframe. Is this planned ? Are they insurmountable issues ?

Thanks

@jreback
Owner

can you post a link to the format? and see if any converters have been writtenin python?

obviously the idea would be to read the native format file

@jreback
Owner

so looks like simple binary read/write stuff....could be done....only other question is there any license issue with doing this?

@cpcloud
Collaborator

not sure. @benjello want to contact SAS and ask them?

@miketkelly

That technote is describing the XPORT format, which isn't the native binary format. I've never seen a published layout of the native format. Some people have partially reverse engineered it, but I've never seen a solution that could handle data sets with compression.

https://github.com/BioStatMatt/sas7bdat

@cpcloud
Collaborator

@mtkni thanks had no idea.

@jreback
Owner

aside from using SAS to actually export (e.g. csv or whatever), is there aformat that one could save that provides some interoperbility (and is openish)?

@cpcloud
Collaborator

a 10 minute search suggests no but maybe someone else knows more.

@benjello

I am not a specialist much more a potential user in heavy need of such a tool.
For now I have used alternatively StatTransfer which is not a free software
or when importing to R I used one of the method exposed here but you need to have sas installed.
I am sorry for not being able to provide ypu for more information than above.

@cpcloud
Collaborator

probably could do

  1. call to sas if exists
  2. if doesn't exist can only read xport so try to read that (someone would need to implement the xport reader)
  3. if not 1 or 2 then fail saying u need sas to read sas binary files.
@jreback
Owner

I don't think 1) is a good idea

export data in stata to xport format or csv format

@cpcloud
Collaborator

ok. just throwing it out there. i don't like calling out to other programs either, but this seems like it's going to be tough. i can implement the R code above...if u think that's a good idea...but it basically forces users to use that particular version of the format and if it ever changes we won't know until it breaks.

@cpcloud
Collaborator

i wouldn't be able to test to_sas though since i don't have sas

@benjello

I would be glad to test everything that would do the job. I have sas.

@miketkelly

I've spent some time on this in the past. These are my thoughts:

1) Reverse engineering the binary data set format is a difficult task and not a good priority for this project. Any solution, like the R solution, that doesn't deal with compressed data sets won't do me any good. I can't speak for others on that. I also worry about the licensing issues.
2) Given that, the only option for read_sas may be to use the sas executable to first export SAS to a format that can be read by Python.
3) The SAS transport format, as described in the tech note, is an option but it has some quirks. In particular, the native XPORT engine doesn't handle long variable names. It really hasn't evolved since SAS 6 (we're on SAS 9 now). The SAS workaround for long variable names is a set of macros that not everyone has installed.
4) I'm not convinced the XPORT format is a better option than just CSV. If it's dramatically faster it might be. I'll study that next week.
5) If the mechanism for reading SAS data sets turns out to be export to CSV and then using read_csv, then there's not much code to write that isn't SAS environment dependent. It just doesn't make for a good shared module.

As much as I wish there was a good solution to this, and as much as I'd be willing to help build it, I just don't think there is. I've built read_sas using CSV as an intermediate format. Obviously, this requires a SAS license. It takes very few lines to implement, but most of those lines are specific to our SAS environment and are not well-portable.

I will study the performance of XPORT vs CSV next week. If it's dramatically faster, then it may be worth the effort to implement. Even then, I'm not sure it's worth taking that on as part of the Pandas project. I would be interested in comments from other SAS users on that.

Just my two cents.

@cpcloud
Collaborator

nice to hear from someone who tried to do this. FWIW i think it might be tough to beat CSV for speed, most of it is written C/Cython.

@miketkelly

Agreed. The new, fast CSV was a game changer.

@jreback
Owner

can sas export to HDF5?

@miketkelly

No, it can't.

@jreback
Owner

export in STATA format?

@miketkelly

No, and if it did it would be an expensive add-on module. SAS is pretty good at reading from databases (http://www.sas.com/resources/factsheet/sas-access-factsheet.pdf), although each database platform is a separate license. I haven't found it good at all at writing to databases (it can, but it's slow). Other than that, interoperability doesn't seem to be part of their business model.

@miketkelly

Oh wait, I may have spoken to soon. Apparently I can export to a stata file: http://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003102702.htm

Is STATA supported in Pandas? It would still requires a SAS license, but I can benchmark that versus CSV.

@benjello

There is a read_stata that will be available in te coming version but already available on github

@benjello

BTW, @mtkni I would be happy to look at the read_sas you implemented if you would share it ...

@miketkelly

Just to close the loop on this, exporting to STATA requires an add-on for which I'm not licensed, so I can't benchmark it.

@dramage1

FYI, the XPT or transport format is a non-proprietary format that has no licensing issues and is the only format currently accepted by the Food and Drug Administration (FDA) for clinical trial data. Most pharmaceutical companies submit XPT format to the FDA. It would be nice to have a way to read these files in just like a csv file.

@jtratner
Collaborator

@dramage1 Is "XPT" the same as "XPORT" above?

@jtratner
Collaborator

Heyo - there's at least one Python package for reading XPT files - https://pypi.python.org/pypi/xport/0.1.0

@dramage1

Just what I needed, much appreciated.

@jtratner
Collaborator

@dramage1 if you use this enough to want to write up a pandas wrapper for it, that could be a useful addition to pandas (depending on the stability of xport)

@selik

@jtratner I worked on the xport library before. I can refactor xport to give a better API for use in a pandas read_xpt or borrow some code to include directly in pandas.

@selik

@dramage1 let me know if the xport library is confusing or broken. I'll try to improve the docs and/or code.

@jtratner
Collaborator
@jreback
Owner

@benjello @selik any action on this?

@selik

Not yet... check back in a couple weeks :-0

@benjello

I am sorry but I won't be qualified enough but I am willing to test any code

@selik

@benjello Could you give me a few test cases? I don't have SAS available to me. I'd like to get just a few tiny test files to make some unit tests.

@jreback
Owner

@selik you are going to use the xport soln?

@jreback
Owner

@jseabold you have thoughts on this?

@jseabold
Collaborator

Not really anything to add beyond what's here. I'm sure it will be useful if XPT format is used places (yikes that's a terrible data policy re: FDA). I've been lucky enough to avoid SAS beyond coursework which required it.

@selik

@jreback It makes sense to refactor the xport library to make it friendly as a dependency for a pandas.read_xpt().

@jreback
Owner

@selik yes....prob best to simply incorporate it directly (with the licensening references / included) - see what we did with msgpack. Then you can modify and not introduce a dep.

I am not a license expert...but I think that the MIT license is compat with pandas BDS 3 clause

(you basically just copy the LICENSE to the LICENSES dir) and are good 2 go

@dramage1

@jseabold Regarding the FDA data format. They are beginning to realize that it is time to move forwar dand have a XML format pilot project proposal http://goo.gl/1xNiv8. The SAS transport (XPT) format is not going away anytime soon - the FDA moves at a snails pace implementing changes, so I glad to see you are working on this. Unfortunately, I am a complete newbie at python and can't help much. I could provide some sample data in XPT format if someone can explain how to upload it to GIT.

@selik

@dramage1 You can email me the files if you'd like. I'm looking for a Rosetta stone for XPT and CSV. Or XPT and some other plain-text format. I think my email address is in my profile.

@dramage1

@selik Mike, I tried mike@selik.org and got an undeliverable message.

@selik

@dramage1 That's not good. I wonder who else is having trouble emailing me. Mind sharing your email in your profile?

@dramage1

I updated my profile

@spearsem

Even though it would be slower, would it be worthwhile to add pyodbc-based support for SAS?

@selik

@spearsem It wouldn't necessarily be slower if SAS has some secret awesome algorithm for reading XPT files. That's how R reads XPT. But if you already have SAS, the best thing to do is read the file in SAS and save as CSV, not to read it directly from Python.

@selik

BTW, I'm slowly moving along with xport. I think I'll have code ready for inclusion in pandas by end of April.

@spearsem

I'm thinking specifically when you don't already have SAS, just someone's old data files. It would be interesting to be able to connect to the data via some within-pandas wrapper on pyodbc for the SAS drivers and then perform some queries on it into pandas.

@selik

I don't follow. How would you have SAS drivers without SAS?

@spearsem

Well, you may be a person who knows absolutely nothing about SAS, but who can solve the problem very quickly in Pandas. This happened to me before with Stata. My company had plenty of Stata licenses, but no one who knew Stata had time to help explain to me what was going on with the script that generated some data. In the interim, I found a statsmodels function that would read .dta files (pandas didn't have that ability yet) and solved the problem with the data very quickly. It would have taken much longer for me to do it with Stata.

@selik

Ah, I see. Unfortunately, I can't help with that, as I don't have SAS.

@benjello

Some progress on the front of reading SAS sas7bdat format in R: http://cran.r-project.org/web/packages/sas7bdat/index.html and it seems that a pythonista got inspired by it http://git.pyhacker.com/sas7bdat

@mrocklin

Can anyone vouch for the R solution?

@gdementen

FWIW, there is apparently also support for reading compressed files in the Java-based "parso" open-source library, http://search.maven.org/remotecontent?filepath=com/ggasoftware/parso/1.2.1/parso-1.2.1-sources.jar

@jreback jreback referenced this issue in ContinuumIO/blaze
Closed

backend: SAS support #763

@jaredhobbs

the sas7bdat Python library now supports compressed files too: https://pypi.python.org/pypi/sas7bdat

@gdementen

@jaredhobbs yeah! Thanks a lot for this! If only it supported Python3, it would be even more awesome :)

@jaredhobbs

@gdementen After a few long nights, Python3 support has landed: https://pypi.python.org/pypi/sas7bdat/2.0.1

I also fixed some bugs I came across.

@dramage1
@talumbau talumbau referenced this issue in ContinuumIO/odo
Closed

SAS7BDAT backend #41

@TomAugspurger
Collaborator

Hadley just release haven for reading SAS, Spss, and Stata files. It wraps a C library (ReadStat)[https://github.com/WizardMac/ReadStat]. It has an MIT license.

@tyler-abbot

Did anyone ever solve the read_sas issue? Or is reading fixed width ascii files with accompanying dictionary still an issue?

@jreback
Owner

see #9711 - going to be merged shortly -

@tyler-abbot

But does that only deal with '.xpt' file types? Glancing at the code it doesn't seem to deal with the other two part SAS file format.

@jreback
Owner

@tyler-abbot yes that is for xport type files. It would be straightforward to wrap the library mentioned above to extend this to the sas binary format. just need a volunteer - interested?

@tyler-abbot

@jreback I'm actually working on transcribing the SAScii (http://cran.r-project.org/web/packages/SAScii/index.html) package from R to Python. I would be happy to share the results if it is ok with the author of that package. I haven't done much development, though, so don't know much about sop's. I'm also not sure how compatible it would be with the library you mentioned. Perhaps just an add on with the option of ...format="sas"... or something along those lines.

@jreback
Owner

just saw this. https://pypi.python.org/pypi/sas7bdat/2.0.1. Even if this is pure-python (slower), that is ok to start. Better to have it able to read than not.

cc @kshedden

@benjello

@jreback @kshedden I use extensively https://pypi.python.org/pypi/sas7bdat/2.0.1 It is slow but works well.
Since usually I transfer my data to HDF format once at the beginning of a study, it was worth it.

@tyler-abbot

So, I don't think the sas7bdat package can read the type of sas files I'm talking about. I have finished writing a function to do it, but am going out of town for a few months. It is all contained in this package:

https://pypi.python.org/pypi/psid_py

I have a few days during which I could work on incorporating the read_sas() function into pandas. I'm going to read through the documentation and do some more testing, but if anyone has suggestions that will help me move more quickly it would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.