Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Fix for #1512, added StataReader and StataWriter to pandas.io.parsers #3270

Merged
merged 6 commits into from May 16, 2013

Conversation

PKEuS
Copy link
Contributor

@PKEuS PKEuS commented Apr 7, 2013

This pull request aims at fixing ticket #1512 and contains both a reader and a writer for Stata .dta files.

The code basically comes from th statsmodels project, however, I adapted it to the needs of pandas and implemented support for reading out stata value labels. The writer does not write those labels back.

@jseabold
Copy link
Contributor

jseabold commented Apr 7, 2013

FYI, I have a student working on re-writing StataReader and StataWriter in Cython due to reports of their being slow. We might want to hold off on merging this until the rewrite. I don't know though up to pandas devs.

@ghost
Copy link

ghost commented Apr 7, 2013

@jseabold, Is there a timeline for that, it would help make a decision.
@PKEuS, how different is the API here compared to statsmodels? if it's faithful, probably no harm
in merging in the current version soon, and moving to a faster version when it's available.

I know there's an agreement for not having reverse deps from pandas on statsmodels,
maybe having a pandas export from statsmodels for stata files would make more sense
then stata import in pandas? less duplication, and once statsmodels becomes faster,
this'll work too.

right now, the supported pandas IO file formats are all "vanilla" - csv, xls (well...), fixed width.

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 7, 2013

how different is the API here compared to statsmodels?

The API is different for several reasons:

  • Support for parsing value labels required extending old API
  • Integration into Pandas API (see from_dta method in DataFrame) and removing non-pandas code required some adaptions
  • Stylistic reasons. I tried to create an overall more consistent interface providing methods to get all important metadata from the stata file.

The way the parser works, however, is basically the same.

I don't know if the performance of this parser somehow improved compared to old statsmodels parser. An improvement, however, is imaginable due to some small changes I did concerning the way data is written into a DataFrame.

@jseabold
Copy link
Contributor

jseabold commented Apr 7, 2013

Should be done within the month. That's when the projects are due at least... This is a learning experience, so there might be some unforeseen stumbling blocks. FWIW, I plan to deprecate this from statsmodels given that it will be available in pandas, so we had planned on the Cython code making its way into pandas as well. Shouldn't be any need for a dependency.

@hmgaudecker
Copy link

I should think this would belongs into pandas rather than statsmodels -- pure data I/O, nothing to do with statistics/econometrics except for the fact that handles the data format that is the near standard in econometrics. @y-p: I don't get the point about "vanilla" formats: sql, hdf5, ... Then there is the ability to read R dataframes, just seems to be in a different place.

I cannot say too much about the statsmodels fork of the StataReader as I always worked with the original one (at some point, the statsmodels version converted everything into floats, which didn't work for me). The current version should keep the original data formats and adds some goodies in terms of carrying metadata along (I should add that PKEuS is my student).

@jseabold: Happy to join forces here once you have produced something that gets rid of bottlenecks. I would guess that looping through the dataset would be the most crucial one.

@ghost
Copy link

ghost commented Apr 7, 2013

Stata files are a vendor-specific format for a proprietary stats package, I Don't think
HDF5/sql fall into the same category. R does, but is a prominent open-source solution.
Admitedly it's a blurry line, and I don't have an objection to this, just points to consider,

@jseabold
Copy link
Contributor

jseabold commented Apr 7, 2013

Foreign I/O seems very much to be a useful feature for pandas. It doesn't require any dependencies, and I'd find this very useful. I am using Stata -> pandas in many projects, and every time I wish it was in pandas. Cf. scipy.io for Matlab files. And in R

http://cran.r-project.org/web/packages/foreign/index.html

@ghost
Copy link

ghost commented Apr 7, 2013

If the same functionality is coming at the same time from two different sources aimed
at pandas, I think that's a good enough indicator there's a need.

Would help if the two developers could reach a concensus re the two branches.

@jseabold
Copy link
Contributor

jseabold commented Apr 7, 2013

My vote is merge this given it looks ok on your end. It doesn't look like the StataReader code is really much different than the statsmodels version, except the encoding, labels handling, and the use of Categorical (is this a correct reading of the changes?), so updating the optimization based on this won't be difficult. I can go ahead and deprecate the statsmodels version for our next release when I need to. It might be polite to acknowledge the original author Joe's work and my changes as well, though I'm not overly concerned about this.

@PKEuS I noticed you removed the format checks. Did you test this on data created by early stata versions? I was never able to hunt down an old printed manual to see how the spec was different with ds_format < 113, but I was never able to read these datasets properly.

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 7, 2013

I noticed you removed the format checks.

No, they are still there. It will print a message if format is not 113, 114 or 115. See method StataReader._read_header() (line 2559).

It might be polite to acknowledge the original author Joe's work and my changes as well, though I'm not overly concerned about this.

Sure. Should I add this as a comment above the parser implementation?

@jseabold
Copy link
Contributor

jseabold commented Apr 7, 2013

Ah, ok. I see it now. Wishful thinking. FWIW, I've been told that the old spec is published if you ever come across an old Stata manual lying around somewhere, so be on the lookout. There are a lot of old textbook example datasets that have not been updated.

I usually see an

Authors
----------

section in the module docstring, but I don't know if it's worth it. I usually do it as a courtesy. I don't think there is anything in the license that says you have to. FWIW, looking back at my e-mail Joe changed the license of his original code to MIT, so we could include it in statsmodels.

@hmgaudecker
Copy link

@jseabold: Do you have examples of old datasets floating around that don't work currently? At least for 113-115, it seems that only fields/formats were added, so it might be close to trivial to get previous ones to work. Will check whether we have old references at the University somewhere; in a first shot I could only get my hands on Stata 9 docs.

@jseabold
Copy link
Contributor

jseabold commented Apr 8, 2013

I recall coming across them when I was going through textbook examples years ago, so some of the old stata press books or maybe Cameron and Travedi. You might find one here

http://stata-press.com/data/glmext3.html

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 17, 2013

I am working on implementing support for Stata formats 104, 105 and 108 (that are the formats I found .dta files for in the Cameron and Travedi files). Does the license of this datasets allow using them as unit tests?

@jseabold
Copy link
Contributor

@PKEuS Great! Re: inclusion, that's the rub with all these datasets. It's all very gray. I was never able to get Pravin Trivedi to respond to my requests. I don't see anywhere that I tried Colin Cameron. Maybe including them as test cases would be okay and not distributing them as examples/ data used anywhere else in the code, though IANAL. It also depends whether they originally compiled the data or if it came from another source. I usually try to track them back to an original author / agency and get express written permission. All data from US Government agencies is public domain.

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 18, 2013

Reading old stata files works now, however, I didn't add unit tests for them so far. I could add tests without providing the files (just providing a link where to get them) and flag the tests as skip per default. What do you think?

@ghost
Copy link

ghost commented Apr 20, 2013

@PKEuS , that page's header seems to imply those files are there for use with stata.
best avoid them entirely.

If any old stata file can be used as pass/fail but you have no compatibly-license test files,
go ahead and leave in a skipped test.

Is this merge-ready as far as you're concerned?

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 20, 2013

Besides from missing unit tests for old formats, which I could add as SKIP-tests in an additional commits, I'd consider this to be merge ready. However, there might be small glitches for python2 compatibility which I don't know about, since I was only able to test this in a python3 environment.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2013

setup for Travis testing

http://about.travis-ci.org/docs/user/getting-started/

step 2

then submit a new commit (or rebase)
and watch testing on a slew of different configs

@ghost
Copy link

ghost commented Apr 20, 2013

or git commit --amend -C HEAD to give HEAD a new hash.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2013

I would pull all of this out of io.parsers and make a io.stata module

(you can then have classes like Parser rather than StataParser), but that's optional

@jreback
Copy link
Contributor

jreback commented Apr 20, 2013

@y-p I don't think there is a 'how to setup Travis' anywhere iin contributing/developers page ?

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 20, 2013

I've enabled Travis now (I just enabled it in the Settings of my fork. yml file is there, so it probably just works now). I can commit the SKIP-test tomorrow evening, and then I'll see what travis says.

I would pull all of this out of io.parsers and make a io.stata module

(you can then have classes like Parser rather than StataParser), but that's optional

Creating a separate io.stata module wouldn't fit to the implementation of the other parsers, would it? Thus, I guess that such a refactorization should be done separately - there are other parsers like excel where it might also be a good idea to move it. But if you intend to shrink io.parsers, I can of course put the StataParsers into a separate file directly.

@ghost
Copy link

ghost commented Apr 20, 2013

@jreback , fixed.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2013

@PKEuS your have to turn it on it Travis too (login there)
then do a new commit (or reset hash as @y-p suggests

all of the various parsers have their own module
excel,ga,hdf
parsers is csv, and fixed width too (which share the same parser)
but it looks your module looks pretty big
I would do it before merging
as I said u don't have to change the names, just where it lives

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 20, 2013

I'll change the module. But the Excel parser seems to live in io.parsers - maybe a candidate to be moved out as well.

your have to turn it on it Travis too (login there)

Thanks, I had forgotten that.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2013

@PKEuS 2 minor points

  1. for PY3 determination, use this (instead of your definition of PY3)
from pandas.util.py3compat import PY3

if PY3:
....
  1. your testing data should go in io/tests/data and access like:
import pandas.util.testing as tm

self.dirpath = tm.get_data_path()

@jreback
Copy link
Contributor

jreback commented Apr 20, 2013

@PKEuS you are right about the parsers, I was thinking about the test files.....yes, I think excel should be moved out too...(different issue), actually its not so huge, you can leave if you want....I just think that different type of reader/writers should live in separate modules, but that's my 2c

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 20, 2013

I just think that different type of reader/writers should live in separate modules, but that's my 2c

I agree. I just thought that all parsers live in io.parsers when I started porting the parser - but that was wrong assumption.

  1. for PY3 determination, use this (instead of your definition of PY3)
  2. your testing data should go in io/tests/data and access like:

Yes, I'll fix that tomorrow

@jreback
Copy link
Contributor

jreback commented Apr 20, 2013

@PKEuS great, start a trend!

think excel should be refactored out too, will make an issue for that (its just more manageable esp when have reader /writers - no namespace collisions)

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 22, 2013

Done

@ghost
Copy link

ghost commented Apr 22, 2013

@jreback , are you planning to merge this for 0.11?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2013

@y-p looks ok to me, @jseabold ?

@jseabold
Copy link
Contributor

Ok by me. Is there a read_dta or read_stata in the main pandas namespace? Might consider it even though it's a bit of a niche function. Things like this and the DataReader stuff tend to get underutilized IMO because of a lack of visibility.

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 22, 2013

There is a method DataFrame.from_dta()

@jreback
Copy link
Contributor

jreback commented Apr 22, 2013

@jseabold that's #3411, namespaces for io functions

(and will probably just attach the read_xxx to the apropriate classes)

@ghost
Copy link

ghost commented Apr 22, 2013

I'm -1 on adding new stuff after the RC and half an hour before the final release,
sort of makes the whole thing pointless. No matter. just tag it "experimental" in RELEASE.rst?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2013

I am not sure we are doing a 0.11.1......so....just tag experimental (its not imported by default anywhere right?)
@PKEuS (just in DataFrame.from_stata)?

@PKEuS
Copy link
Contributor Author

PKEuS commented Apr 22, 2013

It is just imported in DataFrame.from_stata(), if that answers your question.

@ghost
Copy link

ghost commented Apr 27, 2013

we are doing a 0.11.1, and I think this'll wait until then before merged into master.

@jreback
Copy link
Contributor

jreback commented May 15, 2013

@PKEuS can you rebase this on top of master....then can put it in 0.11.1, unless any objections (as I think this is pretty independent)

@jreback
Copy link
Contributor

jreback commented May 15, 2013

also change release notes/whatsnew to v0.11.1

@PKEuS
Copy link
Contributor Author

PKEuS commented May 15, 2013

Done

@jreback
Copy link
Contributor

jreback commented May 15, 2013

anyone have any issues with merging in 0.11.1
seems pretty independent to me

@y-p @wesm @jseabold

did anyone want a read_dta in the main namespace? could wait for 0.12 for that in any event

@jreback
Copy link
Contributor

jreback commented May 15, 2013

@PKEuS pls add a section in io.rst showing reading and writing (take a frame, write it, and read back in), to show how to use

also can you rebase to a few number of commits (you can squash some), if possible

@jseabold
Copy link
Contributor

Did you have any commits in mind for squashing? FWIW, I don't find the number of commits to be all that problematic at 9. It's clear to me what each one does in case I want to backport any changes to statsmodels while we're working on it.

@jreback
Copy link
Contributor

jreback commented May 15, 2013

@jseabold oh...nothing specific, just try to squash whenver possible, but you have a valid point

PKEuS added 6 commits May 16, 2013 07:07
- Moved unit test data to tests/data
- Added unit testing for old Stata formats (SKIP per default)
- Use py3compat.PY3 instead of own implementation
- Don't call encode/decode on python2
- Added .dta type to setup.py
- Fixed null byte
@PKEuS
Copy link
Contributor Author

PKEuS commented May 16, 2013

I squashed several bugfixing commits. io.rst is updated.

@jreback jreback merged commit 4f60da9 into pandas-dev:master May 16, 2013
@jreback
Copy link
Contributor

jreback commented May 16, 2013

thanks! this is great

@jreback
Copy link
Contributor

jreback commented May 17, 2013

@PKEuS I made a small edit to the docs; discovered that on read-back the frame has a numeric index, and the index field, is that correct?

take at look at the dev docs after 5pm today as well: http://pandas.pydata.org/pandas-docs/dev/io.html

@jreback
Copy link
Contributor

jreback commented May 22, 2013

@PKEuS you want to join on #3641 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants