ENH: Fix for #1512, added StataReader and StataWriter to pandas.io.parsers #3270

PKEuS · 2013-04-07T19:57:24Z

This pull request aims at fixing ticket #1512 and contains both a reader and a writer for Stata .dta files.

The code basically comes from th statsmodels project, however, I adapted it to the needs of pandas and implemented support for reading out stata value labels. The writer does not write those labels back.

jseabold · 2013-04-07T20:00:24Z

FYI, I have a student working on re-writing StataReader and StataWriter in Cython due to reports of their being slow. We might want to hold off on merging this until the rewrite. I don't know though up to pandas devs.

ghost · 2013-04-07T20:09:21Z

@jseabold, Is there a timeline for that, it would help make a decision.
@PKEuS, how different is the API here compared to statsmodels? if it's faithful, probably no harm
in merging in the current version soon, and moving to a faster version when it's available.

I know there's an agreement for not having reverse deps from pandas on statsmodels,
maybe having a pandas export from statsmodels for stata files would make more sense
then stata import in pandas? less duplication, and once statsmodels becomes faster,
this'll work too.

right now, the supported pandas IO file formats are all "vanilla" - csv, xls (well...), fixed width.

PKEuS · 2013-04-07T20:27:27Z

how different is the API here compared to statsmodels?

The API is different for several reasons:

Support for parsing value labels required extending old API
Integration into Pandas API (see from_dta method in DataFrame) and removing non-pandas code required some adaptions
Stylistic reasons. I tried to create an overall more consistent interface providing methods to get all important metadata from the stata file.

The way the parser works, however, is basically the same.

I don't know if the performance of this parser somehow improved compared to old statsmodels parser. An improvement, however, is imaginable due to some small changes I did concerning the way data is written into a DataFrame.

jseabold · 2013-04-07T20:34:21Z

Should be done within the month. That's when the projects are due at least... This is a learning experience, so there might be some unforeseen stumbling blocks. FWIW, I plan to deprecate this from statsmodels given that it will be available in pandas, so we had planned on the Cython code making its way into pandas as well. Shouldn't be any need for a dependency.

hmgaudecker · 2013-04-07T20:38:44Z

I should think this would belongs into pandas rather than statsmodels -- pure data I/O, nothing to do with statistics/econometrics except for the fact that handles the data format that is the near standard in econometrics. @y-p: I don't get the point about "vanilla" formats: sql, hdf5, ... Then there is the ability to read R dataframes, just seems to be in a different place.

I cannot say too much about the statsmodels fork of the StataReader as I always worked with the original one (at some point, the statsmodels version converted everything into floats, which didn't work for me). The current version should keep the original data formats and adds some goodies in terms of carrying metadata along (I should add that PKEuS is my student).

@jseabold: Happy to join forces here once you have produced something that gets rid of bottlenecks. I would guess that looping through the dataset would be the most crucial one.

ghost · 2013-04-07T20:53:33Z

Stata files are a vendor-specific format for a proprietary stats package, I Don't think
HDF5/sql fall into the same category. R does, but is a prominent open-source solution.
Admitedly it's a blurry line, and I don't have an objection to this, just points to consider,

jseabold · 2013-04-07T21:00:02Z

Foreign I/O seems very much to be a useful feature for pandas. It doesn't require any dependencies, and I'd find this very useful. I am using Stata -> pandas in many projects, and every time I wish it was in pandas. Cf. scipy.io for Matlab files. And in R

http://cran.r-project.org/web/packages/foreign/index.html

ghost · 2013-04-07T21:02:17Z

If the same functionality is coming at the same time from two different sources aimed
at pandas, I think that's a good enough indicator there's a need.

Would help if the two developers could reach a concensus re the two branches.

jseabold · 2013-04-07T21:18:21Z

My vote is merge this given it looks ok on your end. It doesn't look like the StataReader code is really much different than the statsmodels version, except the encoding, labels handling, and the use of Categorical (is this a correct reading of the changes?), so updating the optimization based on this won't be difficult. I can go ahead and deprecate the statsmodels version for our next release when I need to. It might be polite to acknowledge the original author Joe's work and my changes as well, though I'm not overly concerned about this.

@PKEuS I noticed you removed the format checks. Did you test this on data created by early stata versions? I was never able to hunt down an old printed manual to see how the spec was different with ds_format < 113, but I was never able to read these datasets properly.

PKEuS · 2013-04-07T21:25:25Z

I noticed you removed the format checks.

No, they are still there. It will print a message if format is not 113, 114 or 115. See method StataReader._read_header() (line 2559).

It might be polite to acknowledge the original author Joe's work and my changes as well, though I'm not overly concerned about this.

Sure. Should I add this as a comment above the parser implementation?

jseabold · 2013-04-07T21:39:52Z

Ah, ok. I see it now. Wishful thinking. FWIW, I've been told that the old spec is published if you ever come across an old Stata manual lying around somewhere, so be on the lookout. There are a lot of old textbook example datasets that have not been updated.

I usually see an

Authors
----------

section in the module docstring, but I don't know if it's worth it. I usually do it as a courtesy. I don't think there is anything in the license that says you have to. FWIW, looking back at my e-mail Joe changed the license of his original code to MIT, so we could include it in statsmodels.

hmgaudecker · 2013-04-08T08:37:59Z

@jseabold: Do you have examples of old datasets floating around that don't work currently? At least for 113-115, it seems that only fields/formats were added, so it might be close to trivial to get previous ones to work. Will check whether we have old references at the University somewhere; in a first shot I could only get my hands on Stata 9 docs.

jseabold · 2013-04-08T13:02:13Z

I recall coming across them when I was going through textbook examples years ago, so some of the old stata press books or maybe Cameron and Travedi. You might find one here

http://stata-press.com/data/glmext3.html

PKEuS · 2013-04-17T16:29:04Z

I am working on implementing support for Stata formats 104, 105 and 108 (that are the formats I found .dta files for in the Cameron and Travedi files). Does the license of this datasets allow using them as unit tests?

jseabold · 2013-04-17T16:43:03Z

@PKEuS Great! Re: inclusion, that's the rub with all these datasets. It's all very gray. I was never able to get Pravin Trivedi to respond to my requests. I don't see anywhere that I tried Colin Cameron. Maybe including them as test cases would be okay and not distributing them as examples/ data used anywhere else in the code, though IANAL. It also depends whether they originally compiled the data or if it came from another source. I usually try to track them back to an original author / agency and get express written permission. All data from US Government agencies is public domain.

PKEuS · 2013-04-18T09:40:33Z

Reading old stata files works now, however, I didn't add unit tests for them so far. I could add tests without providing the files (just providing a link where to get them) and flag the tests as skip per default. What do you think?

ghost · 2013-04-20T16:32:38Z

@PKEuS , that page's header seems to imply those files are there for use with stata.
best avoid them entirely.

If any old stata file can be used as pass/fail but you have no compatibly-license test files,
go ahead and leave in a skipped test.

Is this merge-ready as far as you're concerned?

PKEuS · 2013-04-20T19:08:29Z

Besides from missing unit tests for old formats, which I could add as SKIP-tests in an additional commits, I'd consider this to be merge ready. However, there might be small glitches for python2 compatibility which I don't know about, since I was only able to test this in a python3 environment.

jreback · 2013-04-20T19:12:37Z

setup for Travis testing

http://about.travis-ci.org/docs/user/getting-started/

step 2

then submit a new commit (or rebase)
and watch testing on a slew of different configs

ghost · 2013-04-20T19:17:06Z

or git commit --amend -C HEAD to give HEAD a new hash.

jreback · 2013-04-20T19:17:14Z

I would pull all of this out of io.parsers and make a io.stata module

(you can then have classes like Parser rather than StataParser), but that's optional

jreback · 2013-04-20T19:29:41Z

@y-p I don't think there is a 'how to setup Travis' anywhere iin contributing/developers page ?

PKEuS · 2013-04-20T19:33:02Z

I've enabled Travis now (I just enabled it in the Settings of my fork. yml file is there, so it probably just works now). I can commit the SKIP-test tomorrow evening, and then I'll see what travis says.

I would pull all of this out of io.parsers and make a io.stata module

(you can then have classes like Parser rather than StataParser), but that's optional

Creating a separate io.stata module wouldn't fit to the implementation of the other parsers, would it? Thus, I guess that such a refactorization should be done separately - there are other parsers like excel where it might also be a good idea to move it. But if you intend to shrink io.parsers, I can of course put the StataParsers into a separate file directly.

ghost · 2013-04-20T19:37:45Z

@jreback , fixed.

jreback · 2013-04-20T19:43:24Z

@PKEuS your have to turn it on it Travis too (login there)
then do a new commit (or reset hash as @y-p suggests

all of the various parsers have their own module
excel,ga,hdf
parsers is csv, and fixed width too (which share the same parser)
but it looks your module looks pretty big
I would do it before merging
as I said u don't have to change the names, just where it lives

PKEuS · 2013-04-20T19:54:55Z

I'll change the module. But the Excel parser seems to live in io.parsers - maybe a candidate to be moved out as well.

your have to turn it on it Travis too (login there)

Thanks, I had forgotten that.

jreback · 2013-04-20T19:55:11Z

@PKEuS 2 minor points

for PY3 determination, use this (instead of your definition of PY3)

from pandas.util.py3compat import PY3

if PY3:
....

your testing data should go in io/tests/data and access like:

import pandas.util.testing as tm

self.dirpath = tm.get_data_path()

jreback · 2013-04-20T19:57:08Z

@PKEuS you are right about the parsers, I was thinking about the test files.....yes, I think excel should be moved out too...(different issue), actually its not so huge, you can leave if you want....I just think that different type of reader/writers should live in separate modules, but that's my 2c

PKEuS · 2013-04-20T20:05:22Z

I just think that different type of reader/writers should live in separate modules, but that's my 2c

I agree. I just thought that all parsers live in io.parsers when I started porting the parser - but that was wrong assumption.

for PY3 determination, use this (instead of your definition of PY3)

your testing data should go in io/tests/data and access like:

Yes, I'll fix that tomorrow

jreback · 2013-04-20T20:08:42Z

@PKEuS great, start a trend!

think excel should be refactored out too, will make an issue for that (its just more manageable esp when have reader /writers - no namespace collisions)

PKEuS · 2013-04-22T19:42:53Z

Done

ghost · 2013-04-22T19:49:28Z

@jreback , are you planning to merge this for 0.11?

jreback · 2013-04-22T19:50:30Z

@y-p looks ok to me, @jseabold ?

jseabold · 2013-04-22T19:53:21Z

Ok by me. Is there a read_dta or read_stata in the main pandas namespace? Might consider it even though it's a bit of a niche function. Things like this and the DataReader stuff tend to get underutilized IMO because of a lack of visibility.

PKEuS · 2013-04-22T19:55:23Z

There is a method DataFrame.from_dta()

jreback · 2013-04-22T19:56:47Z

@jseabold that's #3411, namespaces for io functions

(and will probably just attach the read_xxx to the apropriate classes)

ghost · 2013-04-22T19:59:09Z

I'm -1 on adding new stuff after the RC and half an hour before the final release,
sort of makes the whole thing pointless. No matter. just tag it "experimental" in RELEASE.rst?

jreback · 2013-04-22T20:02:07Z

I am not sure we are doing a 0.11.1......so....just tag experimental (its not imported by default anywhere right?)
@PKEuS (just in DataFrame.from_stata)?

PKEuS · 2013-04-22T20:06:18Z

It is just imported in DataFrame.from_stata(), if that answers your question.

ghost · 2013-04-27T20:21:08Z

we are doing a 0.11.1, and I think this'll wait until then before merged into master.

jreback · 2013-05-15T18:26:14Z

@PKEuS can you rebase this on top of master....then can put it in 0.11.1, unless any objections (as I think this is pretty independent)

jreback · 2013-05-15T18:39:15Z

also change release notes/whatsnew to v0.11.1

PKEuS · 2013-05-15T21:51:59Z

Done

jreback · 2013-05-15T22:31:26Z

anyone have any issues with merging in 0.11.1
seems pretty independent to me

@y-p @wesm @jseabold

did anyone want a read_dta in the main namespace? could wait for 0.12 for that in any event

jreback · 2013-05-15T22:36:40Z

@PKEuS pls add a section in io.rst showing reading and writing (take a frame, write it, and read back in), to show how to use

also can you rebase to a few number of commits (you can squash some), if possible

jseabold · 2013-05-15T22:40:20Z

Did you have any commits in mind for squashing? FWIW, I don't find the number of commits to be all that problematic at 9. It's clear to me what each one does in case I want to backport any changes to statsmodels while we're working on it.

jreback · 2013-05-15T22:48:05Z

@jseabold oh...nothing specific, just try to squash whenver possible, but you have a valid point

- Moved unit test data to tests/data - Added unit testing for old Stata formats (SKIP per default) - Use py3compat.PY3 instead of own implementation

- Don't call encode/decode on python2 - Added .dta type to setup.py - Fixed null byte

PKEuS · 2013-05-16T05:47:49Z

I squashed several bugfixing commits. io.rst is updated.

jreback · 2013-05-16T11:34:26Z

thanks! this is great

jreback · 2013-05-17T19:51:35Z

@PKEuS I made a small edit to the docs; discovered that on read-back the frame has a numeric index, and the index field, is that correct?

take at look at the dev docs after 5pm today as well: http://pandas.pydata.org/pandas-docs/dev/io.html

jreback · 2013-05-22T15:59:45Z

@PKEuS you want to join on #3641 ?

PKEuS added 6 commits May 16, 2013 07:07

ENH: Added StataReader and StataWriter (pandas-dev#1512)

9a71737

[ENH] Added support for reading Stata formats 104, 105 and 108

703834c

Improvements to StataParser:

a5961bd

- Moved unit test data to tests/data - Added unit testing for old Stata formats (SKIP per default) - Use py3compat.PY3 instead of own implementation

Moved StataParser into new module pandas.io.stata

119ee97

Fixed several problems in StataParser with Travis and Python2.

70fed82

- Don't call encode/decode on python2 - Added .dta type to setup.py - Fixed null byte

Added StataParser to release notes and updated io.rst

4f60da9

jreback merged commit 4f60da9 into pandas-dev:master May 16, 2013

jreback mentioned this pull request May 16, 2013

add ability to read Stata files #1512

Closed

jreback mentioned this pull request May 18, 2013

BUG: is StataReader supposed to assign the index? #3641

Closed

ENH: Fix for #1512, added StataReader and StataWriter to pandas.io.parsers #3270

ENH: Fix for #1512, added StataReader and StataWriter to pandas.io.parsers #3270

Conversation

PKEuS commented Apr 7, 2013

jseabold commented Apr 7, 2013

ghost commented Apr 7, 2013

PKEuS commented Apr 7, 2013

jseabold commented Apr 7, 2013

hmgaudecker commented Apr 7, 2013

ghost commented Apr 7, 2013

jseabold commented Apr 7, 2013

ghost commented Apr 7, 2013

jseabold commented Apr 7, 2013

PKEuS commented Apr 7, 2013

jseabold commented Apr 7, 2013

hmgaudecker commented Apr 8, 2013

jseabold commented Apr 8, 2013

PKEuS commented Apr 17, 2013

jseabold commented Apr 17, 2013

PKEuS commented Apr 18, 2013

ghost commented Apr 20, 2013

PKEuS commented Apr 20, 2013

jreback commented Apr 20, 2013

ghost commented Apr 20, 2013

jreback commented Apr 20, 2013

jreback commented Apr 20, 2013

PKEuS commented Apr 20, 2013

ghost commented Apr 20, 2013

jreback commented Apr 20, 2013

PKEuS commented Apr 20, 2013

jreback commented Apr 20, 2013

jreback commented Apr 20, 2013

PKEuS commented Apr 20, 2013

jreback commented Apr 20, 2013

PKEuS commented Apr 22, 2013

ghost commented Apr 22, 2013

jreback commented Apr 22, 2013

jseabold commented Apr 22, 2013

PKEuS commented Apr 22, 2013

jreback commented Apr 22, 2013

ghost commented Apr 22, 2013

jreback commented Apr 22, 2013

PKEuS commented Apr 22, 2013

ghost commented Apr 27, 2013

jreback commented May 15, 2013

jreback commented May 15, 2013

PKEuS commented May 15, 2013

jreback commented May 15, 2013

jreback commented May 15, 2013

jseabold commented May 15, 2013

jreback commented May 15, 2013

PKEuS commented May 16, 2013

jreback commented May 16, 2013

jreback commented May 17, 2013

jreback commented May 22, 2013