Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

ENH: Fix for #1512, added StataReader and StataWriter to pandas.io.parsers #3270

Merged
merged 6 commits into from

6 participants

PKEuS Skipper Seabold y-p Hans-Martin jreback Wes McKinney
PKEuS

This pull request aims at fixing ticket #1512 and contains both a reader and a writer for Stata .dta files.

The code basically comes from th statsmodels project, however, I adapted it to the needs of pandas and implemented support for reading out stata value labels. The writer does not write those labels back.

Skipper Seabold
Collaborator

FYI, I have a student working on re-writing StataReader and StataWriter in Cython due to reports of their being slow. We might want to hold off on merging this until the rewrite. I don't know though up to pandas devs.

y-p
y-p commented

@jseabold, Is there a timeline for that, it would help make a decision.
@PKEuS, how different is the API here compared to statsmodels? if it's faithful, probably no harm
in merging in the current version soon, and moving to a faster version when it's available.

I know there's an agreement for not having reverse deps from pandas on statsmodels,
maybe having a pandas export from statsmodels for stata files would make more sense
then stata import in pandas? less duplication, and once statsmodels becomes faster,
this'll work too.

right now, the supported pandas IO file formats are all "vanilla" - csv, xls (well...), fixed width.

PKEuS

how different is the API here compared to statsmodels?

The API is different for several reasons:

  • Support for parsing value labels required extending old API
  • Integration into Pandas API (see from_dta method in DataFrame) and removing non-pandas code required some adaptions
  • Stylistic reasons. I tried to create an overall more consistent interface providing methods to get all important metadata from the stata file.

The way the parser works, however, is basically the same.

I don't know if the performance of this parser somehow improved compared to old statsmodels parser. An improvement, however, is imaginable due to some small changes I did concerning the way data is written into a DataFrame.

Skipper Seabold
Collaborator

Should be done within the month. That's when the projects are due at least... This is a learning experience, so there might be some unforeseen stumbling blocks. FWIW, I plan to deprecate this from statsmodels given that it will be available in pandas, so we had planned on the Cython code making its way into pandas as well. Shouldn't be any need for a dependency.

Hans-Martin

I should think this would belongs into pandas rather than statsmodels -- pure data I/O, nothing to do with statistics/econometrics except for the fact that handles the data format that is the near standard in econometrics. @y-p: I don't get the point about "vanilla" formats: sql, hdf5, ... Then there is the ability to read R dataframes, just seems to be in a different place.

I cannot say too much about the statsmodels fork of the StataReader as I always worked with the original one (at some point, the statsmodels version converted everything into floats, which didn't work for me). The current version should keep the original data formats and adds some goodies in terms of carrying metadata along (I should add that PKEuS is my student).

@jseabold: Happy to join forces here once you have produced something that gets rid of bottlenecks. I would guess that looping through the dataset would be the most crucial one.

y-p
y-p commented

Stata files are a vendor-specific format for a proprietary stats package, I Don't think
HDF5/sql fall into the same category. R does, but is a prominent open-source solution.
Admitedly it's a blurry line, and I don't have an objection to this, just points to consider,

Skipper Seabold
Collaborator

Foreign I/O seems very much to be a useful feature for pandas. It doesn't require any dependencies, and I'd find this very useful. I am using Stata -> pandas in many projects, and every time I wish it was in pandas. Cf. scipy.io for Matlab files. And in R

http://cran.r-project.org/web/packages/foreign/index.html

y-p
y-p commented

If the same functionality is coming at the same time from two different sources aimed
at pandas, I think that's a good enough indicator there's a need.

Would help if the two developers could reach a concensus re the two branches.

Skipper Seabold
Collaborator

My vote is merge this given it looks ok on your end. It doesn't look like the StataReader code is really much different than the statsmodels version, except the encoding, labels handling, and the use of Categorical (is this a correct reading of the changes?), so updating the optimization based on this won't be difficult. I can go ahead and deprecate the statsmodels version for our next release when I need to. It might be polite to acknowledge the original author Joe's work and my changes as well, though I'm not overly concerned about this.

@PKEuS I noticed you removed the format checks. Did you test this on data created by early stata versions? I was never able to hunt down an old printed manual to see how the spec was different with ds_format < 113, but I was never able to read these datasets properly.

PKEuS

I noticed you removed the format checks.

No, they are still there. It will print a message if format is not 113, 114 or 115. See method StataReader._read_header() (line 2559).

It might be polite to acknowledge the original author Joe's work and my changes as well, though I'm not overly concerned about this.

Sure. Should I add this as a comment above the parser implementation?

Skipper Seabold
Collaborator

Ah, ok. I see it now. Wishful thinking. FWIW, I've been told that the old spec is published if you ever come across an old Stata manual lying around somewhere, so be on the lookout. There are a lot of old textbook example datasets that have not been updated.

I usually see an

Authors
----------

section in the module docstring, but I don't know if it's worth it. I usually do it as a courtesy. I don't think there is anything in the license that says you have to. FWIW, looking back at my e-mail Joe changed the license of his original code to MIT, so we could include it in statsmodels.

Hans-Martin

@jseabold: Do you have examples of old datasets floating around that don't work currently? At least for 113-115, it seems that only fields/formats were added, so it might be close to trivial to get previous ones to work. Will check whether we have old references at the University somewhere; in a first shot I could only get my hands on Stata 9 docs.

Skipper Seabold
Collaborator

I recall coming across them when I was going through textbook examples years ago, so some of the old stata press books or maybe Cameron and Travedi. You might find one here

http://stata-press.com/data/glmext3.html

PKEuS

I am working on implementing support for Stata formats 104, 105 and 108 (that are the formats I found .dta files for in the Cameron and Travedi files). Does the license of this datasets allow using them as unit tests?

Skipper Seabold
Collaborator

@PKEuS Great! Re: inclusion, that's the rub with all these datasets. It's all very gray. I was never able to get Pravin Trivedi to respond to my requests. I don't see anywhere that I tried Colin Cameron. Maybe including them as test cases would be okay and not distributing them as examples/ data used anywhere else in the code, though IANAL. It also depends whether they originally compiled the data or if it came from another source. I usually try to track them back to an original author / agency and get express written permission. All data from US Government agencies is public domain.

PKEuS

Reading old stata files works now, however, I didn't add unit tests for them so far. I could add tests without providing the files (just providing a link where to get them) and flag the tests as skip per default. What do you think?

y-p

@PKEuS , that page's header seems to imply those files are there for use with stata.
best avoid them entirely.

If any old stata file can be used as pass/fail but you have no compatibly-license test files,
go ahead and leave in a skipped test.

Is this merge-ready as far as you're concerned?

PKEuS

Besides from missing unit tests for old formats, which I could add as SKIP-tests in an additional commits, I'd consider this to be merge ready. However, there might be small glitches for python2 compatibility which I don't know about, since I was only able to test this in a python3 environment.

jreback
Owner

setup for Travis testing

http://about.travis-ci.org/docs/user/getting-started/

step 2

then submit a new commit (or rebase)
and watch testing on a slew of different configs

y-p

or git commit --amend -C HEAD to give HEAD a new hash.

jreback
Owner

I would pull all of this out of io.parsers and make a io.stata module

(you can then have classes like Parser rather than StataParser), but that's optional

jreback
Owner

@y-p I don't think there is a 'how to setup Travis' anywhere iin contributing/developers page ?

PKEuS

I've enabled Travis now (I just enabled it in the Settings of my fork. yml file is there, so it probably just works now). I can commit the SKIP-test tomorrow evening, and then I'll see what travis says.

I would pull all of this out of io.parsers and make a io.stata module

(you can then have classes like Parser rather than StataParser), but that's optional

Creating a separate io.stata module wouldn't fit to the implementation of the other parsers, would it? Thus, I guess that such a refactorization should be done separately - there are other parsers like excel where it might also be a good idea to move it. But if you intend to shrink io.parsers, I can of course put the StataParsers into a separate file directly.

y-p

@jreback , fixed.

jreback
Owner

@PKEuS your have to turn it on it Travis too (login there)
then do a new commit (or reset hash as @y-p suggests

all of the various parsers have their own module
excel,ga,hdf
parsers is csv, and fixed width too (which share the same parser)
but it looks your module looks pretty big
I would do it before merging
as I said u don't have to change the names, just where it lives

PKEuS

I'll change the module. But the Excel parser seems to live in io.parsers - maybe a candidate to be moved out as well.

your have to turn it on it Travis too (login there)

Thanks, I had forgotten that.

jreback
Owner

@PKEuS 2 minor points

1) for PY3 determination, use this (instead of your definition of PY3)

from pandas.util.py3compat import PY3

if PY3:
....

2) your testing data should go in io/tests/data and access like:

import pandas.util.testing as tm

self.dirpath = tm.get_data_path()
jreback
Owner

@PKEuS you are right about the parsers, I was thinking about the test files.....yes, I think excel should be moved out too...(different issue), actually its not so huge, you can leave if you want....I just think that different type of reader/writers should live in separate modules, but that's my 2c

PKEuS

I just think that different type of reader/writers should live in separate modules, but that's my 2c

I agree. I just thought that all parsers live in io.parsers when I started porting the parser - but that was wrong assumption.

1) for PY3 determination, use this (instead of your definition of PY3)
2) your testing data should go in io/tests/data and access like:

Yes, I'll fix that tomorrow

jreback
Owner

@PKEuS great, start a trend!

think excel should be refactored out too, will make an issue for that (its just more manageable esp when have reader /writers - no namespace collisions)

jreback
Owner
jreback
Owner

@PKEuS also add a mention in v0.11.0.txt and RELEASE.rst

Wes McKinney
Owner

+1 on moving things to pandas/io/stata.py and pandas/io/excel.py. I'm also weakly -1 in adding an additional class method to DataFrame but "what's another method, anyway". Are we rolling this into 0.11 then?

jreback
Owner

how about a io.api

where read_xxx live (and then exported to main pandas name space)
still need instance methods for to_xxx in any event (some of which live in core.generic but most in core.frame)

jreback
Owner

@PKEuS how's this going?

jreback
Owner

you need to take this commit out: 598b1a566760f7994aace5cf58a6fa4d70609308 as its a merge from master (meaning it will pollute the history of the tree)
just rebase your tree and take it out, then force push

PKEuS

Travis is running now, and most of the failures are caused by the test datasets not being found.

IOError: [Errno 2] No such file or directory: '/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0rc1-py3.2-linux-x86_64.egg/pandas/io/tests/data/stata4.dta'

Same for stata1-stata3, and on all python versions. Does anybody have an idea what can be the reason for this problem? The datasets are in the repository in pandas/io/tests/data/, as they should be.

jreback
Owner

is your master up to date?

e.g. git pull --rebase

if this doesn't work, then its not a tracking branch

git br yourbranch --set-upstream master

git checkout master
git pull
git checkout yourbranch
git pull --rebase
Skipper Seabold
Collaborator

Do you need to add them in setup.py so that the data files get installed?

jreback
Owner

that should be done (in prior commits), it addas all the io/tests/data dirs

jreback
Owner

@PKEuS I just cloned your branch....need to update to master and all will be well

git log should show commits from today (yours shows from april 7)

Skipper Seabold
Collaborator

Not with the *.dta endings IIRC.

jreback
Owner

@jseabold you are right, my bad.....
we do specific files I see

@PKEuS just edit setup.py as @jseabold suggests (look search for pandas.io...will be clear what to do)

PKEuS

Rebased, and applied some changes aiming at fixing travis problems.

PKEuS

Latest results are much better, since python 3.3 testing now succeeds as it does on my machine. Python 2 specific problems seem to have gone, too, since output is pretty much the same for python 3.2.
The test failure test_option_mpl_style seems to happen also in official repository.

I'll try to find out the reason for python 2.6-3.2 failures now by trying to reproduce them with python 3.2.

PKEuS

It might be fixed now.

y-p

master is fixed now, rebase and try again.

PKEuS

Done. Should pass now.

y-p

it ain't easy being green.

jreback
Owner

@PKEuS can you add a mention in RELEASE & v0.11.0.txt ? (one-liner is fine)

PKEuS

Done

y-p

@jreback , are you planning to merge this for 0.11?

jreback
Owner

@y-p looks ok to me, @jseabold ?

Skipper Seabold
Collaborator

Ok by me. Is there a read_dta or read_stata in the main pandas namespace? Might consider it even though it's a bit of a niche function. Things like this and the DataReader stuff tend to get underutilized IMO because of a lack of visibility.

PKEuS

There is a method DataFrame.from_dta()

jreback
Owner

@jseabold that's #3411, namespaces for io functions

(and will probably just attach the read_xxx to the apropriate classes)

y-p

I'm -1 on adding new stuff after the RC and half an hour before the final release,
sort of makes the whole thing pointless. No matter. just tag it "experimental" in RELEASE.rst?

jreback
Owner

I am not sure we are doing a 0.11.1......so....just tag experimental (its not imported by default anywhere right?)
@PKEuS (just in DataFrame.from_stata)?

PKEuS

It is just imported in DataFrame.from_stata(), if that answers your question.

y-p

we are doing a 0.11.1, and I think this'll wait until then before merged into master.

jreback
Owner

@PKEuS can you rebase this on top of master....then can put it in 0.11.1, unless any objections (as I think this is pretty independent)

jreback
Owner

also change release notes/whatsnew to v0.11.1

PKEuS

Done

jreback
Owner

anyone have any issues with merging in 0.11.1
seems pretty independent to me

@y-p @wesm @jseabold

did anyone want a read_dta in the main namespace? could wait for 0.12 for that in any event

jreback
Owner

@PKEuS pls add a section in io.rst showing reading and writing (take a frame, write it, and read back in), to show how to use

also can you rebase to a few number of commits (you can squash some), if possible

Skipper Seabold
Collaborator

Did you have any commits in mind for squashing? FWIW, I don't find the number of commits to be all that problematic at 9. It's clear to me what each one does in case I want to backport any changes to statsmodels while we're working on it.

jreback
Owner

@jseabold oh...nothing specific, just try to squash whenver possible, but you have a valid point

PKEuS added some commits
PKEuS PKEuS ENH: Added StataReader and StataWriter (#1512) 9a71737
PKEuS PKEuS [ENH] Added support for reading Stata formats 104, 105 and 108 703834c
PKEuS PKEuS Improvements to StataParser:
- Moved unit test data to tests/data
- Added unit testing for old Stata formats (SKIP per default)
- Use py3compat.PY3 instead of own implementation
a5961bd
PKEuS PKEuS Moved StataParser into new module pandas.io.stata 119ee97
PKEuS PKEuS Fixed several problems in StataParser with Travis and Python2.
- Don't call encode/decode on python2
- Added .dta type to setup.py
- Fixed null byte
70fed82
PKEuS PKEuS Added StataParser to release notes and updated io.rst 4f60da9
PKEuS

I squashed several bugfixing commits. io.rst is updated.

jreback jreback merged commit 4f60da9 into from
jreback
Owner

thanks! this is great

jreback
Owner

@PKEuS I made a small edit to the docs; discovered that on read-back the frame has a numeric index, and the index field, is that correct?

take at look at the dev docs after 5pm today as well: http://pandas.pydata.org/pandas-docs/dev/io.html

jreback
Owner

@PKEuS you want to join on #3641 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 16, 2013
  1. PKEuS
  2. PKEuS
  3. PKEuS

    Improvements to StataParser:

    PKEuS authored
    - Moved unit test data to tests/data
    - Added unit testing for old Stata formats (SKIP per default)
    - Use py3compat.PY3 instead of own implementation
  4. PKEuS
  5. PKEuS

    Fixed several problems in StataParser with Travis and Python2.

    PKEuS authored
    - Don't call encode/decode on python2
    - Added .dta type to setup.py
    - Fixed null byte
  6. PKEuS
Something went wrong with that request. Please try again.