Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Read Stata dta file incrementally #9493

Merged
merged 1 commit into from
Mar 5, 2015

Conversation

kshedden
Copy link
Contributor

This PR adds a get_chunk method to StataReader to allow files to be read incrementally. This is quite useful when processing large files that can barely fit into memory.

The interface is modeled after the analogous method in TextFileReader.

There are some limitations when incrementally converting categoricals in pre 117 version files. I have a work-around that involves reading the file through once to get the label information, then re-reading and converting using a new function called convert_categoricals.

In a fair amount of testing, I always get identical results using get_chunk and reading the file completely.

I left the existing data(...) method for reading the complete file intact. Potentially it could be changed to simply call get_chunk.

A few additional issues:

  1. The PR fixes an apparent bug in _stata_elapsed_date_to_datetime_vec, in all three of the nested convert_xxx functions. In some execution paths, a new Series is created without using the index of the series passed as an argument. This can cause incorrect NA detection when the result is indexed using bad_locs.
  2. A modest performance improvement (~40% for me) is obtained by using partition inside of _null_terminate in place of index/slice operations.
  3. Different date object types may be returned during date conversions, according to the execution path. These can represent identical dates but test as being different using assert_frame_equal. So there's a bit of extra care required in some of the tests.
  4. It would be easy to add an iterator to read the chunks, but I didn't do that. You need to call get_chunk directly.
  5. Some dtypes may change from chunk-to-chunk. For example, I believe that pandas promotes integers to floats when there are missing values. In a chunk with no missing values the dtype would be int, but become float when there are missing values.
  6. A comment separate from this PR, the slowest step for reading the big data files I work with is string decoding and truncation (millions of calls to _null_terminate). Other than the modest gain using partition, I tried and failed to find a good optimization for this.

@jreback jreback added the IO Stata read_stata, to_stata label Feb 15, 2015
@jreback jreback added this to the 0.16.0 milestone Feb 15, 2015
@jreback jreback mentioned this pull request Feb 15, 2015
4 tasks
@jreback
Copy link
Contributor

jreback commented Feb 15, 2015

@kshedden this is a good idea. See the related issue #9496. We want to try to follow a common pattern for the invocations of this. These other 3 io libs that do this already have a pretty well-defined interface (e.g. they take chunksize/iterator arguments. And provide an iterator returned back to the user so this is pretty transparent.

So if you'd take a look and follow this common pattern would be great (separate issue is actually to consolidate this into a base-class). You are welcome to actually start there if its easier. We can then build off your effort.

lmk and we can then examine your proposal in more detail.

cc @bashtage

raise ValueError("For format versions < 117, `convert_categoricals` must be set to False. See docstring for details.")

first_chunk = False
if not hasattr(self, '_not_first_chunk'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding attributes as a substitute for a bool doesnt' seem like a good idea. Why not just add _first_chunk=True in __init__? If not using the iterator it will just be ignored.

@bashtage
Copy link
Contributor

I left the existing data(...) method for reading the complete file intact. Potentially it could be changed to simply call get_chunk.

It makes sense to replace this method just to avoid have quote a bit of repeated code. Maybe 'chunksize=None` would indicate the read the entire file?

@jreback
Copy link
Contributor

jreback commented Feb 15, 2015

typical we have chunksize=None, iterator=False as the default arguments

then it's easy to provide an iterator if requested or simply return the entire file if not

@kshedden
Copy link
Contributor Author

Thanks for the review. I think I have addressed most of the comments. The interfaces to read_stata and StataReader should now match read_table and TextFileReader, at least in terms of features that are common to both file types.

Comments in the current StataReader source state that it is not possible to read the value labels from the end of the dta file in a pre 117 version file without first reading through all the data. It's not clear to me that this is the case. Using the record size, it is possible to calculate the total size of the data in bytes, and jump from the data location to the value labels location. I did this and all the tests pass. However I don't know all the ins and outs of Stata files. so it is possible that I am missing something here.

I still need to work on docstrings.

@bashtage
Copy link
Contributor

Using the record size, it is possible to calculate the total size of the data in bytes, and jump from the data location to the value labels location. I did this and all the tests pass. However I don't know all the ins and outs of Stata files. so it is possible that I am missing something here.

Correction: I think it is possible if that file has variable length strings that this won't work. This said, we should raise on variable length strings since these aren't supported. In fact, a test that verified the error would be a good idea.

http://www.stata.com/help.cgi?dta

5.10 and 5.11

I think correct reading value labels would require scanning backward for the start of the block: <value_labels> if one wanted to handle the case with variable length string data.

@kshedden
Copy link
Contributor Author

There was already a function called "_read_strls" that appears to try to read the variable length strings. However this test suite seems to never enter this function. I'll try to deal with this.

strls are not mentioned in the stata format 115 docs (http://www.stata.com/help.cgi?dta_115), so I'm guessing that they were introduced in format 117. We don't have a problem seeking to the value labels section in 117 files, the comments in the code that mentioned this limitation were explicit about it being only an issue for format <=115.

So putting this all together it looks like we might be safe jumping to the value labels based on the size of the data table (the part with fixed length records) in format <=115, and in format 117 the position of the value labels is explicitly given in the header.

@kshedden
Copy link
Contributor Author

Follow-up: the strl reader seems not to work and there are checks elsewhere that raise if strls are present. Do we still want to keep _read_strls?

@bashtage
Copy link
Contributor

I would remove the code if it both can't be hit and is broken.

@kshedden
Copy link
Contributor Author

I got read_strls working, and cleaned up the docstrings.

I don't have anything else on my to-do list for this, so anyone who was waiting to review this more carefully can now proceed.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2015

  • can you give a run with vbench, see here
    you can add -r stata to only run those benchmarks (btw, if you feel we need some more, eg. long strings / mixed strings, whatever, feel free to add). IIRC you did speed up some string stuff, so this should be nice
  • pls add a release note (reference this PR number as there is not an associated issue)
  • pls update the io.rst in the stata section with an example of this use (see the HDF5 or read_csv section for an example of usage).


# legacy
@Appender(_data_method_doc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be marked for eventual deprecation? data() as a method doesn't make much sense, and read() is far clearer. (data as a property would be more reasonable).

The for the two docs hint at the deprecation.

@kshedden
Copy link
Contributor Author

The vbench output is below. Am I supposed to commit something for this?

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
packers_write_stata                          |  44.4117 |  92.9410 |   0.4778 |
packers_read_stata                           | 286.1663 | 348.4666 |   0.8212 |
packers_write_stata_with_validation          | 243.5517 | 252.5190 |   0.9645 |
packers_read_stata_with_validation           | 459.6000 | 391.4090 |   1.1742 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [84b683f] : Add release note
Base   [1d94dd2] : Merge pull request #9501 from jreback/series


reader = pd.read_stata('stata.dta', chunksize=1000)
for df in reader:
do_something(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a code-block as this will actually execute (and error)

@jreback
Copy link
Contributor

jreback commented Feb 25, 2015

@kshedden vbench is good. Really just a check to see how optimization worked (or if an issue). Good on that.

Just some minor stylist points.

@kshedden
Copy link
Contributor Author

Does anyone know about Stata date formats like "%tdD_m_Y"? We are not currently converting these to dates (they remain as int32) because the check for date formats needs to match "%td" exactly. It appears to convert correctly if we treat the format as if it were simply "%td".

@kshedden
Copy link
Contributor Author

It appears that %tdD_m_Y should be treated the same as %td. The additional characters are not relevant for reading the file, and are confusing our date conversion routines. So I am now stripping them out.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2015

@kshedden can you squash this down to a single commit. I think ok to merge after that.

@@ -55,6 +55,8 @@ performance improvements along with a large number of bug fixes.

Highlights include:

- Allow Stata files to be read incrementally, support for long strings in Stata files issue(:`9493`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a ref link back to the docs

Remove testing code

Use partition in null_terminate

Manage warnings better in test

Further warning management in testing; add skip_data argument

Major refactoring to address code review

Fix strl reading, templatize docstrings

Fix bug in attaching docstring

Add new test file

Add release note

Call read instead of data when calling pandas.read_stata

various small issues following code review

Improve performance of %td processing

Docs edit (minor)
jreback added a commit that referenced this pull request Mar 5, 2015
ENH Read Stata dta file incrementally
@jreback jreback merged commit 2d368bc into pandas-dev:master Mar 5, 2015
@jreback
Copy link
Contributor

jreback commented Mar 5, 2015

@kshedden thanks for this!

pls review the built docs (take 1-2 hours) here: http://pandas-docs.github.io/pandas-docs-travis/

for any corrections. pls submit a follow up pr if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants