ENH Read Stata dta file incrementally #9493

kshedden · 2015-02-15T04:30:17Z

This PR adds a get_chunk method to StataReader to allow files to be read incrementally. This is quite useful when processing large files that can barely fit into memory.

The interface is modeled after the analogous method in TextFileReader.

There are some limitations when incrementally converting categoricals in pre 117 version files. I have a work-around that involves reading the file through once to get the label information, then re-reading and converting using a new function called convert_categoricals.

In a fair amount of testing, I always get identical results using get_chunk and reading the file completely.

I left the existing data(...) method for reading the complete file intact. Potentially it could be changed to simply call get_chunk.

A few additional issues:

The PR fixes an apparent bug in _stata_elapsed_date_to_datetime_vec, in all three of the nested convert_xxx functions. In some execution paths, a new Series is created without using the index of the series passed as an argument. This can cause incorrect NA detection when the result is indexed using bad_locs.
A modest performance improvement (~40% for me) is obtained by using partition inside of _null_terminate in place of index/slice operations.
Different date object types may be returned during date conversions, according to the execution path. These can represent identical dates but test as being different using assert_frame_equal. So there's a bit of extra care required in some of the tests.
It would be easy to add an iterator to read the chunks, but I didn't do that. You need to call get_chunk directly.
Some dtypes may change from chunk-to-chunk. For example, I believe that pandas promotes integers to floats when there are missing values. In a chunk with no missing values the dtype would be int, but become float when there are missing values.
A comment separate from this PR, the slowest step for reading the big data files I work with is string decoding and truncation (millions of calls to _null_terminate). Other than the modest gain using partition, I tried and failed to find a good optimization for this.

jreback · 2015-02-15T21:27:40Z

@kshedden this is a good idea. See the related issue #9496. We want to try to follow a common pattern for the invocations of this. These other 3 io libs that do this already have a pretty well-defined interface (e.g. they take chunksize/iterator arguments. And provide an iterator returned back to the user so this is pretty transparent.

So if you'd take a look and follow this common pattern would be great (separate issue is actually to consolidate this into a base-class). You are welcome to actually start there if its easier. We can then build off your effort.

lmk and we can then examine your proposal in more detail.

cc @bashtage

bashtage · 2015-02-15T23:07:49Z

pandas/io/stata.py

+            raise ValueError("For format versions < 117, `convert_categoricals` must be set to False.  See docstring for details.")
+
+        first_chunk = False
+        if not hasattr(self, '_not_first_chunk'):


Adding attributes as a substitute for a bool doesnt' seem like a good idea. Why not just add _first_chunk=True in __init__? If not using the iterator it will just be ignored.

bashtage · 2015-02-15T23:30:20Z

I left the existing data(...) method for reading the complete file intact. Potentially it could be changed to simply call get_chunk.

It makes sense to replace this method just to avoid have quote a bit of repeated code. Maybe 'chunksize=None` would indicate the read the entire file?

jreback · 2015-02-15T23:32:43Z

typical we have chunksize=None, iterator=False as the default arguments

then it's easy to provide an iterator if requested or simply return the entire file if not

kshedden · 2015-02-18T20:45:08Z

Thanks for the review. I think I have addressed most of the comments. The interfaces to read_stata and StataReader should now match read_table and TextFileReader, at least in terms of features that are common to both file types.

Comments in the current StataReader source state that it is not possible to read the value labels from the end of the dta file in a pre 117 version file without first reading through all the data. It's not clear to me that this is the case. Using the record size, it is possible to calculate the total size of the data in bytes, and jump from the data location to the value labels location. I did this and all the tests pass. However I don't know all the ins and outs of Stata files. so it is possible that I am missing something here.

I still need to work on docstrings.

bashtage · 2015-02-18T21:03:22Z

Using the record size, it is possible to calculate the total size of the data in bytes, and jump from the data location to the value labels location. I did this and all the tests pass. However I don't know all the ins and outs of Stata files. so it is possible that I am missing something here.

Correction: I think it is possible if that file has variable length strings that this won't work. This said, we should raise on variable length strings since these aren't supported. In fact, a test that verified the error would be a good idea.

http://www.stata.com/help.cgi?dta

5.10 and 5.11

I think correct reading value labels would require scanning backward for the start of the block: <value_labels> if one wanted to handle the case with variable length string data.

kshedden · 2015-02-19T03:12:19Z

There was already a function called "_read_strls" that appears to try to read the variable length strings. However this test suite seems to never enter this function. I'll try to deal with this.

strls are not mentioned in the stata format 115 docs (http://www.stata.com/help.cgi?dta_115), so I'm guessing that they were introduced in format 117. We don't have a problem seeking to the value labels section in 117 files, the comments in the code that mentioned this limitation were explicit about it being only an issue for format <=115.

So putting this all together it looks like we might be safe jumping to the value labels based on the size of the data table (the part with fixed length records) in format <=115, and in format 117 the position of the value labels is explicitly given in the header.

kshedden · 2015-02-19T03:44:16Z

Follow-up: the strl reader seems not to work and there are checks elsewhere that raise if strls are present. Do we still want to keep _read_strls?

bashtage · 2015-02-19T04:06:51Z

I would remove the code if it both can't be hit and is broken.

kshedden · 2015-02-19T22:15:04Z

I got read_strls working, and cleaned up the docstrings.

I don't have anything else on my to-do list for this, so anyone who was waiting to review this more carefully can now proceed.

jreback · 2015-02-20T01:32:10Z

can you give a run with vbench, see here
you can add -r stata to only run those benchmarks (btw, if you feel we need some more, eg. long strings / mixed strings, whatever, feel free to add). IIRC you did speed up some string stuff, so this should be nice
pls add a release note (reference this PR number as there is not an associated issue)
pls update the io.rst in the stata section with an example of this use (see the HDF5 or read_csv section for an example of usage).

bashtage · 2015-02-24T14:30:15Z

pandas/io/stata.py


+    # legacy
+    @Appender(_data_method_doc)


Should this be marked for eventual deprecation? data() as a method doesn't make much sense, and read() is far clearer. (data as a property would be more reasonable).

The for the two docs hint at the deprecation.

kshedden · 2015-02-25T04:20:42Z

The vbench output is below. Am I supposed to commit something for this?

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
packers_write_stata                          |  44.4117 |  92.9410 |   0.4778 |
packers_read_stata                           | 286.1663 | 348.4666 |   0.8212 |
packers_write_stata_with_validation          | 243.5517 | 252.5190 |   0.9645 |
packers_read_stata_with_validation           | 459.6000 | 391.4090 |   1.1742 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [84b683f] : Add release note
Base   [1d94dd2] : Merge pull request #9501 from jreback/series

jreback · 2015-02-25T11:22:10Z

doc/source/io.rst

+
+  reader = pd.read_stata('stata.dta', chunksize=1000)
+  for df in reader:
+      do_something(df)


make this a code-block as this will actually execute (and error)

jreback · 2015-02-25T11:36:26Z

@kshedden vbench is good. Really just a check to see how optimization worked (or if an issue). Good on that.

Just some minor stylist points.

kshedden · 2015-02-26T20:07:41Z

Does anyone know about Stata date formats like "%tdD_m_Y"? We are not currently converting these to dates (they remain as int32) because the check for date formats needs to match "%td" exactly. It appears to convert correctly if we treat the format as if it were simply "%td".

kshedden · 2015-02-27T12:20:09Z

It appears that %tdD_m_Y should be treated the same as %td. The additional characters are not relevant for reading the file, and are confusing our date conversion routines. So I am now stripping them out.

jreback · 2015-03-01T16:10:27Z

@kshedden can you squash this down to a single commit. I think ok to merge after that.

jreback · 2015-03-01T16:11:10Z

doc/source/release.rst

@@ -55,6 +55,8 @@ performance improvements along with a large number of bug fixes.

 Highlights include:

+- Allow Stata files to be read incrementally, support for long strings in Stata files issue(:`9493`)
+


pls add a ref link back to the docs

Remove testing code Use partition in null_terminate Manage warnings better in test Further warning management in testing; add skip_data argument Major refactoring to address code review Fix strl reading, templatize docstrings Fix bug in attaching docstring Add new test file Add release note Call read instead of data when calling pandas.read_stata various small issues following code review Improve performance of %td processing Docs edit (minor)

ENH Read Stata dta file incrementally

jreback · 2015-03-05T23:27:04Z

@kshedden thanks for this!

pls review the built docs (take 1-2 hours) here: http://pandas-docs.github.io/pandas-docs-travis/

for any corrections. pls submit a follow up pr if necessary.

jreback added the IO Stata read_stata, to_stata label Feb 15, 2015

jreback added this to the 0.16.0 milestone Feb 15, 2015

jreback mentioned this pull request Feb 15, 2015

CLN: consolidate Iterator interface #9496

Closed

4 tasks

bashtage reviewed Feb 15, 2015
View reviewed changes

bashtage reviewed Feb 24, 2015
View reviewed changes

kshedden force-pushed the stata_read_chunk branch from 6ba3997 to 84b683f Compare February 25, 2015 00:27

jreback reviewed Feb 25, 2015
View reviewed changes

jreback reviewed Mar 1, 2015
View reviewed changes

kshedden force-pushed the stata_read_chunk branch from 99235ab to 0f22d8d Compare March 1, 2015 17:25

kshedden force-pushed the stata_read_chunk branch from 0f22d8d to 709c034 Compare March 1, 2015 17:33

jreback added a commit that referenced this pull request Mar 5, 2015

Merge pull request #9493 from kshedden/stata_read_chunk

2d368bc

ENH Read Stata dta file incrementally

jreback merged commit 2d368bc into pandas-dev:master Mar 5, 2015

kshedden mentioned this pull request Mar 6, 2015

Fix several stata doc issues #9601

Merged

kshedden deleted the stata_read_chunk branch November 12, 2015 23:44

jbrockmendel mentioned this pull request Dec 10, 2019

DEPR: remove stata deprecations from 2015 and 2017 #30176

Merged

jreback mentioned this pull request Dec 10, 2019

DEPR: deprecations log for removed issues #13777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Read Stata dta file incrementally #9493

ENH Read Stata dta file incrementally #9493

kshedden commented Feb 15, 2015

jreback commented Feb 15, 2015

bashtage Feb 15, 2015

bashtage commented Feb 15, 2015

jreback commented Feb 15, 2015

kshedden commented Feb 18, 2015

bashtage commented Feb 18, 2015

kshedden commented Feb 19, 2015

kshedden commented Feb 19, 2015

bashtage commented Feb 19, 2015

kshedden commented Feb 19, 2015

jreback commented Feb 20, 2015

bashtage Feb 24, 2015

kshedden commented Feb 25, 2015

jreback Feb 25, 2015

jreback commented Feb 25, 2015

kshedden commented Feb 26, 2015

kshedden commented Feb 27, 2015

jreback commented Mar 1, 2015

jreback Mar 1, 2015

jreback commented Mar 5, 2015

		@@ -55,6 +55,8 @@ performance improvements along with a large number of bug fixes.

		Highlights include:

		- Allow Stata files to be read incrementally, support for long strings in Stata files issue(:`9493`)

ENH Read Stata dta file incrementally #9493

ENH Read Stata dta file incrementally #9493

Conversation

kshedden commented Feb 15, 2015

jreback commented Feb 15, 2015

bashtage Feb 15, 2015

Choose a reason for hiding this comment

bashtage commented Feb 15, 2015

jreback commented Feb 15, 2015

kshedden commented Feb 18, 2015

bashtage commented Feb 18, 2015

kshedden commented Feb 19, 2015

kshedden commented Feb 19, 2015

bashtage commented Feb 19, 2015

kshedden commented Feb 19, 2015

jreback commented Feb 20, 2015

bashtage Feb 24, 2015

Choose a reason for hiding this comment

kshedden commented Feb 25, 2015

jreback Feb 25, 2015

Choose a reason for hiding this comment

jreback commented Feb 25, 2015

kshedden commented Feb 26, 2015

kshedden commented Feb 27, 2015

jreback commented Mar 1, 2015

jreback Mar 1, 2015

Choose a reason for hiding this comment

jreback commented Mar 5, 2015