-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: Transpose arrays in fx artifact for better compression #2646
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterhbromley couple comment.
zipline/data/fx/hdf5.py
Outdated
|
||
|
||
class HDF5FXRateWriter(object): | ||
"""Writer class for HDF5 files consumed by HDF5FXRateReader. | ||
""" | ||
def __init__(self, group): | ||
def __init__(self, group, date_chunk_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would probably be reasonable to give this a default value so that zipline users don't need to do the same tuning work that we did on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the default as a global defined at the beginning of the file.
zipline/data/fx/hdf5.py
Outdated
buf[:-1], | ||
np.s_[slice_begin:slice_end], | ||
) | ||
buf[:, :-1] = dataset[:, slice_begin:slice_end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not using read_direct
anymore, we might as well allocate both an extra row and an extra column and use the extra row/column trick for handling both cases rather than using it for just one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
zipline/data/fx/hdf5.py
Outdated
# row. When we then apply the row index to permute the raw data into | ||
# the correct order, any rows with values of -1 will pull from the | ||
# extra row, which will always contain NaN> | ||
# column. When we then apply the column index to permute the raw data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment got a bit scrambled? (1)
still refers to the possibility of nonexistent rows, but then we talk about columns here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we use the extra row/column trick for both cases, I reworded the comment.
zipline/data/fx/hdf5.py
Outdated
mapping its column label's currency to ``quote_currency``. The | ||
arrays that are actually written to the HDF5 file will be | ||
transposed to have shape ``(len(currencies), len(dts))`` so that | ||
similar values are in C-contiguous order, which improves overall | ||
compression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation detail of the file format. I'm not sure it makes sense to include here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only outstanding comment is the note about transposing the inputs to write
feels a little out of place to me. I'd probably either cut it or move it to a Notes
section.
No description provided.