-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add get_rates_columnar method to FXRatesReader. #2617
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one question about whether the fact that we get a bunch of data that we don't end up using in the columnar case matters at all but otherwise this LGTM
rates_2d = self.get_rates( | ||
rate, | ||
quote, | ||
unique_bases, | ||
pd.DatetimeIndex(unique_dts, tz='utc') | ||
) | ||
return rates_2d[dts_ix, bases_ix] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong about this, but it seems like in the columnar case we might load a bunch more data than we need since get_rates
computes the cartesian product but the column is only a subset of that. I assume this doesn't really matter, but was curious if that is a concern at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like in the columnar case we might load a bunch more data than we need since get_rates computes the cartesian product but the column is only a subset of that.
Yup, this is definitely a possibility. The worst case scenario here would be something like a columnar query where every day has a different currency in it. We'd end up loading all the currencies for every day, even though we only needed one day of each currency. I'm not too worried about this because:
- In our expected workloads, I expect that if we need a currency on day N, we'll probably need it for nearby days. In particular, most datasets will have static currencies, which means the most common case will be that if we ever need a currency we'll need it every day.
- Even in the pathological case, it's not clear that we can do much better than what we're doing here, because we're going to want to use compression and chunking in the HDF5 files. What that means in practice is that the data arrays get broken up into regular sized pieces, and each piece gets compressed and stored separately. If you load any value out of the array, you have to decompress and load the entire chunk holding the value, so what dominates performance isn't so much the number of values read: it's the number of distinct chunks read. So, most likely, even if we tried to more precise reads here, we'd end up doing the same amount of IO, and we'd likely have higher overheads from trying to read smaller chunks of the file.
Add a new method,
get_rates_columnar
toFXRateReader
, with a default implementation given in terms ofget_rates
.The behavior of
get_rates_columnar
should be logically equivalent to: