Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add get_rates_columnar method to FXRatesReader. #2617

Merged
merged 1 commit into from
Jan 16, 2020

Conversation

ssanderson
Copy link
Contributor

@ssanderson ssanderson commented Jan 15, 2020

Add a new method, get_rates_columnar to FXRateReader, with a default implementation given in terms of get_rates.

The behavior of get_rates_columnar should be logically equivalent to:

def get_rates_columnar(self, rate, quote, bases, dts):
    out = []
    for base, dt in zip(bases, dts):
        out.append(self.get_rate_scalar(rate, quote, base, dt)
    return np.array(out)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 88.033% when pulling c6bb976 on columnar-fx-rates into b0b20b0 on master.

Copy link
Contributor

@peterhbromley peterhbromley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one question about whether the fact that we get a bunch of data that we don't end up using in the columnar case matters at all but otherwise this LGTM

Comment on lines +140 to +146
rates_2d = self.get_rates(
rate,
quote,
unique_bases,
pd.DatetimeIndex(unique_dts, tz='utc')
)
return rates_2d[dts_ix, bases_ix]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong about this, but it seems like in the columnar case we might load a bunch more data than we need since get_rates computes the cartesian product but the column is only a subset of that. I assume this doesn't really matter, but was curious if that is a concern at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like in the columnar case we might load a bunch more data than we need since get_rates computes the cartesian product but the column is only a subset of that.

Yup, this is definitely a possibility. The worst case scenario here would be something like a columnar query where every day has a different currency in it. We'd end up loading all the currencies for every day, even though we only needed one day of each currency. I'm not too worried about this because:

  1. In our expected workloads, I expect that if we need a currency on day N, we'll probably need it for nearby days. In particular, most datasets will have static currencies, which means the most common case will be that if we ever need a currency we'll need it every day.
  2. Even in the pathological case, it's not clear that we can do much better than what we're doing here, because we're going to want to use compression and chunking in the HDF5 files. What that means in practice is that the data arrays get broken up into regular sized pieces, and each piece gets compressed and stored separately. If you load any value out of the array, you have to decompress and load the entire chunk holding the value, so what dominates performance isn't so much the number of values read: it's the number of distinct chunks read. So, most likely, even if we tried to more precise reads here, we'd end up doing the same amount of IO, and we'd likely have higher overheads from trying to read smaller chunks of the file.

@ssanderson ssanderson merged commit c825927 into master Jan 16, 2020
@ssanderson ssanderson deleted the columnar-fx-rates branch January 16, 2020 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants