Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the flag to disable Ømq copy buffers #595

Closed
dmcguire81 opened this issue Sep 10, 2020 · 13 comments
Closed

Expose the flag to disable Ømq copy buffers #595

dmcguire81 opened this issue Sep 10, 2020 · 13 comments

Comments

@dmcguire81
Copy link
Contributor

One of our engineers found an optimization involving disabling ZeroMQ copy buffers in the ProcessWorker, but this is not exposed in the top-level factory methods, make_reader and make_batch_reader. It's useful, and probably should be.

@selitvin
Copy link
Collaborator

Yep. Agreed. Can you please elaborate what is the issue?

@dmcguire81
Copy link
Contributor Author

There's nothing new to report: better throughput, in exchange for less predictable memory consumption, is a good trade-off for us, and probably for others, so it should be available in the factory methods.

@dmcguire81
Copy link
Contributor Author

While we're talking, though, it seems like the factory methods only work seamlessly if on whatever storage Uber happens to use (guessing HDFS, given libhdfs is the only config given consistently elevated visibility). The experience on S3 is pretty bad.

@dmcguire81
Copy link
Contributor Author

We're considering dropping down to the petastorm.reader.Reader interface to deal with having to patch s3fs and pyarrow for the respective shortcomings (or this code base to work around them), but wanted to get your take on a refactoring to the factory methods, themselves, since it seems like a losing battle to offer all the flexibility users will need and still all do the instantiation yourself.

@selitvin
Copy link
Collaborator

Your diagnosis is correct. We did not use petastorm with s3 internally, hence this is not a polished feature. I can help if you can point me to the issues you run into. I did verify basic functionality but not more then that.

@selitvin
Copy link
Collaborator

I saw that you created #594. If this is not sufficient (curious what else would we need), perhaps we can have make_reader/make_batch_reader take an instance of already constructed filesystem object?

@dmcguire81
Copy link
Contributor Author

dmcguire81 commented Sep 10, 2020

We built up a bit of a backlog while waiting on legal to approve submitting PRs back, but they're all in there, now, so you see everything that's been needed, so far (except for s3fs==0.5.0 throwing TypeError: 'coroutine' object is not iterable from make_batch_reader, which I just found today, by accident, trying to build you a repro case).

I think passing the pyarrow.Filesystem is a great option, since the rest of the code that's hidden (e.g., instances of petastorm.worker_pool.*) is under your direct control, so won't be as much of a maintenance burden to keep "under the hood".

@dmcguire81
Copy link
Contributor Author

#590 was actually the real killer, but that addresses every problem we've seen, so far, except this one, so we could get it all done in 2 PRs, probably.

@dmcguire81
Copy link
Contributor Author

For your part, having it be parameterized instead of discovered, makes it so that you can route defects to those project (pyarrow and s3fs) where they belong. Without making s3fs at least an optional extension, it probably shouldn't be referenced in the code, at all, just advertised as "compatible".

@selitvin
Copy link
Collaborator

So if I understand correctly if we land all PRs you have put up we would solve your outstanding problems, correct?

@dmcguire81
Copy link
Contributor Author

Yes, but we have a private fork we're working off of, so are not blocked. I'm really interested in the future-proofing idea of adding a filesystem parameter, but wanted to document the evidence for it with illustrative PRs.

If you want to merge these and then discuss a refactoring, great, but I'm assuming you don't want to add parameters to the factory methods and then remove them.

@selitvin
Copy link
Collaborator

I would image that s3_config_kwargs would solve 90% of user problems (speculating, since did not work with s3 myself).
If this is the case, then it may be a good idea to land it even that we would have a full-blown filesystem object argument later for other usecases (say filesystems that are currently not supported by petastorm).

@dmcguire81
Copy link
Contributor Author

Makes sense. I'll update that PR with release notes for you to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants