-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding optional args to load() function. #413
Conversation
Changes Unknown when pulling 9586e23 on load_options into * on master*. |
Suggestion from the meeting: |
Actually syntax should be Here is a working example:
Further discussion of this syntax is welcome. |
Changes Unknown when pulling ab24b3f on load_options into * on master*. |
Note for myself: I need to make sure that the syntax highlighting works with the changes. |
Adding optional args to load() function.
Some datasets will be partitioned such that there are files with the same name with different content on different workers. The current syntax assumes this, AFAICT. (Right?) Some shared datasets may not have this property. We may want to process data in an S3 bucket full of files, where each file has a different name and will be assigned to a different worker. I'm wondering if we can hallucinate a glob syntax in MyriaL that would make this work. For example:
Can we think up a way to make this meaningful, such that all csv files in the given bucket are scanned as one big relation in parallel? This is not totally obvious, because there might be 5 files, or there might be 500. In the 500 case, what should the json plan look like? Is there a "load_many" operator that can scan from many files, each in turn? Can we simulate load_many using one sequence operator per worker, with many file_scan operators as children? |
A glob syntax would be indeed useful. We decided to drop the feature that reads multiple files from the local More specifically, the plan is as follows: Step 1) Step 2) Step 3) Magdalena Balazinska On Tue, May 5, 2015 at 9:12 PM, billhowe notifications@github.com wrote:
|
Sounds perfect! Step 2 is what I'm blathering about. I'm wondering if the following plan will work with no new operators needed. Given k workers and files f_1, f_2, ..., f_{n*k}, we build a plan that Worker 0 runs: Worker i runs: (I don't know if that should be a union or a sequence) If this works, we won't have to care how many files are in the bucket. To build this plan, the optimizer will need to resolve the glob pattern, This is not totally critical, perhaps -- we could make people just write a On Tue, May 5, 2015 at 9:20 PM, Magdalena Balazinska <
Bill Howe |
I think that this is a critical next step! Magdalena Balazinska On Tue, May 5, 2015 at 9:46 PM, billhowe notifications@github.com wrote:
|
Surfacing this in MyriaL will allow me to remove the ad hoc JSON-plan-creation cruft in Myria-Python, which is a Good Thing. |
In Hive the LOAD DATA statement takes a file or directory path, not a glob pattern, and I suspect many potential Myria users are accustomed to this syntax for loading multiple files into a single table. We should consider whether the flexibility of supporting glob patterns is worth the added complexity (it's not if it's only used for prefix matches). We could also consider whether we want users to be able to implicitly indicate range partitioning by subdirectory structure (ala EMR's ALTER TABLE RECOVER PARTITIONS extension to Hive, which I understand is now natively supported in Hive). I'm not necessarily saying this is a good idea, just something to consider... |
Specifying a directory will cover a lot of the cases for loading multiple As for partitioning, it might suffice to do that at the level of relations
Magdalena Balazinska On Wed, May 6, 2015 at 10:26 AM, senderista notifications@github.com
|
Allows user to specify comma-separated key-value pairs (with
=
delimiter), separated from any preceding varargs list by semicolon.Tested in local myria-web with the following query: