Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding optional args to load() function. #413

Merged
merged 9 commits into from
May 5, 2015
Merged

Adding optional args to load() function. #413

merged 9 commits into from
May 5, 2015

Conversation

senderista
Copy link
Contributor

Allows user to specify comma-separated key-value pairs (with = delimiter), separated from any preceding varargs list by semicolon.
Tested in local myria-web with the following query:

t = load("https://s3-us-west-2.amazonaws.com/myria/public-adhoc-TwitterK.csv", column0:int, column1:int; skip="1");
store(t, TwitterK2);

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 9586e23 on load_options into * on master*.

@domoritz
Copy link
Member

domoritz commented May 4, 2015

Suggestion from the meeting: load('filename.txt', csv(args...));

@senderista
Copy link
Contributor Author

Per further discussion with @mbalazin, @jortiz16, @domoritz, will update syntax to load('filename.txt', csv(schema=(a:int,...), opt1=val1,...));

@senderista
Copy link
Contributor Author

Actually syntax should be load('file:///Users/tdbaker/filename.txt', csv(schema(a:int,...), opt1=val1,...));

Here is a working example:

t = load("https://s3-us-west-2.amazonaws.com/myria/public-adhoc-TwitterK.csv", csv(schema(column0:int, column1:int), skip=1));
store(t, TwitterK2);

Further discussion of this syntax is welcome.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling ab24b3f on load_options into * on master*.

@senderista senderista assigned domoritz and unassigned billhowe May 5, 2015
@domoritz
Copy link
Member

domoritz commented May 5, 2015

Note for myself: I need to make sure that the syntax highlighting works with the changes.

domoritz added a commit that referenced this pull request May 5, 2015
Adding optional args to load() function.
@domoritz domoritz merged commit a96efce into master May 5, 2015
@domoritz domoritz deleted the load_options branch May 5, 2015 22:21
@billhowe
Copy link
Contributor

billhowe commented May 6, 2015

Some datasets will be partitioned such that there are files with the same name with different content on different workers. The current syntax assumes this, AFAICT. (Right?)

Some shared datasets may not have this property. We may want to process data in an S3 bucket full of files, where each file has a different name and will be assigned to a different worker.

I'm wondering if we can hallucinate a glob syntax in MyriaL that would make this work.

For example:

load("https://s3-us-west-2.amazonaws.com/someapp/*.csv", column0:int, column1:int; skip="1");

store(t, S3Data);

Can we think up a way to make this meaningful, such that all csv files in the given bucket are scanned as one big relation in parallel?

This is not totally obvious, because there might be 5 files, or there might be 500. In the 500 case, what should the json plan look like? Is there a "load_many" operator that can scan from many files, each in turn? Can we simulate load_many using one sequence operator per worker, with many file_scan operators as children?

@mbalazin
Copy link
Contributor

mbalazin commented May 6, 2015

A glob syntax would be indeed useful.

We decided to drop the feature that reads multiple files from the local
filesystem. The reason is that modern engines run on top of cluster
resource managers such as YARN, so there is no concept of one worker per
physical machines. The workers can be scheduled arbitrarily. So the way to
do it is to stash the data into HDFS and then read it in parallel from
there.

More specifically, the plan is as follows:

Step 1)
--> Support serial read from S3 through MyriaL: That's the latest feature
that got added.
--> Support serial read from HDFS through MyriaL: Should be supported. Need
to test.
--> Support serial read from the local filesystem through MyriaL: Works.
This is for the case where someone is just playing with MyriaX locally on
their laptop.

Step 2)
--> Support parallel read of multiple files from either S3 or HDFS and
through either MyriaL or Python. Python side already works when the number
of files = number of workers. Next step is to generalize.

Step 3)
--> Support parallel read of big files from S3 or HDFS: To do.

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Tue, May 5, 2015 at 9:12 PM, billhowe notifications@github.com wrote:

Some datasets will be partitioned such that there are files with the same
name with different content on different workers. The current syntax
assumes this, AFAICT. (Right?)

Some shared datasets may not have this property. We may want to process
data in an S3 bucket full of files, where each file has a different name
and will be assigned to a different worker.

I'm wondering if we can hallucinate a glob syntax in MyriaL that would
make this work.

For example:

load("https://s3-us-west-2.amazonaws.com/someapp/*.csv", column0:int, column1:int; skip="1");

store(t, S3Data);

Can we think up a way to make this meaningful, such that all csv files in
the given bucket are scanned as one big relation in parallel?

This is not totally obvious, because there might be 5 files, or there
might be 500. In the 500 case, what should the json plan look like? Is
there a "load_many" operator that can scan from many files, each in turn?
Can we simulate load_many using one sequence operator per worker, with many
file_scan operators as children?


Reply to this email directly or view it on GitHub
#413 (comment).

@billhowe
Copy link
Contributor

billhowe commented May 6, 2015

Sounds perfect!

Step 2 is what I'm blathering about.

I'm wondering if the following plan will work with no new operators needed.

Given k workers and files f_1, f_2, ..., f_{n*k}, we build a plan that
looks like this:

Worker 0 runs:
Union(FileScan(f_1), FileScan(f_2), ..., FileScan(f_n))

Worker i runs:
Union(FileScan(f_{i_n+1}), FileScan(f_{i_n+2}), ..., FileScan(f_{i*n+n}))

(I don't know if that should be a union or a sequence)

If this works, we won't have to care how many files are in the bucket.

To build this plan, the optimizer will need to resolve the glob pattern,
though, which may be backend-specific. For s3, we could make it work
directly.

This is not totally critical, perhaps -- we could make people just write a
super-long painful MyriaL program that explicitly references every file.
But it comes up often enough that we should consider it.

On Tue, May 5, 2015 at 9:20 PM, Magdalena Balazinska <
notifications@github.com> wrote:

A glob syntax would be indeed useful.

We decided to drop the feature that reads multiple files from the local
filesystem. The reason is that modern engines run on top of cluster
resource managers such as YARN, so there is no concept of one worker per
physical machines. The workers can be scheduled arbitrarily. So the way to
do it is to stash the data into HDFS and then read it in parallel from
there.

More specifically, the plan is as follows:

Step 1)
--> Support serial read from S3 through MyriaL: That's the latest feature
that got added.
--> Support serial read from HDFS through MyriaL: Should be supported. Need
to test.
--> Support serial read from the local filesystem through MyriaL: Works.
This is for the case where someone is just playing with MyriaX locally on
their laptop.

Step 2)
--> Support parallel read of multiple files from either S3 or HDFS and
through either MyriaL or Python. Python side already works when the number
of files = number of workers. Next step is to generalize.

Step 3)
--> Support parallel read of big files from S3 or HDFS: To do.

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Tue, May 5, 2015 at 9:12 PM, billhowe notifications@github.com wrote:

Some datasets will be partitioned such that there are files with the same
name with different content on different workers. The current syntax
assumes this, AFAICT. (Right?)

Some shared datasets may not have this property. We may want to process
data in an S3 bucket full of files, where each file has a different name
and will be assigned to a different worker.

I'm wondering if we can hallucinate a glob syntax in MyriaL that would
make this work.

For example:

load("https://s3-us-west-2.amazonaws.com/someapp/*.csv", column0:int,
column1:int; skip="1");

store(t, S3Data);

Can we think up a way to make this meaningful, such that all csv files in
the given bucket are scanned as one big relation in parallel?

This is not totally obvious, because there might be 5 files, or there
might be 500. In the 500 case, what should the json plan look like? Is
there a "load_many" operator that can scan from many files, each in turn?
Can we simulate load_many using one sequence operator per worker, with
many
file_scan operators as children?


Reply to this email directly or view it on GitHub
#413 (comment).


Reply to this email directly or view it on GitHub
#413 (comment).

Bill Howe
Associate Director and Senior Data Science Fellow, UW eScience Institute
Affiliate Faculty, Computer Science & Engineering
University of Washington
To acknowledge eScience: "Supported in part by the University of Washington
eScience Institute"

@mbalazin
Copy link
Contributor

mbalazin commented May 6, 2015

I think that this is a critical next step!
magda

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Tue, May 5, 2015 at 9:46 PM, billhowe notifications@github.com wrote:

Sounds perfect!

Step 2 is what I'm blathering about.

I'm wondering if the following plan will work with no new operators needed.

Given k workers and files f_1, f_2, ..., f_{n*k}, we build a plan that
looks like this:

Worker 0 runs:
Union(FileScan(f_1), FileScan(f_2), ..., FileScan(f_n))

Worker i runs:
Union(FileScan(f_{i_n+1}), FileScan(f_{i_n+2}), ..., FileScan(f_{i*n+n}))

(I don't know if that should be a union or a sequence)

If this works, we won't have to care how many files are in the bucket.

To build this plan, the optimizer will need to resolve the glob pattern,
though, which may be backend-specific. For s3, we could make it work
directly.

This is not totally critical, perhaps -- we could make people just write a
super-long painful MyriaL program that explicitly references every file.
But it comes up often enough that we should consider it.

On Tue, May 5, 2015 at 9:20 PM, Magdalena Balazinska <
notifications@github.com> wrote:

A glob syntax would be indeed useful.

We decided to drop the feature that reads multiple files from the local
filesystem. The reason is that modern engines run on top of cluster
resource managers such as YARN, so there is no concept of one worker per
physical machines. The workers can be scheduled arbitrarily. So the way
to
do it is to stash the data into HDFS and then read it in parallel from
there.

More specifically, the plan is as follows:

Step 1)
--> Support serial read from S3 through MyriaL: That's the latest feature
that got added.
--> Support serial read from HDFS through MyriaL: Should be supported.
Need
to test.
--> Support serial read from the local filesystem through MyriaL: Works.
This is for the case where someone is just playing with MyriaX locally on
their laptop.

Step 2)
--> Support parallel read of multiple files from either S3 or HDFS and
through either MyriaL or Python. Python side already works when the
number
of files = number of workers. Next step is to generalize.

Step 3)
--> Support parallel read of big files from S3 or HDFS: To do.

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Tue, May 5, 2015 at 9:12 PM, billhowe notifications@github.com
wrote:

Some datasets will be partitioned such that there are files with the
same
name with different content on different workers. The current syntax
assumes this, AFAICT. (Right?)

Some shared datasets may not have this property. We may want to process
data in an S3 bucket full of files, where each file has a different
name
and will be assigned to a different worker.

I'm wondering if we can hallucinate a glob syntax in MyriaL that would
make this work.

For example:

load("https://s3-us-west-2.amazonaws.com/someapp/*.csv", column0:int,
column1:int; skip="1");

store(t, S3Data);

Can we think up a way to make this meaningful, such that all csv files
in
the given bucket are scanned as one big relation in parallel?

This is not totally obvious, because there might be 5 files, or there
might be 500. In the 500 case, what should the json plan look like? Is
there a "load_many" operator that can scan from many files, each in
turn?
Can we simulate load_many using one sequence operator per worker, with
many
file_scan operators as children?


Reply to this email directly or view it on GitHub
#413 (comment).


Reply to this email directly or view it on GitHub
#413 (comment).

Bill Howe
Associate Director and Senior Data Science Fellow, UW eScience Institute
Affiliate Faculty, Computer Science & Engineering
University of Washington
To acknowledge eScience: "Supported in part by the University of Washington
eScience Institute"


Reply to this email directly or view it on GitHub
#413 (comment).

@BrandonHaynes
Copy link
Member

Surfacing this in MyriaL will allow me to remove the ad hoc JSON-plan-creation cruft in Myria-Python, which is a Good Thing.

@senderista
Copy link
Contributor Author

In Hive the LOAD DATA statement takes a file or directory path, not a glob pattern, and I suspect many potential Myria users are accustomed to this syntax for loading multiple files into a single table. We should consider whether the flexibility of supporting glob patterns is worth the added complexity (it's not if it's only used for prefix matches).

We could also consider whether we want users to be able to implicitly indicate range partitioning by subdirectory structure (ala EMR's ALTER TABLE RECOVER PARTITIONS extension to Hive, which I understand is now natively supported in Hive). I'm not necessarily saying this is a good idea, just something to consider...

@mbalazin
Copy link
Contributor

mbalazin commented May 6, 2015

Specifying a directory will cover a lot of the cases for loading multiple
files. I agree.

As for partitioning, it might suffice to do that at the level of relations
rather than files:

  • Round robin by default.
  • Hash partition using some attribute values [and optionally a hash
    function]

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Wed, May 6, 2015 at 10:26 AM, senderista notifications@github.com
wrote:

In Hive the LOAD DATA statement takes a file or directory path, not a glob
pattern, and I suspect many potential Myria users are accustomed to this
syntax for loading multiple files into a single table. We should consider
whether the flexibility of supporting glob patterns is worth the added
complexity (it's not if it's only used for prefix matches).

We could also consider whether we want users to be able to implicitly
indicate range partitioning by subdirectory structure (ala EMR's ALTER
TABLE RECOVER PARTITIONS extension to Hive, which I understand is now
natively supported in Hive). I'm not necessarily saying this is a good
idea, just something to consider...


Reply to this email directly or view it on GitHub
#413 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants