Adding optional args to load() function. #413

senderista · 2015-05-04T18:31:54Z

Allows user to specify comma-separated key-value pairs (with = delimiter), separated from any preceding varargs list by semicolon.
Tested in local myria-web with the following query:

t = load("https://s3-us-west-2.amazonaws.com/myria/public-adhoc-TwitterK.csv", column0:int, column1:int; skip="1");
store(t, TwitterK2);

coveralls · 2015-05-04T22:12:22Z

Changes Unknown when pulling 9586e23 on load_options into * on master*.

domoritz · 2015-05-04T22:19:36Z

Suggestion from the meeting: load('filename.txt', csv(args...));

senderista · 2015-05-04T23:41:23Z

Per further discussion with @mbalazin, @jortiz16, @domoritz, will update syntax to load('filename.txt', csv(schema=(a:int,...), opt1=val1,...));

senderista · 2015-05-05T02:19:38Z

Actually syntax should be load('file:///Users/tdbaker/filename.txt', csv(schema(a:int,...), opt1=val1,...));

Here is a working example:

t = load("https://s3-us-west-2.amazonaws.com/myria/public-adhoc-TwitterK.csv", csv(schema(column0:int, column1:int), skip=1));
store(t, TwitterK2);

Further discussion of this syntax is welcome.

coveralls · 2015-05-05T03:01:24Z

Changes Unknown when pulling ab24b3f on load_options into * on master*.

domoritz · 2015-05-05T22:09:22Z

Note for myself: I need to make sure that the syntax highlighting works with the changes.

Adding optional args to load() function.

billhowe · 2015-05-06T04:12:27Z

Some datasets will be partitioned such that there are files with the same name with different content on different workers. The current syntax assumes this, AFAICT. (Right?)

Some shared datasets may not have this property. We may want to process data in an S3 bucket full of files, where each file has a different name and will be assigned to a different worker.

I'm wondering if we can hallucinate a glob syntax in MyriaL that would make this work.

For example:

load("https://s3-us-west-2.amazonaws.com/someapp/*.csv", column0:int, column1:int; skip="1");

store(t, S3Data);

Can we think up a way to make this meaningful, such that all csv files in the given bucket are scanned as one big relation in parallel?

This is not totally obvious, because there might be 5 files, or there might be 500. In the 500 case, what should the json plan look like? Is there a "load_many" operator that can scan from many files, each in turn? Can we simulate load_many using one sequence operator per worker, with many file_scan operators as children?

mbalazin · 2015-05-06T04:20:47Z

A glob syntax would be indeed useful.

We decided to drop the feature that reads multiple files from the local
filesystem. The reason is that modern engines run on top of cluster
resource managers such as YARN, so there is no concept of one worker per
physical machines. The workers can be scheduled arbitrarily. So the way to
do it is to stash the data into HDFS and then read it in parallel from
there.

More specifically, the plan is as follows:

Step 1)
--> Support serial read from S3 through MyriaL: That's the latest feature
that got added.
--> Support serial read from HDFS through MyriaL: Should be supported. Need
to test.
--> Support serial read from the local filesystem through MyriaL: Works.
This is for the case where someone is just playing with MyriaX locally on
their laptop.

Step 2)
--> Support parallel read of multiple files from either S3 or HDFS and
through either MyriaL or Python. Python side already works when the number
of files = number of workers. Next step is to generalize.

Step 3)
--> Support parallel read of big files from S3 or HDFS: To do.

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Tue, May 5, 2015 at 9:12 PM, billhowe notifications@github.com wrote:

Some datasets will be partitioned such that there are files with the same
name with different content on different workers. The current syntax
assumes this, AFAICT. (Right?)

Some shared datasets may not have this property. We may want to process
data in an S3 bucket full of files, where each file has a different name
and will be assigned to a different worker.

I'm wondering if we can hallucinate a glob syntax in MyriaL that would
make this work.

For example:

load("https://s3-us-west-2.amazonaws.com/someapp/*.csv", column0:int, column1:int; skip="1");

store(t, S3Data);

Can we think up a way to make this meaningful, such that all csv files in
the given bucket are scanned as one big relation in parallel?

This is not totally obvious, because there might be 5 files, or there
might be 500. In the 500 case, what should the json plan look like? Is
there a "load_many" operator that can scan from many files, each in turn?
Can we simulate load_many using one sequence operator per worker, with many
file_scan operators as children?

—
Reply to this email directly or view it on GitHub
#413 (comment).

billhowe · 2015-05-06T04:46:48Z

Sounds perfect!

Step 2 is what I'm blathering about.

I'm wondering if the following plan will work with no new operators needed.

Given k workers and files f_1, f_2, ..., f_{n*k}, we build a plan that
looks like this:

Worker 0 runs:
Union(FileScan(f_1), FileScan(f_2), ..., FileScan(f_n))

Worker i runs:
Union(FileScan(f_{i_n+1}), FileScan(f_{i_n+2}), ..., FileScan(f_{i*n+n}))

(I don't know if that should be a union or a sequence)

If this works, we won't have to care how many files are in the bucket.

To build this plan, the optimizer will need to resolve the glob pattern,
though, which may be backend-specific. For s3, we could make it work
directly.

This is not totally critical, perhaps -- we could make people just write a
super-long painful MyriaL program that explicitly references every file.
But it comes up often enough that we should consider it.

On Tue, May 5, 2015 at 9:20 PM, Magdalena Balazinska <
notifications@github.com> wrote:

A glob syntax would be indeed useful.

We decided to drop the feature that reads multiple files from the local
filesystem. The reason is that modern engines run on top of cluster
resource managers such as YARN, so there is no concept of one worker per
physical machines. The workers can be scheduled arbitrarily. So the way to
do it is to stash the data into HDFS and then read it in parallel from
there.

More specifically, the plan is as follows:

Step 1)
--> Support serial read from S3 through MyriaL: That's the latest feature
that got added.
--> Support serial read from HDFS through MyriaL: Should be supported. Need
to test.
--> Support serial read from the local filesystem through MyriaL: Works.
This is for the case where someone is just playing with MyriaX locally on
their laptop.

Step 2)
--> Support parallel read of multiple files from either S3 or HDFS and
through either MyriaL or Python. Python side already works when the number
of files = number of workers. Next step is to generalize.

Step 3)
--> Support parallel read of big files from S3 or HDFS: To do.

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Tue, May 5, 2015 at 9:12 PM, billhowe notifications@github.com wrote:

Some datasets will be partitioned such that there are files with the same
name with different content on different workers. The current syntax
assumes this, AFAICT. (Right?)

Some shared datasets may not have this property. We may want to process
data in an S3 bucket full of files, where each file has a different name
and will be assigned to a different worker.

I'm wondering if we can hallucinate a glob syntax in MyriaL that would
make this work.

For example:

load("https://s3-us-west-2.amazonaws.com/someapp/*.csv", column0:int,
column1:int; skip="1");

store(t, S3Data);

Can we think up a way to make this meaningful, such that all csv files in
the given bucket are scanned as one big relation in parallel?

This is not totally obvious, because there might be 5 files, or there
might be 500. In the 500 case, what should the json plan look like? Is
there a "load_many" operator that can scan from many files, each in turn?
Can we simulate load_many using one sequence operator per worker, with
many
file_scan operators as children?

—
Reply to this email directly or view it on GitHub
#413 (comment).

—
Reply to this email directly or view it on GitHub
#413 (comment).

Bill Howe
Associate Director and Senior Data Science Fellow, UW eScience Institute
Affiliate Faculty, Computer Science & Engineering
University of Washington
To acknowledge eScience: "Supported in part by the University of Washington
eScience Institute"

mbalazin · 2015-05-06T04:57:08Z

I think that this is a critical next step!
magda

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Tue, May 5, 2015 at 9:46 PM, billhowe notifications@github.com wrote:

Sounds perfect!

Step 2 is what I'm blathering about.

I'm wondering if the following plan will work with no new operators needed.

Given k workers and files f_1, f_2, ..., f_{n*k}, we build a plan that
looks like this:

Worker 0 runs:
Union(FileScan(f_1), FileScan(f_2), ..., FileScan(f_n))

Worker i runs:
Union(FileScan(f_{i_n+1}), FileScan(f_{i_n+2}), ..., FileScan(f_{i*n+n}))

(I don't know if that should be a union or a sequence)

If this works, we won't have to care how many files are in the bucket.

To build this plan, the optimizer will need to resolve the glob pattern,
though, which may be backend-specific. For s3, we could make it work
directly.

This is not totally critical, perhaps -- we could make people just write a
super-long painful MyriaL program that explicitly references every file.
But it comes up often enough that we should consider it.

On Tue, May 5, 2015 at 9:20 PM, Magdalena Balazinska <
notifications@github.com> wrote:

A glob syntax would be indeed useful.

We decided to drop the feature that reads multiple files from the local
filesystem. The reason is that modern engines run on top of cluster
resource managers such as YARN, so there is no concept of one worker per
physical machines. The workers can be scheduled arbitrarily. So the way
to
do it is to stash the data into HDFS and then read it in parallel from
there.

More specifically, the plan is as follows:

Step 1)
--> Support serial read from S3 through MyriaL: That's the latest feature
that got added.
--> Support serial read from HDFS through MyriaL: Should be supported.
Need
to test.
--> Support serial read from the local filesystem through MyriaL: Works.
This is for the case where someone is just playing with MyriaX locally on
their laptop.

Step 2)
--> Support parallel read of multiple files from either S3 or HDFS and
through either MyriaL or Python. Python side already works when the
number
of files = number of workers. Next step is to generalize.

Step 3)
--> Support parallel read of big files from S3 or HDFS: To do.

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Tue, May 5, 2015 at 9:12 PM, billhowe notifications@github.com
wrote:

Some datasets will be partitioned such that there are files with the
same
name with different content on different workers. The current syntax
assumes this, AFAICT. (Right?)

Some shared datasets may not have this property. We may want to process
data in an S3 bucket full of files, where each file has a different
name
and will be assigned to a different worker.

I'm wondering if we can hallucinate a glob syntax in MyriaL that would
make this work.

For example:

load("https://s3-us-west-2.amazonaws.com/someapp/*.csv", column0:int,
column1:int; skip="1");

store(t, S3Data);

Can we think up a way to make this meaningful, such that all csv files
in
the given bucket are scanned as one big relation in parallel?

This is not totally obvious, because there might be 5 files, or there
might be 500. In the 500 case, what should the json plan look like? Is
there a "load_many" operator that can scan from many files, each in
turn?
Can we simulate load_many using one sequence operator per worker, with
many
file_scan operators as children?

—
Reply to this email directly or view it on GitHub
#413 (comment).

—
Reply to this email directly or view it on GitHub
#413 (comment).

Bill Howe
Associate Director and Senior Data Science Fellow, UW eScience Institute
Affiliate Faculty, Computer Science & Engineering
University of Washington
To acknowledge eScience: "Supported in part by the University of Washington
eScience Institute"

—
Reply to this email directly or view it on GitHub
#413 (comment).

BrandonHaynes · 2015-05-06T16:44:29Z

Surfacing this in MyriaL will allow me to remove the ad hoc JSON-plan-creation cruft in Myria-Python, which is a Good Thing.

senderista · 2015-05-06T17:26:40Z

In Hive the LOAD DATA statement takes a file or directory path, not a glob pattern, and I suspect many potential Myria users are accustomed to this syntax for loading multiple files into a single table. We should consider whether the flexibility of supporting glob patterns is worth the added complexity (it's not if it's only used for prefix matches).

We could also consider whether we want users to be able to implicitly indicate range partitioning by subdirectory structure (ala EMR's ALTER TABLE RECOVER PARTITIONS extension to Hive, which I understand is now natively supported in Hive). I'm not necessarily saying this is a good idea, just something to consider...

mbalazin · 2015-05-06T18:44:39Z

Specifying a directory will cover a lot of the cases for loading multiple
files. I agree.

As for partitioning, it might suffice to do that at the level of relations
rather than files:

Round robin by default.
Hash partition using some attribute values [and optionally a hash
function]

Magdalena Balazinska
Jean Loup Baer Professor of Computer Science and Engineering
Associate Professor, Dept. of Computer Science & Engineering
Director of the IGERT PhD Program in Big Data / Data Science
Senior Data Science Fellow of the eScience Institute
University of Washington

On Wed, May 6, 2015 at 10:26 AM, senderista notifications@github.com
wrote:

In Hive the LOAD DATA statement takes a file or directory path, not a glob
pattern, and I suspect many potential Myria users are accustomed to this
syntax for loading multiple files into a single table. We should consider
whether the flexibility of supporting glob patterns is worth the added
complexity (it's not if it's only used for prefix matches).

We could also consider whether we want users to be able to implicitly
indicate range partitioning by subdirectory structure (ala EMR's ALTER
TABLE RECOVER PARTITIONS extension to Hive, which I understand is now
natively supported in Hive). I'm not necessarily saying this is a good
idea, just something to consider...

—
Reply to this email directly or view it on GitHub
#413 (comment).

Adding optional args to load() function.

ee86d76

senderista assigned billhowe May 4, 2015

Tobin Baker added 5 commits May 4, 2015 11:57

Fixing formatting.

0dfd6c9

Supporting all literal types in optional args.

41e290e

Fixing tests.

9ef5d5c

Excluding generated files from lint.

4ac3239

Fixing lint error.

9586e23

Revising load() syntax.

0c6b077

Tobin Baker added 2 commits May 4, 2015 19:51

Appeasing lint.

54840bf

Fixing examples.

ab24b3f

senderista assigned domoritz and unassigned billhowe May 5, 2015

domoritz added a commit that referenced this pull request May 5, 2015

Merge pull request #413 from uwescience/load_options

a96efce

Adding optional args to load() function.

domoritz merged commit a96efce into master May 5, 2015

domoritz deleted the load_options branch May 5, 2015 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding optional args to load() function. #413

Adding optional args to load() function. #413

senderista commented May 4, 2015

coveralls commented May 4, 2015

domoritz commented May 4, 2015

senderista commented May 4, 2015

senderista commented May 5, 2015

coveralls commented May 5, 2015

domoritz commented May 5, 2015

billhowe commented May 6, 2015

mbalazin commented May 6, 2015

billhowe commented May 6, 2015

mbalazin commented May 6, 2015

BrandonHaynes commented May 6, 2015

senderista commented May 6, 2015

mbalazin commented May 6, 2015

Adding optional args to load() function. #413

Adding optional args to load() function. #413

Conversation

senderista commented May 4, 2015

coveralls commented May 4, 2015

domoritz commented May 4, 2015

senderista commented May 4, 2015

senderista commented May 5, 2015

coveralls commented May 5, 2015

domoritz commented May 5, 2015

billhowe commented May 6, 2015

mbalazin commented May 6, 2015

billhowe commented May 6, 2015

mbalazin commented May 6, 2015

BrandonHaynes commented May 6, 2015

senderista commented May 6, 2015

mbalazin commented May 6, 2015