Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for variables wrapped around RDD #79

Closed
ghislainfourny opened this issue Oct 1, 2018 · 5 comments
Closed

Support for variables wrapped around RDD #79

ghislainfourny opened this issue Oct 1, 2018 · 5 comments

Comments

@ghislainfourny
Copy link
Member

In the medium to remote future, we will want to smartly bind a variable with, instead of a materialized sequence of items, an "RDD wrapper" acting as a proxy to Spark in local expressions. This requires adapting the code of the dynamic context.

Example:

let $a := json-text("hdfs://.../file.json")
let $b := json-text("hdfs://.../file2.json")
return { a: count($a), b: count($b) }

The above FLWOR expression is local (i.e., the let clauses are executed locally, but wrapping on the RDD returned by json-text as if it were a local value in a blackbox), so that a prerequisite will be that local FLWORs are supported.

Note that this feature will be incompatible with FLWORs running on Spark, i.e., only "materialized" dynamic contexts can be used as RDDs because Spark forbids nesting.

@ghislainfourny
Copy link
Member Author

This is being addressed.

@CanBerker
Copy link
Collaborator

I have just debugged the following nearly-identical query (file locations are updated and json-file function is used).
let $a := json-file("./src/main/resources/queries/conf-ex.json")
let $b := json-file("./src/main/resources/queries/conf-ex.json")
return { a: count($a), b: count($b) }

In the current state on master, this query is evaluated completely locally.

Is there anything else missing for completely addressing this issue ?

@ghislainfourny
Copy link
Member Author

ghislainfourny commented Apr 7, 2020

@CanBerker what is the status: are the two counts now evaluated in parallel? Thanks!

@CanBerker
Copy link
Collaborator

The current result:

rumble$ let $a := json-file("./src/main/resources/queries/conf-ex.json")
>>> let $b := json-file("./src/main/resources/queries/conf-ex.json")
>>> return { a: count($a), b: count($b) }
>>>
>>>
{ "a" : 5, "b" : 5 }
The query took 2289 milliseconds to execute.

I remember testing at the time and verifying the local API was used to evaluate this query as let clause is capable of storing RDDs as variables. If I'm not missing something I think that this issue is addressed.

@ghislainfourny
Copy link
Member Author

Thanks! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants