Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing mistake when processing more than 32 dynamic tables within a single query #240

Closed
zlobober opened this issue Dec 11, 2023 · 1 comment
Assignees
Labels
SPYT SPYT related

Comments

@zlobober
Copy link
Collaborator

zlobober commented Dec 11, 2023

https://gist.github.com/zlobober/c5f782152cc19f10232730401b4b436f

Such mistake occurs when there are more than 32 (aka spark.sql.sources.parallelPartitionDiscovery.threshold) dynamic tables to process simultaneously.

First of all, the mistake is very confusing; if any kind of a safe limit prevents a query from executing, it should be clear which limit is exceeded and (ideally) which parameter is responsible for this limit.

Second, it is not clear why dynamic tables are any different from static tables here.

Third, it is unclear how is this issue related to the (opt-in) feature of using partition_tables call for slicing both static and dynamic tables. Since this option erases any distinction between static and dynamic tables, it looks like enabling it either fixes both static and dynamic cases, or breaks them simultaneously.

@Alexvsalexvsalex Alexvsalexvsalex added the SPYT SPYT related label Dec 11, 2023
robot-piglet pushed a commit that referenced this issue Dec 20, 2023
This commit adds ability to list (determine chunks/pivot keys) dyntables on executors. It happens when partitioning table count is more than 32 (by default).
Listing tables on executor requires serialization of paths. At first raw paths are sent to executors, than after processing they return back to driver. Result must contain information about reading range, but API is poor here and it expects to communicate by Path objects (approximately equivalent to simple string).
In case of static tables we need only 2 integers, begin row and row count, it turns into string simply. But dyntables required more complex serialization, because keys have flexible representation.
In this commit range information is packed in string with base64-encoded ysons and can be used in driver/executor side.
@Alexvsalexvsalex
Copy link
Collaborator

Issue was solved in this commit: f8c6833 .
Information is in description.
Briefly: default was increased, partitioning support was added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SPYT SPYT related
Projects
None yet
Development

No branches or pull requests

2 participants