New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/spark sql side effects #104
Feat/spark sql side effects #104
Conversation
@@ -0,0 +1,59 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -189,6 +189,10 @@ def newRddId(self): | |||
Context.__last_rdd_id += 1 | |||
return Context.__last_rdd_id | |||
|
|||
@property | |||
def defaultParallelism(self): | |||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is later used by Spark SQL to determine in some cases how many partitions a DataFrame should have
In case of multiprocessing it would be nice to return the size of the Context's pool, but I didn't find a clear method in ProcessPoolExecutor/multiprocessing.Pool/ThreadPoolExecutor to retrieve this information
Travis is green: https://travis-ci.org/github/svenkreiss/pysparkling/builds/685014035 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being so slow.
This looks really great! I haven't been able to look over everything in detail, but we should move forward.
I strongly disagree that you should be sorry, it's normal that we all have occupations and the review of this PR was a big piece of work. And it's not the most interesting type of work! You have created a project that is extremely useful to me, it's already more than one would expect, and I am really grateful for this. At no time one should ever think that you have to dedicate more time or effort to it. I may not have realized that before creating the first, huge, PR. I am sorry about that. I am fully aware of how hard it can be to dedicate time to such a project. I struggle with it too. I believe with more time I could structure these changes to make them easier to merge, and that's on me too. Besides, openpifpaf is astonishing, and I guess time-consuming ;) |
Thank you for your kind words. The same is true for all your efforts though
and I am grateful for all the work you put in and also all the extra work
you put in to make this easy to merge.
You are not wrong about openpifpaf being time consuming :)
Thank YOU for all the work you do. It is much appreciated.
…On Sat, May 30, 2020 at 1:13 PM tools4origins ***@***.***> wrote:
Sorry for being so slow.
I strongly disagree that you should be sorry, it's normal that we all have
occupations and the review of this PR was a big piece of work. And it's not
the most interesting type of work!
You have created a project that is extremely useful to me, it's already
more than one would expect, and I am really grateful for this. At no time
one should ever think that you have to dedicate more time or effort to it.
I may not have realized that before creating the first, huge, PR. I am
sorry about that.
I am fully aware of how hard it can be to dedicate time to such a project.
I struggle with it too. I believe with more time I could structure these
changes to make them easier to merge, and that's on me too.
Besides, openpifpaf is astonishing, and I guess time-consuming ;)
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#104 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAVFQJVJNJOA5YQ7ODPHXADRUDS4XANCNFSM4M4X6V5A>
.
|
This PR contains all the modifications required by the Spark SQL implementation (#92) outside of pysparkling.sql,
12 files are affected by this PR:
As it contains mostly interfaces with Spark SQL it sometimes refers to code that is not part of this PR, such references are commented in this PR.
Biggest chunks of code are:
pysparkling/stat_counter.py as this PR add stat counters similar to the existing StatCounter but for Column and Rows. Those counters computes the following stats:
pysparkling/utils.py as it introduces many utils functions