Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/spark sql side effects #104

Merged
merged 74 commits into from May 28, 2020

Conversation

tools4origins
Copy link
Collaborator

@tools4origins tools4origins commented May 9, 2020

This PR contains all the modifications required by the Spark SQL implementation (#92) outside of pysparkling.sql,

12 files are affected by this PR:

.
├── pysparkling
│   ├── sql
│   │   ├── internal_utils
│   │   │   └── joins.py
│   │   └── types.py
│   ├── tests
│   │   ├── test_stat_counter.py
│   │   └── test_streaming_files
│   ├── __init__.py
│   ├── context.py
│   ├── rdd.py
│   ├── stat_counter.py
│   ├── storagelevel.py
│   └── utils.py
├── LICENSE
└── setup.py

As it contains mostly interfaces with Spark SQL it sometimes refers to code that is not part of this PR, such references are commented in this PR.

Biggest chunks of code are:

pysparkling/stat_counter.py as this PR add stat counters similar to the existing StatCounter but for Column and Rows. Those counters computes the following stats:

  • mean
  • variance_pop
  • variance_samp
  • variance
  • stddev_pop
  • stddev_samp
  • stddev
  • min
  • max
  • sum
  • skewness
  • kurtosis
  • covar_samp
  • covar_pop
  • pearson_correlation

pysparkling/utils.py as it introduces many utils functions

@@ -0,0 +1,59 @@
#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -189,6 +189,10 @@ def newRddId(self):
Context.__last_rdd_id += 1
return Context.__last_rdd_id

@property
def defaultParallelism(self):
return 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is later used by Spark SQL to determine in some cases how many partitions a DataFrame should have

In case of multiprocessing it would be nice to return the size of the Context's pool, but I didn't find a clear method in ProcessPoolExecutor/multiprocessing.Pool/ThreadPoolExecutor to retrieve this information

@tools4origins
Copy link
Collaborator Author

Copy link
Owner

@svenkreiss svenkreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being so slow.

This looks really great! I haven't been able to look over everything in detail, but we should move forward.

@svenkreiss svenkreiss merged commit f1a1975 into svenkreiss:master May 28, 2020
@tools4origins
Copy link
Collaborator Author

Sorry for being so slow.

I strongly disagree that you should be sorry, it's normal that we all have occupations and the review of this PR was a big piece of work. And it's not the most interesting type of work!

You have created a project that is extremely useful to me, it's already more than one would expect, and I am really grateful for this. At no time one should ever think that you have to dedicate more time or effort to it. I may not have realized that before creating the first, huge, PR. I am sorry about that.

I am fully aware of how hard it can be to dedicate time to such a project. I struggle with it too. I believe with more time I could structure these changes to make them easier to merge, and that's on me too.

Besides, openpifpaf is astonishing, and I guess time-consuming ;)

@svenkreiss
Copy link
Owner

svenkreiss commented May 31, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants