Feat/spark sql side effects #104

tools4origins · 2020-05-09T10:54:37Z

This PR contains all the modifications required by the Spark SQL implementation (#92) outside of pysparkling.sql,

12 files are affected by this PR:

.
├── pysparkling
│   ├── sql
│   │   ├── internal_utils
│   │   │   └── joins.py
│   │   └── types.py
│   ├── tests
│   │   ├── test_stat_counter.py
│   │   └── test_streaming_files
│   ├── __init__.py
│   ├── context.py
│   ├── rdd.py
│   ├── stat_counter.py
│   ├── storagelevel.py
│   └── utils.py
├── LICENSE
└── setup.py

As it contains mostly interfaces with Spark SQL it sometimes refers to code that is not part of this PR, such references are commented in this PR.

Biggest chunks of code are:

pysparkling/stat_counter.py as this PR add stat counters similar to the existing StatCounter but for Column and Rows. Those counters computes the following stats:

mean
variance_pop
variance_samp
variance
stddev_pop
stddev_samp
stddev
min
max
sum
skewness
kurtosis
covar_samp
covar_pop
pearson_correlation

pysparkling/utils.py as it introduces many utils functions

tools4origins · 2020-05-09T10:56:30Z

pysparkling/storagelevel.py

@@ -0,0 +1,59 @@
+#


Comes from https://github.com/apache/spark/blob/master/python/pyspark/storagelevel.py

tools4origins · 2020-05-09T11:41:22Z

pysparkling/context.py

@@ -189,6 +189,10 @@ def newRddId(self):
        Context.__last_rdd_id += 1
        return Context.__last_rdd_id

+    @property
+    def defaultParallelism(self):
+        return 1


This method is later used by Spark SQL to determine in some cases how many partitions a DataFrame should have

In case of multiprocessing it would be nice to return the size of the Context's pool, but I didn't find a clear method in ProcessPoolExecutor/multiprocessing.Pool/ThreadPoolExecutor to retrieve this information

tools4origins · 2020-05-09T11:47:49Z

Travis is green: https://travis-ci.org/github/svenkreiss/pysparkling/builds/685014035

svenkreiss

Sorry for being so slow.

This looks really great! I haven't been able to look over everything in detail, but we should move forward.

tools4origins · 2020-05-30T11:13:04Z

Sorry for being so slow.

I strongly disagree that you should be sorry, it's normal that we all have occupations and the review of this PR was a big piece of work. And it's not the most interesting type of work!

You have created a project that is extremely useful to me, it's already more than one would expect, and I am really grateful for this. At no time one should ever think that you have to dedicate more time or effort to it. I may not have realized that before creating the first, huge, PR. I am sorry about that.

I am fully aware of how hard it can be to dedicate time to such a project. I struggle with it too. I believe with more time I could structure these changes to make them easier to merge, and that's on me too.

Besides, openpifpaf is astonishing, and I guess time-consuming ;)

svenkreiss · 2020-05-31T06:46:45Z

Thank you for your kind words. The same is true for all your efforts though and I am grateful for all the work you put in and also all the extra work you put in to make this easy to merge. You are not wrong about openpifpaf being time consuming :) Thank YOU for all the work you do. It is much appreciated.

…

On Sat, May 30, 2020 at 1:13 PM tools4origins ***@***.***> wrote: Sorry for being so slow. I strongly disagree that you should be sorry, it's normal that we all have occupations and the review of this PR was a big piece of work. And it's not the most interesting type of work! You have created a project that is extremely useful to me, it's already more than one would expect, and I am really grateful for this. At no time one should ever think that you have to dedicate more time or effort to it. I may not have realized that before creating the first, huge, PR. I am sorry about that. I am fully aware of how hard it can be to dedicate time to such a project. I struggle with it too. I believe with more time I could structure these changes to make them easier to merge, and that's on me too. Besides, openpifpaf is astonishing, and I guess time-consuming ;) — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#104 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAVFQJVJNJOA5YQ7ODPHXADRUDS4XANCNFSM4M4X6V5A> .

tools4origins and others added 30 commits May 9, 2020 12:06

Add defaultParallelism method to Context

2394e4c

Add reduceByKeyLocally method to RDD

a3cfcb5

Add treeReduce method to RDD

a90acc9

Start implementing dataframes

3b75862

Add row and column stat helpers

4155899

Store last created context in Context

9fa6b1b

Expose Row in root module

dbc0e7d

Ignore idea files in git

78ab8d8

Handle aggregations

e422b90

Implement more functions

8e27133

Add a naïve implementation of SparkSession.read.csv

b56a2c3

Expose SparkSession in pysparkling root module

b7469b9

Use tuples instead of lists in row_from_keyed_values

a7f64ac

Implement inner join

e71a2cc

Remove SparkSession exposition

94adef9

Implement additional functions

64e14a3

Implement additional functions

e3ef9ff

Keep record of schema across all DataFrame operations

8a8c3e8

Clarify code

8f9dab7

Implement vertical show

d7d6cf2

Enhance schema handling

5e1f0f0

Fix join on a column

4711cc0

Extract resolve_column

dbc43a9

Implement RDD.toDF

2192fdc

Implement dropDuplicates

f8a8421

fix unittests

711b1ce

Enhance type support

7605eaf

use portable hash from pyspark

2e0a86a

use python hash for non string types

abbe531

Use absolute import instead of relative ones

b419783

tools4origins added 21 commits May 9, 2020 12:06

Add pandas as an optional dependency

29bb230

Align tests on license with license file modication

16c3101

Use raw string for regexs

89e8682

Downgrade pandas version to one compatible with older version of python

6bc5835

Fix CustomJSONEncoder handling of Rows and OrderedDict in python < 3.6

f153762

Fix pandas minimum version declaration

d1ce45e

Ensure datetime hash portability due to modified behavior in py38

4acb66b

Fix nested Row serialization in json writer

c1042db

Use human-readable pylint rule name

694ba8f

Enhance sort key function evaluation

57fecca

Refactor Row creation with an handling of metadata

65a0a30

Add sql packages to setup.py

6c8442c

Fix handling of None values in stats aggregations

2df084e

Remove non-project-related ignored folder

314e246

Align import style with project

59d66fa

Remove code duplicate

b1bf954

Define all join types

fdcb942

Expose methods for row creation in sql.types

cc00222

Fix pylint regression

b938372

Clarify pylint rule

9221971

Remove or comment references to not yet implemented content

d5d19cd

tools4origins commented May 9, 2020

View reviewed changes

tools4origins added 2 commits May 9, 2020 13:23

Add or enhance docstrings

b04bd56

Reduce new code addition

5361c1d

tools4origins commented May 9, 2020

View reviewed changes

svenkreiss approved these changes May 28, 2020

View reviewed changes

svenkreiss merged commit f1a1975 into svenkreiss:master May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/spark sql side effects #104

Feat/spark sql side effects #104

tools4origins commented May 9, 2020 •

edited

tools4origins May 9, 2020

tools4origins May 9, 2020

tools4origins commented May 9, 2020

svenkreiss left a comment

tools4origins commented May 30, 2020

svenkreiss commented May 31, 2020 via email

Feat/spark sql side effects #104

Feat/spark sql side effects #104

Conversation

tools4origins commented May 9, 2020 • edited

tools4origins May 9, 2020

Choose a reason for hiding this comment

tools4origins May 9, 2020

Choose a reason for hiding this comment

tools4origins commented May 9, 2020

svenkreiss left a comment

Choose a reason for hiding this comment

tools4origins commented May 30, 2020

svenkreiss commented May 31, 2020 via email

tools4origins commented May 9, 2020 •

edited