Feat/spark sql #92

tools4origins · 2019-11-03T18:54:26Z

This PR is related to #47: It implements a big part of DataFrame related APIs using pure Python

It is already huge in term of code, and for that I'm sorry as there's a lot to review.
On the other hand it adds a lot of features and support some of Spark nicest features :).

NB: I'm opening this PR as-is for 2 main reasons:

See what happens with the test suite of pysparkling (I haven't test a lot with python2 even if a lot of effort went into compatibility with it)
Discuss if there is a way to make it easier to ingest. A suggestion would be to split it but there are still a lot of connected components that are codependent (mainly DataFrame and Column).

What this PR is about

It introduce the DataFrame object, a data structure that contains Rows of data, Row are quite similar to namedtuple.

DataFrame coolest feature is that you describe operation based on the schema of your Row but not on their values (by manipulating Column).
DataFrame operations are supported by the existing RDD code, like PySpark's DataFrame most of the logic is not directly in DataFrame but in another object (in PySpark it's in the Scala counterpart of DataFrame, here in a DataFrameInternal object written in Python).

What this PR includes:

pysparkling.sql module, including:
- SparkSession and SQLContext that allow DataFrame creation and management
- DataFrame and GroupedData
- Column
- Types
DataFrameReader partial support of JSON and CSV
Some missing methods of RDD
An implementation of most of PySpark SQL functions, both classic expression and aggregations

What it does not include and that should be address in another PR:

Raw SQL strings parsing, both for schema description and for query creation:

This does not work:

spark.sql("select count(1) from swimmers").show()

This works:

df = spark.read.csv("swimmer")
df.select(count(1)).show()

Window functions
Catalog related features
Streaming related features

I'm available for any questions/as mush walk-through on the code as you want 😃
(twitter: https://twitter.com/geekowan if you want to send DMs)

svenkreiss · 2019-11-22T14:10:12Z

Uff. 362 commits. 65 changed files.
This is an amazing effort and I want to support it in anyway I can.

How do we move forward? We need to somehow break it down. Can you split the sql module into a separate PR? This way I can review all your new RDD functions etc first (and then merge them) before going into anything sql related. Is that possible and fair?

tools4origins · 2019-11-25T22:51:48Z

Tests are green! \o/

I will split as suggested, starting with one related to #93 as they are not green with py27

…d7f53ff44d02b033e571914d032b60a)

…a7a2a)

anna-hope · 2020-01-15T19:56:42Z

@svenkreiss @tools4origins I appreciate all your work. Is there any timeline to get this reviewed and merged?

tools4origins · 2020-01-21T19:17:56Z

Hi @notnami, there is a merged version of this code available here: https://github.com/tools4origins/pysparkling, please feel free to use it and report any issue you'll encounter! :)

On my side I created an alternative PR to start the merge: #99

ketgo · 2020-03-03T07:52:26Z

Hi @tools4origins @svenkreiss This looks like a really good feature to have as lots of applications nowadays use Spark SQL. Anything I can contribute on getting this PR merged? Thanks!

tools4origins · 2020-03-04T21:09:02Z

Hi @tools4origins @svenkreiss This looks like a really good feature to have as lots of applications nowadays use Spark SQL. Anything I can contribute on getting this PR merged? Thanks!

Hi @ketgo and thank you for the feedback!

If you have use cases where this support could be useful, I believe the best way to help is to try it!
master branch of https://github.com/tools4origins/pysparkling/ contains this PR merged, feedback and defects detection would be highly valuable!

Furthermore as the support of Spark SQL still is only partial it would clarify what are the main missing components :)

stale · 2020-05-03T21:22:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-27T05:04:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

MattFenner · 2020-07-27T05:32:24Z

This seems like it would be very useful

tools4origins · 2020-08-17T19:52:35Z

Created #105 to #108 which contains a part of this PR and make it easier to digest.

Next mains PRs should contain:

Expression definitions and Operators
Some expressions and their associated functions
Columns
DataFrame
Quite a lot of SQL functions (probably in several PRs)
DataFrameReaders

And after that, I don't know, as both repos would be synced again :)

tools4origins · 2020-08-23T17:25:56Z

Columns are now part of the #113 PR! 🎉

tools4origins · 2020-10-04T09:30:24Z

A basic implementation of DataFrame is now part of #117 and #121 implement most of DataFrame methods

tools4origins · 2020-10-14T18:25:22Z

What a ride it has been! The features proposed by this PR are now integrated in this repository, but with a heavy rewrite of the history. PR #93 to #129!

tools4origins force-pushed the feat/sparkSQL branch from ca1802a to efcc982 Compare November 13, 2019 19:07

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 25, 2019

Revert to feat/sparkSql state (svenkreiss#92)

ec3e630

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 25, 2019

Revert to feat/sparkSql state (svenkreiss#92, feat/sparkSQL/1ca7a2a64…

bbbf2b2

…d7f53ff44d02b033e571914d032b60a)

tools4origins mentioned this pull request Nov 25, 2019

Travis python versions #94

Closed

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 25, 2019

Revert to feat/sparkSql state (svenkreiss#92 aka feat/sparkSQL aka 1c…

1df07fe

…a7a2a)

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019

Revert to feat/sparkSql state (svenkreiss#92 aka feat/sparkSQL aka 1c…

4224251

…a7a2a)

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019

Revert to feat/sparkSql state (svenkreiss#92 aka feat/sparkSQL aka 1c…

8284334

…a7a2a)

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019

Revert to feat/sparkSql state (svenkreiss#92 aka feat/sparkSQL aka 1c…

61d1496

…a7a2a)

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019

Revert to feat/sparkSql state (svenkreiss#92 aka 1ca7a2a)

4102c32

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019

Revert to feat/sparkSql state (svenkreiss#92 aka 1ca7a2a)

4c178b2

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019

Revert to feat/sparkSql state (svenkreiss#92 aka 1ca7a2a)

7d87181

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Jan 26, 2020

Revert to feat/sparkSql state (svenkreiss#92 aka 1ca7a2a)

7cc5183

stale bot added the stale label May 3, 2020

svenkreiss removed the stale label May 4, 2020

tools4origins force-pushed the feat/sparkSQL branch from 9920572 to bdd1ec1 Compare May 9, 2020 08:49

tools4origins added 8 commits May 9, 2020 11:30

Remove no longer used parameter

82b8b5f

Define all join types

39367f5

Add tests on coalesce

dcd19e4

Implement all join types for join on values

e990fa3

Fix aggregation detection

b124392

Remove no longer relevant comment

dedcbbe

Move arguments checks from InternalDataFrame to DataFrame

ed82500

Fix parsing of operator applied to Columns

ad879ed

tools4origins added 19 commits May 9, 2020 11:40

Split CaseWhen and Otherwise expression types

ac6b344

Disable pylint error on a todo

860684e

Enhance code readability

7a250c0

Fix error in csv reader when header option is not set

456a061

Fix error in regexp_extract when there is no match

5b436eb

Clarify data source reader limitations on schema format

1e75d69

Implement input_file_name()

239b05e

Add sql packages to setup.py

ca3057f

Add missing argument parsing in some functions

f58150d

Fix Column.contains handling of strings

6ecc10d

Fix handling of None values in stats aggregations

4b37e21

Fix Greatest and Least behaviour

edf2f61

Implement pivot

4a2177a

Align the MapColumn init with that of other Expressions

f0578f2

Fix hasattr behaviour

c516cd8

Remove non-project-related ignored folder

01a2ffd

Align import style with project

4eef760

Pylint-related style modifications

331feee

Remove code duplicate

bcc0176

tools4origins force-pushed the feat/sparkSQL branch from cbf96ec to bcc0176 Compare May 9, 2020 09:43

tools4origins mentioned this pull request May 9, 2020

Feat/spark sql side effects #104

Merged

stale bot added stale and removed stale labels Jul 27, 2020

tools4origins closed this Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/spark sql #92

Feat/spark sql #92

tools4origins commented Nov 3, 2019

svenkreiss commented Nov 22, 2019

tools4origins commented Nov 25, 2019 •

edited

anna-hope commented Jan 15, 2020

tools4origins commented Jan 21, 2020

ketgo commented Mar 3, 2020

tools4origins commented Mar 4, 2020

stale bot commented May 3, 2020

stale bot commented Jul 27, 2020

MattFenner commented Jul 27, 2020

tools4origins commented Aug 17, 2020

tools4origins commented Aug 23, 2020

tools4origins commented Oct 4, 2020

tools4origins commented Oct 14, 2020

Feat/spark sql #92

Feat/spark sql #92

Conversation

tools4origins commented Nov 3, 2019

What this PR is about

What this PR includes:

What it does not include and that should be address in another PR:

svenkreiss commented Nov 22, 2019

tools4origins commented Nov 25, 2019 • edited

anna-hope commented Jan 15, 2020

tools4origins commented Jan 21, 2020

ketgo commented Mar 3, 2020

tools4origins commented Mar 4, 2020

stale bot commented May 3, 2020

stale bot commented Jul 27, 2020

MattFenner commented Jul 27, 2020

tools4origins commented Aug 17, 2020

tools4origins commented Aug 23, 2020

tools4origins commented Oct 4, 2020

tools4origins commented Oct 14, 2020

tools4origins commented Nov 25, 2019 •

edited