Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/spark sql #92

Closed
wants to merge 379 commits into from
Closed

Conversation

tools4origins
Copy link
Collaborator

This PR is related to #47: It implements a big part of DataFrame related APIs using pure Python

It is already huge in term of code, and for that I'm sorry as there's a lot to review.
On the other hand it adds a lot of features and support some of Spark nicest features :).

NB: I'm opening this PR as-is for 2 main reasons:

  • See what happens with the test suite of pysparkling (I haven't test a lot with python2 even if a lot of effort went into compatibility with it)
  • Discuss if there is a way to make it easier to ingest. A suggestion would be to split it but there are still a lot of connected components that are codependent (mainly DataFrame and Column).

What this PR is about

It introduce the DataFrame object, a data structure that contains Rows of data, Row are quite similar to namedtuple.

DataFrame coolest feature is that you describe operation based on the schema of your Row but not on their values (by manipulating Column).
DataFrame operations are supported by the existing RDD code, like PySpark's DataFrame most of the logic is not directly in DataFrame but in another object (in PySpark it's in the Scala counterpart of DataFrame, here in a DataFrameInternal object written in Python).

What this PR includes:

  • pysparkling.sql module, including:
    • SparkSession and SQLContext that allow DataFrame creation and management
    • DataFrame and GroupedData
    • Column
    • Types
  • DataFrameReader partial support of JSON and CSV
  • Some missing methods of RDD
  • An implementation of most of PySpark SQL functions, both classic expression and aggregations

What it does not include and that should be address in another PR:

  • Raw SQL strings parsing, both for schema description and for query creation:

    This does not work:

    spark.sql("select count(1) from swimmers").show()
    

    This works:

    df = spark.read.csv("swimmer")
    df.select(count(1)).show()
  • Window functions

  • Catalog related features

  • Streaming related features

I'm available for any questions/as mush walk-through on the code as you want 😃
(twitter: https://twitter.com/geekowan if you want to send DMs)

@svenkreiss
Copy link
Owner

Uff. 362 commits. 65 changed files.
This is an amazing effort and I want to support it in anyway I can.

How do we move forward? We need to somehow break it down. Can you split the sql module into a separate PR? This way I can review all your new RDD functions etc first (and then merge them) before going into anything sql related. Is that possible and fair?

@tools4origins
Copy link
Collaborator Author

tools4origins commented Nov 25, 2019

Tests are green! \o/

I will split as suggested, starting with one related to #93 as they are not green with py27

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 25, 2019
tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 25, 2019
tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 25, 2019
tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019
tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019
tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019
tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019
tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019
tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Nov 26, 2019
@anna-hope
Copy link

@svenkreiss @tools4origins I appreciate all your work. Is there any timeline to get this reviewed and merged?

@tools4origins
Copy link
Collaborator Author

Hi @notnami, there is a merged version of this code available here: https://github.com/tools4origins/pysparkling, please feel free to use it and report any issue you'll encounter! :)

On my side I created an alternative PR to start the merge: #99

tools4origins added a commit to tools4origins/pysparkling that referenced this pull request Jan 26, 2020
@ketgo
Copy link

ketgo commented Mar 3, 2020

Hi @tools4origins @svenkreiss This looks like a really good feature to have as lots of applications nowadays use Spark SQL. Anything I can contribute on getting this PR merged? Thanks!

@tools4origins
Copy link
Collaborator Author

Hi @tools4origins @svenkreiss This looks like a really good feature to have as lots of applications nowadays use Spark SQL. Anything I can contribute on getting this PR merged? Thanks!

Hi @ketgo and thank you for the feedback!

If you have use cases where this support could be useful, I believe the best way to help is to try it!
master branch of https://github.com/tools4origins/pysparkling/ contains this PR merged, feedback and defects detection would be highly valuable!

Furthermore as the support of Spark SQL still is only partial it would clarify what are the main missing components :)

@stale
Copy link

stale bot commented May 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Jul 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added stale and removed stale labels Jul 27, 2020
@MattFenner
Copy link

This seems like it would be very useful

@tools4origins
Copy link
Collaborator Author

Created #105 to #108 which contains a part of this PR and make it easier to digest.

Next mains PRs should contain:

  1. Expression definitions and Operators
  2. Some expressions and their associated functions
  3. Columns
  4. DataFrame
  5. Quite a lot of SQL functions (probably in several PRs)
  6. DataFrameReaders

And after that, I don't know, as both repos would be synced again :)

@tools4origins
Copy link
Collaborator Author

Columns are now part of the #113 PR! 🎉

@tools4origins
Copy link
Collaborator Author

A basic implementation of DataFrame is now part of #117 and #121 implement most of DataFrame methods

@tools4origins
Copy link
Collaborator Author

What a ride it has been! The features proposed by this PR are now integrated in this repository, but with a heavy rewrite of the history. PR #93 to #129!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants