New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/spark sql #92
Feat/spark sql #92
Conversation
ca1802a
to
efcc982
Compare
Uff. 362 commits. 65 changed files. How do we move forward? We need to somehow break it down. Can you split the sql module into a separate PR? This way I can review all your new RDD functions etc first (and then merge them) before going into anything sql related. Is that possible and fair? |
Tests are green! \o/ I will split as suggested, starting with one related to #93 as they are not green with py27 |
…d7f53ff44d02b033e571914d032b60a)
@svenkreiss @tools4origins I appreciate all your work. Is there any timeline to get this reviewed and merged? |
Hi @notnami, there is a merged version of this code available here: https://github.com/tools4origins/pysparkling, please feel free to use it and report any issue you'll encounter! :) On my side I created an alternative PR to start the merge: #99 |
Hi @tools4origins @svenkreiss This looks like a really good feature to have as lots of applications nowadays use Spark SQL. Anything I can contribute on getting this PR merged? Thanks! |
Hi @ketgo and thank you for the feedback! If you have use cases where this support could be useful, I believe the best way to help is to try it! Furthermore as the support of Spark SQL still is only partial it would clarify what are the main missing components :) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
9920572
to
bdd1ec1
Compare
cbf96ec
to
bcc0176
Compare
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This seems like it would be very useful |
Created #105 to #108 which contains a part of this PR and make it easier to digest. Next mains PRs should contain:
And after that, I don't know, as both repos would be synced again :) |
Columns are now part of the #113 PR! 🎉 |
This PR is related to #47: It implements a big part of DataFrame related APIs using pure Python
It is already huge in term of code, and for that I'm sorry as there's a lot to review.
On the other hand it adds a lot of features and support some of Spark nicest features :).
NB: I'm opening this PR as-is for 2 main reasons:
What this PR is about
It introduce the
DataFrame
object, a data structure that containsRow
s of data,Row
are quite similar tonamedtuple
.DataFrame
coolest feature is that you describe operation based on the schema of yourRow
but not on their values (by manipulatingColumn
).DataFrame
operations are supported by the existingRDD
code, like PySpark's DataFrame most of the logic is not directly inDataFrame
but in another object (in PySpark it's in the Scala counterpart ofDataFrame
, here in aDataFrameInternal
object written in Python).What this PR includes:
pysparkling.sql
module, including:SparkSession
andSQLContext
that allowDataFrame
creation and managementDataFrame
andGroupedData
Column
Types
DataFrameReader
partial support ofJSON
andCSV
RDD
What it does not include and that should be address in another PR:
Raw SQL strings parsing, both for schema description and for query creation:
This does not work:
This works:
Window functions
Catalog related features
Streaming related features
I'm available for any questions/as mush walk-through on the code as you want 😃
(twitter: https://twitter.com/geekowan if you want to send DMs)