Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup Documentation - Spark Core Classes #18

Closed
sjrusso8 opened this issue Apr 16, 2024 · 11 comments · Fixed by #37
Closed

Cleanup Documentation - Spark Core Classes #18

sjrusso8 opened this issue Apr 16, 2024 · 11 comments · Fixed by #37
Labels
good first issue Good for newcomers

Comments

@sjrusso8
Copy link
Owner

sjrusso8 commented Apr 16, 2024

Description

The overall documentation needs to be reviewed and matched against the Spark Core Classes and Functions. For instance, the README should be accurate to what functions and methods are currently implemented compared to the existing Spark API.

However, there are probably a few misses that are either currently implemented but marked as open, or were accidentally excluded. Might consider adding a few sections for other classes like StreamingQueryManager, DataFrameNaFunctions, DataFrameStatFunctions, etc.

@sjrusso8 sjrusso8 added the good first issue Good for newcomers label Apr 16, 2024
@abrassel
Copy link
Contributor

abrassel commented May 5, 2024

Hi @sjrusso8 , can I take this issue? I will want more details about what "reviewing and matching" the documentation needs. For example, let's take Dataframe. I see in pyspark here that DataFrame has a set of existing docs. Do you want the rust docs to be word for word the same?

@sjrusso8
Copy link
Owner Author

sjrusso8 commented May 6, 2024

Thanks for reaching out! You can take on this if you want.

Here is what I was thinking, and it's mostly just 2 parts. First, update the existing README and rust docs so that it somewhat matches up with the spark docs that you have linked. Like you mentioned on the DataFrame object, each of the DataFrame methods should have some type of documentation that is similar to the existing spark docs. For the methods that are widely used like select, filter, sort, etc. we should probably provide a small example as well. Updating the README would just be marking open or closed correctly for the various sections, and even added new sections if you think it makes sense.

Second, would be to identify where there are missing parts on the core classes and we can make a longer issue tracker to start to build out those areas. For instance, DataFrame currently does not implement methods like approxQuantile, checkpoint, observe, etc. Classes like DataFrameNaFunctions and DataFrameStatFunctions are also not implemented.

This issue will be a way to create a cleaner roadmap of work to be completed :) I can also help with anything on this as well. I have slowly been making a list of gaps as well.

@abrassel
Copy link
Contributor

Great! Thanks for the thorough explanation. I'll doubtless have ore questions for you as I work on this.

@abrassel
Copy link
Contributor

starting now!

@abrassel
Copy link
Contributor

@sjrusso8 do you want me to create new issue trackers for every pyspark core class?
image

@abrassel
Copy link
Contributor

Follow up question: I am seeing unresolved paths in rust-analyzer, such as spark::relation::RelType. Looking into the spark subdirectory, those paths do indeed seem to be missing. How do I find them?

@sjrusso8
Copy link
Owner Author

Let’s update the readme with the “done/open” and add any additional missing sections to the readme as well. Then let’s create one issue per core class that should be implemented and can be implemented.

There are some things like UDFs that would not be feasible because of how the remote spark cluster deserializes and runs the UDFs. Things like “toPandas” might be “toPolars” instead.

I’m still on the fence on translating an arrow recordbatch to the spark “row” representation when using collect. Im open for suggestions!

@sjrusso8
Copy link
Owner Author

Follow up question: I am seeing unresolved paths in rust-analyzer, such as spark::relation::RelType. Looking into the spark subdirectory, those paths do indeed seem to be missing. How do I find them?

Did you refresh the git submodule? That’s probably because of the build step under “/core” that points to the spark connect protobuf located in the submodule. Make sure the submodule is checked out to the tag for 3.5.1 and not “main”

@abrassel
Copy link
Contributor

I checked out version 3.5.1. The git submodule is up to date. taking a look at the build step now

@abrassel
Copy link
Contributor

yep, works now!

@abrassel
Copy link
Contributor

quick bump here @sjrusso8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants