-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[611] RFC: Module Restructuring #612
base: main
Are you sure you want to change the base?
[611] RFC: Module Restructuring #612
Conversation
This will allow the user to easily add jars and the required dependencies when executing the standalone RunSync command as well as avoid issues with version differences when adding these jars to an existing class-path that may contain table format related dependencies already. | ||
In addition to this, the jars will still be published to allow users to compose their own shaded jar if they have specific needs or customizations that they want. | ||
|
||
We can also create these modules with a suffix indicating which version of the format is supported to allow for different implementations or versions of dependencies for the same format. For example, have an xtable-iceberg-v2 and xtable-iceberg-v3 module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashvina this is based off of our conversation, can you include some of the more practical items for why you would want this split?
One open question, should the existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the proposal of adding the version number in Iceberg module. I think it may soon apply to Delta as well. Do you think it would be consistent to add the version proactively to all modules, or do you think it makes it verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am less concerned with how verbose it may be but more concerned around maintenance and how we will test these conflicting dependency versions in the future.
These conflicts may only surface as runtime errors when there is difference in method signature for example. | ||
This can make the experience brittle for the end user by requiring them to closely follow any upgrades within the XTable repo as well. | ||
In this scenario the user will also have dependencies on all 3 table formats by default. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the footprint of the fat jar and the class conflicts, the root causes of most issues in production, another reason is the performance of XTable. Currently, bootstrapping XTable is slow. One of the reasons is the service providers that load all table formats even if they are not needed. Splitting the modules will also split the service providers, which in turn would skip loading all formats. Currently, they service providers are hardcoded, resulting in the unnecessary class loading and checks.
This can make the experience brittle for the end user by requiring them to closely follow any upgrades within the XTable repo as well. | ||
In this scenario the user will also have dependencies on all 3 table formats by default. | ||
|
||
## Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation description outlines what users will need to do once the modules are created (thanks for adding it).
However, I anticipate the process of splitting the modules to be a bit involved task and won't be as simple as moving classes to new packages. There could be several sub-tasks, such as resolving test dependencies, moving any format classes in the API module, and more. It might be helpful to list these tasks here and create trackable sub-tasks that can be distributed among contributors.
Additionally, could you share how you imagine this change to take effect? For instance, clarify whether this change will happen all at once in one PR or incrementally in multiple PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the current draft: #618
It can be split up to pull out the test module into its own smaller change and then each format can also have its own PR after that.
What is the purpose of the pull request
Issue: #611
This is a proposal for how the project can be restructured to allow the user to only include the dependencies that they desire. It also contains a proposal for allowing the project to support multiple versions of a spec, like Iceberg v2 and v3.
Brief change log
Verify this pull request
NA