Skip to content

Add webpage for Generic Table support #1889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 18, 2025

Conversation

gh-yzou
Copy link
Contributor

@gh-yzou gh-yzou commented Jun 12, 2025

add a webpage for generic table support

fixes #1881

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this doc, @gh-yzou ! I think it is very valuable for end users (even though the feature itself is still "beta").

## Working with Generic Table

There are two ways to work with Polaris Generic Tables today:
1) Directly communicate with Polaris through REST API calls using curl. Details will be described in the later section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using curl seems to be an example rather than a requirement here, right? I suppose any tool capable to making HTTP requests will work just as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, i updated the wording to using tools such as curl

Generic Table provides a different set of APIs to operate on the Generic Table entities while Iceberg APIs operates on
the Iceberg Table entities.

| Operations | **Generic Table API** | **Iceberg Table API** |
Copy link
Contributor

@dimas-b dimas-b Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Operation for making changes to the data in a Generic Table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client does that, not the server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that the server does not "know" what such a table change? If so, it certainly deserves a dedicated paragraph 😅 As for me, I tend to view the Generic Tables API as something similar to the Iceberg REST Catalog API, which does control commits and by extension conflict resolution on the server side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much closer to the Spark "HMS" catalog integration. The catalog itself is unaware of the anything about the underlying table except for some loosely defined metadata about it. It's up to the engine (and plugins in that engine) to determine exactly how loading data or committing data actually occur based on that metadata.

You could imagine examples of use cases being something like a CSV based table, or a JDBC Table. When these are stored in the HMS by Spark, the HMS doesn't know how to actually interact with the metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not challenge this mode of operation from a technical POV. I mean that it was a surprise for me and might be a surprise to other people coming from the Iceberg REST Catalog side. I'd appreciate if this aspect were discussed in more details in this doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more description in the limitation section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the update, @gh-yzou !


### API Reference

For the complete and up-to-date API specification, see the [generic-tables-api.yaml](https://github.com/apache/polaris/blob/main/spec/polaris-catalog-apis/generic-tables-api.yaml).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use a swagger.io reference as in #1879 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should point to the same YAML version as the version of the doc (e.g. 1.0.0 vs. main).... not sure how to do it, though 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the link to the swagger.io, but seems we only have the catalog bundle yaml, so updated the text to Catalog API Spec also.

- The table base location is a location that includes all files for the table
- A table with multiple disjoint locations (i.e. containing files that are outside the configured base location) is not compliant with the current generic table support in Polaris.
- If no location is provided, clients or users are responsible for managing the location.
- **properties** (optional): Properties for the generic table passed on creation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the Spark plugin use any specific property names? If yes, it would be good to add a section for them:

  • as an illustration,
  • to avoid name clashes with other use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean does spark plugin actually do transfer of specific property it receives from Spark and convert to a reserved key name ? Today, we do not do this. All properties a defined by Spark, and Spark have the right to update any of the property keys, which i don't think it is a good idea to doc it here for Generic tables

Copy link
Contributor

@dimas-b dimas-b Jun 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid, I'm a bit lost here 😅

I mean: do we know what specific properties are currently set/retrieved from this properties list by Polaris code on the client or server side?

Is the properties property used by any code now (apologies that I did not review the Spark Client end-to-end)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer to your question is no, it's not used. Actually per #1785 we don't propagate the properties from client to server, which seems incorrect to me. But it does mean there's not some special property to call out here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so today, the polaris doesn't set/retrieve properties at server side. At client side today, we do retrive the "provider" and "location" property at spark client side and then translate it to the format and location. However, this is more of a contract between spark and spark client, which may not be suitable to mention in the generic table webpage. I can mention that our Polaris spark client today looks into the table property and translate the provider and location into our catalog format and location, but i don't think we want to make it like the property has to contain fields like "provider" or "location" from spark.

Copy link
Contributor

@dimas-b dimas-b Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose a Generic Table named A exists and has some properties usable from Spark via our new Spark Client. Now, if another client wants to query table A via the Generic Tables API, how should that (new) client interpret existing table properties?

The new client may not be aware of the Spark Client, but the properties are observable via API. I believe this aspect of the Generic Tables API ought to be clarified in this doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the same for Iceberg tables? There may be some property on a table that my Spark job knows how to interpret, but yours doesn't. The REST catalog itself does not take any actions based on generic table properties currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but the REST catalog spec is not owned / defined by Polaris, while this one is.

I do not insist on a complete enumeration of possibilities. I think it should be sufficient to make a broad statement, for example: at this time, the contents of the properties map are not strictly defined and their interpretation is delegated to client / engine implementations, including interoperability concerns.

In fact, this is what happens with Iceberg tables too, IMHO, but I believe it is valuable to be explicit in specs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! i added the following description

  - Currently, there is no reserved property key defined.
  - The definition and interpretation is delegated to client or engine implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx - sgtm

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start.

I've phrased some of my comments as questions as a reader of the doc.

Generally, the phrasing needs to be more precise and terms need to be explained before being used/referred to.

What I'm missing is the reason for "generic", because the only implemented use case is very specific. I.e. the text is missing an explanation how other formats would/could be represented. The reference of "structured" implies an explicit exclusion of "semi/unstructured" data.

The doc should IMO also describe the various interactions, edge cases and failure scenarios from an client integration's view.

I was not aware of No commit coordination or update capability provided at the catalog service level.. This is IMHO a very serious issue, because it means that there is absolutely no guarantee that the state is consistent. Older changes can overwrite newer changes (ordering of request executions). This means (data) consistency issues.

weight: 435
---

The Generic Table in Apache Polaris provides basic management support for non-Iceberg tables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "generic table"?
What does "basic" and "management" mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant the operations listed below, i removed those unclear wording, and just point to the list of operations

Comment on lines +43 to +41
- **base-location** (optional): Table base location in URI format. For example: s3://<my-bucket>/path/to/table
- The table base location is a location that includes all files for the table
- A table with multiple disjoint locations (i.e. containing files that are outside the configured base location) is not compliant with the current generic table support in Polaris.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents leveraging "object store friendly paths", no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean volume usage? I think how volume is going to be supported in Polaris is currently not discussed yet. Since this is a beta feature, if we decided to support the use cases with mutilple locations, we can evolve quickly to support this.


The support for cross engine sharing of Generic Table is very limited:
1) Limited spec information. Currently, there is no spec for information like Schema, Partition etc.
2) No commit coordination or update capability provided at the catalog service level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this is a very serious issue. No way to coordinate changes means there will be consistency issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all formats even have a way to do transactional commits. The basic premises here is to behave like the Spark Catalog with HMS (or Unity) which have these same guarantees for any sources.

For example registering a Cassandra would work but there is nothing in the Polaris world that would (or could) manage commits for a C* source.

Another example would be Delta Lake, which only can optionally (in 4.0) use a Catalog based commit coordinator and usually does not even consider the catalog when making commits.

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all formats even have a way to do transactional commits.

Delta has, no?

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

How is a consistent view on metadata ensured?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mean, this blog post mentions A data catalog serves as the central registry for a table’s metadata. It manages transactions and table state, as well as access controls and read/write interoperability.

Copy link
Member

@RussellSpitzer RussellSpitzer Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the metadata only exists in polaris. Only this set of properties.

Not all formats even have a way to do transactional commits.

Delta has, no?

Delta does without using the catalog, and has an optional "commit coordinator" which uses another api which is not provided here. So like users of HMS for a delta table they would need to use a third party commit coordinator if they wanted to use the optional "commit coordinator"

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

How is a consistent view on metadata ensured?

I think you may want to check out the original design docs here. The "metadata" we are talking about here is referring to what the user puts in their Create statement that talks to Polaris. That is the only thing Polaris knows about the table and is the part that will not change. Again, this is similar to how the HMS works with Spark or how Iceberg originally worked with the HMS Catalog implementaiton. The Catalog is essentially just holding a bag of properties we we will maintain. Changes to these properties (not yet allowed) would be atomic but they are essentially disconnected from the underlying format.

Copy link
Contributor

@eric-maynard eric-maynard Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if basically only the table name is there, what's a user's benefit for it?

I'm confused -- are we cross-examining the feature here or documenting it?

The benefit is as is documented here; you can use Delta and other non-Iceberg tables in Spark using the Spark connector. The doc walks you through how that works. If that benefit is unclear in the doc, let's fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if basically only the table name is there, what's a user's benefit for it?

As @RussellSpitzer mentioned, the milestone polaris accomplishes today is enable polaris as a centralized catalog service for Spark Catalog. Furthermore, for delta, the state is inside the delta log, as far as the client is able to load the delta log, "base-location" is the only information needed to enable access to the delta table. I can try to make it more clear in the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I really do not like is implying promises (A data catalog serves as the central registry for a table’s metadata. It manages transactions and table state, as well as access controls and read/write interoperability.) which just do not hold true: there is no way to manage transactions and table state, control access, etc.

WRT to what's in Polaris - it's incomplete (I suspect there's no doubt there).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think it's incomplete in that context? I wouldn't imagine we would ever support those things for this endpoint and I don't think it would be a surprise to any user of Spark who uses this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @snazy 's point that the blog post (link above) is not really aligned with the Generic Tables API as far as Catalog as a "central registry" for table's metadata is concerned. The Generic Table API actually diminishes the role of the catalog as metadata registry by delegating most of the metadata loading to the client (even location is optional in the API).

That said, as far as this PR is concerned, I believe it should be sufficient to describe actual behaviour of Polaris in this respect. There is certainly room for improvements in terms of clarity and precision in this doc, but I think the current state of this PR is probably acceptable for 1.0.

@adutra
Copy link
Contributor

adutra commented Jun 13, 2025

What I'm missing is the reason for "generic", because the only implemented use case is very specific

There is some prior discussion around this naming choice in the dev ML:

https://lists.apache.org/thread/9jcx656ybkn132qw94g5wh8n5nmkg1d9

@snazy
Copy link
Member

snazy commented Jun 13, 2025

There is some prior discussion around this naming choice in the dev ML:

yup - but we cannot expect readers to know the whole dev-ML history

@gh-yzou gh-yzou force-pushed the yzou-generic-table-webpage branch from 12722fc to 28d0b23 Compare June 13, 2025 20:41
}
```

Here is an example to create a generic table with name `delta_table` and format as `delta` under a namespace `delta_ns`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's no "update" API, what happens if there's a mistake in the initial create request? WIll the client have to delete and re-create the table?

I do not mean to cause API changes at this point, just trying to clarity things for potential non-Polaris readers.

Copy link
Contributor

@eric-maynard eric-maynard Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, yes. I think adding update is a good idea but the intent here should be to document the existing behavior.

IIRC the motivation to not have update in v0 was due to a potential lack of clarity around what responsibilities the catalog takes on for updates (i.e. it's not the same as an Iceberg update where the catalog writes metadata).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late reply! as we don't have update capability today, rename will require a re-creation, I added some description at section Generic Table API Vs. Iceberg Table API

@dimas-b
Copy link
Contributor

dimas-b commented Jun 14, 2025

Thanks again for making this doc page, @gh-yzou ! I think I'm done with comments from my side. I'd be fine with merging this PR. Not approving only to allow other reviewers to have another round of comments.


## Limitations

The Generic Table support today is very limited:
Copy link
Contributor

@flyrain flyrain Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we reword a bit like this?

Suggested change
The Generic Table support today is very limited:
Current limitations of Generic Table support:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

flyrain
flyrain previously approved these changes Jun 16, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jun 16, 2025
@gh-yzou
Copy link
Contributor Author

gh-yzou commented Jun 17, 2025

@snazy When we come with the name generic table, the intension is to evolute the support across different table formats, I added some more description at the top to help make the naming more clear. The whole feature is currently marked as beta, which is indicating things are still evolving.

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this doc page, @gh-yzou !

I think it is sufficient inform users about the Generic Tables API in 1.0


The Generic Table in Apache Polaris is designed to provide support for non-Iceberg tables across different table formats includes delta, csv etc. It currently provides the following capabilities:
- Create a generic table under a namespace
- Load a generic table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the term "load", given other conversations under this PR, is still a bit confusing, IMHO, because it resonates with Iceberg's loadTable, which provides tables's metadata... However, this call is more like "get properties". All in all, I think it's ok since subsequent doc sections provide more clarity.

Copy link
Contributor

@eric-maynard eric-maynard Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API itself is actually named loadGenericTable, so I don't think it's exactly misleading. It does load the generic table's metadata, which is whatever metadata was registered in the catalog during createGenericTable. This is very similar to the behavior of, say, the HMS's getTable. Actually cracking open the metadata.json and returning its contents in the IRC is the exception, not the rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a confusing one but to match behavior of other systems with similar functionality I think load or get is probably correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i was mainly try to matching the term of other systems to use "load or get", and I don't think load has to be specifically to the iceberg metadata.

@dimas-b
Copy link
Contributor

dimas-b commented Jun 18, 2025

This PR probably needs rebasing to catch up with CI changes (even though it's only a doc change)

@gh-yzou gh-yzou force-pushed the yzou-generic-table-webpage branch from a263c68 to 9cd06da Compare June 18, 2025 17:43
@gh-yzou
Copy link
Contributor Author

gh-yzou commented Jun 18, 2025

@dimas-b Thanks! i just rebased

@gh-yzou gh-yzou merged commit 48e7e88 into apache:main Jun 18, 2025
12 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jun 18, 2025
flyrain pushed a commit that referenced this pull request Jun 18, 2025
* add change

* add comment

* address feedback

* update limitations

* update docs

* update doc

* address feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add webpage for Generic Table support
7 participants