Add webpage for Generic Table support #1889

gh-yzou · 2025-06-12T23:38:44Z

add a webpage for generic table support

dimas-b

Thanks for writing this doc, @gh-yzou ! I think it is very valuable for end users (even though the feature itself is still "beta").

site/content/in-dev/unreleased/generic-table.md

dimas-b · 2025-06-13T03:06:53Z

site/content/in-dev/unreleased/generic-table.md

+## Working with Generic Table
+
+There are two ways to work with Polaris Generic Tables today:
+1) Directly communicate with Polaris through REST API calls using curl. Details will be described in the later section.


using curl seems to be an example rather than a requirement here, right? I suppose any tool capable to making HTTP requests will work just as well.

Good point, i updated the wording to using tools such as curl

site/content/in-dev/unreleased/generic-table.md

dimas-b · 2025-06-13T03:12:33Z

site/content/in-dev/unreleased/generic-table.md

+Generic Table provides a different set of APIs to operate on the Generic Table entities while Iceberg APIs operates on
+the Iceberg Table entities.
+
+| Operations   | **Generic Table API**                                       | **Iceberg Table API**                                       |


What is the Operation for making changes to the data in a Generic Table?

The client does that, not the server

Do you mean that the server does not "know" what such a table change? If so, it certainly deserves a dedicated paragraph 😅 As for me, I tend to view the Generic Tables API as something similar to the Iceberg REST Catalog API, which does control commits and by extension conflict resolution on the server side.

This is much closer to the Spark "HMS" catalog integration. The catalog itself is unaware of the anything about the underlying table except for some loosely defined metadata about it. It's up to the engine (and plugins in that engine) to determine exactly how loading data or committing data actually occur based on that metadata.

You could imagine examples of use cases being something like a CSV based table, or a JDBC Table. When these are stored in the HMS by Spark, the HMS doesn't know how to actually interact with the metadata.

I do not challenge this mode of operation from a technical POV. I mean that it was a surprise for me and might be a surprise to other people coming from the Iceberg REST Catalog side. I'd appreciate if this aspect were discussed in more details in this doc.

I will add more description in the limitation section.

Thx for the update, @gh-yzou !

dimas-b · 2025-06-13T03:15:04Z

site/content/in-dev/unreleased/generic-table.md

+
+### API Reference
+
+For the complete and up-to-date API specification, see the [generic-tables-api.yaml](https://github.com/apache/polaris/blob/main/spec/polaris-catalog-apis/generic-tables-api.yaml).


nit: maybe use a swagger.io reference as in #1879 ?

Ideally this should point to the same YAML version as the version of the doc (e.g. 1.0.0 vs. main).... not sure how to do it, though 🤔

updated the link to the swagger.io, but seems we only have the catalog bundle yaml, so updated the text to Catalog API Spec also.

dimas-b · 2025-06-13T03:18:48Z

site/content/in-dev/unreleased/generic-table.md

+  - The table base location is a location that includes all files for the table
+  - A table with multiple disjoint locations (i.e. containing files that are outside the configured base location) is not compliant with the current generic table support in Polaris.
+  - If no location is provided, clients or users are responsible for managing the location.
+- **properties** (optional): Properties for the generic table passed on creation


Does the Spark plugin use any specific property names? If yes, it would be good to add a section for them:

as an illustration,

to avoid name clashes with other use cases.

Do you mean does spark plugin actually do transfer of specific property it receives from Spark and convert to a reserved key name ? Today, we do not do this. All properties a defined by Spark, and Spark have the right to update any of the property keys, which i don't think it is a good idea to doc it here for Generic tables

I'm afraid, I'm a bit lost here 😅

I mean: do we know what specific properties are currently set/retrieved from this properties list by Polaris code on the client or server side?

Is the properties property used by any code now (apologies that I did not review the Spark Client end-to-end)?

I think the answer to your question is no, it's not used. Actually per #1785 we don't propagate the properties from client to server, which seems incorrect to me. But it does mean there's not some special property to call out here.

so today, the polaris doesn't set/retrieve properties at server side. At client side today, we do retrive the "provider" and "location" property at spark client side and then translate it to the format and location. However, this is more of a contract between spark and spark client, which may not be suitable to mention in the generic table webpage. I can mention that our Polaris spark client today looks into the table property and translate the provider and location into our catalog format and location, but i don't think we want to make it like the property has to contain fields like "provider" or "location" from spark.

Suppose a Generic Table named A exists and has some properties usable from Spark via our new Spark Client. Now, if another client wants to query table A via the Generic Tables API, how should that (new) client interpret existing table properties?

The new client may not be aware of the Spark Client, but the properties are observable via API. I believe this aspect of the Generic Tables API ought to be clarified in this doc.

Isn't it the same for Iceberg tables? There may be some property on a table that my Spark job knows how to interpret, but yours doesn't. The REST catalog itself does not take any actions based on generic table properties currently.

True, but the REST catalog spec is not owned / defined by Polaris, while this one is.

I do not insist on a complete enumeration of possibilities. I think it should be sufficient to make a broad statement, for example: at this time, the contents of the properties map are not strictly defined and their interpretation is delegated to client / engine implementations, including interoperability concerns.

In fact, this is what happens with Iceberg tables too, IMHO, but I believe it is valuable to be explicit in specs.

sounds good! i added the following description

- Currently, there is no reserved property key defined. - The definition and interpretation is delegated to client or engine implementations.

site/content/in-dev/unreleased/generic-table.md

snazy

This is a good start.

I've phrased some of my comments as questions as a reader of the doc.

Generally, the phrasing needs to be more precise and terms need to be explained before being used/referred to.

What I'm missing is the reason for "generic", because the only implemented use case is very specific. I.e. the text is missing an explanation how other formats would/could be represented. The reference of "structured" implies an explicit exclusion of "semi/unstructured" data.

The doc should IMO also describe the various interactions, edge cases and failure scenarios from an client integration's view.

I was not aware of No commit coordination or update capability provided at the catalog service level.. This is IMHO a very serious issue, because it means that there is absolutely no guarantee that the state is consistent. Older changes can overwrite newer changes (ordering of request executions). This means (data) consistency issues.

snazy · 2025-06-13T09:51:25Z

site/content/in-dev/unreleased/generic-table.md

+weight: 435
+---
+
+The Generic Table in Apache Polaris provides basic management support for non-Iceberg tables. 


What is a "generic table"?
What does "basic" and "management" mean here?

I just meant the operations listed below, i removed those unclear wording, and just point to the list of operations

site/content/in-dev/unreleased/generic-table.md

snazy · 2025-06-13T09:56:45Z

site/content/in-dev/unreleased/generic-table.md

+- **base-location** (optional): Table base location in URI format. For example: s3://<my-bucket>/path/to/table
+  - The table base location is a location that includes all files for the table
+  - A table with multiple disjoint locations (i.e. containing files that are outside the configured base location) is not compliant with the current generic table support in Polaris.


This prevents leveraging "object store friendly paths", no?

Do you mean volume usage? I think how volume is going to be supported in Polaris is currently not discussed yet. Since this is a beta feature, if we decided to support the use cases with mutilple locations, we can evolve quickly to support this.

snazy · 2025-06-13T10:02:12Z

site/content/in-dev/unreleased/generic-table.md

+
+The support for cross engine sharing of Generic Table is very limited:
+1) Limited spec information. Currently, there is no spec for information like Schema, Partition etc. 
+2) No commit coordination or update capability provided at the catalog service level.


To me this is a very serious issue. No way to coordinate changes means there will be consistency issues.

Not all formats even have a way to do transactional commits. The basic premises here is to behave like the Spark Catalog with HMS (or Unity) which have these same guarantees for any sources.

For example registering a Cassandra would work but there is nothing in the Polaris world that would (or could) manage commits for a C* source.

Another example would be Delta Lake, which only can optionally (in 4.0) use a Catalog based commit coordinator and usually does not even consider the catalog when making commits.

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

Not all formats even have a way to do transactional commits.

Delta has, no?

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

How is a consistent view on metadata ensured?

Mean, this blog post mentions A data catalog serves as the central registry for a table’s metadata. It manages transactions and table state, as well as access controls and read/write interoperability.

Because the metadata only exists in polaris. Only this set of properties.

Not all formats even have a way to do transactional commits.

Delta has, no?

Delta does without using the catalog, and has an optional "commit coordinator" which uses another api which is not provided here. So like users of HMS for a delta table they would need to use a third party commit coordinator if they wanted to use the optional "commit coordinator"

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

How is a consistent view on metadata ensured?

I think you may want to check out the original design docs here. The "metadata" we are talking about here is referring to what the user puts in their Create statement that talks to Polaris. That is the only thing Polaris knows about the table and is the part that will not change. Again, this is similar to how the HMS works with Spark or how Iceberg originally worked with the HMS Catalog implementaiton. The Catalog is essentially just holding a bag of properties we we will maintain. Changes to these properties (not yet allowed) would be atomic but they are essentially disconnected from the underlying format.

So if basically only the table name is there, what's a user's benefit for it?

I'm confused -- are we cross-examining the feature here or documenting it?

The benefit is as is documented here; you can use Delta and other non-Iceberg tables in Spark using the Spark connector. The doc walks you through how that works. If that benefit is unclear in the doc, let's fix that.

So if basically only the table name is there, what's a user's benefit for it?

As @RussellSpitzer mentioned, the milestone polaris accomplishes today is enable polaris as a centralized catalog service for Spark Catalog. Furthermore, for delta, the state is inside the delta log, as far as the client is able to load the delta log, "base-location" is the only information needed to enable access to the delta table. I can try to make it more clear in the doc.

What I really do not like is implying promises (A data catalog serves as the central registry for a table’s metadata. It manages transactions and table state, as well as access controls and read/write interoperability.) which just do not hold true: there is no way to manage transactions and table state, control access, etc.

WRT to what's in Polaris - it's incomplete (I suspect there's no doubt there).

I don't actually think it's incomplete in that context? I wouldn't imagine we would ever support those things for this endpoint and I don't think it would be a surprise to any user of Spark who uses this?

I agree with @snazy 's point that the blog post (link above) is not really aligned with the Generic Tables API as far as Catalog as a "central registry" for table's metadata is concerned. The Generic Table API actually diminishes the role of the catalog as metadata registry by delegating most of the metadata loading to the client (even location is optional in the API).

That said, as far as this PR is concerned, I believe it should be sufficient to describe actual behaviour of Polaris in this respect. There is certainly room for improvements in terms of clarity and precision in this doc, but I think the current state of this PR is probably acceptable for 1.0.

adutra · 2025-06-13T11:04:36Z

What I'm missing is the reason for "generic", because the only implemented use case is very specific

There is some prior discussion around this naming choice in the dev ML:

https://lists.apache.org/thread/9jcx656ybkn132qw94g5wh8n5nmkg1d9

snazy · 2025-06-13T11:07:32Z

There is some prior discussion around this naming choice in the dev ML:

yup - but we cannot expect readers to know the whole dev-ML history

dimas-b · 2025-06-14T01:56:09Z

site/content/in-dev/unreleased/generic-table.md

+}
+```
+
+Here is an example to create a generic table with name `delta_table` and format as `delta` under a namespace `delta_ns` 


Since there's no "update" API, what happens if there's a mistake in the initial create request? WIll the client have to delete and re-create the table?

I do not mean to cause API changes at this point, just trying to clarity things for potential non-Polaris readers.

Currently, yes. I think adding update is a good idea but the intent here should be to document the existing behavior.

IIRC the motivation to not have update in v0 was due to a potential lack of clarity around what responsibilities the catalog takes on for updates (i.e. it's not the same as an Iceberg update where the catalog writes metadata).

sorry for the late reply! as we don't have update capability today, rename will require a re-creation, I added some description at section Generic Table API Vs. Iceberg Table API

dimas-b · 2025-06-14T02:07:33Z

Thanks again for making this doc page, @gh-yzou ! I think I'm done with comments from my side. I'd be fine with merging this PR. Not approving only to allow other reviewers to have another round of comments.

flyrain · 2025-06-16T23:49:10Z

site/content/in-dev/unreleased/generic-table.md

+
+## Limitations
+
+The Generic Table support today is very limited:


Nit: can we reword a bit like this?

Suggested change

The Generic Table support today is very limited:

Current limitations of Generic Table support:

gh-yzou · 2025-06-17T20:11:01Z

@snazy When we come with the name generic table, the intension is to evolute the support across different table formats, I added some more description at the top to help make the naming more clear. The whole feature is currently marked as beta, which is indicating things are still evolving.

dimas-b

Thanks for making this doc page, @gh-yzou !

I think it is sufficient inform users about the Generic Tables API in 1.0

dimas-b · 2025-06-17T20:38:34Z

site/content/in-dev/unreleased/generic-table.md

+
+The Generic Table in Apache Polaris is designed to provide support for non-Iceberg tables across different table formats includes delta, csv etc. It currently provides the following capabilities:
+- Create a generic table under a namespace
+- Load a generic table 


nit: the term "load", given other conversations under this PR, is still a bit confusing, IMHO, because it resonates with Iceberg's loadTable, which provides tables's metadata... However, this call is more like "get properties". All in all, I think it's ok since subsequent doc sections provide more clarity.

The API itself is actually named loadGenericTable, so I don't think it's exactly misleading. It does load the generic table's metadata, which is whatever metadata was registered in the catalog during createGenericTable. This is very similar to the behavior of, say, the HMS's getTable. Actually cracking open the metadata.json and returning its contents in the IRC is the exception, not the rule.

This is a confusing one but to match behavior of other systems with similar functionality I think load or get is probably correct

yeah, i was mainly try to matching the term of other systems to use "load or get", and I don't think load has to be specifically to the iceberg metadata.

dimas-b · 2025-06-18T17:33:55Z

This PR probably needs rebasing to catch up with CI changes (even though it's only a doc change)

gh-yzou · 2025-06-18T17:48:31Z

@dimas-b Thanks! i just rebased

* add change * add comment * address feedback * update limitations * update docs * update doc * address feedback

github-project-automation bot added this to Basic Kanban Board Jun 12, 2025

gh-yzou requested review from adutra, ashvina, dennishuo, dimas-b, eric-maynard, jackye1995, jbonofre, vvcephei, collado-mike and snazy as code owners June 12, 2025 23:38

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Jun 12, 2025

gh-yzou requested review from RussellSpitzer, takidau, MonkeyCanCode, flyrain, ebyhr, ajantha-bhat, HonahX, singhpk234 and pingtimeout as code owners June 12, 2025 23:38

dimas-b added the 1.0-blocker label Jun 13, 2025

dimas-b reviewed Jun 13, 2025

View reviewed changes

eric-maynard reviewed Jun 13, 2025

View reviewed changes

site/content/in-dev/unreleased/generic-table.md Outdated Show resolved Hide resolved

eric-maynard reviewed Jun 13, 2025

View reviewed changes

site/content/in-dev/unreleased/generic-table.md Outdated Show resolved Hide resolved

eric-maynard reviewed Jun 13, 2025

View reviewed changes

site/content/in-dev/unreleased/generic-table.md Outdated Show resolved Hide resolved

snazy reviewed Jun 13, 2025

View reviewed changes

gh-yzou force-pushed the yzou-generic-table-webpage branch from 12722fc to 28d0b23 Compare June 13, 2025 20:41

dimas-b reviewed Jun 14, 2025

View reviewed changes

flyrain reviewed Jun 16, 2025

View reviewed changes

flyrain previously approved these changes Jun 16, 2025

View reviewed changes

github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jun 16, 2025

gh-yzou dismissed flyrain’s stale review via a263c68 June 17, 2025 20:11

gh-yzou force-pushed the yzou-generic-table-webpage branch from 07319e7 to a263c68 Compare June 17, 2025 20:11

dimas-b approved these changes Jun 17, 2025

View reviewed changes

gh-yzou added 7 commits June 18, 2025 10:43

add change

6abf791

add comment

b661df0

address feedback

d751085

update limitations

f3ac0d9

update docs

d3e61aa

update doc

ed474dd

address feedback

9cd06da

gh-yzou force-pushed the yzou-generic-table-webpage branch from a263c68 to 9cd06da Compare June 18, 2025 17:43

flyrain approved these changes Jun 18, 2025

View reviewed changes

gh-yzou merged commit 48e7e88 into apache:main Jun 18, 2025
12 checks passed

github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jun 18, 2025

flyrain pushed a commit that referenced this pull request Jun 18, 2025

Add webpage for Generic Table support (#1889)

6439bde

* add change * add comment * address feedback * update limitations * update docs * update doc * address feedback


		### API Reference

		For the complete and up-to-date API specification, see the [generic-tables-api.yaml](https://github.com/apache/polaris/blob/main/spec/polaris-catalog-apis/generic-tables-api.yaml).


		## Limitations

		The Generic Table support today is very limited:

	The Generic Table support today is very limited:
	Current limitations of Generic Table support:

Add webpage for Generic Table support #1889

Add webpage for Generic Table support #1889

Uh oh!

Conversation

gh-yzou commented Jun 12, 2025 • edited by dimas-b Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimas-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dimas-b Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snazy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gh-yzou commented Jun 12, 2025 •

edited by dimas-b

Loading

dimas-b Jun 13, 2025 •

edited

Loading

dimas-b Jun 14, 2025 •

edited

Loading

dimas-b Jun 16, 2025 •

edited

Loading

RussellSpitzer Jun 16, 2025 •

edited

Loading

eric-maynard Jun 16, 2025 •

edited

Loading

eric-maynard Jun 16, 2025 •

edited

Loading