Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest unstructured json records or capture unrecognized fields #12207

Open
lmatz opened this issue Sep 11, 2023 · 4 comments
Open

ingest unstructured json records or capture unrecognized fields #12207

lmatz opened this issue Sep 11, 2023 · 4 comments

Comments

@lmatz
Copy link
Contributor

lmatz commented Sep 11, 2023

In a discussion with @fuyufjh that is inspired by a user's question

When a user has some rows in JSON format, he would like to:

  1. Ingest some/all the columns by defining a concrete data type for each column.
  2. Ingest the entire row as a JSONB column.

Right now, when (1), if a JSON field is in the row data but is not defined in the table schema, it will not be parsed and ingested into Risingwave.

For (2), right now, the user has to wrap the entire row by another field in JSON, e.g. 'data': {the original data}. Therefore, from time to time, it requires the user to do another transformation before ingesting data into Risingwave. However, the user may not have control of the data format in the source as the source data is collected by some other data team.
By enabling users to do so, they can do more ETL workload all in RW instead of bringing up another system.

If we give the option that users can group all the JSON fields undefined in the table schema as one field, (2) is naturally solved.

Also, the user often wants to take the primary key out as a single column to be the primary key to duplicate the source stream of the table but keep everything else in a huge data JSONB column.

Welcome more observations and counter-examples

@github-actions github-actions bot added this to the release-1.3 milestone Sep 11, 2023
@fuyufjh fuyufjh removed this from the release-1.3 milestone Oct 10, 2023
@fuyufjh
Copy link
Member

fuyufjh commented Oct 10, 2023

Will be tracked on User-requested issues (Notion)

@BugenZhao
Copy link
Member

Also, the user often wants to take the primary key out as a single column to be the primary key to duplicate the source stream of the table but keep everything else in a huge data JSONB column.

IIUC, this can be done by defining a generated column accessing that JSONB.

Copy link
Contributor

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@BugenZhao
Copy link
Member

FYI, this is somehow similar to #[serde(flatten)]:

#[derive(Serialize, Deserialize)]
struct S {
    a: u32,
    b: String,
    #[serde(flatten)]
    other: Map<String, Value>,
}

from serde-rs/serde#941 (comment)

We need to find a way to mark the column.

@BugenZhao BugenZhao changed the title ingest a row in JSON with many fields on the top level as one single jsonb column ingest unstructured json records into a single column, or capture unrecognized fields Jun 3, 2024
@BugenZhao BugenZhao changed the title ingest unstructured json records into a single column, or capture unrecognized fields ingest unstructured json records or capture unrecognized fields Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants