ingest unstructured json records or capture unrecognized fields #12207

lmatz · 2023-09-11T09:49:48Z

In a discussion with @fuyufjh that is inspired by a user's question

When a user has some rows in JSON format, he would like to:

Ingest some/all the columns by defining a concrete data type for each column.
Ingest the entire row as a JSONB column.

Right now, when (1), if a JSON field is in the row data but is not defined in the table schema, it will not be parsed and ingested into Risingwave.

For (2), right now, the user has to wrap the entire row by another field in JSON, e.g. 'data': {the original data}. Therefore, from time to time, it requires the user to do another transformation before ingesting data into Risingwave. However, the user may not have control of the data format in the source as the source data is collected by some other data team.
By enabling users to do so, they can do more ETL workload all in RW instead of bringing up another system.

If we give the option that users can group all the JSON fields undefined in the table schema as one field, (2) is naturally solved.

Also, the user often wants to take the primary key out as a single column to be the primary key to duplicate the source stream of the table but keep everything else in a huge data JSONB column.

Welcome more observations and counter-examples

The text was updated successfully, but these errors were encountered:

fuyufjh · 2023-10-10T09:22:33Z

Will be tracked on User-requested issues (Notion)

BugenZhao · 2023-10-10T09:25:04Z

Also, the user often wants to take the primary key out as a single column to be the primary key to duplicate the source stream of the table but keep everything else in a huge data JSONB column.

IIUC, this can be done by defining a generated column accessing that JSONB.

github-actions · 2023-12-11T01:50:39Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

BugenZhao · 2023-12-11T05:41:24Z

FYI, this is somehow similar to #[serde(flatten)]:

#[derive(Serialize, Deserialize)]
struct S {
    a: u32,
    b: String,
    #[serde(flatten)]
    other: Map<String, Value>,
}

from serde-rs/serde#941 (comment)

We need to find a way to mark the column.

github-actions bot added this to the release-1.3 milestone Sep 11, 2023

lmatz added type/feature needs-discussion labels Sep 11, 2023

fuyufjh removed this from the release-1.3 milestone Oct 10, 2023

github-actions bot added the no-issue-activity label Dec 11, 2023

github-actions bot removed the no-issue-activity label Dec 12, 2023

BugenZhao changed the title ~~ingest a row in JSON with many fields on the top level as one single jsonb column~~ ingest unstructured json records into a single column, or capture unrecognized fields Jun 3, 2024

BugenZhao changed the title ~~ingest unstructured json records into a single column, or capture unrecognized fields~~ ingest unstructured json records or capture unrecognized fields Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest unstructured json records or capture unrecognized fields #12207

ingest unstructured json records or capture unrecognized fields #12207

lmatz commented Sep 11, 2023 •

edited

Loading

fuyufjh commented Oct 10, 2023

BugenZhao commented Oct 10, 2023

github-actions bot commented Dec 11, 2023

BugenZhao commented Dec 11, 2023

ingest unstructured json records or capture unrecognized fields #12207

ingest unstructured json records or capture unrecognized fields #12207

Comments

lmatz commented Sep 11, 2023 • edited Loading

fuyufjh commented Oct 10, 2023

BugenZhao commented Oct 10, 2023

github-actions bot commented Dec 11, 2023

BugenZhao commented Dec 11, 2023

lmatz commented Sep 11, 2023 •

edited

Loading