prototyping Nebula Ingestion DDL #62

chenqin · 2020-12-16T19:28:41Z

Nebula Ingestion DDL

YAML is powerful way to express configurations, it's easy for people to understand and change. At same time, remember all different configurations and concepts can pose high tax when we starts support functions and preprocess, indexing, or consistent hashing (possible concept to expand/shrink storage cluster) This may lead to invent new set of configuration and concepts that only expert can remember.

Moreover, OLAP system is working as part of big data ecosystem, be able to transform and pre-process during ingestion time will provide an edge compare to other OLAP engines for user to adopt.

Use an inspiring example that not yet supported by nebula.

User has a hive table and a kafka stream ingest into nebula. Hive table has hourly partition keeping last 60 days of moving average of business spent per account; kafka stream contains business transactions in foreign currency of each account. User want to investigate account spending status in near realtime in home currency (e.g USD)

The complexity of this use case comes from three folds

hive table may read data and eventually shard per account basis
kafka stream may need to rpc and convert currency into usd
both kafka stream may need to do stream/table join on per account basis before land result to run slice and dice.

If user write a RDBMS query, it should look like

OPTION 1 Materialized View with schema as part of config

create view nebula.transaction_analytic as (select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join kafka on hive.account = kafka.acount where <all configs on hive, kafka>)

Alternatively, we can support two statement flow like

OPTION 2 Full Table with schema inference

DDL
`
// mapping of hive table sync to nebula table
create table hive.account (
accountid bigint PRIMARY KEY,
spend double,
dt varchat(20) UNIQUE NOT NULL
) with ();

create table kafka.transaction (
transactionid bigint PRIMARY KEY,
accountid bigint not null,
transaction_amount double,
_time timestamp
) with ();

create table transaction_analytic (
accountid bigint PRIMARY KEY,
avg_transaction double,
transaction_amount_in_usd double,
_time timestamp
) with ();
`

DML
insert into transaction_analytic select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join transaction on hive.account = transaction.acount;

The text was updated successfully, but these errors were encountered:

chenqin · 2020-12-17T02:16:07Z

support ddl based table ingestion declaration #63

shawncao · 2020-12-24T02:22:55Z

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

chenqin · 2020-12-26T01:33:47Z

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

In deed interesting project. Especially declarative MV syntax.

Here is foundation programming diagram it runs upon.
https://github.com/TimelyDataflow

a related industry oriented thought would be - robust full featured stream stream join is tricky and complicated at scale.

starts with common use case like stream to materialized table join which users can run out of box with minimal learning needed.
leverage built in columnar storage to filter efficient one pass hash join
use omp to take advantage of parallel perf gain in heterogeneous hardware architecture.

chenqin · 2020-12-27T20:23:27Z

A bit related topic. from adoption point of view, having a UI to d'n'd would be significant easier for end user compare to SQL and magnitudes compare to even python

My old team open sourced UI framework flow builder
https://github.com/chenqin/react-digraph

chenqin · 2020-12-30T18:01:21Z

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

In deed interesting project. Especially declarative MV syntax.

Here is foundation programming diagram it runs upon.
https://github.com/TimelyDataflow

a related industry oriented thought would be - robust full featured stream stream join is tricky and complicated at scale.

starts with common use case like stream to materialized table join which users can run out of box with minimal learning needed.

leverage built in columnar storage to filter efficient one pass hash join

use omp to take advantage of parallel perf gain in heterogeneous hardware architecture.

something related from splunk
https://twitter.com/esammer/status/1343675579850113024

chenqin · 2021-01-02T06:44:38Z

I put a bit more on ingestion SQL during vacation. There is good amount of prototyping and experiment work needed before propose a complete solution. I plan to take a long shot and write a stand alone prototype.

chenqin created this issue from a note in Nebula SQL (In Progress) Dec 16, 2020

chenqin self-assigned this Dec 17, 2020

chenqin mentioned this issue Dec 20, 2020

support ddl based table declare #63

Closed

chenqin closed this as completed Jan 2, 2021

Nebula SQL automation moved this from In Progress to Done Jan 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prototyping Nebula Ingestion DDL #62

prototyping Nebula Ingestion DDL #62

chenqin commented Dec 16, 2020 •

edited

chenqin commented Dec 17, 2020 •

edited

shawncao commented Dec 24, 2020

chenqin commented Dec 26, 2020

chenqin commented Dec 27, 2020

chenqin commented Dec 30, 2020

chenqin commented Jan 2, 2021

prototyping Nebula Ingestion DDL #62

prototyping Nebula Ingestion DDL #62

Comments

chenqin commented Dec 16, 2020 • edited

Nebula Ingestion DDL

OPTION 1 Materialized View with schema as part of config

OPTION 2 Full Table with schema inference

chenqin commented Dec 17, 2020 • edited

shawncao commented Dec 24, 2020

chenqin commented Dec 26, 2020

chenqin commented Dec 27, 2020

chenqin commented Dec 30, 2020

chenqin commented Jan 2, 2021

chenqin commented Dec 16, 2020 •

edited

chenqin commented Dec 17, 2020 •

edited