Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototyping Nebula Ingestion DDL #62

Closed
chenqin opened this issue Dec 16, 2020 · 6 comments
Closed

prototyping Nebula Ingestion DDL #62

chenqin opened this issue Dec 16, 2020 · 6 comments
Assignees
Projects

Comments

@chenqin
Copy link
Contributor

chenqin commented Dec 16, 2020

Nebula Ingestion DDL

YAML is powerful way to express configurations, it's easy for people to understand and change. At same time, remember all different configurations and concepts can pose high tax when we starts support functions and preprocess, indexing, or consistent hashing (possible concept to expand/shrink storage cluster) This may lead to invent new set of configuration and concepts that only expert can remember.

Moreover, OLAP system is working as part of big data ecosystem, be able to transform and pre-process during ingestion time will provide an edge compare to other OLAP engines for user to adopt.

Use an inspiring example that not yet supported by nebula.

User has a hive table and a kafka stream ingest into nebula. Hive table has hourly partition keeping last 60 days of moving average of business spent per account; kafka stream contains business transactions in foreign currency of each account. User want to investigate account spending status in near realtime in home currency (e.g USD)

The complexity of this use case comes from three folds

  • hive table may read data and eventually shard per account basis
  • kafka stream may need to rpc and convert currency into usd
  • both kafka stream may need to do stream/table join on per account basis before land result to run slice and dice.

If user write a RDBMS query, it should look like

OPTION 1 Materialized View with schema as part of config

create view nebula.transaction_analytic as (select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join kafka on hive.account = kafka.acount where <all configs on hive, kafka>)

Alternatively, we can support two statement flow like

OPTION 2 Full Table with schema inference

DDL
`
// mapping of hive table sync to nebula table
create table hive.account (
accountid bigint PRIMARY KEY,
spend double,
dt varchat(20) UNIQUE NOT NULL
) with ();

create table kafka.transaction (
transactionid bigint PRIMARY KEY,
accountid bigint not null,
transaction_amount double,
_time timestamp
) with ();

create table transaction_analytic (
accountid bigint PRIMARY KEY,
avg_transaction double,
transaction_amount_in_usd double,
_time timestamp
) with ();
`

DML
insert into transaction_analytic select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join transaction on hive.account = transaction.acount;

@chenqin chenqin created this issue from a note in Nebula SQL (In Progress) Dec 16, 2020
@chenqin
Copy link
Contributor Author

chenqin commented Dec 17, 2020

support ddl based table ingestion declaration #63

@shawncao
Copy link
Collaborator

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

@chenqin
Copy link
Contributor Author

chenqin commented Dec 26, 2020

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

In deed interesting project. Especially declarative MV syntax.

Here is foundation programming diagram it runs upon.
https://github.com/TimelyDataflow

a related industry oriented thought would be - robust full featured stream stream join is tricky and complicated at scale.

  • starts with common use case like stream to materialized table join which users can run out of box with minimal learning needed.

  • leverage built in columnar storage to filter efficient one pass hash join

  • use omp to take advantage of parallel perf gain in heterogeneous hardware architecture.

@chenqin
Copy link
Contributor Author

chenqin commented Dec 27, 2020

A bit related topic. from adoption point of view, having a UI to d'n'd would be significant easier for end user compare to SQL and magnitudes compare to even python

My old team open sourced UI framework flow builder
https://github.com/chenqin/react-digraph

@chenqin
Copy link
Contributor Author

chenqin commented Dec 30, 2020

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

In deed interesting project. Especially declarative MV syntax.

Here is foundation programming diagram it runs upon.
https://github.com/TimelyDataflow

a related industry oriented thought would be - robust full featured stream stream join is tricky and complicated at scale.

  • starts with common use case like stream to materialized table join which users can run out of box with minimal learning needed.
  • leverage built in columnar storage to filter efficient one pass hash join
  • use omp to take advantage of parallel perf gain in heterogeneous hardware architecture.

something related from splunk
https://twitter.com/esammer/status/1343675579850113024
image

@chenqin
Copy link
Contributor Author

chenqin commented Jan 2, 2021

I put a bit more on ingestion SQL during vacation. There is good amount of prototyping and experiment work needed before propose a complete solution. I plan to take a long shot and write a stand alone prototype.

@chenqin chenqin closed this as completed Jan 2, 2021
Nebula SQL automation moved this from In Progress to Done Jan 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Nebula SQL
  
Done
Development

No branches or pull requests

2 participants