New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prototyping Nebula Ingestion DDL #62
Comments
support ddl based table ingestion declaration #63 |
FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/ |
In deed interesting project. Especially declarative MV syntax. Here is foundation programming diagram it runs upon. a related industry oriented thought would be - robust full featured stream stream join is tricky and complicated at scale.
|
A bit related topic. from adoption point of view, having a UI to d'n'd would be significant easier for end user compare to SQL and magnitudes compare to even python My old team open sourced UI framework flow builder |
something related from splunk |
I put a bit more on ingestion SQL during vacation. There is good amount of prototyping and experiment work needed before propose a complete solution. I plan to take a long shot and write a stand alone prototype. |
Nebula Ingestion DDL
YAML is powerful way to express configurations, it's easy for people to understand and change. At same time, remember all different configurations and concepts can pose high tax when we starts support functions and preprocess, indexing, or consistent hashing (possible concept to expand/shrink storage cluster) This may lead to invent new set of configuration and concepts that only expert can remember.
Moreover, OLAP system is working as part of big data ecosystem, be able to transform and pre-process during ingestion time will provide an edge compare to other OLAP engines for user to adopt.
Use an inspiring example that not yet supported by nebula.
User has a hive table and a kafka stream ingest into nebula. Hive table has hourly partition keeping last 60 days of moving average of business spent per account; kafka stream contains business transactions in foreign currency of each account. User want to investigate account spending status in near realtime in home currency (e.g USD)
The complexity of this use case comes from three folds
If user write a RDBMS query, it should look like
OPTION 1 Materialized View with schema as part of config
create view nebula.transaction_analytic as (select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join kafka on hive.account = kafka.acount where <all configs on hive, kafka>)
Alternatively, we can support two statement flow like
OPTION 2 Full Table with schema inference
DDL
`
// mapping of hive table sync to nebula table
create table hive.account (
accountid bigint PRIMARY KEY,
spend double,
dt varchat(20) UNIQUE NOT NULL
) with ();
create table kafka.transaction (
transactionid bigint PRIMARY KEY,
accountid bigint not null,
transaction_amount double,
_time timestamp
) with ();
create table transaction_analytic (
accountid bigint PRIMARY KEY,
avg_transaction double,
transaction_amount_in_usd double,
_time timestamp
) with ();
`
DML
insert into transaction_analytic select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join transaction on hive.account = transaction.acount;
The text was updated successfully, but these errors were encountered: