[RFC]: Architecture of StoneDB V2.0 #436

RingsC · 2022-08-23T03:29:16Z

1 Overview of StoneDB V2.0
2 SQL Syntaxes
3 Massive Parallel Processing
4 Query Engine
5 Execution Engine
6 In-Memory Column-Based Engine
7 Backup & Restore
8 Crash Recovery
9 Replication
10 Data Security
11 Cloud-native

Related RFC Subtasks Discussions:

Before You Start

Now, our in-memory column sotre, also called Tianmu_rapid.

After the release of StoneDB V1.0 on GitHub, we have received lots of feedback from our community members and users. As shown in the TPC-H test reports, StoneDB V1.0 provides 10 to 100 times higher performance than MySQL, with only one tenth the total cost of ownership (TCO).

Though StoneDB V1.0 provides inspiring performance, it still has some drawbacks that are rooted in its design, including long execution time for TPC-H query 16, unsatisfying performance in processing transactional workloads, and comparatively long latency in populating latest changes to analytics workloads.

Therefore, we are redesigning StoneDB to eliminate these problems in our next version — StoneDB V2.0. StoneDB V2.0 will be based on MySQL 8.0. It aims to achieve at least 10 times higher query performance on TPC-H than that of V1.0, with a more concise architecture and a query optimizer that can intelligently offload query workloads to corresponding storage engines.

The design philosophy of StoneDB V2.0 is modularity and performance-cost balance. The following outlines the new features that will be implemented in StoneDB V2.0. To learn details about each feature, see the relevant chapter.

StoneDB V2.0 will still be an open source project, which is a counterpart of close source service, MySQL Heatwave.

At first, an in-memory column-based engine (IMCE) will be used. Secondly, a cost-based query engine will be developed to automatically offload transactional and analytics workloads. Thirdly, StoneDB V2.0 will provide a vectorized execution engine and support massive parallel processing. In this way, the execution performance of StoneDB V2.0 will be at least ten times as that of StoneDB V1.0.

StoneDB will load the data into memory from InnoDB, just the same as MySQL Heatwave.

“MySQL Analytics is an in-memory processing engine, data is only persisted in MySQL InnoDB storage engine.”

This sentence functions as the basic rule and guideline for us when designing StoneDB V2.0. This design document introduces the main changes that will be achieved and gives you an overview of architecture of StoneDB V2.0.

The main design goals of StoneDB V2.0 will include:
1: Large Scale.
2: Real Time.
3: Highly Fresh Data Changes.
4: Strong Data Consistency.
5: Query Performance Capability.
6: Single System Interface.
7: Performance Isolation.

1 Overview of StoneDB V2.0

StoneDB V2.0 is an integrated HTAP database that adopts hybrid row-column store and in-memory computing. It is fully compatible with MySQL.

In StoneDB V2.0, the update of column-based storage engine Tianmu will be a qualitative leap. Compared with Tianmu V1.0, Tianmu V2.0 provides brand new functionality. Tianmu plays the same role in MySQL as InnoDB. In StoneDB V2.0, Tianmu is used as the secondary engine. The support for secondary engine is a new feature introduced by MySQL 8.0.

MySQL 8.0 that enables the secondary engine can intelligently route each workload to the primary engine or secondary engine based on the workload type.

2 SQL Syntaxes

Firstly, the SQL syntaxes must be defined. All these syntaxes are the basis of all future works. This chapter introduces syntaxes of the following operations:

Create an in-memory Tianmu table.
Load data.
Process a query.
Monitor the system status.

To determine which SQL syntaxes to define, we must first figure out why we want to develop StoneDB V2.0: We want users can port their workload seamlessly from MySQL Heatwave to StoneDB. Therefore, we try to adopt all SQL syntax that MySQL Heatwave uses.

In addition, relevant changes will be implemented in the MySQL server layer. Following are examples showing some SQL syntaxes supported by StoneDB V2.0.

Certain SQL grammars must be added in sql/sql_yacc.yy. The following uses the SELECT statement as an example:

select_stmt:
          query_expression
          {
            $$ = NEW_PTN PT_select_stmt($1);
          }
        | query_expression locking_clause_list
          {
            $$ = NEW_PTN PT_select_stmt(NEW_PTN PT_locking($1, $2),
                                        nullptr, true);
          }
        | query_expression_parens
          {
            $$ = NEW_PTN PT_select_stmt($1);
          }
        | select_stmt_with_into
        ;

After SQL syntaxes are added, new SQL items are created in yacc. These items will be processed in the MySQL server layer during query optimization.

Create a Tianmu table:

CREATETABLE orders (id INT)SECONDARY_ENGINE= Tianmu;
ALTERTABLE orders SECONDARY_ENGINE= Tianmu;

Compared with the syntax for creating tables used in StoneDB V1.0, StoneDB V2.0 will support a new keyword SECONDARY_ENGINE that is adopted in MySQL 8.0.

Original CREATE statement syntax used in MySQL:

create_table_stmt:
          CREATE opt_temporary TABLE_SYM opt_if_not_exists table_ident
          '(' table_element_list ')' opt_create_table_options_etc
          {
            $$= NEW_PTN PT_create_table_stmt(YYMEM_ROOT, $2, $4, $5,
                                             $7,
                                             $9.opt_create_table_options,
                                             $9.opt_partitioning,
                                             $9.on_duplicate,
                                             $9.opt_query_expression);
          }

opt_create_table_options_etc:
          create_table_options
          opt_create_partitioning_etc
          {
            $$= $2;
            $$.opt_create_table_options= $1;
          }
        | opt_create_partitioning_etc
        ;

create_table_option:
          ENGINE_SYM opt_equal ident_or_text
          {
            $$= NEW_PTN PT_create_table_engine_option(to_lex_cstring($3));
          }
        | **SECONDARY_ENGINE_SYM** opt_equal NULL_SYM
          {
            $$= NEW_PTN PT_create_table_secondary_engine_option();
          }
        | SECONDARY_ENGINE_SYM opt_equal ident_or_text
          {
            $$= NEW_PTN PT_create_table_secondary_engine_option(to_lex_cstring($3));
          }

From the definition above, SECONDARY_ENGINE_SYM is already defined in create_table_option and also should be in class PT_create_table_stmt. For more information about how SQL syntax support will be designed, see issue: #423.

3 Massive Parallel Processing

Generally, an analytical processing (AP) engine deals with GB- or PB-scale data. However, the workload processing capability of a single SQL engine instance is limited. As the complexity of business grows, the data also grows exponentially. And, the traditional computing architecture is out of date. Therefore, massive parallel processing (MPP) architecture is introduced, and more and more computing nodes are employed to solve process large amounts of data.

As mentioned in chapter “Before You Start”, StoneDB V2.0 employs the MPP framework for AP, that enables the AP engine with scale-out capabilities. The MPP framework is used widely in analytical databases, such as Greenplum and Presto. To adapt to the MPP framework, we also need to develop the query engine and the execution engine. With the multi-nodes computing capabilities, StoneDB V2.0 will distribute sub-jobs for query optimization and query plan execution to corresponding nodes.

3.1 Data Distribution Approaches

Firstly, the data distribution rules must be defined. Hash distribution is widely used in distributed architectures. Hash distribution allows you to specify a hash key when creating a table, so that the hash key can determine on which node data is stored. Other distribution approaches include distribution function and random distribution.

3.2 Redistribution Approaches

Another challenge is how to implement high availability. If no measure is taken to ensure high availability after MPP is enabled and a node fails, the data on this node will no longer available. Therefore, it is important to ensure the system is fault-tolerant.

The replica method is a mature approach to ensure high availability. Each replica of data on a node has at least two replicas stored on different nodes. This indicates that every piece of data has three copies each of which stores on a different node, and thus high availability is ensured. However, three copies for every piece of data consume more disk capacity.

3.3 Metadata Management

Metadata in MPP includes:

System architecture information, such as node information, IP address, and port number
Data distribution rules
Replica information
For more information, see issue #xxxx.

4 Query Engine

After StoneDB V2.0 achieves MPP, the way that StoneDB processes SQL statements will be different from centralized systems. A distributed query plan will be generated after query optimization is complete.

After all new SQL syntaxes are enabled, the server will understand all the SQL statements. When the server receives an SQL statement, the SQL string will create some SQL classes, such as PT_create_table_stmt after lexical processing and grammatical processing. We will not discuss how distributed query plans in MPP are generated in this document. Instead, we focus on ONE NODE and try to explain what happens in ONE node when processing an SQL statement.

In MySQL 8.0, when the cost of a query plan on the primary engine is greater than the threshold defined by the new system variable (secondary_engine_cost_threshold), the query optimization engine will offload this workload to the secondary engine, ensuring optimal processing efficiency.

At the last phase of query optimization, The query engine will add optimize_secondary_engine to determine to which engine will the workload route for execution by performing the following three steps:

Use the original processing way: unit->optimize().
Estimate the cost spent by each engine to process the query: current_query_cost and accumulate_current_query_cost.
If current_query_cost is greater than secondary_engine_cost_threshold, forward the workload to optimize_secondary_engine.

if (current_query_cost < variables.secondary_engine_cost_threshold) 
    return false; 
optimize_secondary_engine;

If the workload is routed to the secondary engine (the Tianmu in-memory engine), StoneDB will re-process this query to create a distributed query plan according to the metadata in MPP.

Based on the analysis above, a new query optimization module and cost module need to be reconstructed. For more information, see issue #493 .

5 Execution Engine

As for the execution engine, a vectorized execution engine will be used in Tianmu. A column-based AP system is native to implement a vectorized execution engine. The vectorized engine seems as a standard approach to improve the performance of AP systems. Database systems such as ClickHouse also use vectorized execution engines. Two ways are available to achieve vectorized execution: 1. Use SIMD (single instruction, multiple data) to re-write execution plans. Multiple tuples will be fetched in an iteration, rather than a-tuple-an-iteration. 2. Use GCC to generate vectorized code.

In addition to the SIMD mechanism, parallel execution is another property of MPP. Some functions such as count(), sum(), and avg() can be executed in parallel. After a query plan is dispatched on a data node through the management node, the execution engine executes this query plan in parallel and the job is divided into sub-jobs and simultaneously executed through threads. The framework of parallel execution is discussed in issue #xxxx.

The following picture depicts an example of parallel aggregation.

Parallel execution is achieved depending on many modules in the system, including the query optimization module and some hardware.

6 In-Memory Column-Based Engine

The StoneDB Database In-Memory feature set includes the in-memory column store, advanced query optimization, and high availability solutions. These features accelerate analytic queries by orders of magnitude without sacrificing OLTP performance or availability.

The in-memory column store — Tianmu — maintains copies of tables, partitions, and individual columns in a compressed columnar format that is optimized for rapid scans.

6.1 Memory Organization and Columnar Format

The Tianmu engine will be located in the buffer pool. When the system starts, part of the memory in the buffer pool will be marked as in-memory engine used. The following figure shows how this part of memory is organized.

The layouts of data used in Tianmu include:

In-memory column-based Data Pack (IMCDP), also known as chunk.
Snapshot Meta Unit (SMU)

6.2 Load Data

After the data layouts and organization are defined, the next challenge is how to load data from the primary engine (InnoDB is used in the following as an example) to the secondary engine.

MySQL 8.0 provides an interface to load data from the InnoDB to the secondary engine.

Prepare for loading data: ha_prepare_load_table.

 DBUG_ASSERT(thd->mdl_context.owns_equal_or_stronger_lock(
      MDL_key::TABLE, table.s->db.str, table.s->table_name.str, MDL_EXCLUSIVE));
  DBUG_ASSERT(table.s->has_secondary_engine());

  // Load at least one column to the secondary engine.
  if (bitmap_bits_set(table.read_set) == 0) {
    my_error(ER_SECONDARY_ENGINE, MYF(0),
             "All columns marked as NOT SECONDARY");
    return true;
  }

Load data. The InnoDB parallel scan index tree will be used to retrieve the data.

// InnoDB parallel scan context
Parallel_reader_adapter

struct handler {
// Parallel scan interface, which can be used to speed up the offloading process
int parallel_scan_init(void *&scan_ctx, size_t &num_threads);
int parallel_scan(void *scan_ctx, void **thread_ctxs,
                  Load_init_cbk init_fn, Load_cbk load_fn, Load_end_cbk end_fn);
void parallel_scan_end(void *scan_ctx);

// Allow concurrent DML in the offload process.
int ha_prepare_load_table(const TABLE &table);
int ha_load_table(const TABLE &table);
int ha_unload_table(const char *db_name, const char *table_name,
                    bool error_if_not_loaded);

void ha_set_primary_handler(handler *primary_handler);
};

6.3 Populate the Changes

This is the most important feature of StoneDB, which makes StoneDB to be a true HTAP database. The changes in transactional workloads will be automatically populated to the AP engine Tianmu.

If any of the following criteria is met, the populate operation will be triggered:

The buffer is filled up by TP changes.
The clock is running out of time.
Analytics workloads use any columns loaded to Tianmu.

6.5 Purge the inactive data

W.I.P

For more information, see issue #424.

7 Backup & Restore

8 Crash recovery

9 Replication

10 Data Importing and Exporting

11 Data Security

All data in StoneDB is encrypted.

12 Cloud-native

The text was updated successfully, but these errors were encountered:

cjkbjhb · 2022-08-31T08:42:11Z

It's necessary to describe the cloud native feature of version 2.0 architecture. DO NOT miss it.

hustjieke · 2022-08-31T08:43:34Z

It's necessary to describe the cloud native feature of version 2.0 architecture. DO NOT miss it.

ACK.
Cloud-native is an awesome feature!

yuqi1129 · 2022-09-22T12:07:51Z

Is the secondary engine Timu store full amount of data that lies in primary engine. As far as i know, Two engine means two copy of data, Storage cost can'b be ignore, Is there any method to reduce the storage volume. this seems a little like TiDB that introduces TiFlash to support AP workload.
AP and TP execution engines lies in the same instance? How to isolate this two kinds of workloads if in one process or in the same machine.

RingsC · 2022-09-23T02:52:03Z

Is the secondary engine Timu store full amount of data that lies in primary engine. As far as i know, Two engine means two copy of data, Storage cost can'b be ignore, Is there any method to reduce the storage volume. this seems a little like TiDB that introduces TiFlash to support AP workload.

AP and TP execution engines lies in the same instance? How to isolate this two kinds of workloads if in one process or in the same machine.

1: Yes, you are right, there are two copies, but it's not a copy of whole data, just a copy of data which is loaded into the secondary engine. when we do analytical processing, we are just only process the specific data, such as in a 'sale' table, we only concern that the saled volume, total profit, etc. Therefore, we think that the two copies is OK in production. The PAX data format can archieve 'one copy' . But the PAX will lead the performance degrade, and it's a complex engineering issue. For more information about HTAP issues, your can refer to our articles published at our offical wechat account. Thanks.

2: Yes, Integrated architecture will meet this problem： How to do isolation between the AP worloads and TP worloads. We will set some threshold for AP workloads to make sue these workloads will not interfere TP workloads. And, the more approaches are under investigation. If you have any good idea, pls let us know. Thanks.

RingsC added the A-feature feature with good idea label Aug 23, 2022

RingsC self-assigned this Aug 23, 2022

RingsC added prio: high High priority RFC labels Aug 23, 2022

RingsC changed the title ~~feature: Architecuture of StoneDB verion 2.0~~ feature: Architecuture of StoneDB version 2.0 Aug 23, 2022

RingsC pinned this issue Aug 23, 2022

RingsC mentioned this issue Aug 23, 2022

[RFC]: SQL supported on StoneDB v2.0 #423

Open

RingsC added this to the StoneDB v2.0 milestone Aug 24, 2022

RingsC mentioned this issue Aug 24, 2022

StoneDB 2.0 Design(disscussion) #425

Closed

stoneatom deleted a comment from hustjieke Aug 24, 2022

RingsC mentioned this issue Aug 24, 2022

[RFC]: Tianmu In-Memory Column Store Engine #424

Open

RingsC changed the title ~~feature: Architecuture of StoneDB version 2.0~~ feature: Architecture of StoneDB version 2.0 Aug 25, 2022

hustjieke changed the title ~~feature: Architecture of StoneDB version 2.0~~ RFC: Architecture of StoneDB version 2.0 Aug 28, 2022

stoneatom deleted a comment from stoneatomadmin Aug 31, 2022

xuejiao-joy changed the title ~~RFC: Architecture of StoneDB version 2.0~~ RFC: Architecture of StoneDB V2.0 Sep 2, 2022

OpenDataBase mentioned this issue Sep 6, 2022

太令人遗憾了，AliSQL社区版已经4年没有更新维护了，我们做了一款MySQL HTAP数据库，大家来看看吧 alibaba/AliSQL#112

Closed

RingsC mentioned this issue Sep 22, 2022

[RFC]: Query Engine for StoneDB V2.0 #493

Open

duanfuxiang0 unpinned this issue Sep 30, 2022

RingsC pinned this issue Sep 30, 2022

Nliver changed the title ~~RFC: Architecture of StoneDB V2.0~~ [RFC]: Architecture of StoneDB V2.0 Nov 15, 2022

Nliver mentioned this issue Apr 3, 2023

[RFC]: Track of Architecture of StoneDB V2.0 #1519

Open

hustjieke unpinned this issue May 29, 2023

derrickball mentioned this issue Sep 25, 2023

feat(rapid): basic function(#436) #1965

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Architecture of StoneDB V2.0 #436

[RFC]: Architecture of StoneDB V2.0 #436

RingsC commented Aug 23, 2022 •

edited by Nliver

cjkbjhb commented Aug 31, 2022

hustjieke commented Aug 31, 2022

yuqi1129 commented Sep 22, 2022

RingsC commented Sep 23, 2022 •

edited

[RFC]: Architecture of StoneDB V2.0 #436

[RFC]: Architecture of StoneDB V2.0 #436

Comments

RingsC commented Aug 23, 2022 • edited by Nliver

Before You Start

1 Overview of StoneDB V2.0

2 SQL Syntaxes

3 Massive Parallel Processing

3.1 Data Distribution Approaches

3.2 Redistribution Approaches

3.3 Metadata Management

4 Query Engine

5 Execution Engine

6 In-Memory Column-Based Engine

6.1 Memory Organization and Columnar Format

6.2 Load Data

6.3 Populate the Changes

6.5 Purge the inactive data

7 Backup & Restore

8 Crash recovery

9 Replication

10 Data Importing and Exporting

11 Data Security

12 Cloud-native

cjkbjhb commented Aug 31, 2022

hustjieke commented Aug 31, 2022

yuqi1129 commented Sep 22, 2022

RingsC commented Sep 23, 2022 • edited

RingsC commented Aug 23, 2022 •

edited by Nliver

RingsC commented Sep 23, 2022 •

edited