-
Notifications
You must be signed in to change notification settings - Fork 43
Restructure Storage engines chapter and write the memtx engine overview #2243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
fcfe80c
Restructure Storage engines chapter. Add initial structure and topics…
veod32 4f0354a
Update description and doc structure
veod32 c711557
Return excluded vynil.rst back
veod32 07da442
Update memtx overview page and storage enging chapter index page
veod32 13ff640
Correct content after review
veod32 f38954d
Corrections after review
veod32 ee7c916
Update doc/book/box/engines/memtx.rst
patiencedaur fcef810
Corrections after review
veod32 6cc698e
Update translations
ainoneko 929f6cf
Merge branch 'latest' into veod32/gh-1632-memtx-overview
patiencedaur 3be7390
Update translations
patiencedaur File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,161 @@ | ||
| .. _engines-memtx: | ||
|
|
||
| Storing data with memtx | ||
| ======================= | ||
|
|
||
| The ``memtx`` storage engine is used in Tarantool by default. It keeps all data in random-access memory (RAM), and therefore has very low read latency. | ||
|
|
||
| The obvious question here is: | ||
| if all the data is stored in memory, how can you prevent the data loss in case of emergency such as outage or Tarantool instance failure? | ||
|
|
||
| First of all, Tarantool persists all data changes by writing requests to the write-ahead log (WAL) that is stored on disk. | ||
| Read more about that in the :ref:`memtx-persist` section. | ||
| In case of a distributed application, there is an option of synchronous replication that ensures keeping the data consistent on a quorum of replicas. | ||
| Although replication is not directly a storage engine topic, it is a part of the answer regarding data safety. Read more in the :ref:`memtx-replication` section. | ||
|
|
||
| In this chapter, the following topics are discussed in brief with the references to other chapters that explain the subject matter in details. | ||
|
|
||
| .. contents:: | ||
| :local: | ||
| :depth: 1 | ||
|
|
||
| .. _memtx-memory: | ||
|
|
||
| Memory model | ||
| ------------ | ||
|
|
||
| There is a fixed number of independent :ref:`execution threads <atomic-threads_fibers_yields>`. | ||
| The threads don't share state. Instead they exchange data using low-overhead message queues. | ||
| While this approach limits the number of cores that the instance uses, | ||
| it removes competition for the memory bus and ensures peak scalability of memory access and network throughput. | ||
|
|
||
| Only one thread, namely, the **transaction processor thread** (further, **TX thread**) | ||
| can access the database, and there is only one TX thread for each Tarantool instance. | ||
| In this thread, transactions are executed in a strictly consecutive order. | ||
| Multi-statement transactions exist to provide isolation: | ||
| each transaction sees a consistent database state and commits all its changes atomically. | ||
| At commit time, a yield happens and all transaction changes are written to :ref:`WAL <internals-wal>` in a single batch. | ||
| In case of errors during transaction execution, a transaction is rolled-back completely. | ||
| Read more in the following sections: :ref:`atomic-transactions`, :ref:`atomic-transactional-manager`. | ||
|
|
||
| Within the TX thread, there is a memory area allocated for Tarantool to store data. It's called **Arena**. | ||
|
|
||
| .. image:: memtx/arena2.svg | ||
|
|
||
| Data is stored in :term:`spaces <space>`. Spaces contain database records—:term:`tuples <tuple>`. | ||
| To access and manipulate the data stored in spaces and tuples, Tarantool builds :doc:`indexes </book/box/indexes>`. | ||
|
|
||
| Special `allocators <https://github.com/tarantool/small>`__ manage memory allocations for spaces, tuples, and indexes within the Arena. | ||
| The slab allocator is the main allocator used to store tuples. | ||
| Tarantool has a built-in module called ``box.slab`` which provides the slab allocator statistics | ||
| that can be used to monitor the total memory usage and memory fragmentation. | ||
| For details, see the ``box.slab`` module :doc:`reference </reference/reference_lua/box_slab>`. | ||
|
|
||
| .. image:: memtx/spaces_indexes.svg | ||
|
|
||
| Also inside the TX thread, there is an event loop. Within the event loop, there are a number of :ref:`fibers <fiber-fibers>`. | ||
| Fibers are cooperative primitives that allows interaction with spaces, that is, reading and writting the data. | ||
| Fibers can interact with the event loop and between each other directly or by using special primitives called channels. | ||
| Due to the usage of fibers and :ref:`cooperative multitasking <atomic-cooperative_multitasking>`, the ``memtx`` engine is lock-free in typical situations. | ||
|
|
||
| .. image:: memtx/fibers-channels.svg | ||
|
|
||
| To interact with external users, there is a separate :ref:`network thread <atomic-threads_fibers_yields>` also called the **iproto thread**. | ||
| The iproto thread receives a request from the network, parses and checks the statement, | ||
| and transforms it into a special structure—a message containing an executable statement and its options. | ||
| Then the iproto thread ships this message to the TX thread and runs the user's request in a separate fiber. | ||
|
|
||
| .. image:: memtx/iproto.svg | ||
|
|
||
| .. _memtx-persist: | ||
|
|
||
| Data persistence | ||
| ---------------- | ||
|
|
||
| To ensure :ref:`data persistence <index-box_persistence>`, Tarantool does two things. | ||
|
|
||
| * After executing data change requests in memory, Tarantool writes each such request to the :ref:`write-ahead log (WAL) <internals-wal>` files (``.xlog``) | ||
| that are stored on disk. Tarantool does this via a separate thread called the **WAL thread**. | ||
|
|
||
| .. image:: memtx/wal.svg | ||
|
|
||
| * Tarantool periodically takes the entire :doc:`database snapshot </reference/reference_lua/box_snapshot>` and saves it on disk. | ||
| It is necessary for accelerating instance's restart because when there are too many WAL files, it can be difficult for Tarantool to restart quickly. | ||
|
|
||
| To save a snapshot, there is a special fiber called the **snapshot daemon**. | ||
| It reads the consistent content of the entire Arena and writes it on disk into a snapshot file (``.snap``). | ||
| Due of the cooperative multitasking, Tarantool cannot write directly on disk because it is a locking operation. | ||
| That is why Tarantool interacts with disk via a separate pool of threads from the :doc:`fio </reference/reference_lua/fio>` library. | ||
|
|
||
| .. image:: memtx/snapshot03.svg | ||
|
|
||
| So, even in emergency situations such as an outage or a Tarantool instance failure, | ||
| when the in-memory database is lost, the data can be restored fully during Tarantool restart. | ||
|
|
||
| What happens during the restart: | ||
|
|
||
| 1. Tarantool finds the latest snapshot file and reads it. | ||
| 2. Tarantool finds all the WAL files created after that snapshot and reads them as well. | ||
| 3. When the snapshot and WAL files have been read, there is a fully recovered in-memory data set | ||
| corresponding to the state when the Tarantool instance stopped. | ||
| 4. While reading the snapshot and WAL files, Tarantool is building the primary indexes. | ||
| 5. When all the data is in memory again, Tarantool is building the secondary indexes. | ||
| 6. Tarantool runs the application. | ||
|
|
||
| .. _memtx-indexes: | ||
|
|
||
| Accessing data | ||
| -------------- | ||
|
|
||
| To access and manipulate the data stored in memory, Tarantool builds indexes. | ||
| Indexes are also stored in memory within the Arena. | ||
|
|
||
| Tarantool supports a number of :ref:`index types <index-types>` intended for different usage scenarios. | ||
| The possible types are TREE, HASH, BITSET, and RTREE. | ||
|
|
||
| Select query are possible against secondary index keys as well as primary keys. | ||
| Indexes can have multi-part keys. | ||
|
|
||
| For detailed information about indexes, refer to the :doc:`/book/box/indexes` page. | ||
|
|
||
| .. _memtx-replication: | ||
|
|
||
| Replicating data | ||
| ---------------- | ||
|
|
||
| Although this topic is not directly related to the ``memtx`` engine, it completes the overall picture of how Tarantool works in case of a distributed application. | ||
|
|
||
| Replication allows multiple Tarantool instances to work on copies of the same database. | ||
| The copies are kept in sync because each instance can communicate its changes to all the other instances. | ||
| It is implemented via WAL replication. | ||
|
|
||
| To send data to a replica, Tarantool runs another thread called **relay**. | ||
| Its purpose is to read the WAL files and send them to replicas. | ||
| On a replica, the fiber called **applier** is run. It receives the changes from a remote node and applies them to the replica's Arena. | ||
| All the changes are being written to WAL files via the replica's WAL thread as if they are done locally. | ||
|
|
||
| .. image:: memtx/replica-xlogs.svg | ||
|
|
||
| By default, :ref:`replication <replication-architecture>` in Tarantool is asynchronous: if a transaction | ||
| is committed locally on a master node, it does not mean it is replicated onto any | ||
| replicas. | ||
|
|
||
| :ref:`Synchronous replication <repl_sync>` exists to solve this problem. Synchronous transactions | ||
| are not considered committed and are not responded to a client until they are | ||
| replicated onto some number of replicas. | ||
|
|
||
| For more information on replication, refer to the :doc:`corresponding chapter </book/replication/index>`. | ||
|
|
||
| .. _memtx-summary: | ||
|
|
||
| Summary | ||
| -------- | ||
|
|
||
| The main key points describing how the in-memory storage engine works can be summarized in the following way: | ||
|
|
||
| * All data is in RAM. | ||
| * Access to data is from one thread. | ||
| * Tarantool writes all data change requests in WAL. | ||
| * Data snapshots are taken periodically. | ||
| * Indexes are build to access the data. | ||
| * WAL can be replicated. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В кратком описании не закрыт главный вопрос новичка: а не утеряются ли мои данные, если вдруг отключится электричество? А если данные персистятся, то как много данных я потеряю при отрубе машины? А можно ли сделать из кода так, чтобы данные вообще не потерялись? Тут две строчки про синхронную репликацию и про наличие транзакций в движке.
Тут возможно должен быть ответ из разряда: при нагрузке 10000 RPS на запись, мы примерно потеряем столько данных. Но вот статья где описано, как их не потерять вообще. @alyapunov тут эксперт по короткому и емкому ответу.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Это не то, чтобы краткое описание - этот абзац уже был перед сравнительной таблицей 2х движков, и я его оставил. Про вопрос новичка о потери данных дописал непосредственно в разделе про memtx - файл
memtx.rst. Посмотри, пож-та.