New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVCC and transactional design shootout :gun: :bomb: :explosionbomb: :extrabigexplosion: #232

Closed
spacejam opened this Issue Feb 8, 2018 · 5 comments

Comments

Projects
None yet
1 participant
@spacejam
Owner

spacejam commented Feb 8, 2018

Time for a literature shootout to determine the initial architecture for sled 0.16's transactions and snapshots! If anyone has particular insights or opinions on these, please jump in here!

There are a number of separate concerns when MVCC is mentioned, that we should be careful not to conflate (taken from the paper "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control" below):

  • concurrency control protocols - this is the primary thing under consideration in this issue!
  • version storage
    • append-only storage - all versions exist in the same storage space. lock free linked list points either from oldest to newest (O2N) which suffers from cache pollution due to having to scan through old versions always, and newest to oldest (N2O) which requires an indirection table (which we have anyway). off-page storage of BLOBs can be complicated, may need reference counts.
    • time-travel storage - other versions are stored in a separate table. some systems store the newest version in the master table, others store the oldest there to facilitate easy SI (SAP HANA does this). For our trade-offs, we may be fine keeping old versions in a separate, memory-only space after a transaction completes and invalidates them, and then we could use an epoch system to handle GC. This does not scale for long snapshots, but this is acceptable for the chosen tradeoffs of this system. time-travel storage can suffer from the same issues with off-page storage as append-only storage.
    • delta-storage - used by mysql, oracle, and HyPer. just store the main version in storage, and then store deltas as later versions. can incur read performance hits.
    • general considerations: balancing write contention and cache pollution are the major trade-offs here
  • GC
    • vaccuum (doesn't scale but is simple)
    • cooperative cleaning - when a page is read if we notice it has stale versions, tell a global GC mechanism about it, vulnerable to "dusty corners" so hekaton does this but also occasionally vacuums. only compatible with append-only storage (so we're good). requires that the mechanism that potentially cleans up data reads the oldest entries. we can implement this by having relocation calls fully read chains, while reads only access the new part of chains that they require to keep cache pollution down. right now sled reads the entire page always, which pollutes cache. this is something to flag for later.
    • transaction-level gc, track read+write sets of each transaction per epoch, when epochs end, we can clear all state generated by transactions in that epoch. simple but requires tracking the epoch mapping
  • index management of secondary indices (2i). we are NOT going to be implementing 2i in 0.16, but we should be mindful not to make the path toward implementing them in the future needlessly complex
    • logical pointer to primary key - grows in size with primary key length
    • logical pointer to tuple ID - use a separate lock-free lookup table, just store 64 bit tuple ID
    • physical pointer to storage. only usable with append-only storage, so we're good. used by hekaton, memsql. when we update the current version in storage, we update all 2i's to point to the new physical location.

The ideas in silo and tictoc particularly appeal to me due to the focus on reducing scalability barriers at high core counts, where we'll be headed in the coming years. A concern that will have a major impact on the implementation is how snapshots are handled. Sled is optimized for point reads and short-medium scans. It is acceptable for long scans to pay a higher GC penalty, especially if it simplifies implementation complexity. We value reliable worst case latency far more than crazy p0th-p50.

High-Performance Concurrency Control Mechanisms for Main-Memory Databases (2012)

  • Hekaton MVCC paper. They went with the optimistic choice here.

Speedy Transactions in Multicore In-Memory Databases (2013)
(silo)

  • interesting techniques for supporting many-core, decentralized systems
  • epoch-based
it avoids all centralized contention points, including that of
centralized transaction ID assignment. Silo’s key contri-
bution is a commit protocol based on optimistic concur-
rency  control  that  provides  serializability  while  avoid-
ing all shared-memory writes for records that were only
read.  Though  this  might  seem  to  complicate  the  en-
forcement of a serial order, correct logging and recov-
ery is provided by linking periodically-updated epochs
with the commit protocol. Silo provides the same guar-
antees as any serializable database without unnecessary
scalability bottlenecks or much additional latency. 

Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (2015)

  • keep old versions in undo buffers
We present a novel MVCC implementation for main-mem-
ory database systems that has very little overhead compared
to serial execution with single-version concurrency control,
even when maintaining serializability guarantees.  Updating
data in-place and storing versions as before-image deltas in
undo buffers not only allows us to retain the high scan per-
formance  of  single-version  systems  but  also  forms  the  ba-
sis  of  our  cheap  and  fine-grained  serializability  validation
mechanism.   The  novel  idea  is  based  on  an  adaptation  of
precision  locking  and  verifies  that  the  (extensional)  writes
of recently committed transactions do not intersect with the
(intensional) read predicate space of a committing transac-
tion.  We experimentally show that our MVCC model allows
very  fast processing of  transactions with point accesses
as well as read-heavy transactions and that there is little need
to prefer SI over full serializability any longer.

High performance transactions via early write visibility (2017)

  • restrict transaction abort conditions to avoid delayed write visibility
    delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write visibility, which safely makes transactions' writes visible prior to the end of their execution. We design a new serializable concurrency control protocol, piece-wise visibility (PWV), with the explicit goal of enabling early write visibility. We evaluate PWV against state-of-the-art serializable protocols and a highly optimized implementation of read committed, and find that PWV can outperform serializable protocols by an order of magnitude and read committed by 3X on high contention workloads.

Efficiently making (almost) any concurrency control mechanism serializable (2017)

  • certifier approach
We propose the serial safety net (SSN), a
serializability-enforcing certifier which can be applied on top of var-
ious CC schemes that offer higher performance but admit anomalies,
such as snapshot isolation and read committed. The underlying CC
mechanism retains control of scheduling and transactional accesses,
while SSN tracks the resulting dependencies. At commit time, SSN
performs a validation test by examining only direct dependencies
of the committing transaction to determine whether it can commit
safely or must abort to avoid a potential dependency cycle.

An Empirical Evaluation of In-Memory Multi-Version Concurrency Control (2017)

  • a survey of trade-offs, mentions a few of the previous contestants here
we conduct an extensive study of the
scheme’s four key design decisions: concurrency control protocol,
version storage, garbage collection, and index management.  We
implemented state-of-the-art variants of all of these in an in-memory
DBMS and evaluated them using OLTP workloads.  Our analysis
identifies the fundamental bottlenecks of each design choice

Transaction Healing: Scaling Optimistic Concurrency Control on Multicores (2016)

we propose a new concurrency-control mechanism, 
called transaction healing, that exploits program
 semantics to scale the conventional OCC towards 
dozens of cores even under highly contended workloads.
 Transaction healing captures the dependencies across 
operations within a transaction prior to its execution. 
Instead of blindly rejecting a transaction once its validation 
fails, the proposed mechanism judiciously restores any
non-serializable operation and heals inconsistent transaction 
states as well as query results according to the extracted 
dependencies. Transaction healing can partially update the 
membership of read/write sets when processing dependent 
transactions. Such overhead, however, is largely reduced by 
carefully avoiding false aborts and rearranging validation 
orders. We implemented the idea of transaction healing in 
TheDB, a main-memory database prototype that provides 
full ACID guarantee with a scalable commit protocol. By 
evaluating TheDB on a 48-core machine with two widely-used 
benchmarks, we confirm that transaction healing can scale 
near-linearly, yielding significantly higher transaction rate than 
the state-of-the-art OCC implementations.

TicToc: Time Traveling Optimistic Concurrency Control (2016)

  • avoid centralized timestamp ordering
Previous research has shown that timestamp management is the key
scalability bottleneck in concurrency control algorithms. This pre-
vents the system from scaling to large numbers of cores.
In this paper we present TicToc, a new optimistic concurrency
control algorithm that avoids the scalability and concurrency bot-
tlenecks of prior T/O schemes. TicToc relies on a novel and prov-
ably correct data-driven timestamp management protocol. Instead
of assigning timestamps to transactions, this protocol assigns read
and write timestamps to data items and uses them to lazily com-
pute a valid commit timestamp for each transaction. TicToc re-
moves the need for centralized timestamp allocation, and commits
transactions that would be aborted by conventional T/O schemes.
@spacejam

This comment has been minimized.

Show comment
Hide comment
@spacejam

spacejam Feb 9, 2018

Owner

Contention-Aware Lock Scheduling for Transactional Databases

  • innodb recently added CATS (originally VATS) support, for scheduling lock acquisition based on a dependency graph and picking transactions with the most downstream dependencies.
  • not sure how this approach benefits a system targeted from day-1 for modern architectures, but it seems to be a very powerful incremental improvement for the innodb engine
we present the concept of contention-aware scheduling,
show the hardness of the problem, and propose novel lock
scheduling algorithms (LDSF and bLDSF), which guarantee
a constant factor approximation of the best scheduling.  We
conduct extensive experiments using a popular database on
both TPC-C and a microbenchmark.  Compared to FIFO—
the default scheduler in most database systems—our bLDSF
algorithm yields up to 300x speedup in overall transaction
latency.   On  the  other  hand,  our  LDSF  algorithm,  which
is simpler and achieves comparable performance to bLDSF,
has already been adopted by open-source community,  and
chosen as default scheduling strategy in MySQL 8.0.3+.
Owner

spacejam commented Feb 9, 2018

Contention-Aware Lock Scheduling for Transactional Databases

  • innodb recently added CATS (originally VATS) support, for scheduling lock acquisition based on a dependency graph and picking transactions with the most downstream dependencies.
  • not sure how this approach benefits a system targeted from day-1 for modern architectures, but it seems to be a very powerful incremental improvement for the innodb engine
we present the concept of contention-aware scheduling,
show the hardness of the problem, and propose novel lock
scheduling algorithms (LDSF and bLDSF), which guarantee
a constant factor approximation of the best scheduling.  We
conduct extensive experiments using a popular database on
both TPC-C and a microbenchmark.  Compared to FIFO—
the default scheduler in most database systems—our bLDSF
algorithm yields up to 300x speedup in overall transaction
latency.   On  the  other  hand,  our  LDSF  algorithm,  which
is simpler and achieves comparable performance to bLDSF,
has already been adopted by open-source community,  and
chosen as default scheduling strategy in MySQL 8.0.3+.
@spacejam

This comment has been minimized.

Show comment
Hide comment
@spacejam

spacejam Feb 9, 2018

Owner

This nice slidedeck from Andy Pavlo covers the high-level implementations of hekaton, hyper, and cicada.

Owner

spacejam commented Feb 9, 2018

This nice slidedeck from Andy Pavlo covers the high-level implementations of hekaton, hyper, and cicada.

@spacejam

This comment has been minimized.

Show comment
Hide comment
@spacejam

spacejam Feb 9, 2018

Owner

Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores (2014)

o better understand just how unprepared current DBMSs are for
future CPU architectures, we performed an evaluation of concur-
rency control for on-line transaction processing (OLTP) workloads
on many-core chips.  We implemented seven concurrency control
algorithms on a main-memory DBMS and using computer simula-
tions scaled our system to 1024 cores.  Our analysis shows that all
algorithms fail to scale to this magnitude but for different reasons.
In each case, we identify fundamental bottlenecks that are indepen-
dent of the particular database implementation and argue that even
state-of-the-art DBMSs suffer from these limitations. We conclude
that  rather  than  pursuing  incremental  solutions,  many-core  chips
may  require  a  completely  redesigned  DBMS  architecture  that  is
built from ground up and is tightly coupled with the hardware.

Another cool Pavlo deck about the work from "staring into the abyss"

Owner

spacejam commented Feb 9, 2018

Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores (2014)

o better understand just how unprepared current DBMSs are for
future CPU architectures, we performed an evaluation of concur-
rency control for on-line transaction processing (OLTP) workloads
on many-core chips.  We implemented seven concurrency control
algorithms on a main-memory DBMS and using computer simula-
tions scaled our system to 1024 cores.  Our analysis shows that all
algorithms fail to scale to this magnitude but for different reasons.
In each case, we identify fundamental bottlenecks that are indepen-
dent of the particular database implementation and argue that even
state-of-the-art DBMSs suffer from these limitations. We conclude
that  rather  than  pursuing  incremental  solutions,  many-core  chips
may  require  a  completely  redesigned  DBMS  architecture  that  is
built from ground up and is tightly coupled with the hardware.

Another cool Pavlo deck about the work from "staring into the abyss"

@spacejam

This comment has been minimized.

Show comment
Hide comment
@spacejam

spacejam Feb 11, 2018

Owner

Cicada: Dependably Fast Multi-Core In-Memory Transactions

Cicada is a single-node multi-core in-memory transactional data-
base with serializability. To provide high performance under diverse
workloads, Cicada reduces overhead and contention at several levels
of the system by leveraging optimistic and multi-version concur-
rency control schemes and multiple loosely synchronized clocks
while mitigating their drawbacks. On the TPC-C and YCSB bench-
marks, Cicada outperforms Silo, TicToc, FOEDUS, MOCC, two-
phase locking, Hekaton, and ERMIA in most scenarios, achieving
up to 3X higher throughput than the next fastest design.

Seems like the current contender.

Owner

spacejam commented Feb 11, 2018

Cicada: Dependably Fast Multi-Core In-Memory Transactions

Cicada is a single-node multi-core in-memory transactional data-
base with serializability. To provide high performance under diverse
workloads, Cicada reduces overhead and contention at several levels
of the system by leveraging optimistic and multi-version concur-
rency control schemes and multiple loosely synchronized clocks
while mitigating their drawbacks. On the TPC-C and YCSB bench-
marks, Cicada outperforms Silo, TicToc, FOEDUS, MOCC, two-
phase locking, Hekaton, and ERMIA in most scenarios, achieving
up to 3X higher throughput than the next fastest design.

Seems like the current contender.

@spacejam

This comment has been minimized.

Show comment
Hide comment
@spacejam

spacejam Feb 20, 2018

Owner

I'm aiming toward a cicada-like architecture, but with simplified timestamp generation initially just using an atomic fetch_add. We can also support causal consistency with zero centralized timestamp contention by just using a higher timestamp than what we find in our initial reads and tracking a thread-local max timestamp encountered, plus a per-thread identifier in the low bits for guaranteeing timestamp uniqueness. We have a LOT of tuning to go before this becomes a bottleneck.

Owner

spacejam commented Feb 20, 2018

I'm aiming toward a cicada-like architecture, but with simplified timestamp generation initially just using an atomic fetch_add. We can also support causal consistency with zero centralized timestamp contention by just using a higher timestamp than what we find in our initial reads and tracking a thread-local max timestamp encountered, plus a per-thread identifier in the low bits for guaranteeing timestamp uniqueness. We have a LOT of tuning to go before this becomes a bottleneck.

@spacejam spacejam closed this Feb 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment