Skip to content
This repository has been archived by the owner on Jul 15, 2019. It is now read-only.
dgomezferro edited this page Nov 6, 2011 · 50 revisions

Welcome to the omid wiki! Omid, which stands for Optimistically transaction Management In Data stores, is an open-source project started at Yahoo!. Here, we walk you through the architecture of CrSO in omid project and explain how you could use it.

What is CrSO?

CrSO adds lock-free transactional support on top of HBase. CrSO benefits from a centralized scheme in which a single server, called status oracle, monitors the modified rows by transactions and use that to detect write-write conflicts. HBase clients in CrSO maintain a read-only copy of transaction commit times to reduce the load on the status oracle, making it scalable up to 50,000 transactions per second (TPS).

Omid's architecture

Why transactions?

A transaction comprises a unit of work against a database, which must either entirely complete (i.e., commit) or have no effect (i.e., abort). In other words, partial executions of the transaction are not defined. Without the support for transactions, the developers are burdened with ensuring atomic execution of a multi-row transaction despite failures as well as concurrent accesses to the database by other transactions. Data stores such as HBase, BigTable, PNUT, and Cassandra, usually lack this precious feature.

Why CrSO?

  1. CrSO is lock-free. In lock-based approaches, the locks that are held by the incomplete transactions of a failed client prevent others from making progress. In CrSO, if a client is slow or faulty, it does not slow down the other clients.
  2. CrSO does not require any modification into HBase code. All the transactional logic is implemented in the status oracle and the clients.
  3. CrSO does not require any change into HBase table schema. The only change into the data is that the version of an inserted value is assigned to the transaction start timestamp, to enable the transactions to read from a snapshot.
  4. Contrary to previous approaches the commit timestamps are not persisted into data servers. CrSO, therefore, brings transaction support to distributed data stores with a negligible overhead on the HBase servers. CrSO, therefore, enables transactions for many applications running on top of HBase with no perceptible impact on performance.

What is snapshot isolation?

Snapshot Isolation Example

Snapshot isolation guarantees that all reads of a transaction are performed on a snapshot of the database that corresponds to a valid database state with no concurrent transaction. To implement snapshot isolation, the database maintains multiple versions of the data in some data servers, and transactions, running by clients, observe different versions of the data depending on their start time. For example, transaction txn_n_ reads the modifications made by the transaction txn_o_, but not the ones made by the concurrent transaction txn_c_ because txn_c_ is not committed when txn_n_ starts. Implementations of snapshot isolation have the advantage that writes of a transaction do not block the reads of others. Two concurrent transactions still conflict if they write into the same data item, say a database row.

API usage

To use HBase transactional support the relevant interfaces are TransactionManager and TransactionalTable, both in com.yahoo.omid.client. The common use case is to start a new transaction with TransactionState txn1 = transactionManager.beginTransaction(), then use this transaction to perform different HBase operations, like transactionalTable.put(txn1, putOperation) and finally commit it transactionManager.tryCommit(txn1).

     Configuration conf = HBaseConfiguration.create();
     TransactionManager tm = new TransactionManager(conf);
     TransactionalTable tt = new TransactionalTable(conf, TEST_TABLE);
     
     TransactionState t1 = tm.beginTransaction();

     Put put = new Put(row);
     putt.add(fam, col, data);
     tt.put(t1, p);
     
     ResultScanner rs = tt.getScanner(t1, new Scan().setStartRow(startrow).setStopRow(stoprow));
     Result r = rs.next();
     while (r != null) {
        ...
        r = rs.next();
     }
     tm.tryCommit(t1);

What the status oracle does?

In the centralized implementation of snapshot isolation, a single server, i.e., the status oracle, receives the commit requests accompanied by the set of the identifiers (id) of modified rows, R. Since the status oracle has observed the modified rows by the previous commit requests, it has enough information to check if there is temporal overlap for each modified row. The timestamps are obtained from a timestamp oracle integrated into the status oracle and the uncommitted data of transactions are stored on the same data tables.

For each transaction, the status oracle server sends/receives the following main messages:

  • Timestamp Request/Timestamp Response,
  • isCommitted Query/isCommitted Response,
  • Commit Request/Commit Response,
  • Abort Cleaned-up.

Status Oracle in action

Since the timestamp oracle is integrated into the status oracle, the client obtains the start timestamp from the status oracle. The following list details the steps of transactions:

  • Transaction start Before performing any read or write, the client obtains a start timestamp from the status oracle.
  • Single-row write. A write operation by transaction txn_w_ is performed by simply writing the new data with a version equal to the transaction start timestamp, Ts(txn_w_).
  • Transaction commit. After a client has written its values on the rows, it tries to commit them by submitting to the status oracle a commit request, which consists of the start timestamp Ts(txn_w_) as well as the list of all the modified rows, R. If the status oracle aborts the transaction, the client must clean up the modified rows.
  • Single-row cleanup. After a transaction aborts, it should clean up all the modified rows. To clean up each row after an abort, the transaction deletes its written versions. This is an extra overhead on data servers, which occurs rarely, only after aborts.
  • Single-row Read. Each read in transaction txnr must observe the last committed data before Ts(txn_r_). To do so, starting with the latest version (assuming that the versions are sorted by timestamp in ascending order), it looks for the first value with commit timestamp δ, where δ < Ts(txn_r_). To verify, the transaction inquires the txn_w_ commit time of status oracle or its local, read-only replica on the client side.

Comparison with other projects

hbase-trx is an ongoing project that attempts to extend HBase with transactional support. Similar to Percolator, hbase-trx runs a 2PC algorithm to detect write-write conflicts. In contrast to Percolator, hbase-trx generates a transaction id locally (by generating a random integer) rather than acquiring one from a global oracle. During the commit preparation phase, hbase-trx detects write-write conflicts and caches the write operations in a server-side state object. On commit, the data server (i.e., RegionServer) applies the write operations to its regions. Each data server considers the commit preparation and applies the commit in isolation. There is no global knowledge of the commit status of transactions.

In the case of a client failure after a commit preparation has been sent to a data server, the transaction will eventually be applied optimistically after a timeout, regardless of the correct status of the transaction. This could lead to inconsistency in the database. To resolve this issue, hbase-trx would require a global transaction status oracle similar to that presented in this project. hbase-trx does not use the timestamp attribute of HBase fields; transaction ids are randomly generated integers. Consequently, hbase-trx is unable to offer snapshot isolation, as there is no fixed order in which transactions are written to the database.

CrSO Architecture

The core architecture of the software is described in more detail here: Technical Details. Some experimental results could also be found here: Experimental Results.

Contributors

  • Daniel Gómez Ferro (dgomezferro@yahoo.com)
  • Flavio Junqueira (fpj@apache.org)
  • Ivan B. Kelly (... at yahoo-inc.com)
  • Benjamin Reed (... at yahoo-inc.com)
  • Maysam Yabandeh (firstname at yahoo-inc.com)