Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[#6476] Set local_limit and read restart time in a better way to redu…
…ce the number of read restarts Summary: Background ========== "Read restart" is a mechanism that we use to make sure that a snapshot isolation read request returns all the records that the application "knows" have been written to the database, e.g. records that were written or read earlier by the same client or a different client that notified the client that is sending the current read request. It works as follows. For each read request, we select an MVCC timestamp to read at (also known as read hybrid time, read_ht, or read_time), and we try to return all records with commit timestamp <= read_time. However, sometimes we see records with commit timestamp slightly higher than read_time, and if we cannot rule out a possibility that a record like that might have been written prior to the beginning of our snapshot isolation transaction, we have no choice but to restart the entire transaction with a read time that is >= this commit timestamp. Read restarts are detrimental to performance and we want to minimize them. Each read request from the YQL engine to the DocDB layer of the tablet server is parameterized with multiple components in addition to read_time, including the following: - global_limit is the upper bound on physical time (and, therefore, on hybrid time) in the cluster prior to the beginning of the transaction, computed as current time on the YQL node that received the read request + max clock skew. - local_limit is a value maintained for each tablet, separately for each transaction. It starts off as global_limit for the first request to a tablet as part of the transaction, but then. for second and later requests to that tablet, is set to the safe time on that tablet returned to the YQL engine by the response to the first request. After it was set to a tablet's safe time, it is not supposed to change until the end of the transaction. Both local_limit and global_limit help us "prove" that a particular record we see in RocksDB could not have been committed prior to the beginning of our transaction. For the intents read path, the logic for deciding when to trigger read restarts is implemented in IntentAwareIterator::ProcessIntent. For regular RocksDB records, the corresponding logic is the IntentAwareIterator::SkipFutureRecords function, which skips all records for which it can "prove" that they could not have been committed prior to the beginning of our transaction and, independently, are not visible as of read_time. This function operates on encoded hybrid times and comparisons are inverted, which is unintuitive. Also it looks at intent hybrid times stored in values in regular RocksDB, which are lower than the commit hybrid times that are part of the key. Also note that in any case, records committed after the beginning of the transaction, and therefore not causing a read restart, could still have commit time <= read_time due to clock skew and therefore have to be included in the read result. Changes in this diff ==================== Prior to this diff, the value of local_limit that YQL engine sends to a tablet server would actually be set to max(read_time, local_limit_for_tablet) where local_limit_for_tablet is the current value local limit for this tablet maintained by the YQL engine for the transaction. This was needed due to how the logic in IntentAwareIterator::SkipFutureRecords was set up, in order to capture all records with commit time <= read_time, which are required for consistency. However, due to the artificially increased effective value of local_limit, this approach was causing us to miss some opportunities to avoid read restarts. E.g. when we are trying to read a key that is constantly being overwritten, the read operation would keep getting read restarts, but local_limit would always be set to the new read_time, and it is likely that new intents would have been written with intent hybrid time lower than that read_time, not allowing us to ignore them. This would result in many unnecessary read restarts. With this diff, we are no longer updating local_limit to be greater than or equal to read_time in `ConsistentReadPoint::GetReadTime`, so we need to take extra care to ensure we include all records with commit time <= read_time elsewhere in the read path. The new version of the logic can be summarized as follows: - If intent_ht > local_limit, meaning the intent was certainly written after the read operation started: include the record if commit_ht <= read_time. No read restart is possible. - If intent_ht <= local_limit, meaning the intent could have been written before the read operation started: include the record if commit_ht <= global_limit. Read restart is still possible if commit_ht > read_time. Another important change in this diff is how we now set read restart time: min(max(restart_time, safe_ht_to_read), global_limit) Here, restart_time is the time that the tablet returned to a YQL engine's request, and safe_ht_to_read is the safe time on that tablet. By restarting at safe_ht_to_read we would avoid doing more than one read restart per transaction per tablet. However, we still need to cap the read restart time at global_limit for slow transactions, because global_limit is the latest MVCC timestamp we every have to read at to avoid stale results. Backward compatibility ====================== The old local_limit_ht protobuf field is renamed to DEPRECATED_max_of_read_time_and_local_limit_ht, and is set to what its name implies, keeping the old logic. The new local_limit_ht field is set to the real value of local limit, which could be smaller than read_time. The deprecated field can be removed after all YugabyteDB clusters are upgraded to the new version. Related work ============ Please also see https://phabricator.dev.yugabyte.com/D8510 ( 26260e0 ), which introduced cleanup of intent hybrid times from records in regular RocksDB that were created by applying transactions committed before the history cutoff time. Test Plan: ybd release --gtest_filter PgMiniTest.ReadRestartSnapshot -n 8 Reviewers: mbautin Reviewed By: mbautin Subscribers: kannan, ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D10047
- Loading branch information