Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

Fix a critical bug in the META cache.

A custom comparator is needed to maintain a local cache of META.
A META entry is essentially made of 3 comma-separated fields:
  [table],[start-key],[start-code]
The comparator needs to essentially do 3 memcmp, to first sort
by table name, then by start-key, and finally by start-code for
breaking ties.

The comparator had a bug such that if the key being looked up contained
a comma, and if the key being looked up was a prefix of a start-key,
and if the byte following the comma was less than '1', then the META
cache would incorrectly sort the key being looked up and find the wrong
region.

Here's an example:
  Key being looked up: "foo,\002"
  Region hosting that key: "table,foo,1234567890"
  META cache lookup key:   "table,foo,\002,:"
Then in this case the code was incorrectly comparing '1' with 0x02,
and because 0x02 < '1' it would incorrectly sort the search key to
be before that region.

The flaw in the logic came from the fact that the code was trying to
compare both the start keys and, if necessary, the start codes in the
same loop.  But we need to make sure that we actually do the equivalent
of 3 memcmp, so that we can check for possible prefix-only matches for
each of the 3 fields (table name, start key, start code).

This bug doesn't happen frequently due to the 3 conditions that are
required to trigger it, but when it does get triggered, its visible
consequence would be that some RPCs are failing without even being
sent because the client retries to do META lookups without finding
the right region to send the RPC to.  In particular note that the RPC
does not get sent to the wire at all.  The RPC would simply quickly
fail with a "NonRecoverableException: Too many attempts".

This closes #27.

Change-Id: I4994cb3535181397e185d5a8d8a2fa5beb16bff9
  • Loading branch information...
commit 07065fdfd044c13aa85a64c7bb4270ed52a92193 1 parent 5b033d9
Benoit Sigoure authored
40  src/RegionInfo.java
@@ -275,28 +275,32 @@ public int compare(final byte[] a, final byte[] b) {
275 275
       // If either `a' or `b' is followed immediately by another comma, then
276 276
       // they are the first region (it's the empty start key).
277 277
       i++;   // No need to check against `length', there MUST be more bytes.
278  
-      if (a_comma != b_comma) {
279  
-        if (a_comma == i) {
280  
-          return -1002;  // a < b  because a is the start key, b is not.
281  
-        } else if (b_comma == i) {
282  
-          return 1002;   // a > b  because b is the start key, a is not.
  278
+
  279
+      // Compare keys.
  280
+      final int first_comma = Math.min(a_comma, b_comma);
  281
+      for (/*nothing*/; i < first_comma; i++) {
  282
+        final byte ai = a[i];
  283
+        final byte bi = b[i];
  284
+        if (ai != bi) {  // The keys differ.
  285
+          return (ai & 0xFF) - (bi & 0xFF);  // "promote" to unsigned.
283 286
         }
284 287
       }
  288
+      if (a_comma < b_comma) {
  289
+        return -1002;  // `a' has a shorter key.  a < b
  290
+      } else if (b_comma < a_comma) {
  291
+        return 1002;  // `b' has a shorter key.  a > b
  292
+      }
285 293
 
286  
-      // Now either both are the start key (in which case we need to keep
287  
-      // comparing the "start codes" to pick up the most recent region)
288  
-      // or neither is and we simply compare the start keys.
289  
-      do {
290  
-        if (a[i] != b[i]) {
291  
-          if (i == a_comma) {
292  
-            return -1003;  // `a' has a smaller start key.  a < b
293  
-          } else if (i == b_comma) {
294  
-            return 1003;   // `b' has a smaller start key.  a > b
295  
-          }
296  
-          return (a[i] & 0xFF) - (b[i] & 0xFF);  // The start keys differ.
  294
+      // Keys have the same length and have compared identical.  Compare the
  295
+      // rest, which essentially means: use start code as a tie breaker.
  296
+      for (/*nothing*/; i < length; i++) {
  297
+        final byte ai = a[i];
  298
+        final byte bi = b[i];
  299
+        if (ai != bi) {  // The start codes differ.
  300
+          return (ai & 0xFF) - (bi & 0xFF);  // "promote" to unsigned.
297 301
         }
298  
-        i++;
299  
-      } while (i < length);
  302
+      }
  303
+
300 304
       return a.length - b.length;
301 305
     }
302 306
 
4  test/TestMETALookup.java
@@ -58,10 +58,14 @@ public void testRegionCmp() {
58 58
     assertGreater("table,a,,c,1234567890", "table,a,,b,1234567890");
59 59
     // If keys are equal, then start code should break the tie.
60 60
     assertGreater("table,foo,1234567891", "table,foo,1234567890");
  61
+    // Make sure that a start code being a prefix of another is handled.
  62
+    assertGreater("table,foo,1234567890", "table,foo,123456789");
61 63
     // If both are start keys, then start code should break the tie.
62 64
     assertGreater("table,,1234567891", "table,,1234567890");
63 65
     // The value `:' is always greater than any start code.
64 66
     assertGreater("table,foo,:", "table,foo,9999999999");
  67
+    // Issue 27: searching for key "8,\001" and region key is "8".
  68
+    assertGreater("table,8,\001,:", "table,8,1339667458224");
65 69
   }
66 70
 
67 71
   /** Ensures that {@code sa > sb} in the META cache.  */

0 notes on commit 07065fd

Please sign in to comment.
Something went wrong with that request. Please try again.