Topology awareness capability

Zones

A lot of NoSQL systems have network topology specific replication systems. For example, Cassandra has a pluggable modules for Rack aware and Datacenter aware routing. We intend to achieve something similar but instead we want to abstract out the network by forming something which we call zones. A zone is a group of nodes very close to each other in a network. So for example it could be all nodes in a rack or even all nodes in a datacenter.

Now due to possible high latency between these zones we want the routing strategy in the client to optimize for the closest zones. In order to help the routing strategy make these decisions we need to add more information to cluster.xml.
i) First and foremost we need to define a zone. A zone (defined with a <zone> tag) should have two parameters – unique id <zone-id> and proximity list <proximity-list>, i.e. a list of other zone ids in increasing order of distance

For example zone 1 is closer to zone 0 than zone 2.


<zone>
   <zone-id>0</zone-id>
   <proximity-list>1,2</proximity-list>
</zone>

ii) We also need to assign every server to one zone. This can be achieved by adding a new tag <zone-id> in the existing <server> tag. In the case wherein a zone is not mentioned, the server gets assigned to the default zone id 0


<server>
  <id>0</id>
  <host>zoo.rack0.node1</host>
  <zone-id>0</zone-id>
</server>

Now since these zones may be geographically distributed we can expect network partitions to be very common. Hence we would like every store to have some amount of redundancy in every zone. This would allow the system to be serving (i.e. available) even in case of a network partitions. We can set this zone specific replication factor in the existing stores.xml file by using the <zone-replication-factor> tag given below. In the example below we set the replication factor for zone 0 to be 2, while that of zone 1 and zone 2 to be 1.

We also require a change in the store semantics to be able to select the number of zones we want a wait for before we return a success. For reads we use the <zone-count-reads> tag, with a high value signifying availability of reads and reliability of the data. For writes we use <zone-count-writes> tag, with a high value indicating a decrease in the availability of writes. A missing value of <zone-count-reads> / <zone-count-writes> defaults to 0, which means we don’t care about the zones and fall back to zone unaware strategy.


<stores>
    <store>
       <name>test</name>
       <!-- global replication factor -->
       <replication-factor>4</replication-factor>
       <!-- zone specific replication factors -->
       <zone-replication-factor>
           <replication-factor zone-id="0">2</replication-factor>
           <replication-factor zone-id="1">1</replication-factor>
           <replication-factor zone-id="2">1</replication-factor>
       </zone-replication-factor>
       <zone-count-reads>2</zone-count-reads>
       <zone-count-writes>2</zone-count-writes>
    </store>
</stores>

Now let us see how we can use the concept of zones to get Rack aware and Data aware routing. The best way to explain this is with an example.
Imagine the scenario wherein we have two datacenters ( zoo and jungle ) with racks ( zoo.rack0 ) ( jungle.rack0, jungle.rack1 ). Each rack has the following nodes ( zoo.rack0.node0, zoo.rack0.node1 ) ( jungle.rack0.node0 ) ( jungle.rack1.node0 )

i) Rack aware : If we intend to make our routing strategy to do replication at a rack level, following would be the cluster.xml file


<cluster>
  <name>mycluster</name>
  <zone>
    <zone-id>0</zone-id>
    <!-- Indicating the closest zone in the start -->
    <proximity-list>1,2</proximity-list>
  </zone>
  <zone>
    <zone-id>1</zone-id>
    <proximity-list>0,2</proximity-list>
  </zone>
  <zone>
    <zone-id>2</zone-id>
    <proximity-list>1,0</proximity-list>
  </zone>
  <server>
    <id>0</id>
    <host>zoo.rack0.node0</host>
    <!-- zone-id should be a value from among the ids mentioned above -->
    <zone-id>0</zone-id>
    <partitions>0,1,2</partitions>
  </server>
  <server>
    <id>1</id>
    <host>zoo.rack0.node1</host>
    <zone-id>0</zone-id>
    <partitions>3,4</partitions>
  </server>
  <server>
    <id>2</id>
    <host>jungle.rack0.node0</host>
    <zone-id>1</zone-id>
    <partitions>5</partitions>
  </server>
  <server>
    <id>3</id>
    <host>jungle.rack1.node0</host>
    <zone-id>2</zone-id>
    <partitions>6,7,8</partitions>
  </server>
</cluster>

The stores.xml would be as follows :


<stores>
    <store>
       <name>test</name>
       <replication-factor>4</replication-factor>
       <zone-replication-factor>
           <replication-factor zone-id="0">2</replication-factor>
           <replication-factor zone-id="1">1</replication-factor>
           <replication-factor zone-id="2">1</replication-factor>
       </zone-replication-factor>
       <zone-count-reads>2</zone-count-reads>
       <zone-count-writes>2</zone-count-writes>
    </store>
</stores>

ii) Data center aware : To make it datacenter aware, we would simply have to change the zone-ids as follows


<cluster>
  <name>mycluster</name>
  <zone>
    <zone-id>0</zone-id>
    <proximity-list>1</proximity-list>
  </zone>
  <zone>
    <zone-id>1</zone-id>
    <proximity-list>0</proximity-list>
  </zone>
  <server>
    <id>0</id>
    <host>zoo.rack0.node0</host>
    <zone-id>0</zone-id>
    <partitions>0,1,2</partitions>
  </server>
  <server>
    <id>1</id>
    <host>zoo.rack0.node1</host>
    <zone-id>0</zone-id>
    <partitions>3,4</partitions>
  </server>
  <server>
    <id>2</id>
    <host>jungle.rack0.node0</host>
    <zone-id>1</zone-id>
    <partitions>5</partitions>
  </server>
  <server>
    <id>3</id>
    <host>jungle.rack1.node0</host>
    <zone-id>1</zone-id>
    <partitions>6,7,8</partitions>
  </server>
</cluster>

Other Changes

a) Addition of Zone class containing proximity list information
b) Additional zone semantics added to Node class
c) StoreClient should get client zone information i.e. the zone to which the client is the closest.

Routing Logic

Preference list should be constructed by hopping the ring and picking up nodes from each zone such that we satisfy the replication-factor criteria. For example, if we use the Rack aware configuration file from above, ( Constraint: replication-factor-zone <= nodes-in-that-zone )
- hash(key) = partition 4, our preference list would contain [ 4,5,6,0 ] ( order may vary depending on read / write )
- hash(key) = partition 7, our preference list would contain [ 7,0,3,5 ]
- hash(key) = partition 0, our preference list would contain [ 0,3,5,6 ]

Another example, if N = 3 with the following stores.xml


<stores>
    <store>
       <name>test</name>
       <replication-factor>3</replication-factor>
       <zone-replication-factor>
           <replication-factor zone-id="0">1</replication-factor>
           <replication-factor zone-id="1">1</replication-factor>
           <replication-factor zone-id="2">1</replication-factor>
       </zone-replication-factor>
    </store>
</stores>

Then, the preference lists would be as follows
- hash(key) = partition 4, our preference list would contain [ 4,5,6 ]
- hash(key) = partition 7, our preference list would contain [ 7,0,5 ] ( compared to [ 7,0,3 ] from zone unaware hashing )
- hash(key) = partition 0, our preference list would contain [ 0,5,6 ] ( compared to [ 0,3,5 ] )

Writes

The partitions in the preference list should be sorted in the order of the proximity list of the client zone. The write currently does multiple serial puts till it finds a master node. By having the preference list sorted by proximity we hope to find our master node within the client zone or atleast in a closer zone.
Once the master is selected we pick the remaining nodes and do parallel writes. We now block till we satisfy W and have a successful response from zone-count-writes zones. In other words, latency = max (time to get W successful responses, time to have at least one response from zone-count-write zones).

Some examples : For the rack aware configuration given above :
- hash(key) = partition 4, client zone = 0, preference list = ( 0,4,5,6 ) ( as per proximity list – zone0(0,4), zone1(5), zone2(6) )
- hash(key) = partition 4, client zone = 1, preference list = ( 5,0,4,6 ) ( as per proximity list – zone1(5), zone0(0,4), zone2(6) )
- hash(key) = partition 4, client zone = 2, preference list = ( 6,5,0,4 ) ( as per proximity list – zone2(6), zone1(5), zone0(0,4) )

Reads / Deletes

The current implementation of get first does attempts = min ( N, preferred-read ) number of parallel requests. We would now want our parallel requests to have nodes from zone-count-reads zones. Two things need to be done now. First, attempts = max ( preferred-reads, zone-count-reads ). Second, we order the preference list so as to satisfy the zone-count-reads requirement. We first place zone-count-reads nodes and then place the rest of the nodes sorted by proximity list of the client zone. The following example (using the same configuration for Rack aware as mentioned above) will explain the scheme we plan to adopt :

- hash(key) = partition 7, client zone = 0, zone-count-reads = 0, preference list = ( (0,3),(5),(7) )
- hash(key) = partition 7, client zone = 0, zone-count-reads = 1, preference list = ( (5),(0,3),(7) )
- hash(key) = partition 7, client zone = 0, zone-count-reads = 2, preference list = ( (5),(7),(0,3) )
- hash(key) = partition 7, client zone = 1, zone-count-reads = 0, preference list = ( (5),(0,3),(7) )
- hash(key) = partition 7, client zone = 1, zone-count-reads = 1, preference list = ( (0),(5),(3),(7) )
- hash(key) = partition 7, client zone = 1, zone-count-reads = 2, preference list = ( (0),(7),(5),(3) )
- hash(key) = partition 7, client zone = 2, zone-count-reads = 0, preference list = ( (7),(5),(0,3) )
- hash(key) = partition 7, client zone = 2, zone-count-reads = 1, preference list = ( (5),(7),(0,3) )
- hash(key) = partition 7, client zone = 2, zone-count-reads = 2, preference list = ( (5),(0),(7),(3) )

Once again we block till we satisfy R and have a success from zone-count-reads zones i.e. max (time to get R successful responses, time to have atleast one response from zone-count-reads zones)

Provide feedback

Saved searches