Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
What is a token
Token is the hash value of a partition key. This hash value is produced by the
partitioner, a hash algorithm, and is used to determine which node or nodes
the partition is stored on. Scylla uses
MurmurHash3 as its
Scylla's partitioner hashes each partition key into a signed integer of 64 bits,
so tokens are in the range of
-2**63 .. 2**63 - 1. Multiple partition keys may
share the same token, as there is a limited number of token values to which a
theoretically unlimited number of partition keys have to be mapped.
How are tokens distributed among nodes
Token is used to split a ring into multiple ranges. Each and every Cassandra node will be responsible for one range or multiple ranges, if vnode is disabled or enabled respectively. We will explain vnode shortly.
For example, assume we have a ring with size 16 and the token range is [-8, -7, ..., 0, 1, ..., 7]. There are two nodes in a cluster. Node A has a token -8 and Node B has a token 0. In this case, Node A will be responsible for range [-8, 0) and Node B will be responsible for range [0, -8). The range [-8, 0) is easy to understand. It is from -8 to -1. How about range [0, -8)? It is from 0 to 7. If you draw a circle and place the numbers evenly on it. It is much easier to figure out the ranges.
So the token range is defined as starting from the token assigned to a Node and walking the ring clockwise until next token assigned to another node is seen.
How are tokens assigned to a node
Before vnode is introduced, user needs to assign a token to a node manually, through the
initial_token config option in cassandra.yaml.
With vnode, user is not required to assign tokens manually. Instead, user specify number of tokens a node will be assigned to through the
num_tokens config option in cassandra.yaml.
But what does
num_tokens actually mean? How does Cassandra automatically choose the tokens to make sure each node holds similar length of token ranges so that workload is distributed to nodes evenly? What if a new node is more powerful than others and we want to distribute more workload to it?
It seems magic but actually the idea is pretty simple.
- First, the number of tokens assigned to a node is configured manually.
If you want to distribute more workload, assign more tokens to it. E.g., by default 256 are assigned, you can assign 512 to a powerful node.
- Second, each token is selected randomly.
This means there is no central place to decide which node gets which tokens. But how is it possible we can choose tokens randomly and have each node have similar length of token ranges. Let's say we have
num_tokens equals 2 and have a ring size 2^64, that is from [-2^63, 2^63).
Node A: t1, t2
Node B: t3, t4
So, this 4 token (points) will split the ring (circle) into 4 ranges. Each node will have 2 ranges in totally. Since t1-t4 are choose randomly, we'll have a very big chance that Node A and Node B will be responsible for very different size of ranges.
However, if we increase the
num_tokens, the probability of ranges being split evenly increases! Don't make your
num_tokens too small.
We have one more issue to address. We know each node will choose tokens randomly and independently. What if two nodes happen to choose a same token. This is avoid when a node joins a ring, it will first know other nodes's token. When it generates tokens randomly , it will skip tokens already used by other nodes.
How token information is shared in a cluster
Nodes in a cluster know the node token and node ip mapping information through Gossip. There is an application_state::TOKENS. Local node injects its token information into Gossip and knows other node's token information.