# Unique ID generation

*Leetcode reference*: [link](https://leetcode.com/discuss/interview-question/system-design/3092436/Design-Unique-ID-Generator)

Unique IDs are crucial for tracking, managing resources, and ensuring that every entry or operation can be uniquely identified across the system.

Design a distributed system to provide a unique ID with each request.

``` python
def generator():
    """
    Returns:
      str: random string comprised from digits ^[0-9a-zA-Z]{k}$
    """
    pass
```

Example 1:
```
Input: k = 5
Output: a1B2c
Explanation: ...
```

<style>
/* CSS to change font size of code blocks */
pre {
    font-size: 12px;
}
code {
    font-size: small;
}
</style> 

## Personal Notes

My first question would be: How many unique IDs do we estimate we need to generate? This will help us determine the necessary number of characters in each ID.

For example, if we know we won't generate more than 100 IDs, we can use two-digit IDs, such as: 11, 32, 94, ...

The next code is a simple implementation for generating a sequence with a length of 5 characters, made up of digits, lower case, and upper case alphabet.

In [11]:
import string
import random

digits = string.digits
lower_case_chars = string.ascii_lowercase
upper_case_chars = string.ascii_uppercase

def generator(k = 5):
    s = ''
    while k > 0:
        digit = random.choice(digits)
        lower_char = random.choice(lower_case_chars)
        upper_char = random.choice(upper_case_chars)
        rand = random.randint(1,3)
        if rand == 1:
            next = digit
        elif rand == 2:
            next = lower_char
        else:
            next = upper_char
        s += next
        k -= 1
    return s

print(generator())
print(generator())

# In this strategy, there is a (1/(10 + 26 + 26)) ^ k probability that the generated id is repeated 
# Time complexity: O(k)
# Space complexity: O(1)

G5n1u
MGyQQ


More efficient implement:

In [12]:
import random
import string

def generator(k=5):
    all_chars = string.digits + string.ascii_lowercase + string.ascii_uppercase
    return ''.join(random.choices(all_chars, k=k))

print(generator(5))
print(generator(10))

kvlrr
omsmMIV5wU


## Exploring the Distributed Systems Aspect in Detail

*Generated with ChatGPT's assistance*

### System Requirements

- **Uniqueness**: No two IDs should be the same, even if requests are processed concurrently or by different nodes in the distributed system.
- **Scalability**: The system should be able to handle a large number of requests and scale horizontally.
- **Availability**: It should provide IDs even in the event of partial system failures or network partitions.
- **Performance**: The system should be capable of generating IDs quickly, with minimal latency, to meet real-time or near-real-time processing requirements.
- **Consistency**: In the event of failures or network partitions, the system should ensure that IDs remain unique and that the system can recover gracefully.

### Design Considerations

- **ID Format**: Define the format and structure of the IDs (e.g., numeric, alphanumeric, or a combination). <br>Ensure that the format is suitable for the system's needs and supports the expected volume of IDs.
- **Coordination Mechanisms**: Implement mechanisms to coordinate between distributed nodes to avoid conflicts and ensure uniqueness. This could involve centralized coordination or decentralized approaches, depending on the system's architecture.
- **Redundancy and Fault Tolerance**: Design the system to handle node failures gracefully, ensuring that ID generation continues without disruption. Implement redundancy and fault-tolerant mechanisms to maintain high availability.
- **Performance Optimization**: Optimize the system for performance, considering factors such as load balancing, caching, and efficient network communication to minimize latency in ID generation.
- **Scalability Strategy**: Develop a strategy to scale the system both vertically (adding resources to individual nodes) and horizontally (adding more nodes) to handle increasing demand.
- **Security**: Consider potential security implications, such as ensuring that IDs cannot be easily predicted or forged, and implement appropriate measures to protect the integrity of the ID generation process.

### Example Solution Approaches:
- **Centralized ID Generation Service**: A single service responsible for generating unique IDs, with other components interacting with this service to obtain IDs. This approach simplifies uniqueness management but can become a bottleneck.
- **Distributed ID Generation with Time-based or Sequential Strategies**: Nodes generate unique IDs based on a combination of timestamps and sequence numbers, with mechanisms to avoid collisions and ensure uniqueness.
- **UUIDs (Universally Unique Identifiers)**: Use UUIDs, which are designed to be globally unique. UUID generation can be distributed without requiring central coordination, making it a suitable option for many distributed systems.
- **Combination Approaches**: Combine different strategies, such as using a central authority for ID prefixing and distributed nodes for suffix generation, to balance coordination and scalability.

### Distributed Unique ID Generator Implementation

*Generated with ChatGPT's assistance*

This solution uses a combination of `timestamp`, `machine_id`, and `sequence_number`. <br>
This approach is inspired by the <span style="color:yellow">Snowflake ID</span> generator by Twitter, which ensures unique IDs across distributed systems.

#### Role of Epoch

An `epoch` is a specific point in time used as a reference for measuring time intervals. <br>
In the context of unique ID generator, the epoch is the starting point from which timestamps are counted. <br>
This helps in generating unique IDs that are time-ordered and avoids conflicts. <br>

By subtracting the epoch from the current time, you get a relative timestamp that can be used in generating unique IDs.


#### Threading 

Threading is a technique in Python that allows you to run multiple threads (smaller units of a process) concurrently, enabling parallel execution of tasks within a single process. <br>
*Example usage*: This can be particularly useful for I/O-bound tasks, such as reading from or writing to a file or network operations, as it can improve performance by overlapping waiting periods with computation.

For Synchronization, a `lock` is used to control access to a shared resource by multiple threads, ensuring that only one thread can access the critical section of code at a time, preventing race conditions. <br>
Without a lock, multiple threads might access and modify shared resources (`sequence_number` and `last_timestamp`) concurrently, leading to inconsistencies and potentially non-unique IDs.

In [3]:
import time
import threading

class Unique_id_generator:
    def __init__(self, machine_id, epoch_start=1288834974657):
        self.machine_id = machine_id # A unique identifier for each machine or instance generating IDs, ensuring uniqueness across different machines
        self.epoch_start = epoch_start # Epoch reference time
        self.sequence_number = 0 # A sequence number that increments with each ID generated within the same millisecond, avoiding collisions
        self.last_timestamp = -1 # ensuring the IDs are time-ordered
        self.lock = threading.Lock() # This creates a lock object. The lock is initially unlocked. 

    def _get_current_timestamp(self):
        return int(time.time() * 1000) # in milliseconds

    def _wait_for_next_millis(self, last_timestamp):
        # while self._get_current_timestamp() < next_timestamp:
        #     continue
        # return self._get_current_timestamp()
        timestamp = self._get_current_timestamp()
        while timestamp <= last_timestamp:
            timestamp = self._get_current_timestamp()
        return timestamp

    def generate_id(self):
        with self.lock: # Acquire the lock before entering the critical section

            timestamp = self._get_current_timestamp() # ensuring IDs are time-ordered and avoiding conflicts
            
            if timestamp < self.last_timestamp:
                raise Exception("Clock moved backwards. Refusing to generate id")
            elif timestamp == self.last_timestamp: # Handling IDs Generated in the Same Millisecond
                # This block of code handles this situation by managing the sequence number to ensure each ID is unique.
                self.sequence = (self.sequence_number + 1) & 0xFFF  # incrementing a sequence number and ensures that the result is limited to the lower 12 bits
                if self.sequence_number == 0:
                    # If the sequence number wraps around to 0, it means that the limit of 4096 unique IDs for that millisecond has been reached.
                    timestamp = self._wait_for_next_millis(self.last_timestamp)
            else:
                self.sequence = 0

            self.last_timestamp = timestamp

            # Bit shifting to construct the unique ID
            # We are going to generate the unique id by combining different components into a single integer:
            #   `timestamp - self.epoch_start`: calculates the number of milliseconds that have passed since the epoch
            #   `<<22`: Shifting by 22 bits left. Moving timestamp component to the higher-order bits of the unique ID, making room for other components
            #   `|`: Bitwise OR, combining the shifted components into a single integer. 
            unique_id = ((timestamp - self.epoch_start) << 22) | (self.machine_id << 12) | self.sequence_number
        
        # Releasing the lock automatically when exiting the block
        return unique_id
        

generator = Unique_id_generator(machine_id=1)
for _ in range(5):
    unique_id = generator.generate_id()
    print(unique_id)

1822771922056777728
1822771922060972032
1822771922065166336
1822771922069360640
1822771922073554944


#### Parallelism vs Synchronization?

**Parallelism**: refers to the simultaneous *execution* of multiple tasks or threads. <br>
The primary goal of parallelism is to improve performance and efficiency by executing multiple operations concurrently.

**Synchronization**: refers to the *coordination* of concurrent threads or processes to ensure that they operate in a controlled and predictable manner. <br>
The main goal of synchronization is to manage access to shared resources and avoid conflicts or inconsistencies.

Example Scenario: <br>
Imagine a web server handling multiple client requests (parallelism) while ensuring that all requests are handled correctly and without interfering with each other’s data (synchronization). The server may process requests in parallel, but synchronization mechanisms (like locks or atomic operations) ensure that shared resources, such as a database, are accessed safely and consistently. <br>
In essence, parallelism and synchronization work together: parallelism aims to maximize efficiency by running multiple tasks at once, while synchronization ensures that these tasks do not disrupt each other and operate correctly.
