# Chapter 6: User Identity, Diversion, and Segmentation

This notebook demonstrated all key code concepts from Chapter 6:

1. **Hash-based assignment with MD5** - Standard industry practice for stable, uniform randomization
2. **MurmurHash alternative** - High-performance option with 3-6x speed improvement
3. **Layer-based system** - Concurrent experiment management with mutual exclusion

## Setup: Install Required Packages

The examples use standard library `hashlib` for MD5 hashing and `mmh3` for MurmurHash.
Install mmh3 if needed for the performance comparison example.

In [None]:
# Install mmh3 if needed (uncomment to run)
# !pip install mmh3

## Import Libraries

In [None]:
import hashlib
import time

# Try to import mmh3 for performance comparison
try:
    import mmh3
    HAS_MMH3 = True
except ImportError:
    HAS_MMH3 = False
    print("⚠ mmh3 not installed. Install with: pip install mmh3")
    print("MurmurHash examples will be skipped.")

---

## Section 2.3: Hash-Based Assignment with MD5

This section demonstrates the standard industry practice for experimental assignment:
using a salted hash (MD5) to achieve both uniform distribution and stability.

**Key Concept:** `hash(user_id + experiment_id)` ensures:
- Same user gets same assignment for same experiment (stability)
- Same user gets different assignments for different experiments (independence)

In [None]:
def get_experiment_bucket(unit_id: str, experiment_id: str, num_buckets: int = 100) -> int:
    """
    Assigns a unit to a bucket for a given experiment in a stable and uniform way.

    Args:
        unit_id: The identifier for the unit of diversion (e.g., user_id, device_id).
        experiment_id: A unique identifier for the experiment (the "salt").
        num_buckets: The total number of buckets to distribute units into.

    Returns:
        An integer representing the assigned bucket (e.g., 0-99).
    """
    # 1. Concatenate the unit ID and the experiment ID salt.
    combined_string = f"{unit_id}:{experiment_id}"

    # 2. Hash the combined string using a standard algorithm (MD5 is common and fast).
    hash_object = hashlib.md5(combined_string.encode('utf-8'))
    hex_digest = hash_object.hexdigest()

    # 3. Convert the first part of the hash to an integer.
    # We use the first 8 characters for a 32-bit integer.
    hash_integer = int(hex_digest[:8], 16)

    # 4. Apply the modulo operator to get a bucket assignment.
    bucket = hash_integer % num_buckets

    return bucket

### Test MD5 Assignment

In [None]:
# --- Example Usage ---
user_id = "user-12345"
exp_id_button_color = "exp_button_color_2025"
exp_id_recs_algo = "exp_recs_algo_2025"

# The user gets a stable assignment for each experiment.
bucket1 = get_experiment_bucket(user_id, exp_id_button_color) 
print(f"User {user_id} assigned to bucket {bucket1} for button color experiment")

# For a different experiment, the same user is re-randomized.
bucket2 = get_experiment_bucket(user_id, exp_id_recs_algo)
print(f"User {user_id} assigned to bucket {bucket2} for recs algo experiment")

# Verify stability - calling again returns same bucket
bucket1_verify = get_experiment_bucket(user_id, exp_id_button_color)
assert bucket1 == bucket1_verify, "Assignment is not stable!"
print(f"\n✅ Stability verified: Same user + same experiment = same bucket")

---

## Section 2.4: MurmurHash Implementation (High-Performance Alternative)

For high-throughput systems handling millions of requests per second, MurmurHash
provides significantly better performance than MD5 while maintaining excellent
distribution properties.

In [None]:
def get_experiment_bucket_murmur(unit_id: str, experiment_id: str, num_buckets: int = 100) -> int:
    """
    Assigns a unit to a bucket using MurmurHash for optimal performance.
    
    This implementation is preferred for high-throughput systems where hashing
    speed is critical (e.g., millions of assignments per second).

    Args:
        unit_id: The identifier for the unit of diversion (e.g., user_id, device_id).
        experiment_id: A unique identifier for the experiment (the "salt").
        num_buckets: The total number of buckets to distribute units into.

    Returns:
        An integer representing the assigned bucket (e.g., 0-99).
    """
    if not HAS_MMH3:
        raise ImportError("mmh3 library required. Install with: pip install mmh3")
    
    # 1. Concatenate the unit ID and the experiment ID salt.
    combined_string = f"{unit_id}:{experiment_id}"
    
    # 2. Hash using MurmurHash3 (32-bit variant).
    # mmh3.hash() returns a signed 32-bit integer, so we take absolute value.
    hash_value = mmh3.hash(combined_string, signed=False)
    
    # 3. Apply modulo to get bucket assignment.
    bucket = hash_value % num_buckets
    
    return bucket

### Performance Comparison: MD5 vs MurmurHash

In [None]:
if HAS_MMH3:
    # --- Performance Comparison ---
    user_id = "user-12345"
    exp_id = "exp_button_color_2025"

    # Test MD5 performance
    start = time.perf_counter()
    for i in range(100000):
        get_experiment_bucket(user_id, exp_id)
    md5_time = time.perf_counter() - start

    # Test MurmurHash performance
    start = time.perf_counter()
    for i in range(100000):
        get_experiment_bucket_murmur(user_id, exp_id)
    murmur_time = time.perf_counter() - start

    print(f"MD5: {md5_time:.3f}s for 100k assignments")
    print(f"MurmurHash: {murmur_time:.3f}s for 100k assignments")
    print(f"MurmurHash is ~{md5_time/murmur_time:.1f}x faster")

    # Verify both produce stable, uniform distributions
    print(f"\nMD5 bucket: {get_experiment_bucket(user_id, exp_id)}")
    print(f"MurmurHash bucket: {get_experiment_bucket_murmur(user_id, exp_id)}")
    print("Note: These will be different buckets, but both are stable and uniform")
else:
    print("⚠ Skipping performance comparison - mmh3 not installed")

---

## Section 3.3: Layer-Based Concurrent Experiment Management

This section demonstrates how to implement a layering system for running multiple
mutually exclusive experiments concurrently.

**Key Concepts:**
- **Layers** partition traffic to prevent experiment interference
- Users can be in experiments on different layers simultaneously
- Users can only be in ONE experiment within the same layer

In [None]:
# The platform's configuration, likely stored in a database.
# We start with two layers, each with 1000 buckets, and no experiments.
# The 'allocated_buckets' tracks how many buckets are in use.
LAYER_CONFIG = {
    "layer_1": {
        "total_buckets": 1000,
        "allocated_buckets": 0,
        "experiments": {}
    },
    "layer_2": {
        "total_buckets": 1000,
        "allocated_buckets": 0,
        "experiments": {}
    }
}

### Allocation Logic (Control Plane)

This function finds a layer with enough capacity and allocates traffic to a new experiment.
In production, this would be run by an experimenter via a UI or API call.

In [None]:
def allocate_experiment_to_layer(experiment_id: str, traffic_percentage: float):
    """
    Finds a layer with enough capacity and allocates traffic to a new experiment.
    This would be run by an experimenter via a UI or API call.
    """
    num_buckets_needed = int(LAYER_CONFIG["layer_1"]["total_buckets"] * (traffic_percentage / 100.0))

    for layer_id, layer_data in LAYER_CONFIG.items():
        available_buckets = layer_data["total_buckets"] - layer_data["allocated_buckets"]
        
        if available_buckets >= num_buckets_needed:
            # Found a suitable layer. Allocate the next block of buckets.
            start_bucket = layer_data["allocated_buckets"]
            end_bucket = start_bucket + num_buckets_needed - 1
            
            layer_data["experiments"][experiment_id] = {"start": start_bucket, "end": end_bucket}
            layer_data["allocated_buckets"] += num_buckets_needed
            
            print(f"SUCCESS: Experiment '{experiment_id}' allocated to '{layer_id}' in buckets {start_bucket}-{end_bucket}.")
            return True
            
    print(f"FAILURE: Could not allocate '{experiment_id}'. No layer has {num_buckets_needed} available buckets.")
    return False

### Assignment Logic (Data Plane)

These functions determine which variant a user sees at runtime.

In [None]:
def get_layer_bucket(unit_id: str, layer_id: str, num_buckets: int = 1000) -> int:
    """Assigns a unit to a stable bucket for a given layer."""
    combined_string = f"{unit_id}:{layer_id}"
    hash_object = hashlib.md5(combined_string.encode('utf-8'))
    hex_digest = hash_object.hexdigest()
    hash_integer = int(hex_digest[:8], 16)
    return hash_integer % num_buckets

def get_variant_for_experiment(unit_id: str, experiment_id: str):
    """
    Determines the variant for a user in a specific experiment at runtime.
    Returns None if the user is not in the experiment.
    """
    # Find which layer the experiment lives in (in production, this would be a fast O(1) lookup).
    layer_id = None
    for l_id, data in LAYER_CONFIG.items():
        if experiment_id in data["experiments"]:
            layer_id = l_id
            break
    if not layer_id:
        return "Experiment not found"

    # Get the user's stable bucket for that layer.
    user_bucket = get_layer_bucket(unit_id, layer_id)

    # Check if the user's bucket falls within this experiment's allocated range.
    exp_config = LAYER_CONFIG[layer_id]["experiments"][experiment_id]
    if exp_config["start"] <= user_bucket <= exp_config["end"]:
        # User is in the experiment. Assign to a variant (e.g., 50/50 split).
        total_exp_buckets = exp_config["end"] - exp_config["start"] + 1
        midpoint = exp_config["start"] + total_exp_buckets / 2
        if user_bucket < midpoint:
            return "control"
        else:
            return "treatment"
    
    # User's bucket is not in this experiment's range.
    return None

### Test Layer System

In [None]:
print("--- Allocating Experiments ---")
allocate_experiment_to_layer("exp_A_button_color", 20)  # Needs 200 buckets
allocate_experiment_to_layer("exp_B_headline", 30)      # Needs 300 buckets
allocate_experiment_to_layer("exp_C_recs_algo", 100)    # Needs 1000 buckets

print("\n--- Current Layer Configuration ---")
for layer_id, layer_data in LAYER_CONFIG.items():
    print(f"\n{layer_id}:")
    print(f"  Total buckets: {layer_data['total_buckets']}")
    print(f"  Allocated: {layer_data['allocated_buckets']}")
    print(f"  Available: {layer_data['total_buckets'] - layer_data['allocated_buckets']}")
    print(f"  Experiments: {list(layer_data['experiments'].keys())}")

In [None]:
print("\n--- Getting User Assignments ---")
user_id = "user-12345"

# Check what bucket this user is in for each layer
layer1_bucket = get_layer_bucket(user_id, "layer_1")
layer2_bucket = get_layer_bucket(user_id, "layer_2")
print(f"User {user_id} is in bucket {layer1_bucket} for layer_1")
print(f"User {user_id} is in bucket {layer2_bucket} for layer_2")

# Check assignments for each experiment
print(f"\nButton Color Exp: {get_variant_for_experiment(user_id, 'exp_A_button_color')}")
print(f"Headline Exp: {get_variant_for_experiment(user_id, 'exp_B_headline')}")
print(f"Recs Algo Exp: {get_variant_for_experiment(user_id, 'exp_C_recs_algo')}")

print("\n✅ Layer system working correctly:")
print("   - User can be in experiments on different layers simultaneously")
print("   - User is in at most ONE experiment per layer")