# Python Sets for Data Engineering

Sets are **unordered collections of unique elements**. They are extremely useful in Data Engineering for:
- **Deduplication** - Removing duplicate records
- **Membership testing** - Fast O(1) lookups
- **Set operations** - Finding common, different, or combined elements
- **Data validation** - Checking allowed values

**Key Properties:**
- No duplicate values
- Unordered (no index access)
- Mutable (can add/remove elements)
- Elements must be hashable (immutable types like str, int, tuple)

---
# Section 1: Creating Sets

Different ways to create sets in Python

## 1.1 Creating Sets - Syntax

```python
# Method 1: Using curly braces
my_set = {1, 2, 3}

# Method 2: Using set() constructor
my_set = set([1, 2, 3])

# IMPORTANT: Empty set MUST use set(), not {}
empty_set = set()    # Correct
empty_dict = {}      # This creates an empty DICT, not set!
```

In [None]:
# Creating sets

# Method 1: Curly braces
status_codes = {200, 201, 400, 404, 500}
print(f"Status codes: {status_codes}")
print(f"Type: {type(status_codes)}")
print()

# Method 2: From a list (automatically removes duplicates!)
raw_ids = ["A001", "A002", "A001", "A003", "A002", "A001"]
unique_ids = set(raw_ids)
print(f"Raw list:   {raw_ids} (length: {len(raw_ids)})")
print(f"Unique set: {unique_ids} (length: {len(unique_ids)})")

In [None]:
# COMMON MISTAKE: Empty set vs empty dict

empty_dict = {}
empty_set = set()

print(f"{{}} creates: {type(empty_dict)}")
print(f"set() creates: {type(empty_set)}")
print()
print("Always use set() to create an empty set!")

---
# Section 2: Adding & Removing Elements

Modifying set contents

## 2.1 `add()` - Add Single Element

**Syntax:**
```
set.add(element)
```

**Behavior:** Adds element if not present, does nothing if already exists

**DE Use Case:** Building unique value sets during data processing

In [None]:
# add() - Collecting unique values during processing
# Simulating reading records from a data stream

transactions = [
    {"customer_id": "C001", "amount": 100},
    {"customer_id": "C002", "amount": 200},
    {"customer_id": "C001", "amount": 150},  # Duplicate customer
    {"customer_id": "C003", "amount": 300},
    {"customer_id": "C002", "amount": 250},  # Duplicate customer
]

# Collect unique customers
unique_customers = set()

for txn in transactions:
    unique_customers.add(txn["customer_id"])
    print(f"After processing {txn['customer_id']}: {unique_customers}")

print(f"\nTotal transactions: {len(transactions)}")
print(f"Unique customers: {len(unique_customers)}")

## 2.2 `update()` - Add Multiple Elements

**Syntax:**
```
set.update(iterable)
set.update(iterable1, iterable2, ...)  # Can pass multiple
```

**DE Use Case:** Merging unique values from multiple sources

In [None]:
# update() - Merging data from multiple sources

# Customer IDs from different data sources
customers_db1 = ["C001", "C002", "C003"]
customers_db2 = ["C003", "C004", "C005"]
customers_api = ["C001", "C005", "C006"]

# Combine all unique customers
all_customers = set()
all_customers.update(customers_db1)
print(f"After DB1:  {all_customers}")

all_customers.update(customers_db2)
print(f"After DB2:  {all_customers}")

all_customers.update(customers_api)
print(f"After API:  {all_customers}")

print(f"\nTotal unique customers across all sources: {len(all_customers)}")

In [None]:
# update() with multiple iterables at once
all_at_once = set()
all_at_once.update(customers_db1, customers_db2, customers_api)

print(f"All at once: {all_at_once}")

## 2.3 `remove()` vs `discard()` - Remove Elements

**Syntax:**
```
set.remove(element)   # Raises KeyError if not found
set.discard(element)  # Does nothing if not found (safer)
```

**DE Use Case:** Removing known bad values, filtering out specific items

In [None]:
# remove() vs discard()

valid_statuses = {"active", "pending", "completed", "unknown"}
print(f"Original: {valid_statuses}")

# remove() - Element must exist
valid_statuses.remove("unknown")  # Remove invalid status
print(f"After remove('unknown'): {valid_statuses}")

# discard() - Safe even if element doesn't exist
valid_statuses.discard("invalid")  # Doesn't exist, but no error
print(f"After discard('invalid'): {valid_statuses}")

print()
print("Use discard() when you're not sure if element exists")
print("Use remove() when element MUST exist (fail-fast)")

In [None]:
# remove() raises KeyError if not found
test_set = {"a", "b", "c"}

try:
    test_set.remove("x")  # Doesn't exist
except KeyError as e:
    print(f"KeyError: {e} - Element not found in set")

## 2.4 `pop()` and `clear()` - Other Removal Methods

**Syntax:**
```
set.pop()    # Remove and return arbitrary element
set.clear()  # Remove all elements
```

In [None]:
# pop() - Remove and return an arbitrary element
# Useful when you need to process items one at a time

pending_jobs = {"job_001", "job_002", "job_003"}
print(f"Pending jobs: {pending_jobs}")

# Process jobs one by one
while pending_jobs:
    job = pending_jobs.pop()
    print(f"Processing: {job}, Remaining: {pending_jobs}")

In [None]:
# clear() - Remove all elements
cache = {"key1", "key2", "key3"}
print(f"Before clear: {cache}")

cache.clear()
print(f"After clear:  {cache}")

---
# Section 3: Set Operations (Most Important for DE!)

These operations are what make sets powerful for data engineering tasks

## 3.1 Union - Combine All Elements

**What it does:** Returns all elements from both sets (no duplicates)

**Syntax:**
```
set1.union(set2)       # Method
set1 | set2            # Operator
set1.union(set2, set3) # Multiple sets
```

**DE Use Case:** Combining records from multiple data sources

In [None]:
# Union - Combining customer data from multiple regions

customers_east = {"C001", "C002", "C003", "C004"}
customers_west = {"C003", "C004", "C005", "C006"}

print(f"East region: {customers_east}")
print(f"West region: {customers_west}")
print()

# Get all unique customers (both methods work the same)
all_customers_method = customers_east.union(customers_west)
all_customers_operator = customers_east | customers_west

print(f"Union (method):   {all_customers_method}")
print(f"Union (operator): {all_customers_operator}")
print(f"\nTotal unique customers: {len(all_customers_method)}")

In [None]:
# Visual representation
print("""
UNION (A | B) - All elements from both sets

    Set A          Set B
   ┌─────┐       ┌─────┐
   │ 1 2 │       │ 3 4 │
   │   ╲─┼───────┼─╱   │
   │    ╲│ 2  3  │╱    │
   └─────┴───────┴─────┘
   
   Result: {1, 2, 3, 4}
""")

## 3.2 Intersection - Common Elements Only

**What it does:** Returns only elements that exist in BOTH sets

**Syntax:**
```
set1.intersection(set2)  # Method
set1 & set2              # Operator
```

**DE Use Cases:**
- Finding customers who exist in multiple systems
- Identifying common products between stores
- Data reconciliation

In [None]:
# Intersection - Find customers in BOTH regions

customers_east = {"C001", "C002", "C003", "C004"}
customers_west = {"C003", "C004", "C005", "C006"}

print(f"East region: {customers_east}")
print(f"West region: {customers_west}")
print()

# Customers active in BOTH regions
common_customers = customers_east.intersection(customers_west)
# OR: common_customers = customers_east & customers_west

print(f"Customers in BOTH regions: {common_customers}")
print(f"Count: {len(common_customers)}")

In [None]:
# DE Use Case: Data reconciliation
# Find records that exist in both source and target after ETL

source_ids = {"R001", "R002", "R003", "R004", "R005"}
target_ids = {"R001", "R002", "R003", "R006"}  # Some missing, some extra

# Successfully migrated records
migrated = source_ids & target_ids

print(f"Source records: {source_ids}")
print(f"Target records: {target_ids}")
print(f"Successfully migrated: {migrated}")
print(f"Migration rate: {len(migrated)}/{len(source_ids)} = {len(migrated)/len(source_ids)*100:.1f}%")

## 3.3 Difference - Elements in One but Not Other

**What it does:** Returns elements in first set but NOT in second set

**Syntax:**
```
set1.difference(set2)  # Elements in set1 but not in set2
set1 - set2            # Operator
```

**Important:** Order matters! `A - B` ≠ `B - A`

**DE Use Cases:**
- Finding missing records after migration
- Identifying new/removed items between snapshots
- Data quality checks

In [None]:
# Difference - Find missing records after ETL migration

source_ids = {"R001", "R002", "R003", "R004", "R005"}
target_ids = {"R001", "R002", "R003", "R006"}

# Records in source but NOT in target (failed to migrate)
missing_in_target = source_ids - target_ids

# Records in target but NOT in source (unexpected/orphan records)
unexpected_in_target = target_ids - source_ids

print(f"Source: {source_ids}")
print(f"Target: {target_ids}")
print()
print(f"Missing in target (failed migration): {missing_in_target}")
print(f"Unexpected in target (orphan records): {unexpected_in_target}")

In [None]:
# DE Use Case: Detecting changes between daily snapshots

yesterday_products = {"P001", "P002", "P003", "P004", "P005"}
today_products = {"P001", "P002", "P004", "P006", "P007"}

# What changed?
removed_products = yesterday_products - today_products
new_products = today_products - yesterday_products

print("Daily Change Report:")
print("=" * 40)
print(f"Products removed: {removed_products}")
print(f"Products added:   {new_products}")
print(f"Unchanged:        {yesterday_products & today_products}")

## 3.4 Symmetric Difference - Elements in Either but Not Both

**What it does:** Returns elements that are in ONE set or the OTHER, but NOT in both

**Syntax:**
```
set1.symmetric_difference(set2)
set1 ^ set2  # Operator
```

**DE Use Case:** Finding all discrepancies between two datasets

In [None]:
# Symmetric Difference - Find ALL discrepancies

system_a_records = {"R001", "R002", "R003", "R004"}
system_b_records = {"R002", "R003", "R005", "R006"}

# Records that don't match between systems
discrepancies = system_a_records ^ system_b_records

print(f"System A: {system_a_records}")
print(f"System B: {system_b_records}")
print(f"\nDiscrepancies (in one but not both): {discrepancies}")
print(f"\nThese records need investigation!")

In [None]:
# Visual comparison of all set operations
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}

print(f"Set A: {a}")
print(f"Set B: {b}")
print()
print(f"Union (A | B):                 {a | b}  ← All elements")
print(f"Intersection (A & B):          {a & b}        ← Common elements")
print(f"Difference (A - B):            {a - b}        ← In A but not B")
print(f"Difference (B - A):            {b - a}        ← In B but not A")
print(f"Symmetric Difference (A ^ B):  {a ^ b}     ← In either, not both")

---
# Section 4: Membership & Comparison

Fast lookups and set comparisons

## 4.1 Membership Testing (`in` operator)

**Why sets are fast:** O(1) average lookup vs O(n) for lists

**DE Use Case:** Validating values against allowed lists, filtering records

In [None]:
# Membership testing - Data validation

# Valid status codes for orders
valid_statuses = {"pending", "processing", "shipped", "delivered", "cancelled"}

# Incoming records to validate
orders = [
    {"order_id": "O001", "status": "pending"},
    {"order_id": "O002", "status": "SHIPPED"},    # Wrong case
    {"order_id": "O003", "status": "delivered"},
    {"order_id": "O004", "status": "unknown"},    # Invalid
    {"order_id": "O005", "status": "processing"},
]

print("Validating order statuses:")
print("-" * 50)
for order in orders:
    status = order["status"].lower()  # Normalize case
    is_valid = status in valid_statuses  # O(1) lookup!
    print(f"{order['order_id']}: '{order['status']}' → Valid: {is_valid}")

In [None]:
# Performance: Set vs List for lookups
import time

# Create large collections
large_list = list(range(100000))
large_set = set(range(100000))

# Test value (worst case for list - at the end)
test_value = 99999

# List lookup
start = time.time()
for _ in range(1000):
    _ = test_value in large_list
list_time = time.time() - start

# Set lookup
start = time.time()
for _ in range(1000):
    _ = test_value in large_set
set_time = time.time() - start

print(f"List lookup (1000x): {list_time:.4f} seconds")
print(f"Set lookup (1000x):  {set_time:.4f} seconds")
print(f"\nSet is {list_time/set_time:.0f}x faster!")

## 4.2 Subset and Superset Checks

**Syntax:**
```
set1.issubset(set2)    # Is set1 contained in set2?
set1 <= set2           # Operator

set1.issuperset(set2)  # Does set1 contain all of set2?
set1 >= set2           # Operator
```

**DE Use Case:** Validating required fields, checking permissions

In [None]:
# issubset() - Validate required fields exist

required_fields = {"customer_id", "order_date", "amount"}

# Incoming records with different field sets
records = [
    {"customer_id": "C001", "order_date": "2024-01-15", "amount": 100, "notes": "Express"},
    {"customer_id": "C002", "amount": 200},  # Missing order_date
    {"customer_id": "C003", "order_date": "2024-01-16", "amount": 300},
]

print(f"Required fields: {required_fields}")
print()
print("Validating records:")
print("-" * 50)

for i, record in enumerate(records):
    record_fields = set(record.keys())
    has_required = required_fields.issubset(record_fields)
    # OR: has_required = required_fields <= record_fields
    
    if has_required:
        print(f"Record {i+1}: ✓ Valid")
    else:
        missing = required_fields - record_fields
        print(f"Record {i+1}: ✗ Missing: {missing}")

In [None]:
# issuperset() - Check if user has required permissions

user_permissions = {"read", "write", "delete", "admin"}
required_for_action = {"write", "delete"}

can_perform = user_permissions.issuperset(required_for_action)
# OR: can_perform = user_permissions >= required_for_action

print(f"User has: {user_permissions}")
print(f"Action requires: {required_for_action}")
print(f"Can perform action: {can_perform}")

## 4.3 `isdisjoint()` - Check for No Common Elements

**Syntax:**
```
set1.isdisjoint(set2)  # True if no common elements
```

**DE Use Case:** Ensuring no overlap between groups, validation

In [None]:
# isdisjoint() - Ensure no overlap in data partitions

partition_1_ids = {"R001", "R002", "R003"}
partition_2_ids = {"R004", "R005", "R006"}
partition_3_ids = {"R003", "R007", "R008"}  # Overlaps with partition_1!

print("Checking partition integrity:")
print("-" * 40)

check_1_2 = partition_1_ids.isdisjoint(partition_2_ids)
check_1_3 = partition_1_ids.isdisjoint(partition_3_ids)

print(f"Partition 1 & 2 disjoint: {check_1_2} ✓" if check_1_2 else f"Partition 1 & 2 disjoint: {check_1_2} ✗")
print(f"Partition 1 & 3 disjoint: {check_1_3} ✓" if check_1_3 else f"Partition 1 & 3 disjoint: {check_1_3} ✗ OVERLAP!")

if not check_1_3:
    overlap = partition_1_ids & partition_3_ids
    print(f"  → Overlapping records: {overlap}")

---
# Section 5: Practical DE Examples

Real-world data engineering scenarios using sets

In [None]:
# Example 1: Deduplication in ETL Pipeline

# Raw data with duplicates
raw_events = [
    {"event_id": "E001", "user": "U1", "action": "click"},
    {"event_id": "E002", "user": "U2", "action": "view"},
    {"event_id": "E001", "user": "U1", "action": "click"},  # Duplicate
    {"event_id": "E003", "user": "U1", "action": "purchase"},
    {"event_id": "E002", "user": "U2", "action": "view"},  # Duplicate
]

# Deduplicate using set to track seen IDs
seen_ids = set()
deduplicated = []

for event in raw_events:
    if event["event_id"] not in seen_ids:
        seen_ids.add(event["event_id"])
        deduplicated.append(event)

print(f"Raw events: {len(raw_events)}")
print(f"After dedup: {len(deduplicated)}")
print(f"Duplicates removed: {len(raw_events) - len(deduplicated)}")

In [None]:
# Example 2: Data Quality Report

def data_quality_report(source_ids, target_ids):
    """Generate a data quality report comparing source and target."""
    
    source_set = set(source_ids)
    target_set = set(target_ids)
    
    matched = source_set & target_set
    missing = source_set - target_set
    extra = target_set - source_set
    
    print("=" * 50)
    print("DATA QUALITY REPORT")
    print("=" * 50)
    print(f"Source records:     {len(source_set)}")
    print(f"Target records:     {len(target_set)}")
    print(f"Matched:            {len(matched)} ({len(matched)/len(source_set)*100:.1f}%)")
    print(f"Missing in target:  {len(missing)}")
    print(f"Extra in target:    {len(extra)}")
    print()
    
    if missing:
        print(f"Missing IDs: {missing}")
    if extra:
        print(f"Extra IDs: {extra}")

# Test the report
source = ["ID001", "ID002", "ID003", "ID004", "ID005"]
target = ["ID001", "ID002", "ID003", "ID006"]

data_quality_report(source, target)

In [None]:
# Example 3: Filter records by allowed values

# Allowed countries for processing
allowed_countries = {"US", "CA", "UK", "DE", "FR"}

# Incoming orders
orders = [
    {"order_id": "O001", "country": "US", "amount": 100},
    {"order_id": "O002", "country": "JP", "amount": 200},  # Not allowed
    {"order_id": "O003", "country": "UK", "amount": 150},
    {"order_id": "O004", "country": "BR", "amount": 300},  # Not allowed
    {"order_id": "O005", "country": "CA", "amount": 250},
]

# Filter using set membership (fast!)
valid_orders = [o for o in orders if o["country"] in allowed_countries]
rejected_orders = [o for o in orders if o["country"] not in allowed_countries]

print(f"Total orders: {len(orders)}")
print(f"Valid orders: {len(valid_orders)}")
print(f"Rejected: {len(rejected_orders)}")
print()
print("Rejected orders:")
for order in rejected_orders:
    print(f"  {order['order_id']}: Country '{order['country']}' not in allowed list")

---
# Quick Reference: Set Methods

| Method | Description | DE Use Case |
|--------|-------------|-------------|
| `add(x)` | Add single element | Collect unique values |
| `update(iter)` | Add multiple elements | Merge from sources |
| `remove(x)` | Remove (error if missing) | Remove known items |
| `discard(x)` | Remove (safe) | Remove if exists |
| `union()` / `\|` | All from both | Combine datasets |
| `intersection()` / `&` | Common only | Find matches |
| `difference()` / `-` | In first not second | Find missing |
| `symmetric_difference()` / `^` | In either not both | Find discrepancies |
| `issubset()` | Is contained in? | Validate required fields |
| `issuperset()` | Contains all? | Check permissions |
| `isdisjoint()` | No overlap? | Validate partitions |

---
# Frozen Sets (Immutable)

`frozenset` is an immutable version of set - can be used as dictionary keys or in other sets

In [None]:
# frozenset - When you need immutable sets

# Can be used as dictionary keys
region_mapping = {
    frozenset({"NY", "NJ", "CT"}): "Northeast",
    frozenset({"CA", "OR", "WA"}): "West Coast",
    frozenset({"TX", "OK", "LA"}): "South Central",
}

# Lookup
states = frozenset({"CA", "OR", "WA"})
print(f"Region for {set(states)}: {region_mapping[states]}")