# Week 1: Advanced Python for Data Engineers
## Topic: Generators & Yield

**Roadmap Goal:** Process large datasets (logs, telemetry) memory-efficiently, just like you did with batch processing in your legacy C++ roles.

### Why Generators?
Normal functions return a value and exit. Generators `yield` a value and *pause*, saving their state. This means you can iterate over a 1TB file one line at a time without loading it all into RAM.

---

In [1]:
# Example 1: The 'Old Way' (Eager Loading)
# This creates the huge list in memory all at once.
def get_large_list():
    result = []
    for i in range(1000000):
        result.append(i)
    return result

# print(get_large_list()) # Don't run this if you don't have RAM!

In [2]:
# Example 2: The 'Generator Way' (Lazy Loading)
# This yields one item at a time. Zero memory impact.
def get_large_generator():
    for i in range(1000000):
        yield i

gen = get_large_generator()
print(next(gen))
print(next(gen))
print(next(gen))

0
1
2


### Exercise: Streaming Log Processor
Simulate processing a massive HorizonScale log file.

In [3]:
import time

# Mock Data Stream (Imagine this is reading a file line-by-line)
log_data = [
    "2025-01-01 ERROR: Clamp detected on server A",
    "2025-01-01 INFO: Health check pass",
    "2025-01-01 ERROR: Memory overflow server B",
    "2025-01-02 INFO: Scaling up"
]

def stream_errors(logs):
    """Generator that only yields ERROR lines"""
    for line in logs:
        if "ERROR" in line:
            yield line

# Use the generator pipeline
error_stream = stream_errors(log_data)

for error in error_stream:
    print(f"Alerting on: {error}")

Alerting on: 2025-01-01 ERROR: Clamp detected on server A
Alerting on: 2025-01-01 ERROR: Memory overflow server B
