# Exploring Dask TaskState for Per-Worker Chunk Tracking

This notebook explores what data is available from Dask's internal `TaskState` objects to understand if we can use them for per-worker chunk byte tracking.

## Goal

Determine if we can extract:
- Bytes per chunk (input and/or output)
- Which worker processed which chunk
- Mapping from task keys to user-visible chunks (files, datasets)

## Setup

In [1]:
import awkward as ak
import pandas as pd
import skhep_testdata
from coffea import processor
from coffea.nanoevents import NanoAODSchema
from dask.distributed import Client, LocalCluster

## Create a Simple Processor

In [None]:
class SimpleProcessor(processor.ProcessorABC):
    """Simple processor for testing."""

    def process(self, events):
        # Do some computation
        jets = events.Jet[events.Jet.pt > 30]

        return {
            "nevents": len(events),
            "njets": ak.sum(ak.num(jets)),
            "dataset": events.metadata.get("dataset", "unknown"),
            "filename": events.metadata.get("filename", "unknown"),
        }

    def postprocess(self, accumulator):
        return accumulator

## Start Dask Cluster

We'll use 2 workers to see task distribution.

In [3]:
cluster = LocalCluster(n_workers=2, threads_per_worker=1, processes=True)
client = Client(cluster)

print(f"Dashboard: {client.dashboard_link}")
print(f"Workers: {len(client.scheduler_info()['workers'])}")

Perhaps you already have a cluster running?
Hosting the HTTP server on port 55712 instead


Dashboard: http://127.0.0.1:55712/status
Workers: 2


## Run Coffea Processing

In [4]:
# Get test file
test_file = skhep_testdata.data_path("nanoAOD_2015_CMS_Open_Data_ttbar.root")

# Create fileset
fileset = {
    "ttbar": {
        "files": {test_file: "Events"},
    },
}

# Run processor
proc = SimpleProcessor()
executor = processor.DaskExecutor(client=client)
runner = processor.Runner(
    executor=executor,
    savemetrics=True,
    schema=NanoAODSchema,
)

output, report = runner(
    fileset,
    treename="Events",
    processor_instance=proc,
)

print(f"\nProcessed {report['entries']} events in {report['chunks']} chunks")
print(f"Total bytes read: {report['bytesread'] / 1e6:.2f} MB")

Output()

Output()




Processed 200 events in 1 chunks
Total bytes read: 0.34 MB


## Part 1: Access Scheduler and Worker State

Let's explore what's available in the Dask scheduler.

In [5]:
# Get scheduler from client
scheduler = client.cluster.scheduler

print("=== Scheduler Info ===")
print(f"Scheduler type: {type(scheduler)}")
print(f"Number of workers: {len(scheduler.workers)}")
print(f"Number of tasks: {len(scheduler.tasks)}")
print(f"\nWorker IDs: {list(scheduler.workers.keys())}")

=== Scheduler Info ===
Scheduler type: <class 'distributed.scheduler.Scheduler'>
Number of workers: 2
Number of tasks: 0

Worker IDs: ['tcp://127.0.0.1:55720', 'tcp://127.0.0.1:55721']


## Part 2: Explore Worker State

Let's see what data is available for each worker.

In [None]:
print("=== Worker State Details ===")
for worker_id, worker_state in scheduler.workers.items():
    print(f"\nWorker: {worker_id}")
    print(f"  Address: {worker_state.address}")
    print(f"  Threads: {worker_state.nthreads}")
    print(f"  Memory limit: {worker_state.memory_limit / 1e9:.2f} GB")
    print(f"  Memory used: {worker_state.memory.managed / 1e9:.2f} GB")
    print(f"  Total nbytes: {worker_state.nbytes / 1e6:.2f} MB")
    print(f"  Number of tasks: {len(worker_state.tasks)}")
    print(f"  Currently processing: {len(worker_state.processing)}")

    # Show attributes available
    attrs = [a for a in dir(worker_state) if not a.startswith("_")]
    print(f"  Available attributes: {', '.join(attrs[:10])}...")

=== Worker State Details ===

Worker: tcp://127.0.0.1:55720
  Address: tcp://127.0.0.1:55720
  Threads: 1
  Memory limit: 17.18 GB
  Memory used: 0.00 GB
  Total nbytes: 0.00 MB


AttributeError: 'WorkerState' object has no attribute 'tasks'

## Part 3: Explore TaskState Objects

This is the critical part - what's in individual task states?

In [None]:
print("=== TaskState Details ===")

# Get all tasks from scheduler
all_tasks = list(scheduler.tasks.values())
print(f"Total tasks in scheduler: {len(all_tasks)}")

# Look at first few tasks
for i, task in enumerate(all_tasks[:5]):
    print(f"\n--- Task {i + 1} ---")
    print(f"Key: {task.key}")
    print(f"State: {task.state}")
    print(f"Worker: {task.who_has if hasattr(task, 'who_has') else 'N/A'}")
    print(
        f"nbytes: {task.nbytes / 1e3:.2f} KB"
        if hasattr(task, "nbytes") and task.nbytes
        else "nbytes: N/A"
    )
    print(f"Type: {task.type if hasattr(task, 'type') else 'N/A'}")

    # Show available attributes
    attrs = [a for a in dir(task) if not a.startswith("_")]
    print(f"Attributes: {', '.join(attrs[:15])}...")

## Part 4: Find Coffea-Related Tasks

Let's filter for tasks related to our processor.

In [None]:
print("=== Coffea/Processor Tasks ===")

# Find tasks with 'process' or processor name in key
processor_tasks = [
    task
    for task in all_tasks
    if "SimpleProcessor" in str(task.key) or "process" in str(task.key).lower()
]

print(f"Found {len(processor_tasks)} processor-related tasks\n")

# Show details for processor tasks
for i, task in enumerate(processor_tasks[:10]):
    print(f"\nTask {i + 1}:")
    print(f"  Key: {task.key}")
    print(f"  State: {task.state}")

    # Try to get worker who processed it
    if hasattr(task, "who_has") and task.who_has:
        worker_addr = list(task.who_has)[0].address if task.who_has else None
        print(f"  Worker: {worker_addr}")

    # Get size
    if hasattr(task, "nbytes") and task.nbytes:
        print(f"  Result size: {task.nbytes / 1e3:.2f} KB")

    # Check for any metadata
    if hasattr(task, "annotations"):
        print(f"  Annotations: {task.annotations}")

## Part 5: Per-Worker Task Breakdown

Let's see which worker processed how many tasks and total bytes.

In [None]:
print("=== Per-Worker Task Distribution ===")

worker_stats = {}

for worker_id, worker_state in scheduler.workers.items():
    worker_tasks = worker_state.tasks

    # Calculate stats
    total_bytes = sum(
        task.nbytes for task in worker_tasks if hasattr(task, "nbytes") and task.nbytes
    )

    processor_related = [
        task
        for task in worker_tasks
        if "SimpleProcessor" in str(task.key) or "process" in str(task.key).lower()
    ]

    worker_stats[worker_id] = {
        "total_tasks": len(worker_tasks),
        "processor_tasks": len(processor_related),
        "total_bytes": total_bytes,
    }

    print(f"\nWorker: {worker_id}")
    print(f"  Total tasks: {len(worker_tasks)}")
    print(f"  Processor tasks: {len(processor_related)}")
    print(f"  Total result bytes: {total_bytes / 1e6:.2f} MB")

    # Show sample task keys
    if processor_related:
        print("  Sample task keys:")
        for task in processor_related[:3]:
            size = (
                f"{task.nbytes / 1e3:.1f} KB"
                if hasattr(task, "nbytes") and task.nbytes
                else "N/A"
            )
            print(f"    {task.key}: {size}")

## Part 6: Can We Map Task Keys to Chunks?

Let's see if task keys contain any information about files or datasets.

In [None]:
print("=== Task Key Analysis ===")

# Examine task key structure
print("\nTask key examples:")
for i, task in enumerate(processor_tasks[:10]):
    key = task.key
    print(f"\n{i + 1}. {key}")
    print(f"   Type: {type(key)}")

    if isinstance(key, tuple):
        print(f"   Length: {len(key)}")
        print(f"   Elements: {key}")

        # Check if any element contains file/dataset info
        for j, elem in enumerate(key):
            if isinstance(elem, str):
                if "ttbar" in elem or "root" in elem or "nanoAOD" in elem:
                    print(f"   -> Element {j} might contain file/dataset info: {elem}")

print("\n=== Conclusion ===")
print("Task keys are typically tuples like ('function-name', 'hash', index)")
print("They generally do NOT contain human-readable file/dataset information.")
print("Coffea's internal structure may have this mapping, but it's not in task keys.")

## Part 7: Create Summary DataFrame

In [None]:
# Build a DataFrame of all processor tasks
task_data = []

for task in processor_tasks:
    # Get worker
    worker = None
    if hasattr(task, "who_has") and task.who_has:
        worker = list(task.who_has)[0].address if task.who_has else None

    task_data.append(
        {
            "task_key": str(task.key)[:50] + "..."
            if len(str(task.key)) > 50
            else str(task.key),
            "worker": worker,
            "state": task.state,
            "nbytes": task.nbytes if hasattr(task, "nbytes") and task.nbytes else 0,
            "nbytes_kb": task.nbytes / 1e3
            if hasattr(task, "nbytes") and task.nbytes
            else 0,
        }
    )

df = pd.DataFrame(task_data)

print("=== Task Summary DataFrame ===")
print(df.head(10))

print("\n=== Summary Statistics ===")
print(df[["nbytes_kb"]].describe())

print("\n=== Per-Worker Summary ===")
print(df.groupby("worker")["nbytes_kb"].agg(["count", "sum", "mean", "std"]))

## Findings and Conclusions

### What We CAN Get from TaskState:

1. ✅ **Result size per task** (`TaskState.nbytes`)
2. ✅ **Worker attribution** (which worker processed which task)
3. ✅ **Task state** (waiting, executing, finished, etc.)
4. ✅ **Number of tasks per worker**

### What We CANNOT Get:

1. ❌ **Input bytes read** - only output/result size available
2. ❌ **File/dataset mapping** - task keys are opaque hashes
3. ❌ **Chunk identification** - no way to map task to user-visible "chunk"
4. ❌ **Event counts per task** - not in TaskState

### Limitations:

- `nbytes` is the **output** size (accumulator result), not input bytes read from file
- Task keys don't contain human-readable information (filename, dataset, etc.)
- Would need to maintain separate mapping from task keys to chunk metadata
- Snapshot overhead - thousands of tasks to track

### Recommendation:

**TaskState tracking alone is insufficient** for the requirements (per-chunk with worker attribution, bytes per event, throughput per chunk).

**Better approach**: Use `@track_metrics` decorator that:
- Captures input metadata (filename, dataset, event count)
- Measures processing time directly
- Gets worker ID from `get_worker()`
- Can estimate bytes if needed
- Provides clean, user-visible chunk attribution

TaskState could supplement decorator data but cannot replace it.

## Cleanup

In [None]:
client.close()
cluster.close()
print("Cluster closed")