# Dask Delayed for Parallel File Processing

This notebook demonstrates how to use `dask.delayed` to parallelize the processing of multiple files. `dask.delayed` is a simple and powerful way to create task graphs for custom algorithms.

In [None]:
import dask
import time
import os

## 1. Create Dummy Data Files

In [None]:
os.makedirs('temp_data', exist_ok=True)
for i in range(5):
    with open(f'temp_data/file_{i}.txt', 'w') as f:
        f.write('dask is a flexible library for parallel computing in Python.' * (i + 1))

## 2. Define a Function to Process a Single File

In [None]:
def process_file(filename):
    """Reads a file, counts the words, and simulates some work by sleeping."""
    print(f"Processing {filename}...")
    with open(filename, 'r') as f:
        content = f.read()
    word_count = len(content.split())
    time.sleep(1)  # Simulate I/O or CPU intensive work
    print(f"Finished processing {filename}.")
    return word_count

## 3. Use `dask.delayed` to Parallelize

In [None]:
filenames = [f'temp_data/file_{i}.txt' for i in range(5)]

# Create a list of delayed objects
delayed_results = [dask.delayed(process_file)(fn) for fn in filenames]

# Compute the results in parallel
total_word_count = dask.compute(*delayed_results)

In [None]:
print(f"Total word count: {sum(total_word_count)}")

## 4. Visualize the Task Graph

In [None]:
# To visualize the graph, you need to have graphviz installed.
# You can install it with: conda install python-graphviz
total = dask.delayed(sum)(delayed_results)
total.visualize()

## 5. Clean up the dummy data

In [None]:
import shutil
shutil.rmtree('temp_data')