# Tutorial 03: Devices and Constraints

So far all the examples have been using the default (CPU) only. 

In general Parla can manage tasks across heterogenous hardware (CPUs, GPUs)

**Note: This tutorial requires at least 1 GPU available**

Here we will cover:
- Task Device Placement 
- Task Constraints (memory, compute resources)
- Function Variants

In [1]:
from typing import Callable, Optional
from time import perf_counter, sleep
import os

def set_numpy_threads(threads: int = 1):
    """
    Numpy can be configured to use multiple threads for linear algebra operations.
    The backend used by numpy can vary by installation.
    This function attempts to set the number of threads for the most common backends.
    MUST BE CALLED BEFORE IMPORTING NUMPY.

    Args:
        threads (int, optional): The number of threads to use. Defaults to 1.
    """

    os.environ["NUMEXPR_NUM_THREADS"] = str(threads)
    os.environ["OMP_NUM_THREADS"] = str(threads)
    os.environ["OPENBLAS_NUM_THREADS"] = str(threads)
    os.environ["VECLIB_MAXIMUM_THREADS"] = str(threads)
    os.environ["MKL_NUM_THREADS"] = str(threads)

    try:
        # Controlling the MKL backend can use mkl and mkl-service modules if installed.
        # preferred method for controlling the MKL backend.
        import mkl

        mkl.set_num_threads(threads)
    except ImportError:
        pass
    
set_numpy_threads(1)

# Handle for Parla runtime
from parla import Parla

# Spawning task API
from parla.tasks import (
    spawn,
    TaskSpace,
    get_current_context,
    get_current_task,
)

# Device Architectures for placement
from parla.devices import cpu, gpu


def run(function: Callable[[], Optional[TaskSpace]]):
    assert callable(function), "The function argument must be callable."

    # Start the Parla runtime.
    with Parla():
        # Create a top-level task to kick off the computation
        @spawn(placement=cpu, vcus=0)
        async def top_level_task():
            # Note that unless function returns a TaskSpace, this will NOT be blocking.
            # If you want to wait on the completion of the tasks launched by function, you must return a TaskSpace that contains their terminal tasks.
            await function()

    # Runtime exists at the end of the context_manager
    # All tasks are guaranteed to be complete at this point

USE_PYTHON_RUNAHEAD:  True
CUPY_ENABLED:  True
PREINIT_THREADS:  True
DEFAULT SYNC:  0


## Task Placement

The placement argument to `@spawn()` determines where the task will be placed.

It can be a single device type or a list of possible placement options. 

At runtime, Parla will choose the most available option using its mapping policy, typically considering available resources and load balancing.

When running this tutorial, you may see different devices being used for the "either" task.

(Note the runtime may have a preference for scheduling on your GPU)

#### Example: `tutorial/11_task_placement.py`


In [2]:
async def device_task():
    T = TaskSpace("Device")

    for i in range(2):
        @spawn(T["cpu", i], placement=cpu)
        def cpu_task():
            # Runs on a CPU device
            print(f"Hello from {get_current_task()}, running on {get_current_context()}")

        @spawn(T["gpu", i], placement=gpu)
        def gpu_task():
            # Runs on a GPU device 
            print(f"Hello from {get_current_task()}, running on {get_current_context()}")
            
        @spawn(T["either", i], placement=[cpu, gpu])
        def either_task():
            # Runs on either a CPU or GPU device
            print(f"Hello from {get_current_task()}, running on {get_current_context()}")
            
        
run(device_task)

Hello from Task(Device_cpu_0), running on CPUEnvironment(CPU:0)
Hello from Task(Device_gpu_0), running on GPUEnvironment(GPU:0)
Hello from Task(Device_cpu_1), running on CPUEnvironment(CPU:0)
Hello from Task(Device_either_0), running on GPUEnvironment(GPU:0)
Hello from Task(Device_gpu_1), running on GPUEnvironment(GPU:0)
Hello from Task(Device_either_1), running on GPUEnvironment(GPU:0)


In the above example, the placement listed the device architecture type which allowed the runtime to schedule the task on any instance of that type of device. 

Specific devices can be specified by listing an index. For example, `placement=[gpu(i)]` means the task must be placed on the i-th GPU. 

### GPU Tasks 

GPU Tasks are still hosted on the CPU. The task is assigned to a host thread and it's python code block is executed. 
A GPU task does not need to be "pure" and can still execute CPU functions.

Using a GPU task has the following benefits:
- The active CuPy device context is set to the chosen GPU device 
- A CUDA/HIP Stream is pulled from the device's stream pool and set as the active stream
- An event for the task's kernel completion is created on the device's stream

## Task Constraints

The `@spawn()` decorator can also be given constraints on the resources required by the task.
There are two main resource types:
- Memory, the size (bytes) of non-persistent workspace the task is expected to use during its lifetime
- Virtual Compute Unites (VCUs), the expected fraction of the device the task will use 

They can be listed in spawn as `memory` and `vcus` respectively:
`@spawn(placement=[gpu], memory=1e9, vcus=0.5)`

### Virtual Compute Units

VCUs limit the parallelism on each device. For example, `vcus=0.5` means the task will use half of the device's compute resources and that two tasks of this weight could be scheduled onto that device concurrently.

Each task that runs concurrently on the same GPU gets its own CUDA stream.



#### Example: `tutorial/12_task_constraints.py`

In [7]:
async def vcu_example():
    
    T = TaskSpace("T")
    import numpy as np
    N = 10000
    n_tasks = 5
    
    # Try changing the cost to increase parallelism
    cost = 1 # Serial
    # cost = 1/8 # 2 Active CPU Threads
    # cost = 1/4 # 4 Active CPU Threads
    
    start_t = perf_counter()
    vectors = [np.random.rand(N) for _ in range(n_tasks)]
    matricies = [np.random.rand(N, N) for _ in range(n_tasks)]
    
    for i in range(n_tasks):
        @spawn(T[i], placement=cpu, vcus=cost)
        def task():
            print("Starting:", T[i], flush=True)
            v = vectors[i]
            M = matricies[i]
            for _ in range(1):
                v = M @ v
            print("Completed: ", T[i], flush=True)
    
    @spawn(T["sum"], [T[:n_tasks]], placement=cpu, vcus=cost)
    def sum_task():
        print("Starting sum", flush=True)
        vectors[0] = sum(vectors)
        print("Finished sum", flush=True)
        
    await T
    end_t = perf_counter()
    
    print("Elapsed Time: ", end_t - start_t)
    
run(vcu_example)

Starting: Task(T_0)
Completed:  Task(T_0)
Starting: Task(T_1)
Completed:  Task(T_1)
Starting: Task(T_2)
Completed:  Task(T_2)
Starting: Task(T_3)
Completed:  Task(T_3)
Starting: Task(T_4)
Completed:  Task(T_4)
Starting sum
Finished sum
Elapsed Time:  6.829967883008067


## Function Variants

As tasks can be both CPU and GPU it can be advantageous to have different implementations of the same function for each device type. 

Parla provides annotations to overload functions and dispatch them based on the current task's device context.

##### Example: `tutorial/13_function_variants.py`

In [16]:
from parla.tasks import specialize
from parla.array import clone_here
import numpy as np
import cupy as cp

In [17]:
@specialize
def function(A: np.ndarray):
    print("Running Default Implementation", flush=True)
    return np.linalg.eigh(A)
    
    
@function.variant(gpu)
def function_gpu(A: cp.ndarray):
    print("Running GPU Implementation", flush=True)
    return cp.linalg.eigh(A)

In [21]:
def specialization_example():
    A = np.random.rand(1000, 1000)
    B = np.copy(A)
    T = TaskSpace("T")
    
    @spawn(T[0], placement=cpu)
    def t1():
        print("Running CPU Task", flush=True)
        A_local = clone_here(A)
        C = function(A_local)
        print("Completed CPU Task", flush=True)
        
    @spawn(T[1], [T[0]], placement=gpu)
    def t2():
        print("Running GPU Task", flush=True)
        B_local = clone_here(B)
        C = function(B_local)
        print("Completed GPU Task", flush=True)
        
    return  T
    
run(specialization_example)

Running CPU Task
Running Default Implementation
Completed CPU Task
Running GPU Task
Running GPU Implementation


Completed GPU Task
