# Tutorial 01: Introduction to Parla

This tutorial introduces basic concepts in task-based parallel programming using Parla.
We will cover:
- Installation
- What is Parla?
- Running Tasks


# What is Parla?

Parla is a task-parallel programming library for Python. Parla targets the orchestration of heterogeneous (CPU+GPU) workloads on a single shared-memory machine. We provide features for resource management, task variants, and automated scheduling of data movement between devices.

We design for gradual-adoption allowing users to easily port sequential code for parallel execution.

The Parla runtime is multi-threaded but single-process to utilize a shared address space. In practice, this means that the main workload within each task *must* release CPython's Global Interpreter Lock (GIL) to allows tasks to gain parallel speedup. This includes many common numerical libraries such as NumPy, SciPy, Numba, and PyKokkos. 

## Installation
We provide a preconfigued Docker container for running Parla. 
This is the easiest way to get started with Parla, and is the recommended way to run the tutorial.
Please refer to the [README](../../../README.md) for instructions using the provided container.

Briefly, to install Parla on a local machine you will need to clone the repository and install it using pip.

```bash
python -m install numpy scipy
git clone https://github.com/ut-parla/parla-experimental.git
cd parla-experimental
git submodule update --init --recursive
pip install .
```


## Runnings Tasks
Parallel programming in Python is centered around ***tasks***. 
Tasks are annotated blocks of Python code that may be executed asynchronously and concurrently with each other.

Parla tasks can be annotated with various constraints and properties that allow the runtime to make intelligent scheduling decisions about where and when they should execute. We'll cover these in more detail in later tutorials.

For the moment we'll look at the simplest possible tasks and how to run them.

### Example 1: `tutorial/01_hello.py`

First, we'll import the Parla runtime and some of its utilities.



In [1]:
from typing import Callable, Optional
from time import perf_counter, sleep

# Handle for Parla runtime
from parla import Parla

# Spawning  tasks
from parla.tasks import spawn, TaskSpace 
from parla.devices import cpu 

The scope of the Parla runtime is defined by a `Parla()` context manager.
This context manager is used to initialize the runtime and ensure that all tasks have completed before exiting the program.

```python
with Parla():
    # Parla runtime is initialized here
    # Tasks may be submitted and executed
# Parla runtime is finalized here
```

Although, this can be done within the global namespace of a Python program, it is typically bad practice to do so.
Within a task, global variables are not captured by value and may change before the task is executed.

To avoid this, in this tutorial we provide a wrapper function `run` that will execute a program within a `Parla()` context manager. This helps ensure that tasks are declared locally and provides a top-level task to orchestrate the execution of the program.

Don't worry about the details of this function for now, we'll cover them in more detail in later. For now, just know that you should wrap your program in a call to `run` to ensure that it executes correctly.

In [2]:
def run(function: Callable[[], Optional[TaskSpace]]):
    assert callable(function), "The function argument must be callable."

    # Start the Parla runtime.
    with Parla():
        # Create a top-level task to kick off the computation
        @spawn(placement=cpu, vcus=0)
        async def top_level_task():
            await function()
            
    # Runtime exists at the end of the context_manager
    # All tasks are guaranteed to be complete at this point

Now, let's look at spawning a task.

In Parla, tasks are defined and launched using the `@spawn` decorator. 
The `@spawn` decorator captures the code block and submits it to the runtime for asynchronous execution.
*As soon as the task is submitted, the runtime is free to schedule it.*


In [3]:
# A simple task that prints a message
async def my_first_task():
    @spawn()
    def hello():
        print("Hello from the task!")

print("Running my_first_task example...")
run(my_first_task)

Running my_first_task example...
Hello from the task!


### Example 2: `tutorial/02_independent_tasks.py`

Tasks run concurrently with respect to each other and the main program.
Below, we spawn 4 embarrassingly parallel tasks and print a message from each one. 
While the tasks are more likely to execute in the order they are spawned, the runtime is free to schedule them in any order. 
This means that the order of the printed messages may vary between runs.


In [4]:

async def independent_tasks():
    n_tasks = 4
    for i in range(n_tasks):
        @spawn()
        def task():
            # Local variables are captured by a shallow copy of everythig in the closure of `task()`
            # i is captured by value and is not shared between tasks
            print(f"Hello from Task {i}! \n", flush=True)
            # flush=True is needed to ensure that the print statement is not buffered.

print("Running independent_tasks example...")
run(independent_tasks)


Running independent_tasks example...
Hello from Task 0! 

Hello from Task 2! 

Hello from Task 1! 

Hello from Task 3! 



Variables are passed into tasks by being captured in the annotated function's "closure".
All variables in the local scope where a task is spawned are captured by *shallow copy*. 

Value types (like integer, double, string, tuple) are copied by value, but compound objects with internal reference types (like list, dictionary, numpy arrays, etc.) share their memory pointers between tasks.

**Note that this is different from the default behavior of Python functions, which capture all variables by reference in the closure.**

### Example 3: `tutorial/03_shared_memory.py`

In [5]:
async def independent_tasks_dictionary():
    n_tasks = 4
    shared_dict = {}
    for i in range(n_tasks):
        @spawn()
        def task():
            # Local variables are captured by a shallow copy of everything in the closure of `task()`
            # Python primitives are thread safe (locking) 
            shared_dict[i] = i**2
    
    # For now, we need to sleep to ensure that the tasks have completed
    # Later, we'll discuss barriers, returns, and general control flow
    sleep(0.1)
    print("Shared Dictionary: ", shared_dict)
    
print("Running independent_tasks_dictionary example...")
run(independent_tasks_dictionary)
            

Running independent_tasks_dictionary example...


Shared Dictionary:  {0: 0, 1: 1, 2: 4, 3: 9}


Passing output through shared objects is the most common way to communicate between tasks in Parla.

## General Advice for Writing Effective Parla Tasks 

Unlike many Python tasking systems, Parla tasks are run within a thread-based environment. All tasks execute within the same process and, unfortunately, share the same Python interpreter (if run with CPython). All tasks need to acquire the Python Global Interpreter Lock (GIL) to execute any lines of native Python code. This means any pure Python will execute serially and not show parallel speedup.

Tasks only achieve true parallelism when they call out to compiled libraries and external code that releases the GIL, such as Numpy, Cupy, or jit-compiled Numba kernels. Parla is well-suited for parallelism in compute-heavy domains, but less-suited to workloads that need to execute many routines with native-Python-implemented libraries (like SymPy).

To write code that performs well in Parla, tasks should avoid holding and accessing the GIL as much as possible. For a 50ms task, the GIL should be held for less than 5% of the total task-time to avoid noticeable overheads.

Launching tasks with threads, however, does give us some advantages. Tasks share the same address space, allowing copyless operations on any memory buffers. We do not need to worry about managing or importing separate module lists in different persistent-Python processes, and any jit compilation by Numba or other external libraries will be automatically reused between subsequent tasks.