# Tutorial 04: Data Movement

This tutorial introduces how to move data between devices. 
- Data Movement - `clone_here` and `copy`, 
- Prefetched Data Movement - via Parla Managed Arrays (PArrays)


In [2]:
from typing import Callable, Optional
from time import perf_counter, sleep
import os

def set_numpy_threads(threads: int = 1):
    """
    Numpy can be configured to use multiple threads for linear algebra operations.
    The backend used by numpy can vary by installation.
    This function attempts to set the number of threads for the most common backends.
    MUST BE CALLED BEFORE IMPORTING NUMPY.

    Args:
        threads (int, optional): The number of threads to use. Defaults to 1.
    """

    os.environ["NUMEXPR_NUM_THREADS"] = str(threads)
    os.environ["OMP_NUM_THREADS"] = str(threads)
    os.environ["OPENBLAS_NUM_THREADS"] = str(threads)
    os.environ["VECLIB_MAXIMUM_THREADS"] = str(threads)
    os.environ["MKL_NUM_THREADS"] = str(threads)

    try:
        # Controlling the MKL backend can use mkl and mkl-service modules if installed.
        # preferred method for controlling the MKL backend.
        import mkl

        mkl.set_num_threads(threads)
    except ImportError:
        pass
    
set_numpy_threads(1)
import numpy as np 
import cupy as cp 

# Handle for Parla runtime
from parla import Parla

# Spawning task API
from parla.tasks import (
    spawn,
    TaskSpace,
    get_current_context,
    get_current_task,
)

# Device Architectures for placement
from parla.devices import cpu, gpu


def run(function: Callable[[], Optional[TaskSpace]]):
    assert callable(function), "The function argument must be callable."

    # Start the Parla runtime.
    with Parla():
        # Create a top-level task to kick off the computation
        @spawn(placement=cpu, vcus=0)
        async def top_level_task():
            # Note that unless function returns a TaskSpace, this will NOT be blocking.
            # If you want to wait on the completion of the tasks launched by function, you must return a TaskSpace that contains their terminal tasks.
            await function()

    # Runtime exists at the end of the context_manager
    # All tasks are guaranteed to be complete at this point

## Manual Data Movement Annotations - `clone_here` and `copy`

Parla provides two functions for moving data between devices with a task: `clone_here` and `copy`. 
- `clone_here` copies the data to the device where the task is running.
- `copy` writes into a data buffer from any source

In [3]:
from parla.array import clone_here, copy

#### Example: `tutorial/14_manual_data_movement.py`

In [4]:
def clone_here_copy_example_wrapper():
    import numpy as np
    import cupy as cp
    M = 5
    N = 5
    A = np.random.rand(M)
    B = cp.arange(N)
    
    def clone_here_copy_example():
        T = TaskSpace("T")
        
        @spawn(placement=[cpu if np.random.rand() < 0.5 else gpu])
        def task():
            print(get_current_task(), " running on ", get_current_context())
            C = clone_here(A)
            print("C is a", "Numpy Array" if isinstance(C, np.ndarray) else f"Cupy Array on GPU[{C.device}]", flush=True)
            
        return T
    
    run(clone_here_copy_example)
        
clone_here_copy_example_wrapper()

Task(global_1)  running on  GPUEnvironment(GPU:0)
C is a Cupy Array on GPU[<CUDA Device 0>]


## Prefetched Data Movement - via Parla Managed Arrays (PArrays)

Parla provides a data structure for data movement and coherence across devices: `PArrays`. 
PArrays are a lightweight wrapper around CuPy and NumPy ndarrays that:
- Allow the runtime to track the location of the data
- Allow the runtime to prefetch data to the device where a task will run
- And main a coherence protcol for multiple valid distributed copies of the same data. 

PArrays can be created directly from NumPy or CuPy ndarrays. 

```python 
import numpy as np
A = np.random.rand(1000, 1000)
A_wrapped = parla.array.asarray(A)
```

Once an array has been wrapped, it can be used in the dataflow arguments of `@spawn`. 
The runtime will automatically prefetch the data to the device where the task will run. 
`input` moves the data to the device and makes it available for reading.
`inout` moves the data to the device and makes it available for reading and writing. 
This invalidates any other copies of the data.

```python
@spawn(placement=gpu, inout=[A_wrapped])
def compute_A(A):
    # Do something with A on the GPU
```



#### Example: `tutorial/15_prefetched_data_movement.py`

In [5]:
from parla.array import asarray as parla_asarray


async def parray_example():
    A = np.random.rand(5)
    A = parla_asarray(A)

    @spawn(placement=[cpu if np.random.rand() < 0.5 else gpu], input=[A])
    def task():
        print(get_current_task(), " running on ", get_current_context())
        print(
            "A is a",
            "Numpy Array"
            if isinstance(A.array, np.ndarray)
            else f"Cupy Array on GPU[{A.array.device}]",
            flush=True,
        )
        A.print_overview()
        #There is a valid copy of A on both devices


run(parray_example)

Task(global_1)  running on  GPUEnvironment(GPU:0)
A is a Cupy Array on GPU[<CUDA Device 0>]
---Overview of PArray
ID: 137737518180896, Name: NA, Parent_ID: None, Slice: None, Bytes: 40, Owner: GPU 0
At GPU 0: state: SHARED
At CPU: state: SHARED
---End of Overview


#### Example: `tutorial/16_write_invalidation.py`

In [6]:
async def parray_example():
    T = TaskSpace("T")

    A = np.ones(5)
    A = parla_asarray(A)

    @spawn(T[0], placement=[cpu if np.random.rand() < 0.5 else gpu], input=[A])
    def task():
        print(get_current_task(), " running on ", get_current_context())
        print(
            "A is a",
            "Numpy Array"
            if isinstance(A.array, np.ndarray)
            else f"Cupy Array on GPU[{A.array.device}]",
            flush=True,
        )
        A.print_overview()
        print(A.array)
        print("\n")

    @spawn(T[1], [T[0]], placement=[cpu if np.random.rand() < 0.5 else gpu], inout=[A])
    def task():
        print(get_current_task(), " running on ", get_current_context())
        print(
            "A is a",
            "Numpy Array"
            if isinstance(A.array, np.ndarray)
            else f"Cupy Array on GPU[{A.array.device}]",
            flush=True,
        )
        A.print_overview()
        A[:] = A + 1
        print("\n")

    @spawn(
        T[2],
        [T[0], T[1]],
        placement=[cpu if np.random.rand() < 0.5 else gpu],
        inout=[A],
    )
    def task():
        print(get_current_task(), " running on ", get_current_context())
        print(
            "A is a",
            "Numpy Array"
            if isinstance(A.array, np.ndarray)
            else f"Cupy Array on GPU[{A.array.device}]",
            flush=True,
        )
        A.print_overview()
        print(A.array)
        print("\n")


run(parray_example)

Task(T_0)  running on  GPUEnvironment(GPU:0)
A is a Cupy Array on GPU[<CUDA Device 0>]
---Overview of PArray
ID: 137737518206128, Name: NA, Parent_ID: None, Slice: None, Bytes: 40, Owner: GPU 0
At GPU 0: state: SHARED
At CPU: state: SHARED
---End of Overview
[1. 1. 1. 1. 1.]


Task(T_1)  running on  GPUEnvironment(GPU:0)
A is a Cupy Array on GPU[<CUDA Device 0>]
---Overview of PArray
ID: 137737518206128, Name: NA, Parent_ID: None, Slice: None, Bytes: 40, Owner: GPU 0
At GPU 0: state: MODIFIED
At CPU: state: INVALID
---End of Overview


Task(T_2)  running on  GPUEnvironment(GPU:0)
A is a Cupy Array on GPU[<CUDA Device 0>]
---Overview of PArray
ID: 137737518206128, Name: NA, Parent_ID: None, Slice: None, Bytes: 40, Owner: GPU 0
At GPU 0: state: MODIFIED
At CPU: state: INVALID
---End of Overview
[2. 2. 2. 2. 2.]


