Memory manager design (for eviction and garbage collection) #102

nicelhc13 · 2023-05-11T21:28:57Z

Terminology:
- Garbage collection:
  - A PArray instance on a device is implicitly deallocated by Python interpreter due to zero references.
  - This happens only if the instance's coherency state is INVALID, or is VALID and is written back to CPU
    - So, none of task bodies refer to this instance
    - When the coherency state is INVALID, or is VALID and it is written back to CPU, the PArray runtime also releases its reference
- Eviction:
  - A PArray instance on a device could be deallocated if none of mapped (including running) tasks are referencing it NOW, and it is not the only valid instance.
  - This PArray instance could be necessary for non-spawned tasks later.
    - But, in this case, this instance will be created through their corresponding data move tasks.
Garbage collection:
- We can deallocate instances by relying on the Parla/PArray runtime.
  - Removing references from function body and arguments (to remove closure references)
  - Removing references from the PArray runtime
Eviction:
- For this, we need a memory manager (called mm). For this doc, I am assuming a LRU-based mm.
- Mechanism:
  - The C++ mapper reserves PArrays if a target task uses PArrays:
    - The mm manges a PArray reference count table.
    - Create a entry for a PArray if it is used for the first time, or find its entry and increase its reference count
  - When a C++ worker thread finishes a task execution and cleans up its resource usage, it decreases the reference count of the PArrays in the mm
  - If a C++ worker thread notices that a PArray reference count becomes 0, then store the PArray's address to the mm's zero-referenced candidate vector.
    - This vector is guarded by a mutex. The PArrays stored in the vector are not necessarily zero-referenced since it is possible that new tasks who will use them can be spawned/mapped while a worker thread updates this vector.
  - The C++ mapper checks the vector and if the PArrays in the vector are still zero referenced, move them to its zero-referenced LRU list
- When memory overhead is high, the mm (called at the mapping phase) evicts the zero-referenced PArrays one by one
  - Call evict() to remove reference in the PArray runtime <-- This is problematic since we can only call this in Python
  - Remove the PArray reference from the list.

wlruys · 2023-05-11T22:03:47Z

This happens only if the instance's coherency state is INVALID, or is VALID and is written back to CP. So, none of task bodies refer to this instance When the coherency state is INVALID, or is VALID and it is written back to CPU, the PArray runtime also releases its reference

This can also happen at any internal state. It just means that the Python reference count has dropped to zero.

A PArray instance on a device could be deallocated if none of mapped (including running) tasks are referencing it NOW, and it is not the only valid instance.

Number of Reserved Tasks that reference it the stronger necessary condition for safety, given the proposed model without the ability to change scheduling decisions and add dependencies to in-flight tasks.

This PArray instance could be necessary for non-spawned tasks later. But, in this case, this instance will be created through their corresponding data move tasks.

I'm not sure what you mean by instance here. The PArray object will be alive at the scope that is spawning the tasks since it needs to be listed in IN/OUT/INOUT.

When memory overhead is high, the mm (called at the mapping phase) evicts the zero-referenced PArrays one by one

Let's only call evict when the next task wouldn't be able to fit.

Mechansim

The classical hashmap -> linked list LRU implementation for each device should be sufficient.

yinengy · 2023-05-11T22:09:42Z

When memory overhead is high, the mm (called at the mapping phase) evicts the zero-referenced PArrays one by one

maybe here you could only check SHARED parray first since that is what replicate usually be

nicelhc13 · 2023-05-11T22:09:50Z

This can also happen at any internal state. It just means that the Python reference count has dropped to zero.

The reason why I mentioned the state was b/c the PArray runtime has one Python reference count. The runtime releases the reference count only when the coherency state becomes INVALID. (There is some slide cases like slicing)

Number of Reserved Tasks that reference it the stronger necessary condition for safety, given the proposed model without the ability to change scheduling decisions and add dependencies to in-flight tasks.

Agree. I was also on board with you but just explained it incorrectly. So, it should be tasks at the phase after mapping phase.

I'm not sure what you mean by instance here. The PArray object will be alive at the scope that is spawning the tasks since it needs to be listed in IN/OUT/INOUT.

We should differentiate the instance of parray::InternalPArray and its internal data instance. The former one is used to declare and manage the internal data instances through its coherency protocol. The latter one is the internal data. I meant the latter one.

Let's only call evict when the next task wouldn't be able to fit.

Cool

The classical hashmap -> linked list LRU implementation for each device should be sufficient.

Cool

Call evict() to remove reference in the PArray runtime <-- This is problematic since we can only call this in Python

Do you have any idea on this? I am stuck with this.

wlruys · 2023-05-11T22:24:45Z

We should differentiate the instance of parray::InternalPArray and its internal data instance. The former one is used to declare and manage the internal data instances through its coherency protocol. The latter one is the internal data. I meant the latter one.

Gotcha. Thanks! I find this terminology confusing. To me the whole distributed object is the PArray and the internal data are just cupy/numpy buffers.

Call evict() to remove reference in the PArray runtime

I don't know. Eviction tasks seem expensive, but it is a solution. Probably the easiest and most flexible. My other suggestions from our chat were (1) the thread that is running the mm (probably scheduler thread?) calls in Python via a callback and performs the move. This has the disadvantage of blocking scheduler execution during the copy call (& 2 GIL accesses). (2) Any worker that wakes up to run a task checks an eviction workqueue before/after its execution and performs the update.

maybe here you could only check SHARED parray first since that is what replicate usually be

This is a good policy suggestion!

nicelhc13 · 2023-05-11T23:31:03Z

I don't know. Eviction tasks seem expensive, but it is a solution. Probably the easiest and most flexible. My other suggestions from our chat were

I think the problem was I didn't know how to call evict() from C++ side. I discussed this with Yineng and we designed an implementation. I will write it on the next thread.

nicelhc13 · 2023-05-11T23:45:01Z

Eviction:
- For this, we need a memory manager (called mm). For this doc, I am assuming a LRU-based mm.
- Mechanism:
  - The C++ mapper reserves PArrays if a target task uses PArrays:
    - The mm manges a PArray reference count table.
    - Create a entry for a PArray if it is used for the first time, or find its entry and increase its reference count
  - When a C++ worker thread finishes a task execution and cleans up its resource usage, it decreases the reference count of the PArrays in the mm
  - If a C++ worker thread notices that a PArray reference count becomes 0, then store the PArray's address to the mm's zero-referenced candidate vector.
    - This vector is guarded by a mutex. The PArrays stored in the vector are not necessarily zero-referenced since it is possible that new tasks who will use them can be spawned/mapped while a worker thread updates this vector.
  - The C++ mapper checks the vector and if the PArrays in the vector are still zero referenced, move them to its zero-referenced LRU list
- When memory overhead is high, a Python worker thread who is assigned a data move task and is about to call auto_move() performs eviction.
  - In this case, the important point is synchronization between Python worker thread and the C++ scheduler thread.
  - The C++ scheduler thread should not map tasks who have PArrays to avoid the case that those PArrays become invalid/deallocated by the mm
    - Use a semapore in the C++ mm; the worer thread raises this semaphore and the C++ scheduler thread checks this before it starts task mapping. In this case, all the scheduling phases skip the tasks using PArrays
    - Down the semaphore when the worker thread clears memory.
    - Then the scheduler resumes the phases for the tasks using PArrays.

yinengy · 2023-05-12T01:22:30Z

LGTM, thanks!

nicelhc13 · 2023-05-19T01:12:39Z

This is implemented in #100.
It needs more robust tests but was not crashed.

nicelhc13 assigned wlruys, yinengy and nicelhc13 May 11, 2023

nicelhc13 mentioned this issue May 12, 2023

Remove completed task references #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory manager design (for eviction and garbage collection) #102

Memory manager design (for eviction and garbage collection) #102

nicelhc13 commented May 11, 2023

wlruys commented May 11, 2023 •

edited

Loading

yinengy commented May 11, 2023

nicelhc13 commented May 11, 2023

wlruys commented May 11, 2023

nicelhc13 commented May 11, 2023

nicelhc13 commented May 11, 2023

yinengy commented May 12, 2023

nicelhc13 commented May 19, 2023

Memory manager design (for eviction and garbage collection) #102

Memory manager design (for eviction and garbage collection) #102

Comments

nicelhc13 commented May 11, 2023

wlruys commented May 11, 2023 • edited Loading

yinengy commented May 11, 2023

nicelhc13 commented May 11, 2023

wlruys commented May 11, 2023

nicelhc13 commented May 11, 2023

nicelhc13 commented May 11, 2023

yinengy commented May 12, 2023

nicelhc13 commented May 19, 2023

wlruys commented May 11, 2023 •

edited

Loading