Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory manager design (for eviction and garbage collection) #102

Open
nicelhc13 opened this issue May 11, 2023 · 8 comments
Open

Memory manager design (for eviction and garbage collection) #102

nicelhc13 opened this issue May 11, 2023 · 8 comments
Assignees

Comments

@nicelhc13
Copy link
Contributor

  • Terminology:

    • Garbage collection:
      • A PArray instance on a device is implicitly deallocated by Python interpreter due to zero references.
      • This happens only if the instance's coherency state is INVALID, or is VALID and is written back to CPU
        • So, none of task bodies refer to this instance
        • When the coherency state is INVALID, or is VALID and it is written back to CPU, the PArray runtime also releases its reference
    • Eviction:
      • A PArray instance on a device could be deallocated if none of mapped (including running) tasks are referencing it NOW, and it is not the only valid instance.
      • This PArray instance could be necessary for non-spawned tasks later.
        • But, in this case, this instance will be created through their corresponding data move tasks.
  • Garbage collection:

    • We can deallocate instances by relying on the Parla/PArray runtime.
      • Removing references from function body and arguments (to remove closure references)
      • Removing references from the PArray runtime
  • Eviction:

    • For this, we need a memory manager (called mm). For this doc, I am assuming a LRU-based mm.
    • Mechanism:
      • The C++ mapper reserves PArrays if a target task uses PArrays:
        • The mm manges a PArray reference count table.
        • Create a entry for a PArray if it is used for the first time, or find its entry and increase its reference count
      • When a C++ worker thread finishes a task execution and cleans up its resource usage, it decreases the reference count of the PArrays in the mm
      • If a C++ worker thread notices that a PArray reference count becomes 0, then store the PArray's address to the mm's zero-referenced candidate vector.
        • This vector is guarded by a mutex. The PArrays stored in the vector are not necessarily zero-referenced since it is possible that new tasks who will use them can be spawned/mapped while a worker thread updates this vector.
      • The C++ mapper checks the vector and if the PArrays in the vector are still zero referenced, move them to its zero-referenced LRU list
    • When memory overhead is high, the mm (called at the mapping phase) evicts the zero-referenced PArrays one by one
      • Call evict() to remove reference in the PArray runtime <-- This is problematic since we can only call this in Python
      • Remove the PArray reference from the list.
@wlruys
Copy link
Contributor

wlruys commented May 11, 2023

This happens only if the instance's coherency state is INVALID, or is VALID and is written back to CP. So, none of task bodies refer to this instance When the coherency state is INVALID, or is VALID and it is written back to CPU, the PArray runtime also releases its reference

This can also happen at any internal state. It just means that the Python reference count has dropped to zero.

A PArray instance on a device could be deallocated if none of mapped (including running) tasks are referencing it NOW, and it is not the only valid instance.

Number of Reserved Tasks that reference it the stronger necessary condition for safety, given the proposed model without the ability to change scheduling decisions and add dependencies to in-flight tasks.

This PArray instance could be necessary for non-spawned tasks later. But, in this case, this instance will be created through their corresponding data move tasks.

I'm not sure what you mean by instance here. The PArray object will be alive at the scope that is spawning the tasks since it needs to be listed in IN/OUT/INOUT.

When memory overhead is high, the mm (called at the mapping phase) evicts the zero-referenced PArrays one by one

Let's only call evict when the next task wouldn't be able to fit.

Mechansim

The classical hashmap -> linked list LRU implementation for each device should be sufficient.

@yinengy
Copy link
Contributor

yinengy commented May 11, 2023

When memory overhead is high, the mm (called at the mapping phase) evicts the zero-referenced PArrays one by one

maybe here you could only check SHARED parray first since that is what replicate usually be

@nicelhc13
Copy link
Contributor Author

This can also happen at any internal state. It just means that the Python reference count has dropped to zero.

The reason why I mentioned the state was b/c the PArray runtime has one Python reference count. The runtime releases the reference count only when the coherency state becomes INVALID. (There is some slide cases like slicing)

Number of Reserved Tasks that reference it the stronger necessary condition for safety, given the proposed model without the ability to change scheduling decisions and add dependencies to in-flight tasks.

Agree. I was also on board with you but just explained it incorrectly. So, it should be tasks at the phase after mapping phase.

I'm not sure what you mean by instance here. The PArray object will be alive at the scope that is spawning the tasks since it needs to be listed in IN/OUT/INOUT.

We should differentiate the instance of parray::InternalPArray and its internal data instance. The former one is used to declare and manage the internal data instances through its coherency protocol. The latter one is the internal data. I meant the latter one.

Let's only call evict when the next task wouldn't be able to fit.

Cool

The classical hashmap -> linked list LRU implementation for each device should be sufficient.

Cool

Call evict() to remove reference in the PArray runtime <-- This is problematic since we can only call this in Python

Do you have any idea on this? I am stuck with this.

@wlruys
Copy link
Contributor

wlruys commented May 11, 2023

We should differentiate the instance of parray::InternalPArray and its internal data instance. The former one is used to declare and manage the internal data instances through its coherency protocol. The latter one is the internal data. I meant the latter one.

Gotcha. Thanks! I find this terminology confusing. To me the whole distributed object is the PArray and the internal data are just cupy/numpy buffers.

Call evict() to remove reference in the PArray runtime

I don't know. Eviction tasks seem expensive, but it is a solution. Probably the easiest and most flexible. My other suggestions from our chat were (1) the thread that is running the mm (probably scheduler thread?) calls in Python via a callback and performs the move. This has the disadvantage of blocking scheduler execution during the copy call (& 2 GIL accesses). (2) Any worker that wakes up to run a task checks an eviction workqueue before/after its execution and performs the update.

maybe here you could only check SHARED parray first since that is what replicate usually be

This is a good policy suggestion!

@nicelhc13
Copy link
Contributor Author

I don't know. Eviction tasks seem expensive, but it is a solution. Probably the easiest and most flexible. My other suggestions from our chat were

I think the problem was I didn't know how to call evict() from C++ side. I discussed this with Yineng and we designed an implementation. I will write it on the next thread.

@nicelhc13
Copy link
Contributor Author

  • Eviction:
    • For this, we need a memory manager (called mm). For this doc, I am assuming a LRU-based mm.
    • Mechanism:
      • The C++ mapper reserves PArrays if a target task uses PArrays:
        • The mm manges a PArray reference count table.
        • Create a entry for a PArray if it is used for the first time, or find its entry and increase its reference count
      • When a C++ worker thread finishes a task execution and cleans up its resource usage, it decreases the reference count of the PArrays in the mm
      • If a C++ worker thread notices that a PArray reference count becomes 0, then store the PArray's address to the mm's zero-referenced candidate vector.
        • This vector is guarded by a mutex. The PArrays stored in the vector are not necessarily zero-referenced since it is possible that new tasks who will use them can be spawned/mapped while a worker thread updates this vector.
      • The C++ mapper checks the vector and if the PArrays in the vector are still zero referenced, move them to its zero-referenced LRU list
    • When memory overhead is high, a Python worker thread who is assigned a data move task and is about to call auto_move() performs eviction.
      • In this case, the important point is synchronization between Python worker thread and the C++ scheduler thread.
      • The C++ scheduler thread should not map tasks who have PArrays to avoid the case that those PArrays become invalid/deallocated by the mm
        • Use a semapore in the C++ mm; the worer thread raises this semaphore and the C++ scheduler thread checks this before it starts task mapping. In this case, all the scheduling phases skip the tasks using PArrays
        • Down the semaphore when the worker thread clears memory.
        • Then the scheduler resumes the phases for the tasks using PArrays.

@yinengy
Copy link
Contributor

yinengy commented May 12, 2023

LGTM, thanks!

@nicelhc13
Copy link
Contributor Author

This is implemented in #100.
It needs more robust tests but was not crashed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants