-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
tracemalloc: add C API to manually track/untrack memory allocations #70717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The API of Python memory allocators was extended for numpy: calloc() was added (issue bpo-21233). It looks like it's not enough because allocations with an alignment is also required for vector operations (SIMD). For Python, there is the issue bpo-18835 but the benefit for Python code is unclear or negligible. Instead of extending the API each time, it would be simpler to add an API to allow third party memory allocators to track/untrack memory in tracemalloc. I guess that the API should make the assumption that the GIL is not hold and so try to acquire the GIL (if it's not already hold). Related numpy issue: numpy/numpy#4663 See also: https://mail.scipy.org/pipermail/numpy-discussion/2015-January/072068.html |
Patch without test. I chose to require the GIL to be held for efficiency. IMHO the common case is more than the GIL is held. Example to call the function if the GIL is not held: int res;
PyGILState_STATE gil_state;
gil_state = PyGILState_Ensure();
res = _PyTraceMalloc_Track(ptr, size);
PyGILState_Release(gil_state);
return res; See also my numpy pull request on PyMem_Malloc() called without the GIL being held: |
This may indeed be useful but:
|
Antoine Pitrou:
Done
Ok, done. The new patch has a safer API:
I also added unit tests. |
Hum, there is an issue in the patch version 2: tracemalloc_add_trace() rely on the assumption that the pointer is not already tracked. Since we give control to the traces to the user, this assumption may become wrong (if the API is badly used). This issue is now fixed in the patch version 3: tracemalloc_add_trace() now works if the pointer is already tracked. It allows to implement a micro-optimization on realloc() if the new pointer is equal to the old pointer (memory block didn't move). By the way, to track realloc(): _PyTraceMalloc_Untrack() must be called with the old pointer and _PyTraceMalloc_Track() must be called with the new pointer. I don't think that it's worth to add an helper function just for two calls. (The optimization old_ptr==new_ptr is really a micro-optimization, don't use tracemalloc if you care of performances, tracemalloc simply kills the performance ;-)) |
Hi Victor, This is really great, thanks for working on it! What do you think about the richer api I proposed here? |
""" msg232221: PyMem_RecordAlloc(const char* domain, void* tag, size_t quantity,
PyMem_RecordRealloc(const char* domain, void* old_tag, void* new_tag, size_t new_quantity)
PyMem_RecordFree(const char* domain, void* tag)
""" I don't plan to store new information in tracemalloc. tracemalloc design is "simple": it stores a trace for each memory block. A memory block is identified by its address (void* pointer) and a trace is a tuple (size, traceback). Extending a trace would increase the memory footprint overhead of tracemalloc. Note: I don't understand what are domain, tag (address of the memory block?) or quantity (size of the memory block?). Storing the C filename and C line number was proposed but rejected in the design of the PEP-445: I'm open to extending tracemalloc, but it requires to discuss the API and how changes will impact the code (tracemalloc, but also users of the tracemalloc API). |
Yes, the idea is basically the same as now, except with an extra const char * specifying the "domain", so that one could keep separate track of different kinds of allocations. So PyMem_Malloc would just call PyMem_RecordAlloc("heap", ptr, size) (or act equivalently to something that called that, etc.), but something like PyCuda might do PyMem_RecordAlloc("gpu", ptr, size) to track allocations in GPU memory. All the tracing stuff in tracemalloc would be awesome to have for GPU allocations, and it would hardly require any changes to enable it. Ditto for other leakable resources like file descriptors or shmem segments.
I think the extra footprint would be tiny? Logically, you'd index traces by (domain, pointer) instead of (pointer), but since we expect that the number of distinct domains is small (like... 1, in the common case) then the natural implementation strategy would be to replace the current {pointer: trace} mapping with a mapping {domain: {pointer: trace}}. So the total overhead would be basically the overhead of allocating a single additional dict. That said, I think having the domain argument *in the public hookable API* is useful even if the tracemalloc hooks themselves decide to ignore it or just ignore all calls that don't refer to the "heap" domain -- because it's extremely cheap to put this in the API now while we're adding it, and it leaves the door open for someone else to come along later and make an awesome gpu heap tracker or whatever without having to first argue with python-dev and wait for the cpython release cycle. |
Nathaniel Smith:
If I change tracemalloc, it's not to fullfit numpy requirements, it must remain very generic. *If* we add something, I see 3 choices:
int uses 2x less memory than char* or void* on 64-bit systems. The problem of void* is to find a way to expose it in Python. An option is not ignore it in the Python API, and only provide a C API to retrieve traces with the extra info. void* allows to implement the rejected option of also storing the C filename an C line number: When I designed the PEP-445 (malloc API) an PEP-454 (tracemalloc), I recall that it was proposed to add "colors" (red, blue, etc.) to memory allocations. It sounds similar do you "heap" and "gpu" use case. It's just that you use an integer rather than a string. Anyway, extending tracemalloc is non-trivial, we have to investigate use cases to design the new API. I would prefer to move step by step, an begin with exposing existing API. What do you think?
FYI I opened an issue to use tracemalloc when logging ResourceWarning:
In my experience, tracemalloc footprint is large. Basically, it doubles the total memory footprint. So adding 4 or 8 bytes to a trace which currently takes 16 bytes is not negligible! Maybe we can be smart and use compact trace when extra info is not stored (current trace_t) and switch to "extended" trace (trace_t + int/char*/void*) when the extended API is used? It requires to convert all existing traces from the compact to the extende format. It doesn't look too complex to support two formats, expecially if the extended format is based on the compact format (trace_t structure used in extended_trace_t structure).
It's not how tracemalloc is designed. _tracemalloc has a simple design. It's a simple hashtable: pointer => trace. The Python module tracemalloc.py is responsible to group traces: The design is to have a simple and efficient _tracemalloc module, an off-load statistics later. It allows to capture traces on a small and slow device, and then analyze data on a fast compuer with more memory (ex: transfer data by network). The idea is also to limit the overhead of using _tracemalloc. Moreover, your proposed structure looks specific to your use case. I'm not sure that you always want to group by the domain. If domain is a C traceback (filename, line number), you probably don't want to group by traceback, but group by C filename for example. |
I think we're talking past each other :-).
Nothing about what I'm saying is relevant to numpy -- the patch attached to this bug report is already plenty for what numpy needs. (Well, it still needs a public API like PyMem_Track/Untrack or something, but never mind for now.) The only reason I'm speaking up now is that if you're adding a manual track/untrack API, then a relatively trivial addition now makes tracemalloc vastly more powerful, so I don't want to miss this opportunity.
And if you want to attach some extra metadata to traces, then that's an interesting idea that I can potentially imagine various use cases for. But it's not the idea I'm talking about :-). (FWIW, I think the biggest challenge for your idea will be how the allocation sites -- which might be in arbitrary user code -- are supposed to figure out what kind of metadata they should be attaching. And if it's information that tracemalloc can compute itself -- like C backtraces -- then there's no reason for it to be in the public API, which is the thing I'm concerned about here.) What I'm talking about is different: I think it should be possible to re-use the tracemalloc infrastructure to track other resources besides "heap allocations". So for my use case, it's crucial that we index by (domain, pointer), because the address 0xdeadbeef on the heap is different from the address 0xdeadbeef on the GPU. We'll never want to group by pointer alone without the domain, because that would cause us to actually misinterpret the data (if I do PyMem_Track("gpu", 0xdeadbeef); PyMem_Untrack("heap", 0xdeadbeef), then this should not cause tracemalloc to forget about the gpu allocation! I think this is very different than your C backtrace example). And, it's always obvious to callers what kind of thing to pass here, because they know perfectly well whether they just allocated memory on the heap or on the GPU, so the public API is an appropriate place for this information. And, it's immediately obvious that for this use case, there will only be a few different domains in use at one time, so it's very inefficient to literally store (domain, pointer) pairs -- replacing the current pointer => trace design with a domain => (pointer => trace) design would indeed require changing tracemalloc's design a bit, but not, I think, in any fundamental way? |
Instead of having to change the API for tracking GPU memory, I suggest to If it's the minimum is 2 bytes, good. Use the lowest bit as a "GPU" flag! Tracemalloc doesn't give access to pointers in its API, so it shouldn't |
Le 18/03/2016 04:57, Nathaniel Smith a écrit :
FWIW, LLVM calls the "domain" "address space", and simply uses an |
Le 18/03/2016 08:09, STINNER Victor a écrit :
So you guarantee tracemalloc itself won't use such a hack for other |
As the defacto maintainer of the module, yes, I can guarantee that.
I understood that we have the CPU and GPU address spaces. Do you know other address spaces? I only care of address spaces where two equal pointers point to two different memory blocks. |
If you consider that using least signifiant bits to store an -- malloc in POSIX standard: Linux manual page has a simpler definition: "The malloc() and calloc() malloc of the GNU libc: Random Google link for GPU: "cudaMalloc() returns memory which is aligned at 256 bytes." (8 bits) "Minimum alignment (bytes) for any datatype: 128" |
Hum, maybe we can make the "hack" "official": add two C macro to store and retrieve the domain from a pointer. The macro can validate that the domain is smaller or equal to 8 using an assertion (only in debug mode?). |
A n-GPU system will have n+1 address spaces. Such systems exist in high-performance computing. You can also probably find weird systems (embedded, mostly, I'd say) with more than two address spaces. |
There are other leakable resources besides heap and GPU memory -- shmem segments and file descriptors are two that I thought of earlier, but there are probably others too. (Note that not all file descriptors are associated with a python file object.) I guess one could hack most of these things into pointer bit tricks somehow, but I don't really see the appeal myself :-). It's fine for an internal implementation trick, but a very pythonic public api...
This is actually a problem with this scheme... One would like to be able to get separate statistical reports for different resources. Also, tracemalloc is awesome (really), but there are potentially other consumers of the hookable allocation scheme too that might choose to expose more or less information; we shouldn't forget about them entirely :-) |
Nathaniel Smith:
Ok ok, you convinced me. I opened the issue bpo-26588: "_tracemalloc: add support for multiple address spaces (domains)". Hum, the idea of tracking file descriptors is appealing. In practice, I suggest to simply use Py_uintptr_t type in the new C API rather than void*. So it's more obvious that tracemalloc doesn't care of the value, it only requires the value to be unique. FYI On Windows, handles and file descriptors are also different namespaces. You can get a handle and a file descriptor which have the same value but are identify different objects. I don't know shmem. -- Another different idea would be to annotate a "color" to a memory allocator. For example, group memory only used internally, and memory to store user data. For a network application, you can imagine an identifier per request and then group all allocations of this request. It sounds very similar to the domain idea, but there is a very important difference: the API proposed in issue bpo-26588 requires to pass domain to track *and* untrack functions. The problem with coloring allocations is that the color is known when the memory is allocated, but the function releasing the memory can be very far, and it can simply be the garbage collector. So the color must be stored in the hashtable *value* (the trace), whereas the domain is stored in the hashtable *key* (a tuple (pointer, domain)). Since no user directly asked this feature, I propose to defer this issue and focus on the concrete issue bpo-26588 to support tracking GPU(s) memory. |
New changeset 60655e543d8a by Victor Stinner in branch 'default': |
Ok, I added the following C functions: int _PyTraceMalloc_Track(_PyTraceMalloc_domain_t domain, Py_uintptr_t ptr, size_t size);
int _PyTraceMalloc_Untrack(_PyTraceMalloc_domain_t domain, Py_uintptr_t ptr); Antoine, Nathaniel: Please play with it, I'm waiting for your feedback on the API. _PyTraceMalloc_Track() acquires the GIL for you if it was released. I suggest to not call it from a C thread. If you want to use it from a C thread, it's better to initialize manually the Python thread state on this thread (see issue bpo-20891) before using _PyTraceMalloc_Track(). -- _PyTraceMalloc_domain_t is an unsigned int. The type name is annoying (too long). Maybe "unsigned int" would be more readable :-) What do you think? Maybe an unsigned short is enough? _tracemalloc.c tries to use a packed structure to limit the memory footprint. -- I chose to use the Py_uintptr_t type for the pointer instead of void*, but there is a problem with that. It gives a false hint on the expected type. In fact, the hashtable is really optimized for pointers, the hash function uses _Py_HashPointer():
It means that tracking file descriptor (int fd) may create a lot of a hash collisions. Well, if you handle a few file descriptors (less than 100?), it's maybe ok. If you handle tons of file descriptors, we should maybe make the hash function more configurable. Maybe even per domain? Do you think that void* would be less a lie? :-) What do you prefer? -- I also added a function to get the traceback where a memory block was allocated: PyObject* _PyTraceMalloc_GetTraceback(_PyTraceMalloc_domain_t domain, Py_uintptr_t ptr); But you should not use it, it's written for unit tests. You have to wrap the result into a tracemalloc.Traceback object: def get_traceback(self):
frames = _testcapi.tracemalloc_get_traceback(self.domain, self.ptr)
if frames is not None:
return tracemalloc.Traceback(frames)
else:
return None If this feature is useful, it should be added to the official Python API, the tracemalloc.py module. Currently, there is a tracemalloc.get_object_traceback(obj) function which is restricted to the domain 0 and expects a Python object, not a raw pointer. |
tracemalloc_track_fd.py: Proof-of-concept to track file descriptors using monkey-patching on os.open() and os.close(). It's just to show to the API can be used in Python, I don't think that it's very useful to track where file descriptors are allocated. |
Nathaniel, Antoine: ping? |
Nathaniel, Antoine: ping again? No reaction to the new shiny C API of tracemalloc that *you* requested? :-p |
@nathaniel, Antoine: last ping before timeout :-p |
Is the timeout part of the API? :-) |
I don't want to make the API public before you validated that it is usable for your use case or to track numpy memory usage. I guess that the first beta release of Python 3.6 is the deadline for this issue. Otherwise, we will have to wait ~2 years with Python 3.7. |
I may be a bit confused, but the "domain" integer you added in bpo-26588 doesn't seem to be part of this API... Is it deliberate? |
They are part of this API. msg262180: """Ok, I added the following C functions: int _PyTraceMalloc_Track(_PyTraceMalloc_domain_t domain, Py_uintptr_t ptr, size_t size);
int _PyTraceMalloc_Untrack(_PyTraceMalloc_domain_t domain, Py_uintptr_t ptr); (...)""" The domain 0 is used to track Python memory allocations. |
Which patch are we talking about? In tracemalloc_track-3.patch, I see: + If memory block is already tracked, update the existing trace. */ |
I'm talking about this change which was already merged into the default branch of Python in March 2016: New changeset 60655e543d8a by Victor Stinner in branch 'default': |
Uh... ok, I thought you wanted some feedback on the patches posted, since I didn't know you had committed a new version of them. I haven't tried to use the new API (for various reasons it's a bit cumbersome to use a self-compiled Python for Numba), but I'm sure it will be good enough. |
So, the API is implemented, but I leave it as private because nobody tried it whereas I was waiting for a feedback from numpy at least. If you want a public API in Python 3.7, please tell if the API fits your use case and if the implementation works. |
The api looks good to me. Works fine in numpy. |
Julian Taylor: "The api looks good to me. Works fine in numpy." Cool! Should it be made public in that case? (Remove _ prefix and document it.) |
I don't see any reason why not to. |
"I don't see any reason why not to.": Ok, I created the issue bpo-30054. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: