Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the wrapping of heap allocation and deallocation functions #916

Closed
nnethercote opened this Issue Feb 27, 2015 · 14 comments

Comments

Projects
None yet
4 participants
@nnethercote
Copy link

nnethercote commented Feb 27, 2015

(An earlier version of this proposal was posted at http://internals.rust-lang.org/t/add-support-for-the-wrapping-of-heap-allocation-and-deallocation-functions/1618/6, where it didn't get much attention.)

It can be very useful for user programs to be able to take arbitrary actions when when heap allocation and deallocation occurs.

My motivation is that I want to build a heap profiler for Servo, one that's similar to Firefox's DMD. This profiler would record a stack trace on each allocation, and use that to emit data about which parts of the code are performing many allocations (contributing to heap churn) and which ones are responsible for allocating live blocks (contributing to peak memory usage). This may sound like a generic tool that could be built into Rust itself, but it's likely to end up with Servo-specific parts, so flexible building blocks would be better than a canned solution.

There are lots of other potential uses for this, and this kind of facility is common in other systems. E.g. glibc provides one for malloc/realloc/free, and there's also the more general LD_PRELOAD (on Linux) and DYLD_INSERT_LIBRARIES (on Mac) which allow you to hook any library functions.

What this needs.

  • A program needs to be able to specify that it wants to opt in to this feature, and a way to specify the wrapping functions (one each for std::rt::heap::allocate, reallocate, reallocate_inplace, deallocate, and possibly usable_size and stats_print). The opting-in could be at compile-time or runtime; the latter is probably preferable because it's more flexible, but this is not a strong preference.
  • The allocation/deallocation functions need to call any provided wrappers.
  • A way for wrappers to temporarily disable wrapping while they are running, so that we don't get infinite recursion if the wrapper itself triggers a call to the function that it wraps.

I have a basic, ugly, proof-of-concept implementation. It adds the following code to src/liballoc/heap.rs, which defines a struct for holding the wrapper function and a function for setting the wrapper. (I've only shown the code for allocate; the other functions are handled similarly.)

pub type AllocateFn = unsafe fn(usize, usize) -> *mut u8;

pub type AllocateFnWrapper = fn(AllocateFn, usize, usize) -> *mut u8;

struct AllocatorWrappers {
    allocate: AllocateFnWrapper,
}

static mut wrappers: Option<AllocatorWrappers> = Option::None;

pub fn set_allocator_wrappers(allocate: AllocateFnWrapper) {
    let new_wrappers = AllocatorWrappers {
        allocate: allocate,
    };
    unsafe { wrappers = Option::Some(new_wrappers) };
}

It also modifies allocate like so:

 #[inline]
 pub unsafe fn allocate(size: uint, align: uint) -> *mut u8 {
-    imp::allocate(size, align)
+    match wrappers {
+        Option::None        => imp::allocate(size, align),
+        Option::Some(ref h) => (h.allocate)(imp::allocate, size, align),
+    }
 }

In the normal case this adds a single, perfectly-predictable branch to the alloc/dealloc path, which is hopefully small in relation to the cost of an alloc/dealloc.

And here is a sample program that uses it.

#![feature(alloc)]
#![feature(core)]
#![feature(std_misc)]
#![feature(unmarked_api)]
#![feature(unsafe_destructor)]

// This demonstrates how an allocation wrapping could be made to work in Rust
// using std::heap::set_allocator_wrappers().

use std::rt::heap::*;
use std::cell::Cell;
use std::sync::{MutexGuard,StaticMutex,MUTEX_INIT};

//---------------------------------------------------------------------------

// A global table counting how many times each operation is called. This is
// representative of more complex global state likely to be used by any kind of
// heap profiler. It requires explicit locking via a `StaticMutex`, which is
// best done by using `AutoDisableWrappersAndLockCounts`.

struct Counts {
    allocate: u32,
    reallocate: u32,
    reallocate_inplace: u32,
    deallocate: u32,
}

static mut COUNTS: Counts = Counts {
    allocate: 0,
    reallocate: 0,
    reallocate_inplace: 0,
    deallocate: 0,
};

static COUNTS_LOCK: StaticMutex = MUTEX_INIT;

//---------------------------------------------------------------------------

// If the `allocate` wrapper itself called `allocate`, we'd get infinite
// recursion, and likewise for the other wrappers, and the other functions
// dealing with the analysis state. Therefore, within each wrapper, we need a
// way to to temporarily disable wrapping, which we do on a per-thread basis.
//
// One complication with this is that it can interact badly with other TLS.
// For example, if we call println! within `deallocate_wrapper`, we get a
// "cannot access TLS after it's destroyed" panic. This is because the module
// implementing println! (and related printing functions) has its own TLS,
// called `LOCAL_STDOUT`, in stdio.rs. If `LOCAL_STDOUT` is destroyed before
// `WRAPPERS_ENABLED`, during its destruction it deallocates a
// LineBufferedOutput struct allocated by stdout(), which triggers a call to
// `deallocate_wrapper`, and the println! triggers the panic because
// `LOCAL_STDOUT` is in the middle of being destroyed.
//
// Therefore, calling any code that uses a droppable TLS value is potentially a
// problem in `deallocate_wrapper`. Hmm.

thread_local!(static WRAPPERS_ENABLED: Cell<bool> = Cell::new(true));

// `AutoDisableWrappersAndLockCounts` is an RAII struct for (a) the scoped disabling of
// wrapping, and (b) the scoped locking of `Counts`. It is not just convenient,
// but crucial; for (a) if we do `set(false)` and then `set(true)` manually
// then any destructors that run will happen after the latter, which will lead
// to infinite recursion.

struct AutoDisableWrappersAndLockCounts<'a> {
    wrappers_enabled: &'a Cell<bool>,
    _lock: MutexGuard<'a, ()>,
}

impl<'a> AutoDisableWrappersAndLockCounts<'a> {
    fn new(w: &Cell<bool>) -> AutoDisableWrappersAndLockCounts {
        w.set(false);
        AutoDisableWrappersAndLockCounts {
            wrappers_enabled: w,
            _lock: COUNTS_LOCK.lock().unwrap(),
        }
    }
}

#[unsafe_destructor]
impl<'a> Drop for AutoDisableWrappersAndLockCounts<'a> {
    fn drop(&mut self) {
        self.wrappers_enabled.set(true);
    }
}

//---------------------------------------------------------------------------

// The wrapper functions themselves. If wrapping is disabled when they are
// called, they just do a vanilla alloc/realloc/dealloc. Otherwise, they update
// COUNTS appropriately as well.

fn allocate_wrapper(real_allocate: AllocateFn, size: usize, align: usize) -> *mut u8 {
    WRAPPERS_ENABLED.with(|wrappers_enabled| {
        if wrappers_enabled.get() {
            let _a = AutoDisableWrappersAndLockCounts::new(wrappers_enabled);
            unsafe { COUNTS.allocate += 1; }
        }
        unsafe { real_allocate(size, align) }
    })
}

fn reallocate_wrapper(real_reallocate: ReallocateFn, ptr: *mut u8, old_size: usize, size: usize,
                      align: usize) -> *mut u8 {
    WRAPPERS_ENABLED.with(|wrappers_enabled| {
        if wrappers_enabled.get() {
            let _a = AutoDisableWrappersAndLockCounts::new(wrappers_enabled);
            unsafe { COUNTS.reallocate += 1; }
        }
        unsafe { real_reallocate(ptr, old_size, size, align) }
    })
}

fn reallocate_inplace_wrapper(real_reallocate_inplace: ReallocateInplaceFn, ptr: *mut u8,
                              old_size: usize, size: usize, align: usize) -> usize {
    WRAPPERS_ENABLED.with(|wrappers_enabled| {
        if wrappers_enabled.get() {
            let _a = AutoDisableWrappersAndLockCounts::new(wrappers_enabled);
            unsafe { COUNTS.reallocate_inplace += 1; }
        }
        unsafe { real_reallocate_inplace(ptr, old_size, size, align) }
    })
}

fn deallocate_wrapper(real_deallocate: DeallocateFn, ptr: *mut u8, old_size: usize, align: usize) {
    WRAPPERS_ENABLED.with(|wrappers_enabled| {
        if wrappers_enabled.get() {
            let _a = AutoDisableWrappersAndLockCounts::new(wrappers_enabled);
            unsafe { COUNTS.deallocate += 1; }
        }
        unsafe { real_deallocate(ptr, old_size, align) }
    })
}

//---------------------------------------------------------------------------

// Prints the current measurements in COUNTS.
fn snapshot() {
    WRAPPERS_ENABLED.with(|wrappers_enabled| {
        let _a = AutoDisableWrappersAndLockCounts::new(wrappers_enabled);
        unsafe {
            println!("--------");
            println!("counts:");
            println!("- allocate:           {}", COUNTS.allocate);
            println!("- reallocate:         {}", COUNTS.reallocate);
            println!("- reallocate_inplace: {}", COUNTS.reallocate_inplace);
            println!("- deallocate:         {}", COUNTS.deallocate);
            println!("--------");
        }
    });
}

//---------------------------------------------------------------------------

fn main() {
    // Want to do this as early as possible. (Really, it would be better to
    // have a way to specify these wrapper functions that is visible and takes
    // effect before we even reach main().)
    set_allocator_wrappers(allocate_wrapper, reallocate_wrapper, reallocate_inplace_wrapper,
                           deallocate_wrapper);

    // Just some sample code showing a variety of allocations on two threads.

    snapshot();

    for _ in range(0, 5) {
        let _x = Box::new(0u64);
    }

    snapshot();

    let guard = std::thread::Thread::scoped(|| {
        println!("start new thread!");
        for _ in range(0, 5) {
            let _x = Box::new(0u32);
        }
        println!("finish new thread!");
    });
    let _result = guard.join();

    snapshot();
}

Like I said, it's ugly.

  • set_allocation_hooks should be called as early as possible, so that it doesn't miss any allocations. It's also entirely non-thread-safe, which is probably ok if you do call it right at the start of main, but is still nasty. Putting the wrappers table inside an RwLock or something might be possible but it would be a shame, performance-wise, to do that for a data structure that's written once and then read zillions of times. I figure there must be a more Rust-y way of doing this. Could a #[feature("...")] annotation work? We really want these wrappers to be enabled the language runtime at start-up, rather than the program itself having to do it.
  • The thread-local storage to avoid infinite recursion isn't so bad, though it would be nice if this could be somehow handled within the Rust implementation somehow so that each individual program doesn't have to reuse it. The issues with wrappers calling code that uses TLS are also annoying.

It was really just a prototype to see if I could get something working. And it does.

So... I'm wondering if (a) a feature like this seems like something that might be accepted into Rust, and (b) if there are better ways of implementing it. I hope so, because my ideas for measuring Servo's memory usage are dead in the water without this.

This idea has a small amount of overlap with this old RFC -- I want a custom allocator for the entire program, basically -- but is much smaller.

Thank you for reading.

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Feb 27, 2015

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Feb 27, 2015

@nnethercote @nikomatsakis is there any reason we would need to address this before 1.0? AFAICT we can add support for this backwards compatibly.

(I agree that something like this is desirable. I just want to know if it something that we need to rush to work on, or if we can wait until the 1.0-related distractions are behind us...)

@nnethercote

This comment has been minimized.

Copy link
Author

nnethercote commented Feb 27, 2015

I'm looking at this stuff now because I have a Q1 goal about fine-grained memory profiling in Servo.

Currently the only change to Rust is a single new function in std::rt::heap, which is an unstable library. And it was just suggested on the Rust Internals Forum that this could be solved instead by providing an option to provide an alternative allocator at link time, which may well be good enough and wouldn't require any language-level changes.

So there's no particular reason this has to be pre-1.0, as far as I can tell. It would be helpful to me to have feedback on whether any particular approach is likely to be accepted, though, so I don't waste time going down blind alleys.

@nnethercote

This comment has been minimized.

Copy link
Author

nnethercote commented Mar 2, 2015

src/liballoc/heap.rs has a heap::imp module that gets used if the external_funcs feature is defined. It calls rust_allocate et al., which are functions not defined in the Rust implementation. This was added by rust-lang/rust@8723568.

So it looks like this functionality is already present, but it's really hard to use -- you have to recompile liballoc with external_funcs, and then compile your user program, also with external_funcs. (And if you don't provide rust_allocate et al. in your user program you'll get a link error, presumably.)

How hard would it be to improve the ergonomics here? Would it be possible to build two versions of liballoc by default, one with wrapping enabled and one without, and then you could choose which one to use in your user program at compile-time?

I'm not sure what to do about allocating/deallocating from within rust_allocate et al., but maybe you could just call the jemalloc functions (e.g. je_mallocx) directly.

@scialex added this feature, perhaps he/she could comment.

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Mar 3, 2015

@nnethercote By the way, thank you for filling in the details here. In particular, knowing that this is a goal for Servo means that we could provide an unstable API for your use. I think that will be much easier for all of us to swallow (though of course there is still the question of whether we should put this in as a global default for all clients of liballoc).

@nnethercote

This comment has been minimized.

Copy link
Author

nnethercote commented Mar 5, 2015

So it looks like this functionality is already present, but it's really hard to use -- you have to recompile liballoc with external_funcs, and then compile your user program, also with external_funcs.

And it looks like rust_allocate et al. must be C functions rather than Rust functions, which is a shame.

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Mar 5, 2015

And it looks like rust_allocate et al. must be C functions rather than Rust functions, which is a shame.

Are you saying that one is forced to implement such functions in C and not Rust? (I do not see why that would be the case.)

Or are you saying that one must write extern "C" fn rust_allocate(..) { ... } in one's Rust program, and that is a shame compared to writing without the extern "C" ? (In which case, I can see that is a wart, but it doesn't seem like that big a deal...)

@nnethercote

This comment has been minimized.

Copy link
Author

nnethercote commented Mar 5, 2015

Are you saying that one is forced to implement such functions in C and not Rust?

I was saying that, but I was mistaken. Thanks for clarifying.

As implemented, the feature is still very difficult to use. I don't want rustc itself to use the rust_allocate functions, I just want my Rust program to use them, and I tried and failed to achieve this. (And that was after fixing a bunch of compile errors that have crept into the external_funcs module due to bitrot.)

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Mar 5, 2015

@nnethercote I agree that the current workflow is too awkward. I want to look over your code again and review some other material related to allocators; hopefully we can put something together for you soon. (Again: Its a big help that this need not be a stable API.)

@nnethercote

This comment has been minimized.

Copy link
Author

nnethercote commented Mar 5, 2015

@pnkfelix: I'm not sure it even needs an API within the language. I guess it depends if the external_funcs feature (or similar) counts as an API. Opting-in via a compile-time feature definitely feels better than having to call a function from std::rt::heap.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Mar 6, 2015

I think this is very related to the desire to have executables be able
to switch between jemalloc and other allocators easily. See this blog
post
for some thoughts on that. I'm not sure if the scheme I
described there works exactly, but it seems like if we could get SOME
scheme that works, we could use that same scheme to handle the feature
request here, no?

@nnethercote

This comment has been minimized.

Copy link
Author

nnethercote commented Mar 7, 2015

@nikomatsakis: Thanks for the link. I had only thought about trying to wrap jemalloc, not system malloc. I've added instrumentation to Servo to measure both, on Linux at least (jemalloc via je_mallctl, and system malloc via mallinfo), but jemalloc tends to dominate for the workloads I've measured. Having to think about both definitely complicates things.

I agree that if we have a world where there are multiple versions of liballoc, that is definitely likely to make it easier to facilitate wrapping. E.g. you might have a malloc version, a jemalloc version, and a wrapped-jemalloc version. Or something.

BTW, it might be useful for you to talk to Mike Hommey (a.k.a. glandium). He's the expert on how Firefox uses jemalloc, including all the tricks necessary on all major platforms. Indeed, just a couple of weeks ago he fixed a bug that was causing some libraries used by Firefox to (unintentionally) use system malloc.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Mar 10, 2015

On Sat, Mar 07, 2015 at 03:53:18PM -0800, Nicholas Nethercote wrote:

@nikomatsakis: Thanks for the link. I had only thought about trying to wrap jemalloc, not system malloc. I've added instrumentation to Servo to measure both, on Linux at least (jemalloc via je_mallctl, and system malloc via mallinfo), but jemalloc tends to dominate for the workloads I've measured. Having to think about both definitely complicates things.

I'm not sure I follow. The goal is to have only one allocator, and to allow that to be chosen when linking the executable, which seems like what you want?

BTW, it might be useful for you to talk to Mike Hommey (a.k.a. glandium). He's the expert on how Firefox uses jemalloc, including all the tricks necessary on all major platforms. Indeed, just a couple of weeks ago he fixed a bug that was causing some libraries used by Firefox to (unintentionally) use system malloc.

thanks for the suggestion!

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Oct 15, 2015

We have now provided this, via RFC 1183, which describes the capability to swap in a different low-level default allocator.

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.