New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallible collection allocation 1.0 #2116

Merged
merged 2 commits into from Feb 7, 2018

Conversation

@Gankro
Contributor

Gankro commented Aug 18, 2017

Add minimal support for fallible allocations to the standard collection APIs. This is done in two ways:

  • For users with unwinding, an oom=panic configuration is added to make global allocators panic on oom.
  • For users without unwinding, a try_reserve() -> Result<(), CollectionAllocErr> method is added.

The former is sufficient to unwinding users, but the latter is insufficient for the others (although it is a decent 80/20 solution). Completing the no-unwinding story is left for future work.

Rendered


Updated link:

Rendered

@Gankro Gankro changed the title from fallible allocation 1.0 to fallible collection allocation 1.0 Aug 18, 2017

@Gankro

This comment has been minimized.

Show comment
Hide comment
@Gankro

Gankro Aug 18, 2017

Contributor

I'm really sorry this isn't perfect. I am just deeply exhausted with working on this problem right now, and need to push out what I have just to get it out there and focus on something else for a bit.

I'm not 100% convinced with all my "don't rock the boat" rationales for CollectionAllocErr, and could probably be very easily convinced to change that. It's just that my default stance on this kinda stuff is "don't touch anything, because the Servo team probably is relying on it in 17 different ways that will make me sad".

Contributor

Gankro commented Aug 18, 2017

I'm really sorry this isn't perfect. I am just deeply exhausted with working on this problem right now, and need to push out what I have just to get it out there and focus on something else for a bit.

I'm not 100% convinced with all my "don't rock the boat" rationales for CollectionAllocErr, and could probably be very easily convinced to change that. It's just that my default stance on this kinda stuff is "don't touch anything, because the Servo team probably is relying on it in 17 different ways that will make me sad".

@jethrogb

This comment has been minimized.

Show comment
Hide comment
@jethrogb

jethrogb Aug 18, 2017

Contributor

I think the plan for unwinding is terrible. I could support the try_reserve part of this RFC separately as a stop-gap measure on the way to full support.

Similar to the embedded case, handling allocation failure at the granularity of tasks is ideal for quality-of-implementation purposes. However, unlike embedded development, it isn't considered practical (in terms of cost) to properly take control of everything and ensure allocation failure is handled robustly.

There is no evidence of this not being considered practical.

More in general, the server use case seems a little thin. I've already mentioned in the internals thread that there's a lot more to be considered there: servers come in different shapes and sizes. Considering one of Rust's 2017 goals is “Rust should be well-equipped for writing robust, high-scale servers” I think this use case (or, I'd like to argue, use cases) should be explored in more detail.

Depending on unwinding for error handling is a terrible idea and entirely contrary to Rust best practices. This by itself should be listed under the “drawbacks” section. Besides being counteridiomatic, recovering from unwinding doesn't work well in at least three cases, two of which are not currently considered by the RFC:

  1. Platforms without unwinding support. Not really any need to go into detail here, as the RFC describes it pretty well.
  2. FFI. Unwinding is not supported across FFI boundaries. Allocation errors now result in a relatively clean abort. With this RFC, with unwinding from allocation errors through FFI can result in weird/non-deterministic/undefined behavior.
  3. Synchronization primitives You can't use any of the standard synchronization primitives such as Once, Mutex, and RwLock if you expect your code to unwind because of the possibility of lock poisoning. This was also already mentioned in the internals thread.
Contributor

jethrogb commented Aug 18, 2017

I think the plan for unwinding is terrible. I could support the try_reserve part of this RFC separately as a stop-gap measure on the way to full support.

Similar to the embedded case, handling allocation failure at the granularity of tasks is ideal for quality-of-implementation purposes. However, unlike embedded development, it isn't considered practical (in terms of cost) to properly take control of everything and ensure allocation failure is handled robustly.

There is no evidence of this not being considered practical.

More in general, the server use case seems a little thin. I've already mentioned in the internals thread that there's a lot more to be considered there: servers come in different shapes and sizes. Considering one of Rust's 2017 goals is “Rust should be well-equipped for writing robust, high-scale servers” I think this use case (or, I'd like to argue, use cases) should be explored in more detail.

Depending on unwinding for error handling is a terrible idea and entirely contrary to Rust best practices. This by itself should be listed under the “drawbacks” section. Besides being counteridiomatic, recovering from unwinding doesn't work well in at least three cases, two of which are not currently considered by the RFC:

  1. Platforms without unwinding support. Not really any need to go into detail here, as the RFC describes it pretty well.
  2. FFI. Unwinding is not supported across FFI boundaries. Allocation errors now result in a relatively clean abort. With this RFC, with unwinding from allocation errors through FFI can result in weird/non-deterministic/undefined behavior.
  3. Synchronization primitives You can't use any of the standard synchronization primitives such as Once, Mutex, and RwLock if you expect your code to unwind because of the possibility of lock poisoning. This was also already mentioned in the internals thread.
@rpjohnst

This comment has been minimized.

Show comment
Hide comment
@rpjohnst

rpjohnst Aug 18, 2017

Using unwinding to contain errors at task granularity is completely idiomatic. It's why Rust bothers to have unwinding at all. Allowing OOMs to panic in addition to their current behavior is totally in line with this. It's not a full solution, but it is a necessary part of one.

rpjohnst commented Aug 18, 2017

Using unwinding to contain errors at task granularity is completely idiomatic. It's why Rust bothers to have unwinding at all. Allowing OOMs to panic in addition to their current behavior is totally in line with this. It's not a full solution, but it is a necessary part of one.

@pnkfelix

This comment has been minimized.

Show comment
Hide comment
@pnkfelix

pnkfelix Aug 18, 2017

Member

Update: The suggestion was followed. No need to read rest of this comment (which I have left below the line)


I suggest that the filename for this RFC be changed to something that isn't quite so subtle.

(The current filename, "alloc-me-like-one-of-your-french-girls.md", is a meme/quote from the movie "Titanic"; I infer that reference is meant to bring to mind "fallibility", but I needed some help along the way.)

Member

pnkfelix commented Aug 18, 2017

Update: The suggestion was followed. No need to read rest of this comment (which I have left below the line)


I suggest that the filename for this RFC be changed to something that isn't quite so subtle.

(The current filename, "alloc-me-like-one-of-your-french-girls.md", is a meme/quote from the movie "Titanic"; I infer that reference is meant to bring to mind "fallibility", but I needed some help along the way.)

@jethrogb

This comment has been minimized.

Show comment
Hide comment
@jethrogb

jethrogb Aug 18, 2017

Contributor

Using unwinding to contain errors at task granularity is completely idiomatic.

Only as a last resort, such that that one assertion failure doesn't take down your whole process accidentally. Unwinding should not be used for errors that are more or less expected and you know how to deal with. https://doc.rust-lang.org/stable/book/second-edition/ch09-03-to-panic-or-not-to-panic.html

Contributor

jethrogb commented Aug 18, 2017

Using unwinding to contain errors at task granularity is completely idiomatic.

Only as a last resort, such that that one assertion failure doesn't take down your whole process accidentally. Unwinding should not be used for errors that are more or less expected and you know how to deal with. https://doc.rust-lang.org/stable/book/second-edition/ch09-03-to-panic-or-not-to-panic.html

@rpjohnst

This comment has been minimized.

Show comment
Hide comment
@rpjohnst

rpjohnst Aug 18, 2017

Yes, precisely. In many situations, allocation failure is unexpected and has no meaningful response at a granularity smaller than a task. This is a reason to support oom=panic rather than just abort.

rpjohnst commented Aug 18, 2017

Yes, precisely. In many situations, allocation failure is unexpected and has no meaningful response at a granularity smaller than a task. This is a reason to support oom=panic rather than just abort.

@mark-i-m

This comment has been minimized.

Show comment
Hide comment
@mark-i-m

mark-i-m Aug 18, 2017

Contributor

@Gankro Thanks for all the hard work!

However, I also don't like unwinding as error handling. Unwinding should only happen when I have signaled that I don't have anything to do to help; IMHO, it is damage control, rather than error handling.

Have there been any proposals to add some sort of set_oom_handler to the allocator interface? I am imagining something like the following:

enum OOMOutcome {
    Resolved(*const u8), // OOM was resolved and here is the allocation
    
    // Could not resolve the OOM, so do what you have to
    #[cfg(oom=panic)]
    Panic,
    #[cfg(oom=abort)]
    Abort,
}

fn set_oom_handler<H>(handler: H) 
    where H: Fn(/* failed operation args... */) -> OOMOutcome;

You would then call set_oom_handler with your handler at the beginning of your program. Your handler can then choose what it wants to happen, including triggering a GC or whatever...

The benefit of this approach is that the existing interface doesn't have to change at all, and applications don't have choose to use try_reserve vs the existing things.

An alternate approach would be to make the oom_handler a language item or something (but that seems more suitable for the global allocator than collection allocators).

Yet another approach would be to make the OOM handler a type that implements a trait. Allocators would then be generic over their OOMHandler type.

Contributor

mark-i-m commented Aug 18, 2017

@Gankro Thanks for all the hard work!

However, I also don't like unwinding as error handling. Unwinding should only happen when I have signaled that I don't have anything to do to help; IMHO, it is damage control, rather than error handling.

Have there been any proposals to add some sort of set_oom_handler to the allocator interface? I am imagining something like the following:

enum OOMOutcome {
    Resolved(*const u8), // OOM was resolved and here is the allocation
    
    // Could not resolve the OOM, so do what you have to
    #[cfg(oom=panic)]
    Panic,
    #[cfg(oom=abort)]
    Abort,
}

fn set_oom_handler<H>(handler: H) 
    where H: Fn(/* failed operation args... */) -> OOMOutcome;

You would then call set_oom_handler with your handler at the beginning of your program. Your handler can then choose what it wants to happen, including triggering a GC or whatever...

The benefit of this approach is that the existing interface doesn't have to change at all, and applications don't have choose to use try_reserve vs the existing things.

An alternate approach would be to make the oom_handler a language item or something (but that seems more suitable for the global allocator than collection allocators).

Yet another approach would be to make the OOM handler a type that implements a trait. Allocators would then be generic over their OOMHandler type.

@elahn

This comment has been minimized.

Show comment
Hide comment
@elahn

elahn Aug 19, 2017

Great idea, @mark-i-m. In a server app, I'd grab a chunk of memory on startup, then in my OOM handler:

  • set max_inflight_requests = inflight_requests - 1
  • send back-pressure/notify the monitoring, traffic and instance manager
  • resolve the allocation using some of the chunk, so the request can succeed

On operating systems that error on allocation as opposed to access, this would remove the need to manually tune the app for request load as a function of memory consumption.

Thanks @Gankro for your work on this RFC.

elahn commented Aug 19, 2017

Great idea, @mark-i-m. In a server app, I'd grab a chunk of memory on startup, then in my OOM handler:

  • set max_inflight_requests = inflight_requests - 1
  • send back-pressure/notify the monitoring, traffic and instance manager
  • resolve the allocation using some of the chunk, so the request can succeed

On operating systems that error on allocation as opposed to access, this would remove the need to manually tune the app for request load as a function of memory consumption.

Thanks @Gankro for your work on this RFC.

@whitequark

This comment has been minimized.

Show comment
Hide comment
@whitequark

whitequark Aug 19, 2017

@Gankro I have mixed feelings about this RFC. I think this is a design, it provides me with the ability to write a VecAllocExt trait that gives me try_push and friends that would be ergonomic enough to provide a short path to an eventual stability, and these are the parts that I like. It's useful! But also, it's very imperfect, and (without going into details) I would like to see most of the internals to be completely redone before it's stabilized.

Still, it's much better than the current situation, and it doesn't have any drawbacks I can see as long as the guts are unstable (taking into account that application/firmware code will be written against this), so I'm favor of merging this even as-is.

Thanks for writing this!

whitequark commented Aug 19, 2017

@Gankro I have mixed feelings about this RFC. I think this is a design, it provides me with the ability to write a VecAllocExt trait that gives me try_push and friends that would be ergonomic enough to provide a short path to an eventual stability, and these are the parts that I like. It's useful! But also, it's very imperfect, and (without going into details) I would like to see most of the internals to be completely redone before it's stabilized.

Still, it's much better than the current situation, and it doesn't have any drawbacks I can see as long as the guts are unstable (taking into account that application/firmware code will be written against this), so I'm favor of merging this even as-is.

Thanks for writing this!

@mark-i-m

This comment has been minimized.

Show comment
Hide comment
@mark-i-m

mark-i-m Aug 20, 2017

Contributor

Perhaps this could be an experimental RFC until we get experience with what doesn't work so well? That's seemed to work well in the past for the allocator interfaces...

Contributor

mark-i-m commented Aug 20, 2017

Perhaps this could be an experimental RFC until we get experience with what doesn't work so well? That's seemed to work well in the past for the allocator interfaces...

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg Aug 21, 2017

Contributor

Given the constraints, I'd say this is good enough.

I have only one question, what are the semantics of try_reserve on a HashMap ? That is, If I want to insert an element into a HashMap and I do something like hash_map.try_reserve(hash_map.len() + 1) am I guaranteed that on insertion the HasMap won't try to allocate?

Contributor

gnzlbg commented Aug 21, 2017

Given the constraints, I'd say this is good enough.

I have only one question, what are the semantics of try_reserve on a HashMap ? That is, If I want to insert an element into a HashMap and I do something like hash_map.try_reserve(hash_map.len() + 1) am I guaranteed that on insertion the HasMap won't try to allocate?

@aturon aturon self-assigned this Aug 22, 2017

@arthurprs

This comment has been minimized.

Show comment
Hide comment
@arthurprs

arthurprs Aug 23, 2017

@gnzlbg that's implementation detail. The stdlib version, for example, is able to guarantee room for any combination of items in advance. Some variants though can't 100% guarantee room without knowing the dataset. Hopscotch and cuckoo hash tables come to my mind as examples.

arthurprs commented Aug 23, 2017

@gnzlbg that's implementation detail. The stdlib version, for example, is able to guarantee room for any combination of items in advance. Some variants though can't 100% guarantee room without knowing the dataset. Hopscotch and cuckoo hash tables come to my mind as examples.

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg Aug 23, 2017

Contributor

@arthurprs IIUC the whole point of try_reserve is to let you use std::collections without panic!ing on OOM. With the guarantee that "if try_reserve(N) succeeds, the collection can grow up to size N without allocating new memory" and due to this, OOMs cannot happen. Avoiding panics is neither easy, nor ergonomic, nor reliable, but if you are careful, doable.

Without this guarantee, one cannot avoid OOM at all, so why would I actually call try_reserve instead of reserve if I can get an OOM panic! anyway? E.g. try_resere(N) succeeds, ok, now what? I can still get an OOM panic. Is there anything useful that I can do with this information if I cannot use it to avoid OOM panics?

So... I must be missing something, because as I understand it, this guarantee is not an implementation detail of try_reserve, but its reason for existing. Without this guarantee, I really don't see how the method can be used to do anything useful (*).

(*) unless we add try_... variants of other insertion methods.

Contributor

gnzlbg commented Aug 23, 2017

@arthurprs IIUC the whole point of try_reserve is to let you use std::collections without panic!ing on OOM. With the guarantee that "if try_reserve(N) succeeds, the collection can grow up to size N without allocating new memory" and due to this, OOMs cannot happen. Avoiding panics is neither easy, nor ergonomic, nor reliable, but if you are careful, doable.

Without this guarantee, one cannot avoid OOM at all, so why would I actually call try_reserve instead of reserve if I can get an OOM panic! anyway? E.g. try_resere(N) succeeds, ok, now what? I can still get an OOM panic. Is there anything useful that I can do with this information if I cannot use it to avoid OOM panics?

So... I must be missing something, because as I understand it, this guarantee is not an implementation detail of try_reserve, but its reason for existing. Without this guarantee, I really don't see how the method can be used to do anything useful (*).

(*) unless we add try_... variants of other insertion methods.

@arthurprs

This comment has been minimized.

Show comment
Hide comment
@arthurprs

arthurprs Aug 23, 2017

I don't disagree with your overall idea, I'm just saying that it can't be done in ALL cases.

arthurprs commented Aug 23, 2017

I don't disagree with your overall idea, I'm just saying that it can't be done in ALL cases.

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg Aug 23, 2017

Contributor

I don't disagree with your overall idea, I'm just saying that it can't be done for ALL data structures.

@arthurprs This is why I was asking (and why I choose the HashMap as an example). I agree with you on this. This cannot be guaranteed for all collections.

Contributor

gnzlbg commented Aug 23, 2017

I don't disagree with your overall idea, I'm just saying that it can't be done for ALL data structures.

@arthurprs This is why I was asking (and why I choose the HashMap as an example). I agree with you on this. This cannot be guaranteed for all collections.

@jethrogb

This comment has been minimized.

Show comment
Hide comment
@jethrogb

jethrogb Aug 23, 2017

Contributor

I'm going to answer @RalfJung's question on i.r-l.o here to keep the discussion in one place:

@RalfJung wrote

@jethrogb wrote
Unwinding is not a panacea. Depending on unwinding for continuity makes it so that you can’t use std::sync::{Mutex, Once, RwLock} anywhere in your code because of std::sync::PoisonError.

Could you elaborate? Poisoning actually helps dealing with unwinding; if it wasn’t for poisoning, unwinding would be much more likely to introduce subtle bugs into programs. So it actually seems to me like especially when you do unwinding should you use concurrency primitives that do poisoning.

If you don’t do unwinding, things will never be poisoned anyway.

(Sorry, this turned into quite a long post. Please bear with me)

The problem with unwinding is that “handler”-level is not the right place to handle errors. Throughout the rest of this post, keep in mind that while just like it's possible to write correct code in C++, I understand that it's possible to write correct code in Rust using unwinding. It's just not a good way in the sense that it's a giant footgun.

Here's a simple implementation of a synchronous stack:

struct SyncStack<T> {
    vec: Mutex<Vec<T>>
}

impl<T> SyncStack<T> {
    fn push(&self, t: T) {
        match self.vec.lock() {
            Ok(ref mut vec) => vec.push(t),
            Err(poison) => (/* what do I do here? is `inner` even valid? */)
        }
    }
}

Now I'm sure you (Ralf) have a proof somewhere that if any code in the Vec implementation panics, its internal invariants have all been upheld. That's great! But what about user's invariants?

struct Worker { /* ... */ }
struct Pool {
    // always `Some` when `Mutex` is not locked
    worker: Mutex<Option<Worker>>
}
impl Pool {
    /// N.B. `F` must not panic
    fn do_work<F: FnOnce(Worker) -> Worker>(&self, work: F) {
        let mut worker = self.worker.lock().unwrap();
        *worker = Some(work(worker.take().unwrap(/* oops, could have been `None`! */)));
    }
}

If my work contains any kind of allocation, and failed allocations would panic, this would now panic where it wouldn't before. Even if this example would be written to check for None in the PoisonError case, where would the Pool get a new Worker from? The previous one was dropped in the unwind. Moreover, even if we'd use catch_unwind here, the Worker was moved to the closure and we're never getting it back! If we used idiomatic error handling instead of unwinding, you would return (Result<...>, Worker) or something like that, and there wouldn't be a problem.

Let's look at Once. Well, we're looking at some other construct that uses Once internally for extra non-obviousness:

struct Certificate { /*...*/ }

fn get_certificates() -> Box<Iterator<Item=Certificate>> {
    unimplemented!()
}

lazy_static! {
    static ref ROOT_CERTS: Vec<Certificate> = get_certificates().collect();
}

If the collect runs out of memory, there is no way to recover from this situation. The lazy_static initializer runs inside of a Once::call_once closure, and once that panics, every future call to call_once also panics, and every time you're trying to dereference ROOT_CERTS, your thread will panic.

As I said at the beginning, ost of these situations (but not all, see the Worker example) could be “handled” by sprinkling catch_unwind all over your code, but I surely hope no one is advertising for that. The typical usecase people apparently have in mind for catch_unwind is task-level isolation. However, as I've hopefully shown here, trying to handle poison errors at that level is basically impossible. Yes, it's certainly no worse than just aborting on OOM as is done now, but it doesn't really get us closer to graceful handling of OOM either.

Contributor

jethrogb commented Aug 23, 2017

I'm going to answer @RalfJung's question on i.r-l.o here to keep the discussion in one place:

@RalfJung wrote

@jethrogb wrote
Unwinding is not a panacea. Depending on unwinding for continuity makes it so that you can’t use std::sync::{Mutex, Once, RwLock} anywhere in your code because of std::sync::PoisonError.

Could you elaborate? Poisoning actually helps dealing with unwinding; if it wasn’t for poisoning, unwinding would be much more likely to introduce subtle bugs into programs. So it actually seems to me like especially when you do unwinding should you use concurrency primitives that do poisoning.

If you don’t do unwinding, things will never be poisoned anyway.

(Sorry, this turned into quite a long post. Please bear with me)

The problem with unwinding is that “handler”-level is not the right place to handle errors. Throughout the rest of this post, keep in mind that while just like it's possible to write correct code in C++, I understand that it's possible to write correct code in Rust using unwinding. It's just not a good way in the sense that it's a giant footgun.

Here's a simple implementation of a synchronous stack:

struct SyncStack<T> {
    vec: Mutex<Vec<T>>
}

impl<T> SyncStack<T> {
    fn push(&self, t: T) {
        match self.vec.lock() {
            Ok(ref mut vec) => vec.push(t),
            Err(poison) => (/* what do I do here? is `inner` even valid? */)
        }
    }
}

Now I'm sure you (Ralf) have a proof somewhere that if any code in the Vec implementation panics, its internal invariants have all been upheld. That's great! But what about user's invariants?

struct Worker { /* ... */ }
struct Pool {
    // always `Some` when `Mutex` is not locked
    worker: Mutex<Option<Worker>>
}
impl Pool {
    /// N.B. `F` must not panic
    fn do_work<F: FnOnce(Worker) -> Worker>(&self, work: F) {
        let mut worker = self.worker.lock().unwrap();
        *worker = Some(work(worker.take().unwrap(/* oops, could have been `None`! */)));
    }
}

If my work contains any kind of allocation, and failed allocations would panic, this would now panic where it wouldn't before. Even if this example would be written to check for None in the PoisonError case, where would the Pool get a new Worker from? The previous one was dropped in the unwind. Moreover, even if we'd use catch_unwind here, the Worker was moved to the closure and we're never getting it back! If we used idiomatic error handling instead of unwinding, you would return (Result<...>, Worker) or something like that, and there wouldn't be a problem.

Let's look at Once. Well, we're looking at some other construct that uses Once internally for extra non-obviousness:

struct Certificate { /*...*/ }

fn get_certificates() -> Box<Iterator<Item=Certificate>> {
    unimplemented!()
}

lazy_static! {
    static ref ROOT_CERTS: Vec<Certificate> = get_certificates().collect();
}

If the collect runs out of memory, there is no way to recover from this situation. The lazy_static initializer runs inside of a Once::call_once closure, and once that panics, every future call to call_once also panics, and every time you're trying to dereference ROOT_CERTS, your thread will panic.

As I said at the beginning, ost of these situations (but not all, see the Worker example) could be “handled” by sprinkling catch_unwind all over your code, but I surely hope no one is advertising for that. The typical usecase people apparently have in mind for catch_unwind is task-level isolation. However, as I've hopefully shown here, trying to handle poison errors at that level is basically impossible. Yes, it's certainly no worse than just aborting on OOM as is done now, but it doesn't really get us closer to graceful handling of OOM either.

@kornelski

This comment has been minimized.

Show comment
Hide comment
@kornelski

kornelski Aug 28, 2017

Contributor

The reserve API sounds like a great idea. Just the examples of reserving just 1 element seem like an overly conservative use. I imagine this being used with a total upper bound estimate at a beginning of a function (or internally in collections when they double their buffer).

Contributor

kornelski commented Aug 28, 2017

The reserve API sounds like a great idea. Just the examples of reserving just 1 element seem like an overly conservative use. I imagine this being used with a total upper bound estimate at a beginning of a function (or internally in collections when they double their buffer).

@kornelski

This comment has been minimized.

Show comment
Hide comment
@kornelski

kornelski Aug 28, 2017

Contributor

I'm super happy about oom=panic being included! 💯

(edit I wondered what about .collect(). I've found it can be replaced with .extend(), so such code can be adapted to work with try_reserve, too).

Contributor

kornelski commented Aug 28, 2017

I'm super happy about oom=panic being included! 💯

(edit I wondered what about .collect(). I've found it can be replaced with .extend(), so such code can be adapted to work with try_reserve, too).

@badboy badboy referenced this pull request Aug 28, 2017

Closed

September 2017 #37

10 of 10 tasks complete
@RalfJung

This comment has been minimized.

Show comment
Hide comment
@RalfJung

RalfJung Aug 29, 2017

Member

@jethrogb

But what about user's invariants?
[...]
However, as I've hopefully shown here, trying to handle poison errors at that level is basically impossible.

Thanks for this detailed answer! Certainly, if a stack internally uses a lock, it has to gurantee that if push returns without panicking, the element must indeed have been pushed. This will ensure that user invariants hold. The fact that poisoning happened must not be observable. This may not be an easy gurantee to uphold, but it is not made harder by lock poisoning. In doubt, you can always fall back to mutex.lock().unwrap() and "propagate" the failure if you cannot handle it.

Remember, my initial question is about your statement that I read as "poisoning (specifically std::sync::PoisonError) makes using unwinding harder". I still do not understand how that should be the case.

What I do agree with is that locks (no matter whether they have a mechanism such as std::sync::PoisonError or not) make using unwinding for task isolation harder. If a task fails while it holds the lock, that lock is now "infected" by the failure, and if unwrap() is used as suggested above, other tasks accessing the lock will now also fail. This is an inherent problem with shared mutable state; if you want your tasks to be isolated also in terms of failure, you must only share read-only state between them. (As usual, there are exceptions, like transparent caching/memoization or so.) And maybe that is what you wanted to say and I just misunderstood you?

If locks wouldn't have poisoning, the situation would not be any better -- it would indeed be worse! You would implicitly rely on invariants being uphold in the middle of a critical section, and you may not even notice that you are doing this. Lock poisoning makes it so that you have to think about this case, and that can only improve the result.

Member

RalfJung commented Aug 29, 2017

@jethrogb

But what about user's invariants?
[...]
However, as I've hopefully shown here, trying to handle poison errors at that level is basically impossible.

Thanks for this detailed answer! Certainly, if a stack internally uses a lock, it has to gurantee that if push returns without panicking, the element must indeed have been pushed. This will ensure that user invariants hold. The fact that poisoning happened must not be observable. This may not be an easy gurantee to uphold, but it is not made harder by lock poisoning. In doubt, you can always fall back to mutex.lock().unwrap() and "propagate" the failure if you cannot handle it.

Remember, my initial question is about your statement that I read as "poisoning (specifically std::sync::PoisonError) makes using unwinding harder". I still do not understand how that should be the case.

What I do agree with is that locks (no matter whether they have a mechanism such as std::sync::PoisonError or not) make using unwinding for task isolation harder. If a task fails while it holds the lock, that lock is now "infected" by the failure, and if unwrap() is used as suggested above, other tasks accessing the lock will now also fail. This is an inherent problem with shared mutable state; if you want your tasks to be isolated also in terms of failure, you must only share read-only state between them. (As usual, there are exceptions, like transparent caching/memoization or so.) And maybe that is what you wanted to say and I just misunderstood you?

If locks wouldn't have poisoning, the situation would not be any better -- it would indeed be worse! You would implicitly rely on invariants being uphold in the middle of a critical section, and you may not even notice that you are doing this. Lock poisoning makes it so that you have to think about this case, and that can only improve the result.

@jethrogb

This comment has been minimized.

Show comment
Hide comment
@jethrogb

jethrogb Aug 29, 2017

Contributor

What I do agree with is that locks (no matter whether they have a mechanism such as std::sync::PoisonError or not) make using unwinding for task isolation harder.

I think this is pretty close to the point I'm trying to make. Indeed the mechanism doesn't really matter. When encountering a poison error I'd say it's probably to late to do anything useful. You need to handle these errors where they come up. So unless you want to combine every use of lock with catch_unwind (which I think is a terrible idea), unwinding for error handling is not a good idea.

Contributor

jethrogb commented Aug 29, 2017

What I do agree with is that locks (no matter whether they have a mechanism such as std::sync::PoisonError or not) make using unwinding for task isolation harder.

I think this is pretty close to the point I'm trying to make. Indeed the mechanism doesn't really matter. When encountering a poison error I'd say it's probably to late to do anything useful. You need to handle these errors where they come up. So unless you want to combine every use of lock with catch_unwind (which I think is a terrible idea), unwinding for error handling is not a good idea.

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Aug 29, 2017

Contributor

@jethrogb

But it's not like using Result instead of panics for all these random failure modes will make failure handling easier. Shared mutable state + random returns = not fun.

Through I suppose you could have a way to "turn off" panicking in a lock guard for some operation. Maybe have it take the lock guard by &Self and require it contents to be AssertUnwindSafe?

Contributor

arielb1 commented Aug 29, 2017

@jethrogb

But it's not like using Result instead of panics for all these random failure modes will make failure handling easier. Shared mutable state + random returns = not fun.

Through I suppose you could have a way to "turn off" panicking in a lock guard for some operation. Maybe have it take the lock guard by &Self and require it contents to be AssertUnwindSafe?

@jethrogb

This comment has been minimized.

Show comment
Hide comment
@jethrogb

jethrogb Aug 29, 2017

Contributor

But it's not like using Result instead of panics for all these random failure modes will make failure handling easier

What do you mean? Rust's ease of doing error handling soundly and accurately through the Result system is one of the key strengths of the language.

Contributor

jethrogb commented Aug 29, 2017

But it's not like using Result instead of panics for all these random failure modes will make failure handling easier

What do you mean? Rust's ease of doing error handling soundly and accurately through the Result system is one of the key strengths of the language.

@rpjohnst

This comment has been minimized.

Show comment
Hide comment
@rpjohnst

rpjohnst Aug 29, 2017

Unwinding has always been in Rust explicitly for task-granularity error recovery, and it's not going away. If that doesn't work for your use case, great- use the try_ functions and Result, that's the primary purpose of this RFC.

But if you a) don't use (a lot of) shared state, or b) are okay with even an even coarser granularity of failure (e.g. the set of threads that use a Mutex), then unwinding on OOM is useful. There is no reason to force everyone to use either oom=abort or fallible allocation everywhere, when oom=panic is perfectly serviceable.

rpjohnst commented Aug 29, 2017

Unwinding has always been in Rust explicitly for task-granularity error recovery, and it's not going away. If that doesn't work for your use case, great- use the try_ functions and Result, that's the primary purpose of this RFC.

But if you a) don't use (a lot of) shared state, or b) are okay with even an even coarser granularity of failure (e.g. the set of threads that use a Mutex), then unwinding on OOM is useful. There is no reason to force everyone to use either oom=abort or fallible allocation everywhere, when oom=panic is perfectly serviceable.

@kornelski

This comment has been minimized.

Show comment
Hide comment
@kornelski

kornelski Aug 29, 2017

Contributor

The lock poisoning is a red herring. It is a difficulty, but it's not a new problem, and it's not even specific to OOM handling. In Rust already every vec[x] can do the same. And Rust already provides multiple ways of dealing with it, from graceful checks to aborting.

Contributor

kornelski commented Aug 29, 2017

The lock poisoning is a red herring. It is a difficulty, but it's not a new problem, and it's not even specific to OOM handling. In Rust already every vec[x] can do the same. And Rust already provides multiple ways of dealing with it, from graceful checks to aborting.

@stepancheg

This comment has been minimized.

Show comment
Hide comment
@stepancheg

stepancheg Jan 4, 2018

@whitequark

What? That makes no sense.

I'm sorry I couldn't explain it.

The event that can happen between the time of check and time of use is any other process making a large allocation. You can't share a mutex with literally everyone else on the system.

As I said, mutex is only needed when you explicitly allocate certain large object, like "pixmap".

For regular allocations (like memory to store URL) you simply allocate memory and don't worry about memory consumption. Even if at the end amount of allocated memory exceeds 25% to 27% of system memory, it's not an issue.

stepancheg commented Jan 4, 2018

@whitequark

What? That makes no sense.

I'm sorry I couldn't explain it.

The event that can happen between the time of check and time of use is any other process making a large allocation. You can't share a mutex with literally everyone else on the system.

As I said, mutex is only needed when you explicitly allocate certain large object, like "pixmap".

For regular allocations (like memory to store URL) you simply allocate memory and don't worry about memory consumption. Even if at the end amount of allocated memory exceeds 25% to 27% of system memory, it's not an issue.

@stepancheg

This comment has been minimized.

Show comment
Hide comment
@stepancheg

stepancheg Jan 4, 2018

@whitequark

Citation needed. (Memory compression in particular is faster than memory fetches, so it is unlikely to be an issue here.

Citation about what? Compressed memory is slower than just memory: you need to uncompress it to work with it, and you need to compress currently unused memory. So when total program memory consumption is close to 100%, system starts spending time compressing/decompressing memory (and swapping).

If we talk about Linux with overcommit, then Servo can and should take advantage of cgroups to limit its impact on other components of the system. In fact, this already happens on Linux to a degree with task groups automatically created per pty, since 2010!

The problem is that you suggest that there's a hard limit which shouldn't be exceed. And that's not true.

When you allocate large objects (like images), you shouldn't exceed say 25%. But when allocating small objects (like memory to display menu), it's OK to exceed 25%, because it won't affect overall system performance.

When you have single "one size fits all" allocator limit, you get into situation when you don't exceed limits, but random parts of the program (like context menus) won't work, because they will get random allocation failures.

stepancheg commented Jan 4, 2018

@whitequark

Citation needed. (Memory compression in particular is faster than memory fetches, so it is unlikely to be an issue here.

Citation about what? Compressed memory is slower than just memory: you need to uncompress it to work with it, and you need to compress currently unused memory. So when total program memory consumption is close to 100%, system starts spending time compressing/decompressing memory (and swapping).

If we talk about Linux with overcommit, then Servo can and should take advantage of cgroups to limit its impact on other components of the system. In fact, this already happens on Linux to a degree with task groups automatically created per pty, since 2010!

The problem is that you suggest that there's a hard limit which shouldn't be exceed. And that's not true.

When you allocate large objects (like images), you shouldn't exceed say 25%. But when allocating small objects (like memory to display menu), it's OK to exceed 25%, because it won't affect overall system performance.

When you have single "one size fits all" allocator limit, you get into situation when you don't exceed limits, but random parts of the program (like context menus) won't work, because they will get random allocation failures.

@whitequark

This comment has been minimized.

Show comment
Hide comment
@whitequark

whitequark Jan 4, 2018

When you allocate large objects (like images), you shouldn't exceed say 25%. But when allocating small objects (like memory to display menu), it's OK to exceed 25%, because it won't affect overall system performance.

Have you tried to open Twitter lately and keep the tab open for a few hours? It is trivially possible to make your system unusable by allocating a lot of small objects.

In any case, I fail to understand how all this discussion of OS performance degradation is relevant to Rust. First, if your OS cannot manage its disk caches properly, fix your OS, not Rust. Second, even if everything you said matched reality (it doesn't on Windows, as @retep998 mentioned), it would make Rust collections useless for hard real-time systems. Why are you arguing in favor of that?

whitequark commented Jan 4, 2018

When you allocate large objects (like images), you shouldn't exceed say 25%. But when allocating small objects (like memory to display menu), it's OK to exceed 25%, because it won't affect overall system performance.

Have you tried to open Twitter lately and keep the tab open for a few hours? It is trivially possible to make your system unusable by allocating a lot of small objects.

In any case, I fail to understand how all this discussion of OS performance degradation is relevant to Rust. First, if your OS cannot manage its disk caches properly, fix your OS, not Rust. Second, even if everything you said matched reality (it doesn't on Windows, as @retep998 mentioned), it would make Rust collections useless for hard real-time systems. Why are you arguing in favor of that?

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg Jan 4, 2018

Contributor

@retep998

I have noticed this link has not been shared yet so, unrelated to my previous message, here is a good overview of why overcommit is bad: https://www.etalabs.net/overcommit.html

That's extremely off-topic. Even though I agree that overcommit by default is bad, we can't stop Linux, MacOS, FreeBSD, and others from having overcommit enabled by default, so whatever solution is implemented in std:: must work for both users with overcommit by default and users without it (or be part of std::os::).


Then please explain what is e.g. Servo supposed to do when allocating storage for a potentially very large pixmap, or any other application that has similar needs. This is a use case that is not possible to support without fallible allocations.

When answering my question, consider that (a) Rust supports custom allocators, and (b) mmap can be told to allocate backing storage immediately even on systems with overcommit.

First, Vec with the global allocator (be it jemalloc or the system allocator) on Linux distros with overcommit would have a broken try_reserve. The code would appear to do something useful, but it would be setting up a time bomb.

Second, the custom allocator interface doesn't differentiate between reserving memory and committing memory, so you would need to use a custom allocator for those vectors that bypasses the global allocator and uses mmap directly. Given the current Alloc interface, to work properly it would need to commit memory on alloc, so you wouldn't want to use this allocator on all vectors, only in those who can potentially OOM.

At this point, Servo might as well just do what high-performant Apps do and have its own vector implementation for these particular cases. It already uses SmallVec instead of Vec in some situations, so there is precedence. This custom vector type would not be parametrized by an Allocator type, but would directly use VirtualAlloc on Windows and mmap on other POSIX platforms to reserve memory on reserve/with_capacity/... and commit memory on push/insert/resize/extend/... Adding try_reserve/try_push/try_insert/... methods to it that work correctly is trivial (you just check VirtualAlloc and mmap on commit for errors), at least compared to trying to add try_reserve to std:: such that it works correctly on all platforms. And this vector type has some other nice properties, like an extremely simple growth policy, or being very easy to achieve zero reallocations without wasting memory.


Just because we don't have a complete win doesn't mean we have to instead suffer a total loss. Can't we at least have a partial win?

A partial win is oom=panic. It is still useful in systems with overcommit and it isn't very controversial, but the whole try_reserve discussion is delaying progress on it. Hence why I've repeatedly asked to split this RFC. It doesn't make sense to delay uncontroversial useful features for other features that still need more work (I would really like to have try_reserve or something similar to that, but whatever it is, it must work).

Contributor

gnzlbg commented Jan 4, 2018

@retep998

I have noticed this link has not been shared yet so, unrelated to my previous message, here is a good overview of why overcommit is bad: https://www.etalabs.net/overcommit.html

That's extremely off-topic. Even though I agree that overcommit by default is bad, we can't stop Linux, MacOS, FreeBSD, and others from having overcommit enabled by default, so whatever solution is implemented in std:: must work for both users with overcommit by default and users without it (or be part of std::os::).


Then please explain what is e.g. Servo supposed to do when allocating storage for a potentially very large pixmap, or any other application that has similar needs. This is a use case that is not possible to support without fallible allocations.

When answering my question, consider that (a) Rust supports custom allocators, and (b) mmap can be told to allocate backing storage immediately even on systems with overcommit.

First, Vec with the global allocator (be it jemalloc or the system allocator) on Linux distros with overcommit would have a broken try_reserve. The code would appear to do something useful, but it would be setting up a time bomb.

Second, the custom allocator interface doesn't differentiate between reserving memory and committing memory, so you would need to use a custom allocator for those vectors that bypasses the global allocator and uses mmap directly. Given the current Alloc interface, to work properly it would need to commit memory on alloc, so you wouldn't want to use this allocator on all vectors, only in those who can potentially OOM.

At this point, Servo might as well just do what high-performant Apps do and have its own vector implementation for these particular cases. It already uses SmallVec instead of Vec in some situations, so there is precedence. This custom vector type would not be parametrized by an Allocator type, but would directly use VirtualAlloc on Windows and mmap on other POSIX platforms to reserve memory on reserve/with_capacity/... and commit memory on push/insert/resize/extend/... Adding try_reserve/try_push/try_insert/... methods to it that work correctly is trivial (you just check VirtualAlloc and mmap on commit for errors), at least compared to trying to add try_reserve to std:: such that it works correctly on all platforms. And this vector type has some other nice properties, like an extremely simple growth policy, or being very easy to achieve zero reallocations without wasting memory.


Just because we don't have a complete win doesn't mean we have to instead suffer a total loss. Can't we at least have a partial win?

A partial win is oom=panic. It is still useful in systems with overcommit and it isn't very controversial, but the whole try_reserve discussion is delaying progress on it. Hence why I've repeatedly asked to split this RFC. It doesn't make sense to delay uncontroversial useful features for other features that still need more work (I would really like to have try_reserve or something similar to that, but whatever it is, it must work).

@stepancheg

This comment has been minimized.

Show comment
Hide comment
@stepancheg

stepancheg Jan 4, 2018

@whitequark

Have you tried to open Twitter lately and keep the tab open for a few hours? It is trivially possible to make your system unusable by allocating a lot of small objects.

I'm not expert in browser development, probably each tab should have a counter of memory allocated for the tab, and tab should be e. g. restarted when tab allocates too much memory.

I don't see how fallible allocations will help with the case of Twitter tab. OK, program won't exceed memory limit and won't crash. But it won't be able to properly recover either (e. g. it may start killing wrong tabs instead of busiest Twitter tab).

if your OS cannot manage its disk caches properly, fix your OS, not Rust

First, this is a bit non-constructive — to suggest to fix OS.

Second, OS is doing the best it can: it gives apps as much memory as they require, and uses all remaining memory to store disk caches. Result is when app start to request too much memory, system begins swapping, and disk caches stop working well. But this is a proper strategy.

Second, even if everything you said matched reality, it would make Rust collections useless for hard real-time systems.

I don't know why fallible allocation is a requirement for hard real-time systems. I think "hard real-time systems" should simply never reach oom.

stepancheg commented Jan 4, 2018

@whitequark

Have you tried to open Twitter lately and keep the tab open for a few hours? It is trivially possible to make your system unusable by allocating a lot of small objects.

I'm not expert in browser development, probably each tab should have a counter of memory allocated for the tab, and tab should be e. g. restarted when tab allocates too much memory.

I don't see how fallible allocations will help with the case of Twitter tab. OK, program won't exceed memory limit and won't crash. But it won't be able to properly recover either (e. g. it may start killing wrong tabs instead of busiest Twitter tab).

if your OS cannot manage its disk caches properly, fix your OS, not Rust

First, this is a bit non-constructive — to suggest to fix OS.

Second, OS is doing the best it can: it gives apps as much memory as they require, and uses all remaining memory to store disk caches. Result is when app start to request too much memory, system begins swapping, and disk caches stop working well. But this is a proper strategy.

Second, even if everything you said matched reality, it would make Rust collections useless for hard real-time systems.

I don't know why fallible allocation is a requirement for hard real-time systems. I think "hard real-time systems" should simply never reach oom.

@rpjohnst

This comment has been minimized.

Show comment
Hide comment
@rpjohnst

rpjohnst Jan 4, 2018

I don't know why fallible allocation is a requirement for hard real-time systems.

Then go learn why before you derail this thread any further.

rpjohnst commented Jan 4, 2018

I don't know why fallible allocation is a requirement for hard real-time systems.

Then go learn why before you derail this thread any further.

@eternaleye

This comment has been minimized.

Show comment
Hide comment
@eternaleye

eternaleye Jan 4, 2018

@stepancheg:

long before failing allocation, system will become unusable because of swapping (or compressed memory on macOS) and disk caches

Not all systems have swap. My phone, with 6G of RAM, has no swap. Servo is explicitly intended to support mobile. As a result, your suggestion is incompatible with Servo's roadmap.

I don't know why fallible allocation is a requirement for hard real-time systems. I think "hard real-time systems" should simply never reach oom.

Let's walk through a thought experiment.

  1. My process needs to allocate memory.
  2. My process checks how much memory is available, and it's plenty.
  3. Some other process allocates memory.
  4. My process allocates memory, exceeds what's available, and dies.

This race condition is unavoidable as long as (2) and (4) are separate steps. What's it look like when they're a single step, that can't be broken by being split in half? It looks like a fallible allocation: it either successfully allocates, or returns an error safely.

Fallible allocations are how a hard real-time system "simply never reaches OOM" - the allocation failing tells it that there's a shortage, and it can do some tidying up and try again.

eternaleye commented Jan 4, 2018

@stepancheg:

long before failing allocation, system will become unusable because of swapping (or compressed memory on macOS) and disk caches

Not all systems have swap. My phone, with 6G of RAM, has no swap. Servo is explicitly intended to support mobile. As a result, your suggestion is incompatible with Servo's roadmap.

I don't know why fallible allocation is a requirement for hard real-time systems. I think "hard real-time systems" should simply never reach oom.

Let's walk through a thought experiment.

  1. My process needs to allocate memory.
  2. My process checks how much memory is available, and it's plenty.
  3. Some other process allocates memory.
  4. My process allocates memory, exceeds what's available, and dies.

This race condition is unavoidable as long as (2) and (4) are separate steps. What's it look like when they're a single step, that can't be broken by being split in half? It looks like a fallible allocation: it either successfully allocates, or returns an error safely.

Fallible allocations are how a hard real-time system "simply never reaches OOM" - the allocation failing tells it that there's a shortage, and it can do some tidying up and try again.

@stepancheg

This comment has been minimized.

Show comment
Hide comment
@stepancheg

stepancheg Jan 5, 2018

Not all systems have swap. My phone, with 6G of RAM, has no swap. Servo is explicitly intended to support mobile. As a result, your suggestion is incompatible with Servo's roadmap.

It still has kind of swapping: when current process consumes too much memory, other background processes terminated instead of running hot for fast switch.

So even if the phone has 6G of memory, good browser shouldn't use all of it.

And even if browser want to use all of it, it still should properly divide memory between open windows (instead of killing random windows on oom).

So browser requires smarter memory management than simple "fallible allocations".

I don't know why fallible allocation is a requirement for hard real-time systems. I think "hard real-time systems" should simply never reach oom.

This race condition is unavoidable as long as (2) and (4) are separate steps ...

Thank for for clarification.

I understand that "hard real time system" may require an operation like try_allocate (which is a good thing, and currently available in Rust, just not yet stabilized).

But I doubt that "hard real time system" is OK with lots of functions (like Rc::new) panicking on allocations.

stepancheg commented Jan 5, 2018

Not all systems have swap. My phone, with 6G of RAM, has no swap. Servo is explicitly intended to support mobile. As a result, your suggestion is incompatible with Servo's roadmap.

It still has kind of swapping: when current process consumes too much memory, other background processes terminated instead of running hot for fast switch.

So even if the phone has 6G of memory, good browser shouldn't use all of it.

And even if browser want to use all of it, it still should properly divide memory between open windows (instead of killing random windows on oom).

So browser requires smarter memory management than simple "fallible allocations".

I don't know why fallible allocation is a requirement for hard real-time systems. I think "hard real-time systems" should simply never reach oom.

This race condition is unavoidable as long as (2) and (4) are separate steps ...

Thank for for clarification.

I understand that "hard real time system" may require an operation like try_allocate (which is a good thing, and currently available in Rust, just not yet stabilized).

But I doubt that "hard real time system" is OK with lots of functions (like Rc::new) panicking on allocations.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Jan 5, 2018

Contributor

try_allocate isn't even that good. There's still panic in code, and nothing but uneasily verified logic spanning the collection implementation and collection consumer (so very non-local) preventing it from being hit.

Result<Whatever, AllocErr> on operations as I've done (as mentioned above the revival of this thread in #2116 (comment)) is much better because no panic is emitted in the first place. The types trivially ensure the code is total, including race-free.

Contributor

Ericson2314 commented Jan 5, 2018

try_allocate isn't even that good. There's still panic in code, and nothing but uneasily verified logic spanning the collection implementation and collection consumer (so very non-local) preventing it from being hit.

Result<Whatever, AllocErr> on operations as I've done (as mentioned above the revival of this thread in #2116 (comment)) is much better because no panic is emitted in the first place. The types trivially ensure the code is total, including race-free.

@eternaleye

This comment has been minimized.

Show comment
Hide comment
@eternaleye

eternaleye Jan 5, 2018

@stepancheg:

It still has kind of swapping: when current process consumes too much memory, other background processes terminated instead of running hot for fast switch.

This is not swapping by any definition; it is neither demand paging (by which individual pages of memory are moved to disk and back) nor process swapping (where an entire process is offloaded or restored at once). The core reason is that once this happens, you have lost data.

It is a consequence of overcommit, and is deeply problematic for many use cases.

So even if the phone has 6G of memory, good browser shouldn't use all of it.

This is fallacious. Just because 6G of memory is in use, does not mean the browser has used it. Even then, it does not mean the browser is wrong to use it.

  • The browser may be running in a situation where most of the memory has been used by something else.
  • There may just not be much memory - my phone has more than most by a large margin.
  • The browser may have no choice in how much is used because the user may be using a web-based video editor
  • Or running Linux in a VM compiled to WebAssembly
  • Or any number of other things.

And even if browser want to use all of it, it still should properly divide memory between open windows (instead of killing random windows on oom).

Where, exactly, did I say that the browser should "kill random windows on OOM"? It's much more likely that it would discard decoded images on background tabs, or drop prerendered content meant to accelerate scrolling, at the cost of needing to recompute those if they became foreground tabs.

In addition, "properly dividing memory between open windows" is a nonsense phrase. The browser does not have control over how much memory a web page needs. It can't just decide that a video on Youtube gets the same amount of memory as a Rust crate's documentation.

So browser requires better memory management than simple "fallible allocations".

Sure, I'll agree with that. The important point is that it's impossible to build that without fallible allocations. Those are the fundamental building block on which abstractions can be constructred. The alternative to "fallible allocations" isn't "smart memory management", it's "being suddenly killed by the kernel with no opportunity to prevent it or save what the user was typing."

But I doubt that "hard real time system" is OK with lots of functions (like Rc::new) panicking on allocations.

There are a number of ways to answer this:

  • Generic associated types will make it possible to be generic over smart pointers, allowing hard real-time systems to use arena-based ones in place of Rc
  • A try variant of Rc::new is entirely doable
  • Hard real-time systems usually don't use reference-counting, because deallocating data can cause an unbounded delay as anything it referred to is possibly deallocated

I'd suggest spending more time learning about hard real-time systems before trying to claim you know what they need. So far, you've been consistently incorrect.

eternaleye commented Jan 5, 2018

@stepancheg:

It still has kind of swapping: when current process consumes too much memory, other background processes terminated instead of running hot for fast switch.

This is not swapping by any definition; it is neither demand paging (by which individual pages of memory are moved to disk and back) nor process swapping (where an entire process is offloaded or restored at once). The core reason is that once this happens, you have lost data.

It is a consequence of overcommit, and is deeply problematic for many use cases.

So even if the phone has 6G of memory, good browser shouldn't use all of it.

This is fallacious. Just because 6G of memory is in use, does not mean the browser has used it. Even then, it does not mean the browser is wrong to use it.

  • The browser may be running in a situation where most of the memory has been used by something else.
  • There may just not be much memory - my phone has more than most by a large margin.
  • The browser may have no choice in how much is used because the user may be using a web-based video editor
  • Or running Linux in a VM compiled to WebAssembly
  • Or any number of other things.

And even if browser want to use all of it, it still should properly divide memory between open windows (instead of killing random windows on oom).

Where, exactly, did I say that the browser should "kill random windows on OOM"? It's much more likely that it would discard decoded images on background tabs, or drop prerendered content meant to accelerate scrolling, at the cost of needing to recompute those if they became foreground tabs.

In addition, "properly dividing memory between open windows" is a nonsense phrase. The browser does not have control over how much memory a web page needs. It can't just decide that a video on Youtube gets the same amount of memory as a Rust crate's documentation.

So browser requires better memory management than simple "fallible allocations".

Sure, I'll agree with that. The important point is that it's impossible to build that without fallible allocations. Those are the fundamental building block on which abstractions can be constructred. The alternative to "fallible allocations" isn't "smart memory management", it's "being suddenly killed by the kernel with no opportunity to prevent it or save what the user was typing."

But I doubt that "hard real time system" is OK with lots of functions (like Rc::new) panicking on allocations.

There are a number of ways to answer this:

  • Generic associated types will make it possible to be generic over smart pointers, allowing hard real-time systems to use arena-based ones in place of Rc
  • A try variant of Rc::new is entirely doable
  • Hard real-time systems usually don't use reference-counting, because deallocating data can cause an unbounded delay as anything it referred to is possibly deallocated

I'd suggest spending more time learning about hard real-time systems before trying to claim you know what they need. So far, you've been consistently incorrect.

@stepancheg

This comment has been minimized.

Show comment
Hide comment
@stepancheg

stepancheg Jan 5, 2018

@eternaleye

I'd suggest spending more time learning

Well if you so sure that you know things and I don’t, there’s no point in arguing, you won’t hear my arguments.

stepancheg commented Jan 5, 2018

@eternaleye

I'd suggest spending more time learning

Well if you so sure that you know things and I don’t, there’s no point in arguing, you won’t hear my arguments.

@comex

This comment has been minimized.

Show comment
Hide comment
@comex

comex Jan 5, 2018

@eternaleye Assuming you're talking about Android, doesn't that have overcommit enabled by default?

But in any case, there are certainly environments where overcommit is not enabled and handling OOM is desirable. We don't have to go as far as hard realtime systems (which have a lot of other unusual requirements) – how about your favorite OS kernel? Even on Linux, kernel code is expected to gracefully handle out-of-memory conditions… at least some of the time. There have long been proof-of-concepts of writing Linux kernel modules in Rust, and in the future, custom allocator support should make it possible to use standard-library collections there.

comex commented Jan 5, 2018

@eternaleye Assuming you're talking about Android, doesn't that have overcommit enabled by default?

But in any case, there are certainly environments where overcommit is not enabled and handling OOM is desirable. We don't have to go as far as hard realtime systems (which have a lot of other unusual requirements) – how about your favorite OS kernel? Even on Linux, kernel code is expected to gracefully handle out-of-memory conditions… at least some of the time. There have long been proof-of-concepts of writing Linux kernel modules in Rust, and in the future, custom allocator support should make it possible to use standard-library collections there.

@eternaleye

This comment has been minimized.

Show comment
Hide comment
@eternaleye

eternaleye Jan 5, 2018

@comex:

Assuming you're talking about Android, doesn't that have overcommit enabled by default?

Yes, but I was pointing out that the performance degradation warning sign would be absent. Instead, this is a system where the race condition I described instantly results in data loss.

eternaleye commented Jan 5, 2018

@comex:

Assuming you're talking about Android, doesn't that have overcommit enabled by default?

Yes, but I was pointing out that the performance degradation warning sign would be absent. Instead, this is a system where the race condition I described instantly results in data loss.

@stepancheg

This comment has been minimized.

Show comment
Hide comment
@stepancheg

stepancheg Jan 5, 2018

@comex

how about your favorite OS kernel?
Even on Linux, kernel code is expected to gracefully handle out-of-memory conditions… at least some of the time

For “some” allocations calling alloc crate explicitly and/or implementing “FallibleVec” outside of stdlib would probably be enough.

stepancheg commented Jan 5, 2018

@comex

how about your favorite OS kernel?
Even on Linux, kernel code is expected to gracefully handle out-of-memory conditions… at least some of the time

For “some” allocations calling alloc crate explicitly and/or implementing “FallibleVec” outside of stdlib would probably be enough.

@whitequark

This comment has been minimized.

Show comment
Hide comment
@whitequark

whitequark Jan 5, 2018

@eternaleye

Hard real-time systems usually don't use reference-counting, because deallocating data can cause an unbounded delay as anything it referred to is possibly deallocated

That's actually not entirely true. So long as there are no loops where a type T stores an Rc<T> directly or indirectly, the amount of elementary deallocation operations is bounded by the depth of the tree rooted in Rc<T>. In practice, often T would not store any pointers, e.g. consider network buffers that are reference-counted. (You can see that in lwip.) You would also need to use pool allocators to avoid the overhead of merging and splitting free blocks, as opposed to free-list allocators or something similar, in order to have the delay also bounded in time.

In other words, an Rc::try_new() that uses a pool allocator is a perfectly reasonable thing to have on a hard real-time system and there are indeed examples of it today in wide use.

whitequark commented Jan 5, 2018

@eternaleye

Hard real-time systems usually don't use reference-counting, because deallocating data can cause an unbounded delay as anything it referred to is possibly deallocated

That's actually not entirely true. So long as there are no loops where a type T stores an Rc<T> directly or indirectly, the amount of elementary deallocation operations is bounded by the depth of the tree rooted in Rc<T>. In practice, often T would not store any pointers, e.g. consider network buffers that are reference-counted. (You can see that in lwip.) You would also need to use pool allocators to avoid the overhead of merging and splitting free blocks, as opposed to free-list allocators or something similar, in order to have the delay also bounded in time.

In other words, an Rc::try_new() that uses a pool allocator is a perfectly reasonable thing to have on a hard real-time system and there are indeed examples of it today in wide use.

@whitequark

This comment has been minimized.

Show comment
Hide comment
@whitequark

whitequark Jan 5, 2018

@stepancheg

For “some” allocations calling alloc crate explicitly and/or implementing “FallibleVec” outside of stdlib would probably be enough.

And why exactly is this better than having try_reserve? You've just split the crates that work with collections in two incompatible universes, in exchange for... what? A minor ideological point?

whitequark commented Jan 5, 2018

@stepancheg

For “some” allocations calling alloc crate explicitly and/or implementing “FallibleVec” outside of stdlib would probably be enough.

And why exactly is this better than having try_reserve? You've just split the crates that work with collections in two incompatible universes, in exchange for... what? A minor ideological point?

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Jan 5, 2018

Contributor

Try_reserve way has dead branches with panic, vs no panics, a minor disadvantage. But we don't have to pay the price of a second crate and duplicated implementation to have fallible collections!

Contributor

Ericson2314 commented Jan 5, 2018

Try_reserve way has dead branches with panic, vs no panics, a minor disadvantage. But we don't have to pay the price of a second crate and duplicated implementation to have fallible collections!

@stepancheg

This comment has been minimized.

Show comment
Hide comment
@stepancheg

stepancheg Jan 5, 2018

@whitequark

And why exactly is this better than having try_reserve? You've just split the crates that work with collections in two incompatible universes, in exchange for... what? A minor ideological point?

I think it's too expensive to duplicate all of or large part of rust stdlib with try_ functions. try_reserve on vector is not enough, you also need:

  • Rc::try_new
  • String::try_push
  • BufWriter::try_new
  • ToString::try_to_string
  • Mutex::try_new
  • Channel::try_enqueue_with_oom
  • thread-local get
    and so on.

Even worse, as try_ functions are not required to be called, it would be hard to know if some particular library (or just a module inside a program) doesn't accidentally use function which panics/crashes on OOM instead of returning Result.

(And I agree with @gnzlbg that at RFC should be at least split into part which panic on OOM and part with adds try_ functions).

About "minor ideological point". I'd like to insert famous quote: "Simple things should be simple, complex things should be possible."

Most of programs (and libraries) won't ever need and won't support fallible allocations, crash on OOM is the best strategy for them, and having to deal with fallible allocations would be too much burden.

And for specific situations when you need to gracefully handle OOM, low level API to allocate memory should be enough, and kernel-like developers could create specific libraries to use that API.

stepancheg commented Jan 5, 2018

@whitequark

And why exactly is this better than having try_reserve? You've just split the crates that work with collections in two incompatible universes, in exchange for... what? A minor ideological point?

I think it's too expensive to duplicate all of or large part of rust stdlib with try_ functions. try_reserve on vector is not enough, you also need:

  • Rc::try_new
  • String::try_push
  • BufWriter::try_new
  • ToString::try_to_string
  • Mutex::try_new
  • Channel::try_enqueue_with_oom
  • thread-local get
    and so on.

Even worse, as try_ functions are not required to be called, it would be hard to know if some particular library (or just a module inside a program) doesn't accidentally use function which panics/crashes on OOM instead of returning Result.

(And I agree with @gnzlbg that at RFC should be at least split into part which panic on OOM and part with adds try_ functions).

About "minor ideological point". I'd like to insert famous quote: "Simple things should be simple, complex things should be possible."

Most of programs (and libraries) won't ever need and won't support fallible allocations, crash on OOM is the best strategy for them, and having to deal with fallible allocations would be too much burden.

And for specific situations when you need to gracefully handle OOM, low level API to allocate memory should be enough, and kernel-like developers could create specific libraries to use that API.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Jan 5, 2018

Contributor

And I agree with @gnzlbg that at RFC should be at least split into part which panic on OOM and part with adds try_ functions

This is reasonable. In general, I disagree vehemently with unwinding catching as an error handling strategy, but its silly that OOM is somehow a different type of failure. It's even sillier that currently, we also abort on invalid layout because of the laxity of Alloc:ooms type. Invalid use of the allocator API has nothing to do with OOM.

think it's too expensive to duplicate all of or large part of rust stdlib with try_ functions

You're arguing against your own point? It is now demonstrated that this is not expensive at all: eddyb/rust@future-box...QuiltOS:allocator-error

it would be hard to know if some particular library (or just a module inside a program) doesn't accidentally use function which panics/crashes on OOM instead of returning Result.

Totally agreed! This is why I use the allocator type to enforce that this won't happen. (I do want to add a nicer way to zero-cost cast between the A and AbortAlloc<A> variants, but virtual of it still being an extra method call, accidental usage is far less likely).

Contributor

Ericson2314 commented Jan 5, 2018

And I agree with @gnzlbg that at RFC should be at least split into part which panic on OOM and part with adds try_ functions

This is reasonable. In general, I disagree vehemently with unwinding catching as an error handling strategy, but its silly that OOM is somehow a different type of failure. It's even sillier that currently, we also abort on invalid layout because of the laxity of Alloc:ooms type. Invalid use of the allocator API has nothing to do with OOM.

think it's too expensive to duplicate all of or large part of rust stdlib with try_ functions

You're arguing against your own point? It is now demonstrated that this is not expensive at all: eddyb/rust@future-box...QuiltOS:allocator-error

it would be hard to know if some particular library (or just a module inside a program) doesn't accidentally use function which panics/crashes on OOM instead of returning Result.

Totally agreed! This is why I use the allocator type to enforce that this won't happen. (I do want to add a nicer way to zero-cost cast between the A and AbortAlloc<A> variants, but virtual of it still being an extra method call, accidental usage is far less likely).

@whitequark

This comment has been minimized.

Show comment
Hide comment
@whitequark

whitequark Jan 5, 2018

Even worse, as try_ functions are not required to be called, it would be hard to know if some particular library (or just a module inside a program) doesn't accidentally use function which panics/crashes on OOM instead of returning Result.

No, this is solved by having a lint, and then using #[deny(infallible_allocation)] (until collections are parameterized with allocators) or by having an allocator type with an associated error type specified as ! (once collections are parameterized with allocators, as @Ericson2314 correctly suggests).

Most of programs (and libraries) won't ever need and won't support fallible allocations

Every #![no_std] library that uses liballoc today can and should support fallible allocations to allow using it in embedded contexts. If implemented as you propose, all these libraries will have to migrate to FallibleVec for this to happen, which means that using them in hosted contexts with panic-on-OOM becomes unnecessarily unwieldy.

And for specific situations when you need to gracefully handle OOM, low level API to allocate memory should be enough, and kernel-like developers could create specific libraries to use that API.

You are not a developer working in an embedded, RTOS, or OS kernel context. Please stop talking for us, because you do not have any knowledge or understanding of what needs we have by your very own admission, and you have shown a complete lack of empathy or interest in understanding use cases other than "hosted Linux with overcommit".

whitequark commented Jan 5, 2018

Even worse, as try_ functions are not required to be called, it would be hard to know if some particular library (or just a module inside a program) doesn't accidentally use function which panics/crashes on OOM instead of returning Result.

No, this is solved by having a lint, and then using #[deny(infallible_allocation)] (until collections are parameterized with allocators) or by having an allocator type with an associated error type specified as ! (once collections are parameterized with allocators, as @Ericson2314 correctly suggests).

Most of programs (and libraries) won't ever need and won't support fallible allocations

Every #![no_std] library that uses liballoc today can and should support fallible allocations to allow using it in embedded contexts. If implemented as you propose, all these libraries will have to migrate to FallibleVec for this to happen, which means that using them in hosted contexts with panic-on-OOM becomes unnecessarily unwieldy.

And for specific situations when you need to gracefully handle OOM, low level API to allocate memory should be enough, and kernel-like developers could create specific libraries to use that API.

You are not a developer working in an embedded, RTOS, or OS kernel context. Please stop talking for us, because you do not have any knowledge or understanding of what needs we have by your very own admission, and you have shown a complete lack of empathy or interest in understanding use cases other than "hosted Linux with overcommit".

@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth

Manishearth Jan 5, 2018

Member

Mod note Please calm down, everyone. Phrases like "Instead of bragging" are not constructive.

In general, talk about proposals, not people. It helps nobody to nitpick on who has the authority to talk about what. Instead, make your point, and explain the context as to why it is relevant. If someone makes a point that is inapplicable to a kind of system, talk about why it is inapplicable; don't focus on the credentials of the people involved.

Member

Manishearth commented Jan 5, 2018

Mod note Please calm down, everyone. Phrases like "Instead of bragging" are not constructive.

In general, talk about proposals, not people. It helps nobody to nitpick on who has the authority to talk about what. Instead, make your point, and explain the context as to why it is relevant. If someone makes a point that is inapplicable to a kind of system, talk about why it is inapplicable; don't focus on the credentials of the people involved.

@rust-lang rust-lang deleted a comment from stepancheg Jan 5, 2018

@rust-lang rust-lang deleted a comment from whitequark Jan 5, 2018

@mbrubeck

This comment has been minimized.

Show comment
Hide comment
@mbrubeck

mbrubeck Jan 5, 2018

Contributor

further mod note: Two comments deleted. If you have complaints about how someone is engaging in a discussion, please talk to the mods and we can address it privately, rather than bringing them into the thread itself.

Contributor

mbrubeck commented Jan 5, 2018

further mod note: Two comments deleted. If you have complaints about how someone is engaging in a discussion, please talk to the mods and we can address it privately, rather than bringing them into the thread itself.

@aturon

This comment has been minimized.

Show comment
Hide comment
@aturon

aturon Feb 1, 2018

Member

Today I finally girded myself to wade back into this thread :-)

To be honest, I think the key problem with this RFC is its title. Failable allocation as a general topic has a huge set of stakeholders with divergent needs, and this RFC makes very clear that it is not trying to solve the general problem -- but its title perhaps suggests otherwise.

The fact of the matter is that try_reserve solves some problems for some people. This is indisputable, and should not be litigated further on this thread.

However, I think part of the frustration on the thread is that others feel they can see their way to a fully general solution that obviates the need for try_reserve, and serves a larger set of use-cases.

This is a classic situation in the Rust world. The danger, though, is that by always looking to the "more perfect" solution we never ship.

The classic way we handle this in Rust is by considering "forward compatibility". In this case: to what extent does try_reserve preclude "more perfect" solutions in the future? And the answer is clearly "it does not". At worst, we may eventually want to deprecate it in favor of something better.

Now, another concern is that try_reserve is the tip of the iceberg, and soon we will have try_XXX functions cropping up all over std. I agree that this would be an unfortunate outcome, but that goes back to the goals of this RFC. If you see the proposal here as specifically handling a sort of "best effort" situation for particular applications, we really only need this for a couple of core data structures (as proposed). @Gankro took great pains to limit the amount of API bloat needed.

By all means, let's discuss a more perfect solution-- on a separate RFC proposing it in detail. But in the meantime, let's ship a useful iteration that is forward-compatible, with the shared understanding that it's an imperfect approach that we don't want to apply universally across std.

Member

aturon commented Feb 1, 2018

Today I finally girded myself to wade back into this thread :-)

To be honest, I think the key problem with this RFC is its title. Failable allocation as a general topic has a huge set of stakeholders with divergent needs, and this RFC makes very clear that it is not trying to solve the general problem -- but its title perhaps suggests otherwise.

The fact of the matter is that try_reserve solves some problems for some people. This is indisputable, and should not be litigated further on this thread.

However, I think part of the frustration on the thread is that others feel they can see their way to a fully general solution that obviates the need for try_reserve, and serves a larger set of use-cases.

This is a classic situation in the Rust world. The danger, though, is that by always looking to the "more perfect" solution we never ship.

The classic way we handle this in Rust is by considering "forward compatibility". In this case: to what extent does try_reserve preclude "more perfect" solutions in the future? And the answer is clearly "it does not". At worst, we may eventually want to deprecate it in favor of something better.

Now, another concern is that try_reserve is the tip of the iceberg, and soon we will have try_XXX functions cropping up all over std. I agree that this would be an unfortunate outcome, but that goes back to the goals of this RFC. If you see the proposal here as specifically handling a sort of "best effort" situation for particular applications, we really only need this for a couple of core data structures (as proposed). @Gankro took great pains to limit the amount of API bloat needed.

By all means, let's discuss a more perfect solution-- on a separate RFC proposing it in detail. But in the meantime, let's ship a useful iteration that is forward-compatible, with the shared understanding that it's an imperfect approach that we don't want to apply universally across std.

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg Feb 2, 2018

Contributor

@aturon

By all means, let's discuss a more perfect solution-- on a separate RFC proposing it in detail. But in the meantime, let's ship a useful iteration that is forward-compatible, with the shared understanding that it's an imperfect approach that we don't want to apply universally across std.

We agreed before that for try_push to be useful it would need to correctly report error on OOM and for try_reserve to be useful it would need to guarantee that subsequent calls to push that do not increase the capacity cannot fail. You wrote:

all of the collections try_reserve would be added to support a strong guarantee for its behavior

But this is not true. Vec has no control over this at all: only Vecs allocator has a say on whether this can or can't work.

In my opinion, the best platform-agnostic guarantees that we can provide for Vec::try_{push,reserve} is something like this:

  • try_reserve: iff Vec's allocator commits memory on allocation then a successful try_reserve guarantees that subsequent calls to push that do not increase the Vec's capacity cannot fail; otherwise, the behavior of try_reserve is undefined.

  • try_push: iff Vec's allocator commits memory on allocation then try_push returns error if the allocation fails; otherwise, the behavior of try_push is undefined.

In particular, given that Vec cannot query whether its allocator overcommits memory or not (this is a problem that we might be able to fix though), these functions might need to be unsafe.

Also, revisiting the systems in which the System allocator guarantees that these functions work I can only find one answer: Windows. On Linux, MacOSX, and *BSD, they will generally not work unless the user changes the systems default settings to disable overcommit (terrible idea) or uses a Linux/*BSD distro tailored for being used with overcommit disabled.

On embedded and Linux/MacOSX/*BSDs the user can provide allocators that go both ways. As @whitequark correctly points out, in embedded it makes little sense in general for users to provide allocators that overcommit, but it is still something that can be done.

I've repeated many times that we should split try_xxx into its own RFC so that we can make progress on it without delaying progress on oom_panic, but since it seems that it is "all or nothing", in my opinion the quickest ways to stabilize this are:

    1. make the Vec::try_xxx methods unsafe (we can always remove the unsafe keyword later in a backwards compatible way)
    1. expose them safely only behind a feature flag in the standard library that is enabled by default on windows and disabled otherwise (this allows users using xargo on embedded to enable these as well).
    1. make them safe, but add an unstable method to the Allocator trait that returns true if the allocator overcommits and make the Vec::try_xxx methods unconditionally panic if this method returns true. We could set this method to return true by default and then on Windows we make it return false (no overcommit). Embedded users writing their own Allocators can specify whether their allocator overcommits or not. This can be useful for those on MacOSX/Linux/*BSD that write their own allocator that does not overcommit as well.
    1. have a NonOvercommitingAlloc (for a lack of a better name) trait that refines Alloc, and implement these methods on Vec for NonOvercommitingAllocators only. That way windows and embedded users can mark their allocators as being NonOvercommitingAlloc by just adding an impl.

I don't know. Hopefully others have better ideas but 3 and 4 don't look that bad to me.

Contributor

gnzlbg commented Feb 2, 2018

@aturon

By all means, let's discuss a more perfect solution-- on a separate RFC proposing it in detail. But in the meantime, let's ship a useful iteration that is forward-compatible, with the shared understanding that it's an imperfect approach that we don't want to apply universally across std.

We agreed before that for try_push to be useful it would need to correctly report error on OOM and for try_reserve to be useful it would need to guarantee that subsequent calls to push that do not increase the capacity cannot fail. You wrote:

all of the collections try_reserve would be added to support a strong guarantee for its behavior

But this is not true. Vec has no control over this at all: only Vecs allocator has a say on whether this can or can't work.

In my opinion, the best platform-agnostic guarantees that we can provide for Vec::try_{push,reserve} is something like this:

  • try_reserve: iff Vec's allocator commits memory on allocation then a successful try_reserve guarantees that subsequent calls to push that do not increase the Vec's capacity cannot fail; otherwise, the behavior of try_reserve is undefined.

  • try_push: iff Vec's allocator commits memory on allocation then try_push returns error if the allocation fails; otherwise, the behavior of try_push is undefined.

In particular, given that Vec cannot query whether its allocator overcommits memory or not (this is a problem that we might be able to fix though), these functions might need to be unsafe.

Also, revisiting the systems in which the System allocator guarantees that these functions work I can only find one answer: Windows. On Linux, MacOSX, and *BSD, they will generally not work unless the user changes the systems default settings to disable overcommit (terrible idea) or uses a Linux/*BSD distro tailored for being used with overcommit disabled.

On embedded and Linux/MacOSX/*BSDs the user can provide allocators that go both ways. As @whitequark correctly points out, in embedded it makes little sense in general for users to provide allocators that overcommit, but it is still something that can be done.

I've repeated many times that we should split try_xxx into its own RFC so that we can make progress on it without delaying progress on oom_panic, but since it seems that it is "all or nothing", in my opinion the quickest ways to stabilize this are:

    1. make the Vec::try_xxx methods unsafe (we can always remove the unsafe keyword later in a backwards compatible way)
    1. expose them safely only behind a feature flag in the standard library that is enabled by default on windows and disabled otherwise (this allows users using xargo on embedded to enable these as well).
    1. make them safe, but add an unstable method to the Allocator trait that returns true if the allocator overcommits and make the Vec::try_xxx methods unconditionally panic if this method returns true. We could set this method to return true by default and then on Windows we make it return false (no overcommit). Embedded users writing their own Allocators can specify whether their allocator overcommits or not. This can be useful for those on MacOSX/Linux/*BSD that write their own allocator that does not overcommit as well.
    1. have a NonOvercommitingAlloc (for a lack of a better name) trait that refines Alloc, and implement these methods on Vec for NonOvercommitingAllocators only. That way windows and embedded users can mark their allocators as being NonOvercommitingAlloc by just adding an impl.

I don't know. Hopefully others have better ideas but 3 and 4 don't look that bad to me.

@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth

Manishearth Feb 2, 2018

Member

otherwise, the behavior of try_reserve is undefined.

This seems extreme. Adding UB to this case would not be helpful for the use cases mentioned in the original RFC. Firefox, for example, is okay with not being notified of OOMs that were delayed by overcommit, but wants to catch as many OOMs as possible.

Like, 32 bit systems exist, and even with overcommit there can be OOM-on-allocation on those pretty easily.

Hobbling an API completely just because it isn't perfect for some platforms in seems extreme to me.

Rather, it makes more sense to define it as "try_reserve is defined to reserve X memory. If the OS commits memory on allocation, try_reserve is guaranteed to produce an error when it is unable to allocate the memory. If not, it is guaranteed to produce an error in case the allocation operation somehow failed, but may succeed when there isn't actually enough memory to satisfy it"

Member

Manishearth commented Feb 2, 2018

otherwise, the behavior of try_reserve is undefined.

This seems extreme. Adding UB to this case would not be helpful for the use cases mentioned in the original RFC. Firefox, for example, is okay with not being notified of OOMs that were delayed by overcommit, but wants to catch as many OOMs as possible.

Like, 32 bit systems exist, and even with overcommit there can be OOM-on-allocation on those pretty easily.

Hobbling an API completely just because it isn't perfect for some platforms in seems extreme to me.

Rather, it makes more sense to define it as "try_reserve is defined to reserve X memory. If the OS commits memory on allocation, try_reserve is guaranteed to produce an error when it is unable to allocate the memory. If not, it is guaranteed to produce an error in case the allocation operation somehow failed, but may succeed when there isn't actually enough memory to satisfy it"

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg Feb 2, 2018

Contributor

@Manishearth makes sense, what about:

  • try_reserve: iff Vec's allocator commits memory on allocation then a successful try_reserve guarantees that subsequent calls to push that do not increase the Vec's capacity cannot fail. Otherwise, try_reserve might fail or succeed. If it succeeds subsequent calls to push and try_push are not guaranteed to succeed.

Do you know how to remove the undefined behavior for try_push ? I think we actually should add it to push as well. Without this, the API still pretty broken, because the only reason to call try_reserve is to insert something in the Vec afterwards :/

Contributor

gnzlbg commented Feb 2, 2018

@Manishearth makes sense, what about:

  • try_reserve: iff Vec's allocator commits memory on allocation then a successful try_reserve guarantees that subsequent calls to push that do not increase the Vec's capacity cannot fail. Otherwise, try_reserve might fail or succeed. If it succeeds subsequent calls to push and try_push are not guaranteed to succeed.

Do you know how to remove the undefined behavior for try_push ? I think we actually should add it to push as well. Without this, the API still pretty broken, because the only reason to call try_reserve is to insert something in the Vec afterwards :/

@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth

Manishearth Feb 2, 2018

Member

(We discussed this in IRC, @gnzlbg was misusing "undefined behavior")

Reasoning about the OOM killer is out of scope for Rust's safety model ; the OOM killer can kill a Rust program even when the Rust program wasn't allocating.

One could call this "implementation-defined behavior" but it's not even that; the OOM killer is out of scope for Rust, as is the robustness of Rust programs in presence of an attached debugger mutating the process, or /proc/self/mem, or kill -9.

Member

Manishearth commented Feb 2, 2018

(We discussed this in IRC, @gnzlbg was misusing "undefined behavior")

Reasoning about the OOM killer is out of scope for Rust's safety model ; the OOM killer can kill a Rust program even when the Rust program wasn't allocating.

One could call this "implementation-defined behavior" but it's not even that; the OOM killer is out of scope for Rust, as is the robustness of Rust programs in presence of an attached debugger mutating the process, or /proc/self/mem, or kill -9.

@aturon

This comment has been minimized.

Show comment
Hide comment
@aturon

aturon Feb 2, 2018

Member

@gnzlbg Quick clarification: the RFC does not propose to add try_push to std, it only notes that you can define it externally.

The specification you and @Manishearth are hashing out seems just fine to me; this entire enterprise is about a "best effort" API anyway.

Member

aturon commented Feb 2, 2018

@gnzlbg Quick clarification: the RFC does not propose to add try_push to std, it only notes that you can define it externally.

The specification you and @Manishearth are hashing out seems just fine to me; this entire enterprise is about a "best effort" API anyway.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Feb 2, 2018

Contributor

@aturon I totally agree with the principle, but two things. First, the "more perfect solution"'s difficulty is vastly overestimated. As soon as someone fixes the windows errors in rust-lang/rust#47043 I am confident I can make most collections allocator- and fallibility- polymorphic in 2 weeks, tops, seeing that I already did a couple in 2-3 days in https://github.com/QuiltOS/rust/commits/allocator-error.

Second, at the very least, try_reserve should be deprecated as soon as a better solution is available. Its the equivalent of if and unwrap vs pattern matching for allocation: unergonomic and unreasonably difficult to write correct code with, yet easy to understand and so unreasonably attractive to newcommers. So yes, don't let the perfect be the enemy of the good, but also keep in mind the inferior APIs by their mere existence can harm pedagogy and ergonomics.

Contributor

Ericson2314 commented Feb 2, 2018

@aturon I totally agree with the principle, but two things. First, the "more perfect solution"'s difficulty is vastly overestimated. As soon as someone fixes the windows errors in rust-lang/rust#47043 I am confident I can make most collections allocator- and fallibility- polymorphic in 2 weeks, tops, seeing that I already did a couple in 2-3 days in https://github.com/QuiltOS/rust/commits/allocator-error.

Second, at the very least, try_reserve should be deprecated as soon as a better solution is available. Its the equivalent of if and unwrap vs pattern matching for allocation: unergonomic and unreasonably difficult to write correct code with, yet easy to understand and so unreasonably attractive to newcommers. So yes, don't let the perfect be the enemy of the good, but also keep in mind the inferior APIs by their mere existence can harm pedagogy and ergonomics.

@aturon aturon merged commit 2b1d50f into rust-lang:master Feb 7, 2018

@aturon

This comment has been minimized.

Show comment
Hide comment
@aturon

aturon Feb 7, 2018

Member

This RFC has been merged; the tracking issue is here. See the summary here.

Member

aturon commented Feb 7, 2018

This RFC has been merged; the tracking issue is here. See the summary here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment