Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IResult VS Result: making Incomplete part of errors? #356

Closed
Geal opened this issue Nov 1, 2016 · 26 comments
Closed

IResult VS Result: making Incomplete part of errors? #356

Geal opened this issue Nov 1, 2016 · 26 comments
Milestone

Comments

@Geal
Copy link
Collaborator

Geal commented Nov 1, 2016

There have been a lot of demands that I change nom's basic type from IResult to std::result::Result. IResult has the following definition:

pub enum IResult<I,O,E=u32> {
  Done(I,O),
  Error(Err<I,E>),
  Incomplete(Needed)
}

This was originally inspired from attoparsec's IResult in which the Partial branch contained a closure to be called when more data is available. For various reasons, I was not able to make the closure idea work (note that Rust was very far from 1.0 at the time), so I chose to show how much data was needed to ask the user to parse again.

I open this issue to study what a change from IResult to Result would entail. I make no promise to do that change, and I will not put the issue to a vote. I will however take into account the responses I get.

The proposal is to make the Incomplete branch part of the Error branch, which would allow employing Result. I do not know yet what the end type would look like. I see two possibilities for the Err type: containing either a Needed or the other error branches, or flattening Needed at the same level.

So, to detail the arguments:

pro:

  • can reuse all of the Result methods and the code relying on it
  • less surprising for developers
  • nom users can easily ignore all of the Incomplete usage, parsers are easier to write
  • it might make the macros simpler, because in pattern matching, most combinators would handle 2 cases instead of 4: Done and Error. In some combinators, there will be only three of them: Done, Error or Incomplete(Needed::Unknown), Incomplete::Needed::Size(sz), because the calculation of needed data must still happen
  • a lot of nom parsers are stuck on an old version and likely will never update (because it does the job as is), so less code to rewrite, maybe?
  • a lot of people find Incomplete confusing, since they work on complete data (like a file completely read in memory)
  • I could add in there a "non backtrackable error" that can contain the same thing as a normal error, but would make everything return instead of testing other branches
  • it might make compilation faster (compilation time is the reason for the nom fork in syn)

con:

  • it means a nearly complete rewrite of nom. It's not necessarily an issue, I'm willing to make the time for it if needed
  • there are still a lot of nom users relying on Incomplete, and this is a big breaking change for them
  • even parsers that do not use Incomplete would need to be updated to use Result
  • it might make some combinators confusing. All of the backtracking combinators like alt! currently return on Incomplete instead of testing the next branch. So, instead of alt! and alt_complete!, do a alt! and alt_incomplete! ?
  • nom was designed with streaming in mind, I am worried hiding Incomplete will hide this benefit of the library

also, I am not sure about the timeline here. I am doing nom 2.0 very soon and it introduces some breaking changes, but I'm worried this change might be too big and drive users away. On the other hand, a 3.0 would likely happen far in the future, and there would be even more code (in nom and in code relying on nom) to update.

so, what do people think?

@Geal Geal mentioned this issue Nov 1, 2016
59 tasks
@GuillaumeGomez
Copy link
Contributor

From my point of view, it would remove a difficulty from nom usage. Now I have no idea what will be the result but it seems very interesting!

@vcsjones
Copy link

vcsjones commented Nov 1, 2016

it means a nearly complete rewrite of nom. It's not necessarily an issue, I'm willing to make the time for it if needed

Without knowing more details about what the final form of this would take, it would also mean a significant engineering effort for some of my own projects. For my own selfish reasons, I would be opposed to the change unless there is a material benefit for me as well. I do use Incomplete since much of the data that comes in is buffered from a socket.

nom was designed with streaming in mind, I am worried hiding Incomplete will hide this benefit of the library

Yes. Many, but not all, of my usage of nom is with sockets.

@keeperofdakeys
Copy link
Contributor

keeperofdakeys commented Nov 1, 2016

It's worth noting that std::io::ErrorKind has some variants that represent recoverable kinds of errors, like WouldBlock and TimedOut. So such a thing isn't inconsistent with the rest of rust.

Also I don't see how hiding Incomplete behind an error kind would make the streaming story that much harder, besides being mentioned in the documentation, I'd imagine most people would use the built-in nom Producer/Consumer construction (IE: They'd never have to deal with the Incomplete error kind directly).

@Geal
Copy link
Collaborator Author

Geal commented Nov 1, 2016

@vcsjones how much code have you written using nom? If I decide to do the change, it might be something I can help with (I'd try to make the changes sed-able too).

@Geal
Copy link
Collaborator Author

Geal commented Nov 1, 2016

@keeperofdakeys the producers and consumers are awkward to use, and when you implement streaming protocols, you like to manage your sockets and buffers yourself, so that's why Incomplete is exposed

@dtolnay
Copy link

dtolnay commented Nov 1, 2016

I don't have enough understanding of other nom use cases to make a call here, but as you mentioned, this would probably allow us to use real nom rather than a fork in syn.

@vcsjones
Copy link

vcsjones commented Nov 1, 2016

@Geal

how much code have you written using nom?

A few to several thousand lines directly using Nom, much of which is not open :-(. The gist of it is an IPSec and SSH implementation.

I'd try to make the changes sed-able too

That's good. I suppose if you do go through with the change, I would hope to be able to automate as much as I could.

@eternaleye
Copy link

eternaleye commented Nov 3, 2016

To be honest, I'd actually suggest that futures-rs has a good type that would be worth stealing, here:

pub type Poll<T, E> = Result<Async<T>, E>;

/// Return type of future, indicating whether a value is ready or not.
#[derive(Copy, Clone, Debug, PartialEq)]
pub enum Async<T> {
    /// Represents that a value is immediately ready.
    Ready(T),

    /// Represents that a value is not ready yet, but may be so later.
    NotReady,
}

Rather than putting NotReady in the error (for similar reasons to IResult), it puts it in the Ok. In addition, it allows seamless interoperation with the (growing) futures ecosystem, and trivial implementations of Future.

A user could simply do an and_then on whatever future-based IO they are doing to get their input, and call their nom-based parser inside.

@fflorent
Copy link
Contributor

Hi there,

As told in the IRC channel, I would like to work on the issue (though I promise nothing, I just investigate a bit for now).

Currently, I am wondering whether we want to keep the IResult object or just take advantage of Result only. With the latter option, I would think of this kind of transcription:

IResult::Done(i, o) -> Result::Ok((i, o)) // Result encapsulates a tuple of input / output

#[cfg(not(feature = "verbose-errors"))]
IResult::Error(ErrorKind<E>) -> Result::Err(ErrorKind<E>)
#[cfg(feature = "verbose-errors")]
IResult::Error(i, ErrorKind<E>) -> Result::Err((i, ErrorKind<E>))

IResult::Incomplete(Needed::Unknown) -> Result::Err(ErrorKind::Needed::Unknown)
IResult::Incomplete(Needed::Size(sz)) -> Result::Err(ErrorKind::Needed::Size(sz))

Thoughts?

To be honest, I'd actually suggest that futures-rs has a good type that would be worth stealing, here:

I wonder if this couldn't be a separate crate, as (I guess) streaming is a specific need. What do you think?

Florent

@eternaleye
Copy link

eternaleye commented Jul 16, 2017

@fflorent Well, part of my question is "Does Needed need to be in the Incomplete result itself?"

In particular, I feel that scrutinizing the Needed value is (at most) an optimization, and one that is sometimes unavailable anyway (the Unknown variant). As a result, it might be better to provide it as a method (fn needed(&self) -> Option<usize>). At that point, the Poll<T, E> type of the futures crate could replace IResult without loss of generality.

Even better, this could be done compatibly once such a method is added - IResult could implement Into<Poll<T, E>>, and vice versa.

@fflorent
Copy link
Contributor

At that point, the Poll<T, E> type of the futures crate could replace IResult without loss of generality.

If I am not wrong, one of the main point of this thread is the difficulty of using IResult, especially since not everyone has to support incomplete data. Another point is that nom takes longer to compile with IResult (using Result would shorten the time for it).

And I think @Geal is willing to separate some part of nom in different crate (please @Geal correct me if I am wrong :)).

… At least, of course, when that's feasible (I have to admit that I don't know how to handle streaming in a project using nom, if anyone can point me to some I would be grateful :)).

As a result, it might be better to provide it as a method (fn needed(&self) -> Option<usize>)

This sounds reasonable to me to add this in nom too.

@Geal
Copy link
Collaborator Author

Geal commented Jul 23, 2017

mmmh, so, a few things here:

  • a lot of people think they do not need to handle incomplete data, especially for textual formats and languages. But it's in fact very common to end up with large JSON, XML, or even C or PHP files that you would not want to load entirely in memory. And then, trying to retrofit a parser in streaming mode is painful
  • I don't think the compilation time gets this large because of IResult vs Result, but mostly for the calculations of how much data is needed. Right now I'm not sure we always need to know how much data we need when a parser returns Incomplete, especially with network protocols. Maybe Needed could in fact hold the input slice and a needed number of bytes, so we only need to calculate once (offset from base slice + needed bytes) and move the calculation in a separate function. That would reduce the code size greatly
  • I'm open to moving some stuff out of nom in separate crates, but the core types will stay in nom. Streaming is a core feature of nom and it will stay there. There's already a compilation feature to activate or not verbose error handling (with position information and such), and it's painful to maintain. I'm not sure adding another compilation feature for streaming would be a good idea. Especially when nom can be included multiple times in a project from various libraries.
  • as for where Needed should be, I'm torn between both arguments, but I lean a bit more on putting it in Ok:
    • putting it in Err: it's the common approach, but mainly because of historical reasons. C functions returning integer fell into the "0 is success, everything else an error" pattern that caused so many bugs. We have better types in Rust now. The ability to ignore Incomplete is great though.
    • putting it in Ok: Incomplete is not an error. It's a very common case. I think that pattern matching on Ok(Done(i,o)) and putting Ok(Incomplete(_)) and Err in the same case (maybe with a helper function?) would not be that hard.

@fflorent
Copy link
Contributor

Thanks for your answer Geal!

putting it in Ok: Incomplete is not an error. It's a very common case. I think that pattern matching on Ok(Done(i,o)) and putting Ok(Incomplete(_)) and Err in the same case (maybe with a helper function?) would not be that hard.

Sounds wise.

a lot of people think they do not need to handle incomplete data, especially for textual formats and languages. But it's in fact very common to end up with large JSON, XML, or even C or PHP files that you would not want to load entirely in memory. And then, trying to retrofit a parser in streaming mode is painful

So if I get your point correctly, you state that handling incomplete data doesn't mean we do streaming (but that we just may need to load in memory some partial data)?

Florent

@fflorent
Copy link
Contributor

fflorent commented Sep 2, 2017

I have been thinking about this proposal:

putting it in Ok: Incomplete is not an error. It's a very common case. I think that pattern matching on Ok(Done(i,o)) and putting Ok(Incomplete(_)) and Err in the same case (maybe with a helper function?) would not be that hard.

I fear that having Ok(Done(i, o)) and Ok(Incomplete(_)) could make the use of nom rather complex (especially for beginners). The user has to unwrap two layers of enums (Result then Done/Incomplete) to get the value when the parsing is successfully.

Also is this worth to introduce such a breaking change?

With my understanding (I emphasize these three previous words :)), I wonder if that wouldn't be wiser to keep IResult unchanged.

Also if I am missing key points of the idea proposed here, I would be happy to get explanations or to let someone else do the work and see its virtues :).

Cheers,
Florent

@Geal Geal added this to the 4.0 milestone Sep 6, 2017
@Geal
Copy link
Collaborator Author

Geal commented Sep 7, 2017

So I'm currently testing this, and pushed the code there: 6a15807

I'm putting Incomplete on the error side for now, since it will contain more stuff

@Hywan
Copy link
Contributor

Hywan commented Sep 8, 2017

I agree with @eternaleye (#356 (comment)): Put Incomplete in Ok. An incomplete state is not an error.

Using Result with Ok produces easier to read examples for developers that are not used to nom's conventions. It reduces the cognitive effort, and I bet it can increase the adoption of nom (“It looks less difficult, I understand all the examples”).

For Incomplete, few parsers need to be rewritten. However, the “nom spirit” is to use macros, it's just a couple of macros to update. I count 10 macros max. to update on my projects, which is just fine enough (+ test cases + doctests, approximately 50, which is OK for me).

For IResult however, we will have to update all the test cases that uses IResult. For my main project using nom, it's 550+ test cases… I hope I will be able to automate that somehow.

This RFC might break the public API too of the software using nom. It's not a good idea to expose IResult, I concede, but it might happen. Worst, the public API can have C bindings. We can assume it is an error from the implementer, but a note somewhere in the documentation should mention that risk.

Despite the negative feedbacks about the amount of work to update the code, I am willing to go in this direction. If you fear that the user base will not be glad with this second BC (this one is much more important), publish a blog post explaining why it is important, apologise, and go through. People will understand.

@Hywan
Copy link
Contributor

Hywan commented Sep 8, 2017

I am trying to compare the old IResult with a Result embedding Incomplete. The current size of IResult<&[u8], &[u8]> is 40 bytes. With Result + Incomplete, I reach 48 bytes.

// common

pub enum ErrorKind<E=u32> {
    Custom(E),
    Tag,
    MapRes
}

pub enum Needed {
    Unknown,
    Size(usize)
}

// new

pub enum State<I, O> {
    Ok(I, O),
    Incomplete(Needed)
}

type Result_new<I, O, E=u32> = Result<State<I, O>, ErrorKind<E>>;

// old

enum Result_old<I, O, E=u32> {
    Done(I, O),
    Error(ErrorKind<E>),
    Incomplete(Needed)
}

fn main() {
    println!("{}", std::mem::size_of::<Result_new<&[u8], &[u8]>>()); // 48
    println!("{}", std::mem::size_of::<Result_old<&[u8], &[u8]>>()); // 40
}

This is due to the State enum.

You might find a better way to represent it than I did, but the size of Result is important: Parsers generate a lot of them. And here, it increases the Result memory footprint of 16%.

@Geal
Copy link
Collaborator Author

Geal commented Sep 8, 2017

From what I have seen, moving from IResult to Result can be done easily enough with regexps (I had a lot of unit tests to change in nom too).

About the enum size, here's an update version of your code example, witht he pattern I'm currently testing (Incomplete on the error side):

pub enum ErrorKind<E=u32> {
    Custom(E),
    Tag,
    MapRes
}

pub enum Needed {
    Unknown,
    Size(usize)
}

// new

pub enum State<I, O> {
    Ok(I, O),
    Incomplete(Needed)
}

type Result_new<I, O, E=u32> = Result<State<I, O>, ErrorKind<E>>;

// old

enum Result_old<I, O, E=u32> {
    Done(I, O),
    Error(ErrorKind<E>),
    Incomplete(Needed)
}


// Incomplete in error

type Result_err<I, O, E=u32> = Result<(I, O), Err<E>>;

pub enum Err<E=u32> {
  Error(ErrorKind<E>),
  Incomplete(Needed),
}

fn main() {
    println!("IResult: {}", std::mem::size_of::<Result_old<&[u8], &[u8]>>()); // 40
    println!("Incomplete in Ok: {}", std::mem::size_of::<Result_new<&[u8], &[u8]>>()); // 48
    println!("Incomplete in Err: {}", std::mem::size_of::<Result_err<&[u8], &[u8]>>()); // 40
}

So it stays at 40 bytes. There's another reason to put it on the error side: I'll add an unrecoverable error, so it would be a bit like this (I'm not decided on the name yet):

pub enum Err<E=u32> {
  Error(ErrorKind<E>),
  UnrecoverableError(ErrorKind<E>),
  Incomplete(Needed),
}

This is for errors indicating that we should not backtrack and try other branches, but instead just bubble up the error. It does not increase the enum size and will make some parsers easier to manage.

Afterwards, transform a nom result to the futures Poll type could be done with an Into implementation.

@Hywan
Copy link
Contributor

Hywan commented Sep 8, 2017

Yes, we have the same result when keeping Incomplete on the Error side.

We can probably rename Err into something else in this case. I don't have the correct wording yet, but something to mean “the transition has not been applied because (Error | UnrecoverableError | Incomplete)”. Stopped maybe? Or Paused, or Blocking? I don't know, but Err no longer feel appropriated for me (even if it is used as the error branch of Result).

@Geal
Copy link
Collaborator Author

Geal commented Sep 8, 2017

Yes I don't like the name Err yet (because I have Err(Err::Error(_)) everywhere). I could just name the enum Erroror ParseError, but I don't like Err(Error::Error(_)) or Err(ParseError::Error(_)) either :p

@Geal
Copy link
Collaborator Author

Geal commented Sep 8, 2017

another interesting thing, while you got me worrying about struct size, I have a way to make simple_errors and verbose_errors compatible, by using the same enum in simple_errors, but with only one element. But things get more interesting: I can add more context in simple_errors (adding the current slice) without growing the structure, and make the verbose version smaller than in nom 3.0.

Basically, instead of:

type Result_err<I, O, E=u32> = Result<(I, O), Err<E>>;

pub enum Err<E=u32> {
  Error(ErrorKind<E>),
  Incomplete(Needed),
}

I would have:

type Result_err<I, O, E=u32> = Result<(I, O), Err<I,E>>;

pub enum Err<I,E=u32> {
  Error(Context<I,E>),
  Incomplete(Needed),
}

pub enum Context<I,E> {
  Code(I,ErrorKind<E>),
}

and in verbose mode:

pub enum Context<I,E> {
  Code(I,ErrorKind<E>),
  List<Vec(I,ErrorKind<E>)>),
}

That way, the verbose_errors feature would be additive, and as you can see in this playground, it's smalller than in nom 3:

IResult: 40
IResult verbose: 64
Incomplete in Ok: 48
Incomplete in Err: 40
Incomplete in Err extended: 48

@Hywan
Copy link
Contributor

Hywan commented Sep 8, 2017

Sounds great!

@Geal
Copy link
Collaborator Author

Geal commented Sep 8, 2017

it's getting interesting: https://twitter.com/gcouprie/status/906186641706012672

@Geal
Copy link
Collaborator Author

Geal commented Oct 3, 2017

FYI this is now done, with an additional element in the Err structure: Failure(ErrorKind<E>) to prevent a parser from trying other branches and make combinators like many* fail instead of returning a value

@Hywan
Copy link
Contributor

Hywan commented Oct 3, 2017

Any notable impacts on benchmarks?

@Geal
Copy link
Collaborator Author

Geal commented Oct 3, 2017

adding Failure made some parsers slightly slower than without, but they're still much faster than nom 3. And that is even before taking advantage of that failure feature (because it can optimize some behaviours).

@Geal Geal closed this as completed May 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants