Stream + ParseBuffer lifetimes are awkward to work with #12

luser · 2017-08-30T19:34:18Z

I keep running into this when hacking on this crate. I think the root of the problem is that SourceView::as_slice returns a slice whose lifetime matches the lifetime of the reference to the SourceView, not the contained lifetime 's, and then Stream::parse_buffer returns a ParseBuffer with a lifetime derived from a slice returned from SourceView::as_slice. This means that in practice you have to keep both the Stream and ParseBuffer alive to do anything where you want to hand out references to data borrowed from the stream. This seems to mean that you always need an intermediate type in this situation to own the ParseBuffer.

I understand why you've written things this way--you want to potentially support zero-copy operation by having a SourceView that points into a memory-mapped file. However, since ReadView owns its data, it can't actually hand out a reference that outlives itself, so you can't just change SourceView::as_slice to make the slice lifetime match the SourceView lifetime.

I don't know if there's a straightforward way to fix this, but it's frustrated me many times so I figured I'd at least write it down to get it out of my head.

The text was updated successfully, but these errors were encountered:

willglynn · 2017-08-30T20:25:08Z

Yep, you've got it.

Stream represents a possibly-expensive copy of a PDB stream. Think of it as a Vec<u8> (and note that it is a Vec<u8> unless you wrote your own Source). The Stream owns the data for a PDB stream, either by owning a copy of the data or by owning whatever resources are needed to maintain access to the data. The Stream therefore needs to live at least as long as any references to things inside the stream.

A ParseBuffer is basically a &[u8] with convenience functions. (Some of those convenience functions are sensitive to the offset within the stream, so it's actually a (&[u8], usize).) It owns nothing; it's a slice borrowed from a Stream. You can get &[u8]s from a ParseBuffer -- take(), parse_cstring(), etc. These &[u8]s can outlive the ParseBuffer, but they can't outlive the Stream, since they're borrowed from the Stream exactly how a &[u8] is borrowed from a Vec<u8>.

So far this has worked out okay, with the pattern for the existing features being:

// create a PDB from a File
let mut pdb = PDB::open(…)?;
// type_information() reads the TPI stream in its entirety, and TypeInformation owns that copy
// note also that reading things can mutate them, hence why `pdb` is `mut`
let type_information = pdb.type_information()?;
// do things with type_information -- parsing its Stream and whatnot -- but without mutating type_information

I have also struggled with this at times. For example, I started by assuming that TypeFinder's functions should be part of TypeInformation, but attempting to do that was painful. Eventually I stopped fighting and just split them apart. I later found that keeping them separate actually has meaningful benefits; I've since written client code that loads a single TypeInformation and does different things with it in multiple threads. One of the threads builds a partial TypeFinder for one purpose, while another builds a full TypeFinder for another purpose which later processing stages borrow and use in parallel. Even if I had managed to satisfy the borrow checker with my initial combined approach, my mut TypeInformation would not have permitted this kind of use without loads of synchronization, while the &TypeInformation -> mut TypeFinder approach accidentally supported parallel processing right out of the box. ¯\_(ツ)_/¯

For me, I think the biggest mental hurdle is the Stream isn't movable. I feel like I should be able to hold references to e.g. strings within the stream as long as the Stream is alive -- I know for a fact these references live behind a pointer, so moving the Stream has no effect on my &[u8]s -- but AFAIK there's no way to describe that in a way that the borrow checker understands.

luser · 2020-03-18T13:29:08Z

Just ran across this again and wanted to drop a note about what I did to solve a similar issue in my minidump crate. I made the Minidump type generic over the file data storage by declaring it as:

pub struct Minidump<'a, T>
    where T: Deref<Target=[u8]> + 'a,
{
    data: T,
    // ...
}

That way it's possible to create a Minidump<&[u8]>, Minidump<Vec<u8>> or Minidump<Mmap> and all the lifetimes work sensibly. I implemented a convenience Minidump::read_path method on a separate impl<'a> Minidump<'a, Mmap> block to handle the common case of reading from a memory mapped file. Internally everything just winds up calling self.data.deref() to get a &'a [u8] and things fall out from there.

nkaretnikov · 2021-01-09T09:47:29Z

@willglynn

(I don't want to create a new issue since mine is pretty similar to what you wrote above.)

I want to populate the type finder (or a custom structure: hashmap/whatever) in a constructor such that I can use it for lookup later in a method. Right now it doesn't seem to be possible.

Basically, I have something similar in my new constructor. I also maintain a separate mapping for type name -> type index lookup.

I see no way to store the updated type finder information as part of the structure, to be used later from the method I mentioned.

You can store TypeInformation, but it doesn't get updated when you update the type finder, so it's useless for this purpose.

Trying to store the type finder itself will lead you nowhere because it depends on local variables, which will get destroyed when the constructor exits, so Rust doesn't allow that, which is fine.
However, there doesn't seem to be a way to break this dependency by copying data.

I've also tried storing TypeData in a hashmap, similar to this. Again, doesn't seem to be possible due to a dependency on locals.

Doing something like Boxing a value or calling clone on it doesn't help either.

The only way to achieve this at the moment (that I can think of) is to copy each enum variant manually, but this is not good. Not interested in changing my APIs either.

Am I missing something?

In addition, zero-copy is great, however, lifetimes is a pretty complex topic and this complicates things (especially for beginners). Maybe there should be a way to copy everything (you can write giant warnings in docs for such methods, but at least provide them in case people need that).

jan-auer · 2021-01-09T14:29:48Z

You're not missing anything, this is indeed this complicated. Your issue is slightly different from the one in the issue description. The complexity comes from two facts:

The PDB -- more precisely, an MSF file -- is not stored in order. Instead, it consists of reordered pages and one constantly has to go through a mapping to resolve actual data. The PDB create solves this by reading everything into buffers called Stream. Since this is a rather expensive process, this is only done on demand for the very streams you read.
Secondly, there is no straight-forward way to create self-referential types in Rust. There are libraries out there helping with this, such as rental, but they cannot not be used with the way we have to read data. This means that you as a user receive the burden to hold on to all data owners.

You can go about this in multiple ways:

The easiest is to make sure the PDB or the streams are held somewhere higher up your stack, so you don't have to worry about this.
Re-expose all ownership in your own API. Assuming that you want to create higher-level functionality wrapping PDB, you can ensure the API is built so that is has to be put on the stack again.

To give an example for TypeFinder, you can have a look at how symbolic deals with this here and here. The relevant bits of the code are below:

// ItemMap borrows an item iterator and finder, and on-demand advances the 
// iterator to populate the finder. It requires the caller to hold on to `TypeInformation`.
struct ItemMap<'s, I: ItemIndex> {
    iter: pdb::ItemIter<'s, I>,
    finder: pdb::ItemFinder<'s, I>,
}

impl<'s, I> ItemMap<'s, I>
where
    I: ItemIndex,
{
    pub fn try_get(&mut self, index: I) -> Result<pdb::Item<'s, I>, PdbError> {
        if index <= self.finder.max_index() {
            return Ok(self.finder.find(index)?);
        }

        while let Some(item) = self.iter.next()? {
            self.finder.update(&self.iter);
            match item.index().partial_cmp(&index) {
                Some(Ordering::Equal) => return Ok(item),
                Some(Ordering::Greater) => break,
                _ => continue,
            }
        }

        Err(pdb::Error::TypeNotFound(index.into()).into())
    }
}

// This holds on to all owned data and must be put somewhere on the stack
// while someone is using TypeMap.
struct MyPdbUtil<'d> {
    type_info: pdb::TypeInformation<'d>,
    // ...
}

impl<'d> MyPdbUtil<'d> {
    fn from_pdb(pdb: &mut PDB<'d>) -> Result<Self, PdbError> {
        Ok(Self {
            type_info: pdb.type_information()?,
            // ...
        })
    }

    fn type_map(&self) -> TypeMap<'_> {
        ItemMap {
            iter: self.type_info.iter(),
            finder: self.type_info.finder(),
        }
    }
}

Apart from that, the clean solution to achieve true zero-copy and solve the awkward lifetimes is by moving the paging logic. Right now, we read and assemble pages for the entire stream into a Vec. Instead, we could resolve the physical address of RVAs at access time, and would just have to ensure that we never read across page boundaries (so potentially return Cows).

nkaretnikov · 2021-01-09T15:34:14Z

@jan-auer

Thanks for the quick and detailed reply!

FWIW, I "solved" it for now by creating a copy of TypeData that uses Strings in variants with a lifetime parameter (this way I don't have to use any lifetimes at all). I use a hashmap instead of the type finder for lookups, which I manage myself too. Not good, but at least it compiles now.

I'll take a closer look at your suggestions sometime later. I've already spent too much time on this and need to focus on something else. But I'll let you know if there are any issues!

Thanks again!

mstange · 2021-09-23T23:59:21Z

I've also run into this problem in pdb-addr2line. I've ended up propagating the awkwardness to the consumer of my API; rather than just creating a Context they need to create a ContextPdbData and then a Context from that.

nkaretnikov mentioned this issue Jan 12, 2021

Document how primitive types are found #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream + ParseBuffer lifetimes are awkward to work with #12

Stream + ParseBuffer lifetimes are awkward to work with #12

luser commented Aug 30, 2017

willglynn commented Aug 30, 2017

luser commented Mar 18, 2020

nkaretnikov commented Jan 9, 2021

jan-auer commented Jan 9, 2021

nkaretnikov commented Jan 9, 2021

mstange commented Sep 23, 2021

Stream + ParseBuffer lifetimes are awkward to work with #12

Stream + ParseBuffer lifetimes are awkward to work with #12

Comments

luser commented Aug 30, 2017

willglynn commented Aug 30, 2017

luser commented Mar 18, 2020

nkaretnikov commented Jan 9, 2021

jan-auer commented Jan 9, 2021

nkaretnikov commented Jan 9, 2021

mstange commented Sep 23, 2021