Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream + ParseBuffer lifetimes are awkward to work with #12

Open
luser opened this issue Aug 30, 2017 · 6 comments
Open

Stream + ParseBuffer lifetimes are awkward to work with #12

luser opened this issue Aug 30, 2017 · 6 comments

Comments

@luser
Copy link
Contributor

luser commented Aug 30, 2017

I keep running into this when hacking on this crate. I think the root of the problem is that SourceView::as_slice returns a slice whose lifetime matches the lifetime of the reference to the SourceView, not the contained lifetime 's, and then Stream::parse_buffer returns a ParseBuffer with a lifetime derived from a slice returned from SourceView::as_slice. This means that in practice you have to keep both the Stream and ParseBuffer alive to do anything where you want to hand out references to data borrowed from the stream. This seems to mean that you always need an intermediate type in this situation to own the ParseBuffer.

I understand why you've written things this way--you want to potentially support zero-copy operation by having a SourceView that points into a memory-mapped file. However, since ReadView owns its data, it can't actually hand out a reference that outlives itself, so you can't just change SourceView::as_slice to make the slice lifetime match the SourceView lifetime.

I don't know if there's a straightforward way to fix this, but it's frustrated me many times so I figured I'd at least write it down to get it out of my head.

@willglynn
Copy link
Collaborator

Yep, you've got it.

Stream represents a possibly-expensive copy of a PDB stream. Think of it as a Vec<u8> (and note that it is a Vec<u8> unless you wrote your own Source). The Stream owns the data for a PDB stream, either by owning a copy of the data or by owning whatever resources are needed to maintain access to the data. The Stream therefore needs to live at least as long as any references to things inside the stream.

A ParseBuffer is basically a &[u8] with convenience functions. (Some of those convenience functions are sensitive to the offset within the stream, so it's actually a (&[u8], usize).) It owns nothing; it's a slice borrowed from a Stream. You can get &[u8]s from a ParseBuffer -- take(), parse_cstring(), etc. These &[u8]s can outlive the ParseBuffer, but they can't outlive the Stream, since they're borrowed from the Stream exactly how a &[u8] is borrowed from a Vec<u8>.

So far this has worked out okay, with the pattern for the existing features being:

// create a PDB from a File
let mut pdb = PDB::open()?;
// type_information() reads the TPI stream in its entirety, and TypeInformation owns that copy
// note also that reading things can mutate them, hence why `pdb` is `mut`
let type_information = pdb.type_information()?;
// do things with type_information -- parsing its Stream and whatnot -- but without mutating type_information

I have also struggled with this at times. For example, I started by assuming that TypeFinder's functions should be part of TypeInformation, but attempting to do that was painful. Eventually I stopped fighting and just split them apart. I later found that keeping them separate actually has meaningful benefits; I've since written client code that loads a single TypeInformation and does different things with it in multiple threads. One of the threads builds a partial TypeFinder for one purpose, while another builds a full TypeFinder for another purpose which later processing stages borrow and use in parallel. Even if I had managed to satisfy the borrow checker with my initial combined approach, my mut TypeInformation would not have permitted this kind of use without loads of synchronization, while the &TypeInformation -> mut TypeFinder approach accidentally supported parallel processing right out of the box. ¯\_(ツ)_/¯

For me, I think the biggest mental hurdle is the Stream isn't movable. I feel like I should be able to hold references to e.g. strings within the stream as long as the Stream is alive -- I know for a fact these references live behind a pointer, so moving the Stream has no effect on my &[u8]s -- but AFAIK there's no way to describe that in a way that the borrow checker understands.

@luser
Copy link
Contributor Author

luser commented Mar 18, 2020

Just ran across this again and wanted to drop a note about what I did to solve a similar issue in my minidump crate. I made the Minidump type generic over the file data storage by declaring it as:

pub struct Minidump<'a, T>
    where T: Deref<Target=[u8]> + 'a,
{
    data: T,
    // ...
}

That way it's possible to create a Minidump<&[u8]>, Minidump<Vec<u8>> or Minidump<Mmap> and all the lifetimes work sensibly. I implemented a convenience Minidump::read_path method on a separate impl<'a> Minidump<'a, Mmap> block to handle the common case of reading from a memory mapped file. Internally everything just winds up calling self.data.deref() to get a &'a [u8] and things fall out from there.

@nkaretnikov
Copy link

@willglynn

(I don't want to create a new issue since mine is pretty similar to what you wrote above.)

I want to populate the type finder (or a custom structure: hashmap/whatever) in a constructor such that I can use it for lookup later in a method. Right now it doesn't seem to be possible.

Basically, I have something similar in my new constructor. I also maintain a separate mapping for type name -> type index lookup.

I see no way to store the updated type finder information as part of the structure, to be used later from the method I mentioned.

You can store TypeInformation, but it doesn't get updated when you update the type finder, so it's useless for this purpose.

Trying to store the type finder itself will lead you nowhere because it depends on local variables, which will get destroyed when the constructor exits, so Rust doesn't allow that, which is fine.
However, there doesn't seem to be a way to break this dependency by copying data.

I've also tried storing TypeData in a hashmap, similar to this. Again, doesn't seem to be possible due to a dependency on locals.

Doing something like Boxing a value or calling clone on it doesn't help either.

The only way to achieve this at the moment (that I can think of) is to copy each enum variant manually, but this is not good. Not interested in changing my APIs either.

Am I missing something?

In addition, zero-copy is great, however, lifetimes is a pretty complex topic and this complicates things (especially for beginners). Maybe there should be a way to copy everything (you can write giant warnings in docs for such methods, but at least provide them in case people need that).

@jan-auer
Copy link
Member

jan-auer commented Jan 9, 2021

You're not missing anything, this is indeed this complicated. Your issue is slightly different from the one in the issue description. The complexity comes from two facts:

  • The PDB -- more precisely, an MSF file -- is not stored in order. Instead, it consists of reordered pages and one constantly has to go through a mapping to resolve actual data. The PDB create solves this by reading everything into buffers called Stream. Since this is a rather expensive process, this is only done on demand for the very streams you read.
  • Secondly, there is no straight-forward way to create self-referential types in Rust. There are libraries out there helping with this, such as rental, but they cannot not be used with the way we have to read data. This means that you as a user receive the burden to hold on to all data owners.

You can go about this in multiple ways:

  1. The easiest is to make sure the PDB or the streams are held somewhere higher up your stack, so you don't have to worry about this.
  2. Re-expose all ownership in your own API. Assuming that you want to create higher-level functionality wrapping PDB, you can ensure the API is built so that is has to be put on the stack again.

To give an example for TypeFinder, you can have a look at how symbolic deals with this here and here. The relevant bits of the code are below:

// ItemMap borrows an item iterator and finder, and on-demand advances the 
// iterator to populate the finder. It requires the caller to hold on to `TypeInformation`.
struct ItemMap<'s, I: ItemIndex> {
    iter: pdb::ItemIter<'s, I>,
    finder: pdb::ItemFinder<'s, I>,
}

impl<'s, I> ItemMap<'s, I>
where
    I: ItemIndex,
{
    pub fn try_get(&mut self, index: I) -> Result<pdb::Item<'s, I>, PdbError> {
        if index <= self.finder.max_index() {
            return Ok(self.finder.find(index)?);
        }

        while let Some(item) = self.iter.next()? {
            self.finder.update(&self.iter);
            match item.index().partial_cmp(&index) {
                Some(Ordering::Equal) => return Ok(item),
                Some(Ordering::Greater) => break,
                _ => continue,
            }
        }

        Err(pdb::Error::TypeNotFound(index.into()).into())
    }
}

// This holds on to all owned data and must be put somewhere on the stack
// while someone is using TypeMap.
struct MyPdbUtil<'d> {
    type_info: pdb::TypeInformation<'d>,
    // ...
}

impl<'d> MyPdbUtil<'d> {
    fn from_pdb(pdb: &mut PDB<'d>) -> Result<Self, PdbError> {
        Ok(Self {
            type_info: pdb.type_information()?,
            // ...
        })
    }

    fn type_map(&self) -> TypeMap<'_> {
        ItemMap {
            iter: self.type_info.iter(),
            finder: self.type_info.finder(),
        }
    }
}

Apart from that, the clean solution to achieve true zero-copy and solve the awkward lifetimes is by moving the paging logic. Right now, we read and assemble pages for the entire stream into a Vec. Instead, we could resolve the physical address of RVAs at access time, and would just have to ensure that we never read across page boundaries (so potentially return Cows).

@nkaretnikov
Copy link

@jan-auer

Thanks for the quick and detailed reply!

FWIW, I "solved" it for now by creating a copy of TypeData that uses Strings in variants with a lifetime parameter (this way I don't have to use any lifetimes at all). I use a hashmap instead of the type finder for lookups, which I manage myself too. Not good, but at least it compiles now.

I'll take a closer look at your suggestions sometime later. I've already spent too much time on this and need to focus on something else. But I'll let you know if there are any issues!

Thanks again!

@mstange
Copy link
Collaborator

mstange commented Sep 23, 2021

I've also run into this problem in pdb-addr2line. I've ended up propagating the awkwardness to the consumer of my API; rather than just creating a Context they need to create a ContextPdbData and then a Context from that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants