Allow remapping source path prefixes in debug output #38322

Open
jmesmon opened this Issue Dec 12, 2016 · 65 comments

Projects

None yet

7 participants

@jmesmon
Contributor
jmesmon commented Dec 12, 2016 edited

In gcc & clang, the flag -fdebug-prefix-map=old=new allows changing the prefix of source files referred to in debug information.

Related to #34902, this allows one to avoid having the particular source directory that a file was built in affect the output object/executable/library contents.

It also allows source-level debugging to work in cases where the source code is installed by the debug packages to a location that differs from where it was built (debian & OpenEmbedded, at least, take advantage of this).

As an alternate, the program debugedit allows modifying the files after generation to adjust the paths (as far as I can tell, Fedora uses this, and potentially other rpm based distros).

@jmesmon
Contributor
jmesmon commented Dec 12, 2016
@michaelwoerister
Contributor

Thanks for writing up the issue, @jmesmon!

@rust-lang/tools: Can any of you think of a reason not to add something like this as a -C flag?

@alexcrichton
Member

Sounds good to me!

@infinity0
Contributor

A -C flag is good, but sometimes we have seen that certain buildsystems or individual projects like to save compiler flags (i.e. -fdebug-prefix-map=<PATH>=.) into other parts of the overall build output, thereby making it depend on the build path, even if rustc itself supports this -C flag. But it's reasonable to assume that compiler flags will affect the output, and save this somewhere else for auditing.

Because of this, we recently submitted some patches to GCC to also support the same behaviour as debug-prefix-map, except via an environment variable that explicitly should not be saved to any build output. The patches are still pending, but I haven't yet received any significant negative comments on it, and I'll be pinging GCC again soon about it. It would be good if rustc could also support this same environment variable in the future.

Actually, I didn't know about debugedit before, thanks for bringing that up! It might allow us to normalise files outside of GCC or independently of any other compiler, I will have to look into that.

@sanxiyn
Member
sanxiyn commented Dec 13, 2016 edited

I implemented this and it seems to work. I tested by building src/test/run-make/reproducible-build in two different directories. With -C debug-prefix-map=`pwd`=. the output is reproducible, without it isn't.

(reproducible-build test itself only tests stable symbol naming, not bit-for-bit output. In the past Rust could produce different symbol names between runs(!): see #30330.)

@jmesmon
Contributor
jmesmon commented Dec 13, 2016

Unclear if it's important for rust, but in gcc/clang multiple mappings are supported. This ends up being important in C/C++ due to #include pulling in files from different directories. If something similar can happen in rust allowing multiple maps would be useful there too.

Also, it could be a good idea to avoid the splitting on = rather than copying gcc's interface to allow = to be included in the old path (though including = in a path is unlikely, it'd be a good idea to avoid leaving behind that landmine if possible).

@sanxiyn
Member
sanxiyn commented Dec 13, 2016

How do multiple mappings work? Are they applied in command line order? Then is the order significant? I think it must be, since a/b=c a=d applied to a/b results in c, but a=d a/b=c applied to a/b results in d/b.

@jmesmon
Contributor
jmesmon commented Dec 13, 2016

In gcc, the handling of multiple debug_prefix_maps is to search the mappings last-on-cmdline to first-on-cmdline, applying the first prefix that matches. The last-to-first strategy is common in gcc (and other command line tools) as it is intended that later options are able to override earlier ones.

@infinity0
Contributor
infinity0 commented Dec 13, 2016 edited

My GCC patches linked above, includes modifying the existing GCC behaviour to split on the final = instead of the initial one. I think that is better as well, I can imagine someone wanting to map a path that contains a =, but less likely to map something to such a path.

(edit: previously mentioned a space character, that was for some other thing that I got confused with)

@sanxiyn
Member
sanxiyn commented Dec 14, 2016

Thanks for answers! I will implement multiple mappings, last-to-first order, and splitting on the final = now.

@jmesmon
Contributor
jmesmon commented Dec 14, 2016

I'd recommend avoiding splitting on = entirely. Is the cmdline interface of rustc flexible enough to handle either having an argument take 2 parameters or allow 2 flags to work together to implement the same thing? (for example, -C debug_prefix_old=foo -C debug_prefix_new=bar and enforcing ordering + pairing).

It would be a really good idea to keep all of the escaping/special characters in callers of rustc (shells, etc) just to avoid funny limitations like this (= being special).

@michaelwoerister
Contributor

How about requiring the mapping information to be provided in a file (similar to ld's --version-script for example)? Would that be too clunky?
But it seems to me that this is a feature that's only used in specialized settings, so that would seem fine to me.

@jmesmon
Contributor
jmesmon commented Dec 14, 2016

I'd prefer avoiding needing to use (temporary?) external files to configure this feature. I say temporary because debug src mappings aren't something like a target specification where it is a fixed, predetermined value for all platforms: these are something that depend on the build directory & depend on where the source is being mapped to (which is, in the debug source packaging case) is typically a path under /usr/src/package-name-version, and version is potentially adjusted quite a bit.

And one would need to know the escaping in that file format, so it doesn't simplify things wrt allowing arbitrary paths, it just moves them somewhere else.

@michaelwoerister
Contributor
michaelwoerister commented Jan 18, 2017 edited

So, it seems that discussion on this issue has stalled (here and over in #38348) for two reasons:

  1. It's not clear what the command interface should look like. Passing a map of paths without introducing additional escaping rules is harder than it sounds.
  2. The initial implementation in #38348 revealed that the semantics of remapping are more complicated than it might seem at first. Is remapping based on strings or logical paths? E.g. if I replace /abc/def with xyz, what happens when I encounter /abc/./def or /abc/../abc/def/? What about relative paths?

I want to move forward with this so I propose the following solutions:

  1. If no one has a clear, practical reason otherwise, I say we use the CLI as proposed by @infinity0 and myself: have pairs of -Zdebug-prefix-map-from=<...> and -Zdebug-prefix-map-to=<...>, that are matched up nth from to nth to. It is a bit verbose but doesn't require any additional escaping and can handle paths on all platforms.

  2. The semantics are a bit more complicated but I propose that debuginfo paths are generally normalized to not contain . or .. components and that remapping works on absolute versions of these normalized paths. This gives predictable results. UPDATE: Prefix matching works at directory name level, not at the path-string content (see example below).
    Some examples:

map: /abc/def -> /xyz

Absolute paths containing the prefix:
/abc/def/file1.rs -> /xyz/file1.rs
/abc/def/build/../file1.rs -> /xyz/file1.rs
/abc/def/./file1.rs -> /xyz/file1.rs
/abc/./def/file1.rs -> /xyz/file1.rs  // would not match with gcc
/abc/def/mod1/file1.rs -> /xyz/mod1/file1.rs 
/abc/def/mod1/./file1.rs -> /xyz/mod1/file1.rs 

Absolute paths not containing the prefix:
/std/file1.rs -> /std/file1.rs // no change
/std/./file1.rs -> /std/file1.rs // normalization
/std/build/../file1.rs -> /std/file1.rs // normalization

Relative paths containing the prefix:
(DW_AT_comp_dir=/abc/def/build, path=../file.rs) => /xyz/file1.rs
(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs  // would not match with gcc
(DW_AT_comp_dir=/std, path=./mod1/file.rs) => /std/mod1/file1.rs

Mapping happens at the directory name level, no partial names allowed:
/abc/def-2/file1.rs -> /abc/def-2/file1.rs
/abc/def.rs -> /abc/def.rs

The formula that produces these results is:

fn debuginfo_path(p: Path, map: [(Path, Path)]) -> Path {
    let p = normalize(make_absolute(p))
    for (from, to) in map {
        // Exit on the *first* match, order determined by commandline option order
        // UPDATE option order is last to first, i.e. later CLI options overrule earlier ones
        if p.starts_with(from) {
           return p.replace_prefix(from, to)
        }
    }
    // No remapping done, but still normalized and absolute now
    p
}

Note that whether paths are later stored as relative to their DW_AT_comp_dir again is an independent question that I don't want to discuss here.

Thoughts? @jmesmon @infinity0 @sanxiyn @jsgf @rust-lang/tools

@jmesmon
Contributor
jmesmon commented Jan 18, 2017

I don't see an issue with that so long as the "commandline option order" is defined as last-to-first. Doing this bit differently from other command line utilities doesn't buy us anything (unlike having 2 seperate args for from & to).

It also looks like the examples decide to fix the mapping at directory name level, but doesn't appear in the english description. I don't have anything in mind that would break that, but given that allowing matching partials would allow appending '/' to the end to get matching-full-path-elements, I'm not sure such a restriction is a good idea.

@michaelwoerister
Contributor

so long as the "commandline option order" is defined as last-to-first

If there's precedent for that I'm fine with going last-to-first.

@michaelwoerister
Contributor

I don't have anything in mind that would break that, but given that allowing matching partials would allow appending '/' to the end to get matching-full-path-elements, I'm not sure such a restriction is a good idea.

I think it's just simpler to use. You don't have to worry if you need to append a / to avoid accidental renamings.

@jmesmon
Contributor
jmesmon commented Jan 18, 2017

If there's precedent for that I'm fine with going last-to-first.

This is the order used by gcc & clang for all of their "more than 1 & pick 1" options (ignoring special cases): debug-maps (as discussed earlier in this thread), optimization levels, debug info levels, include directories, etc.

@jmesmon
Contributor
jmesmon commented Jan 18, 2017

I think it's just simpler to use. You don't have to worry if you need to append a / to avoid accidental renamings.

Sure, it's simpler. The issue is that it's also less flexible, and it's trivial to get the match-full-paths behavior from match-anything, but going the other direction is impossible. Again, I don't have a use case that would want partial matches, but the ease of supporting both should be considered.

@michaelwoerister
Contributor

This is the order used by gcc & clang

I updated the description above to reflect this.

@michaelwoerister
Contributor

Sure, it's simpler. The issue is that it's also less flexible, and it's trivial to get the match-full-paths behavior from match-anything, but going the other direction is impossible.

Yes, I know that string-based matching is more powerful. My argument is that it gives you more subtle ways to get it wrong without a clear benefit. Is it from=/abc/ to=/xyz/? from=/abc/ to=/xyz? from=/abc to=/xyz? But I don't have a strong preference. If someone says they absolutely want this, I'm fine with implementing it.

@jsgf
Contributor
jsgf commented Jan 18, 2017 edited

Several points:

I think splitting the option into two is very unpretty, and somewhat ambiguous - if they're just related by being adjacent, it seems like it raises a lot of questions:

  • What if from and to get separated by other options?
  • What if there isn't the same number of from and to options?
  • Can the from and to be reordered (ie, what if they appear as to from)?

In particular it means that any tools that's parsing/processing the commandline needs to know about this special case in order to avoid breaking it.

Normalizing paths by eliminating .. is dangerous if the path contains symlinks: foo/bar/../blat/lib.rs is not the same as foo/blat/lib.rs if bar if a symlink. I'm happy with matching at the directory component level (though string matching is strictly more general), but I think going beyond that is a bad idea.

Edit: I can't think of a problem with eliminating . though. Perhaps that would be useful.

Proposal:

Retain the -Zdebug-prefix-map=OLD=NEW syntax, but also add -Zdebug-prefix-map-separator=X such that subsequent (left to right on the command line) OLD and NEW mappings can be separated by X. This allows a the mappings to contain any character (but not every character), and the tool generating the command line can select a separator that doesn't break the path.

@infinity0
Contributor
infinity0 commented Jan 18, 2017 edited

I think @michaelwoerister's suggestions are cleanest; tools that want to add this value to a rustc flag probably don't want to look inside the value to search it and then select a separator.

I agree that matching only full path components are best and less likely prone to error. I am a little concerned that GCC differs a bit from this, but the rustc code example is indeed very simple and it might be possible to ask GCC to adopt a similar approach - I have to send them a patch anyways, I may add this as well. Even if they don't adopt it, we (in the interests of standardising this behaviour) could define a standard that defines a "minimal" mapping behaviour but leave it open saying "the tool might perform further additional mappings".

I am less sure about normalisation, because it has the potential to mess with the semantics of various fields. For example this:

(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs  // would not match with gcc

what would you map DW_AT_comp_dir to? / or .? It's unclear, and it messes with the semantics, which is supposed to be "the working directory of the build command". There could be other fields that depend on the original meaning.

There are two cases:

  • We have a mapping for /abc, in which case both fields are reproducible, no need to normalise.
  • We don't have a mapping for abc, in which case I'd say that there is an issue (could even call it a bug) with the tool that sets these mapping flags - it knowingly set the PWD but gave a mapping only for a specific child of the PWD. In this case, I think it's better to leave the situation unaltered so that it's at least detectable by reproducers, rather than trying to do something fancy.

(edit: "other mappings" -> "further additional mappings")

@infinity0
Contributor

(To expand on the flags points, I think it's fine to allow from and to to be separated by other options and appear in a non-pairwise order, and in practise no build tool would actually do this but it makes the logic simpler to analyse and write code for; and if they have different numbers of arguments then just fail the compile with an error.)

@michaelwoerister
Contributor

@jsgf I agree with @infinity0 regarding the CLI options:

What if from and to get separated by other options?

It makes no difference to the semantics.

What if there isn't the same number of from and to options?

Compilation fails with an error.

Can the from and to be reordered (ie, what if they appear as to from)?

It makes no difference. The n-th to is matched up with the n-th from.

Normalizing paths by eliminating .. is dangerous if the path contains symlinks [...]

You're right, that's a problem. An alternative would be to replace normalize(make_absolute(p)) with std::fs::canonicalize(path), which also resolves symlinks. You'd have to know where your symlinks lead to be able to write mappings but I think it's still easier to reason about than what we have today, where things are unpredictable to a large degree.

@michaelwoerister
Contributor

@infinity0

I am less sure about normalisation, because it has the potential to mess with the semantics of various fields. For example this:

(DW_AT_comp_dir=/abc, path=./def/file.rs) => /xyz/file1.rs // would not match with gcc

what would you map DW_AT_comp_dir to? / or .? It's unclear, and it messes with the semantics, which is supposed to be "the working directory of the build command". There could be other fields that depend on the original meaning.

DW_AT_comp_dir would stay unchanged as /abc and the path being mapped in this example would be absolute and independent of the corresponding DW_AT_comp_dir. Having a mapping like this would probably be a bug in the build system. The output would be predictable though.

@jsgf
Contributor
jsgf commented Jan 18, 2017

Canonicalizing the path might make things worse. If you're using symlinks to normalize the namespace across multiple machines (a distributed build farm, for example), then canonicalizing the path will denormalize it. I think it would be a mistake to be too clever with the paths.

@michaelwoerister
Contributor

If you're using symlinks to normalize the namespace across multiple machines (a distributed build farm, for example), then canonicalizing the path will denormalize it.

Yes, that makes sense. No canonicalizing then.

@jsgf
Contributor
jsgf commented Jan 18, 2017 edited

Thinking about this more, I really think this is getting vastly overcomplicated.

I think the debug prefix map should be considered purely an operation on strings: the literal strings passed on the command-line to refer to the input sources (and - I guess - the current directory either from $PWD or getcwd()) with no attempt at processing, normalizing or even taking path elements into account.

This implies that the remapping should be applied before rustc makes them absolute, and rustc should only make them absolute if the output of the remapping is not already absolute (since remapping ./foo/ to /an/absolute/path should be perfectly reasonable).

I think that's a much simpler model to reason about and use. If I think about how I want to use this feature, any path processing on rustc's part makes things more complex to reason about and doesn't solve any of my problems.

Edit: By which I mean, any path processing can be done by tooling outside rustc - if I want normalized paths, I can normalize them. If I want them relative, absolute, etc, etc, I can do that outside, so long as I know that rustc isn't going to do anything complex/clever with them. The more complexity rustc applies, the fewer options I have.

@michaelwoerister
Contributor
michaelwoerister commented Jan 18, 2017 edited

I agree that predictability is the most important thing here. I thought that scheme I proposed above would provide that best but with corner cases cropping up, I don't think that anymore.

Let's try to come up with a different rule set:

  1. All paths provided by the user are preserved the way they are passed in (as @jsgf suggests).
  2. Paths derived from a user-provided path also keep the same format, e.g., if I do rustc ./main.rs, I'll get ./mod1/sub.rs. If I do rustc main.rs, I get mod1/sub.rs.
  3. If a path isn't already absolute, it's absolute variant is constructed as getcwd() + user-provided-path. So ../src/main.rs will become /home/foo/project/build/../src/main.rs, for example.
  4. At least in the beginning, the compiler will emit absolute paths everywhere (as it does now, see #34187. This might become optional once remapping is stable).
  5. Remapping is the last thing that is done before storing a path in debuginfo, so you get DW_AT_comp_dir=remap(pwd), and DW_AT_decl_file=remap(abs_path).

How does that sound?

(cc @luser, who might also be interested in this whole topic)

@infinity0
Contributor

I think that the mappings should be applied last, after any other processing such as converting to absolute paths. I'd imagine that this would be more predictable, at least from an outsider that is merely observing what rustc does without reading its source code - in other words, one could build both with and without the mappings, and the output would be related in a way that is only based on the mappings and the algorithm, and not on anything else rustc might do now or in the future.

In (3), making all paths absolute, would mean the output can only be reproduced if rebuilders build it in the same path. Is this really necessary? I'd prefer to keep relative paths relative (as GCC does), even if they traverse above cwd. The parent tool that calls rustc could easily add extra ../src=xxx mappings if it feels that this is necessary. The one situation this wouldn't be appropriate, is if these relative paths are generated by rustc itself and are unknown to the parent tool - but does this situation actually occur?

@michaelwoerister
Contributor

I think that the mappings should be applied last, after any other processing such as converting to absolute paths.

That's what I suggest.

In (3), making all paths absolute, would mean the output can only be reproduced if rebuilders build it in the same path.

Or if they set up their prefix-mapping to result in the same path, right?

Is this really necessary?

It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.

@jsgf
Contributor
jsgf commented Jan 18, 2017 edited

The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked, so it won't know what absolute path applies at that point, and therefore can't do path remapping in those terms. It does know what paths it's putting on the commandline though, so it can generate remappings in those terms.

@michaelwoerister Good point about derived names; I'd overlooked those. (Mostly because I assume they'd have a common prefix from the perspective of mapping.)

@michaelwoerister
Contributor

@jsgf

The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked

How do you construct your prefix mapping then?

@michaelwoerister
Contributor
michaelwoerister commented Jan 18, 2017 edited

How do you construct your prefix mapping then?

To make it more concrete: DW_TAG_comp_dir always contains the current directory. If you want to map that to something stable, you have to know its value, right?

@jsgf
Contributor
jsgf commented Jan 18, 2017

All relative. I'm not that interested in remapping comp_dir, since actually is no one canonical path that makes sense in my environment. I'm more interested in remapping the source paths to a relative canonical name within the source tree.

To be specific:

  • In my environment there's a large build farm
  • It's using Buck for all building
  • Buck creates a symlink tree containing all the sources listed as dependencies for a given target, and nothing else so that compilation fails if any sources are referenced that aren't listed as dependencies
  • The build objects built in the farm are stored in a distributed cache, and reused if you're redoing the same build

The net result is that not only is the current directory some random name, but the path to the sources is prefixed by a buck-generated name. This means that debugging with a cached object will result in meaningless source paths. I want to use prefix remapping to map the path to the source from the symlink tree back to the canonical location in the source tree.

Reproducable builds are of secondary interest because they'd have better caching properties; in that case remapping comp_dir to some fixed (but essentially meaningless) string would help - but only if the rest of the object were bit-for-bit identical.

(Added bonus, I'd like to use the remapped name for error messages, but we can get to that later.)

@michaelwoerister
Contributor

@jsgf Can you give a small example of what your mapping would like?

@luser
Contributor
luser commented Jan 18, 2017

It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.

If you're invoking rustc directly then this shouldn't be as much of a problem, since presumably you're not using cargo. If you're using cargo and still need reliable paths you could set CARGO_HOME to a known path and then remap that to a fixed path. We don't have this problem in Gecko (since I fixed the debug info for generics from external crates) because we've vendored all our crates into our source repo, so their source paths are always inside our top source directory.

@jsgf
Contributor
jsgf commented Jan 18, 2017

Something like:

rustc -Zdebug-prefix-map-from=./buck-out/gen/my/build-target#pic,rlib/ -Zdebug-prefix-map-to=./ --other --options ./buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs

ie, remapping the prefix to ./, with the expectation that this will result in DW_TAG_comp_dir being <some random path>, but the DW_TAG_name for each source file would be ./some/path.rs.

@jsgf
Contributor
jsgf commented Jan 18, 2017

BTW, -Z's behaviour of always emitting the "warning: the option Z is unstable and should only be used on the nightly compiler, but it is currently accepted for backwards compatibility; this will soon change, see issue #31847 for more details" warning makes it a non-starter for this.

@michaelwoerister
Contributor

@jsgf Couldn't you use something like the following?

rustc -Zdebug-prefix-map-from=`pwd`/buck-out/gen/my/build-target#pic,rlib/ 
      -Zdebug-prefix-map-to=/something/
      `pwd`/buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs
@jsgf
Contributor
jsgf commented Jan 18, 2017

@michaelwoerister There's no shell involved, so no backtick substitution.

@michaelwoerister
Contributor

BTW, -Z's behaviour of always emitting the [...] warning ...

The plan is to move this to a stable -C flag once we are sure we want to stabilize it. In the beginning it will start out as unstable though (like all new features).

@jsgf
Contributor
jsgf commented Jan 18, 2017

Still really annoying message. Is there -Z yes-I-know?

@michaelwoerister
Contributor

Is there -Z yes-I-know?

Not that I know of, unfortunately.

@michaelwoerister
Contributor

There's no shell involved, so no backtick substitution.

The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative (because they are relative to something different for each crate). This is different from C/C++ where the source for templates is always available when they are instantiated.

@jsgf
Contributor
jsgf commented Jan 18, 2017

The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative

Well, in this case they're either relative to the same path (ie, from the same sourcebase), or their path is meaningless (from crates.io, which are all prebuilt).

@infinity0
Contributor

The issue is that the tool that's building the commandline may not know what the current directory is at the time the command is invoked, so it won't know what absolute path applies at that point [..]

The net result is that not only is the current directory some random name, but the path to the sources is prefixed by a buck-generated name. [..]

If I understand correctly, this is an argument in favour of applying the mappings first, before any other processing? I think this is better fixed in Buck itself, because it is the one that sets up these symlinks. It should know what cwd is used for each invocation of Rust, so it should be able to construct the example maps suggested by @michaelwoerister that contain pwd even without a shell.

Also if one applies the mappings first, it likely would result in a non-existent path. Then it's unclear how you do "other processing" on this. (If there is no other processing, then "first" and "last" are the same, and we're in agreement.)

Tools that run after rustc has emitted the debuginfo, have no control over what rustc does or how this might change over time. So it is more important to keep the expectations here simple. GCC also apply the remapping last (see gcc/dwarf2out.c in dwarf2out_early_finish).


I think that the mappings should be applied last, after any other processing such as converting to absolute paths.

That's what I suggest.

OK, I think I just got confused by the below point:

Is [making all paths absolute] really necessary?

It is at the moment because of items instantiated from other crates. Combining their relative paths with the current working directory yields invalid results. This problem could be solved differently though, maybe.

Could you explain this in some more detail so I/others could think of how to "solve it differently"? For example this:

The problem is that functions from other crates can be instantiated in the local crate, so we cannot store their paths as relative

Well, in this case they're either relative to the same path (ie, from the same sourcebase), or their path is meaningless (from crates.io, which are all prebuilt).

this sort of makes sense to me, but I'm not sure what the details are.

@michaelwoerister
Contributor

@infinity0 Regarding relative paths versus extern crates, let's say you have the following setup:

/libfoo/src      <-- contains the source of libfoo
/libfoo/build    <-- this is where you build libfoo & cwd of rustc while doing so   

If you have a generic function func in libfoo, it's source location would be recorded as ../src/lib.rs.

Now, let's say you compile another library, libbar, that references libfoo and has a similar setup:

/libbar/src      <-- contains the source of libbar
/libbar/build    <-- this is where you build libbar & cwd of rustc while doing so   

If you use foo::func in libbar you end up with a debuginfo entry that tells you that the source code of foo::func can be found in ../src/lib.rs, but we are relative to /libbar/build now, so the debugger would open /libbar/src/lib.rs and show you some unrelated source code. So, at least for items from external crates, in the general case we have to emit absolute paths.

@jsgf
Contributor
jsgf commented Jan 19, 2017

@infinity0 In principle if all the building is happening on one machine, then Buck could control everything. But if the build is being distributed then Buck on machine A could set up an environment that's consistent with relative paths, but be in a different absolute directory on machine B. The problem that paths produced from getcwd() may be absolute, but they are not canonical. (However, if Buck specified everything as absolute paths then they could be made canonical so that the same path works in all environments via the use of symlinks or similar - so I guess it could generate debug-prefix-path options to remap the abs build-time paths to relative.)

For the same reason, I think using absolute paths for the problem @michaelwoerister mentions above is also wrong. We want canonical paths for those source files, not absolute ones. The conflation between absolute and canonical is where I see problems.

Specifically, if you're trying to reconcile the paths for a libfoo build in the context of a libbar build, then I think it's more correct to use the libfoo's comp_dir and relative source paths to construct paths relative to libbar's comp_dir than to use absolute paths.

@luser
Contributor
luser commented Jan 20, 2017

Specifically, if you're trying to reconcile the paths for a libfoo build in the context of a libbar build, then I think it's more correct to use the libfoo's comp_dir and relative source paths to construct paths relative to libbar's comp_dir than to use absolute paths.

The weirdness here is for generics--they're not actually compiled until you instantiate them with a concrete type, so if you're using Foo<T> from libfoo it's not compiled until it gets used in libbar. What I implemented a while back was to put absolute paths in the metadata with the bytecode instead of the relative paths (which couldn't be resolved later). All it does is join the relative path with the comp_dir, so it should be functionally the same except for not having it in two separate fields.

@infinity0
Contributor
infinity0 commented Jan 20, 2017 edited

If you have a generic function func in libfoo, it's source location would be recorded as ../src/lib.rs. [and the cwd as /libfoo/build].

The weirdness here is for generics--they're not actually compiled until you instantiate them with a concrete type, so if you're using Foo<T> from libfoo it's not compiled until it gets used in libbar.

These two comments sound inconsistent - in the first one, where is the source location recorded? But the second comment suggests this is not available?

Anyway, using the example with libfoo and libbar directly above again, there are two cases:

  • libfoo::x is compiled with cwd=/libfoo/build and name=../src/x. Then we compile libbar. In this case what @jsgf suggested seems sensible to me, i.e. store the name of x in libbar as relpath("/libfoo/build" + "../src/x", "/libbar/build") which would be ../../libfoo/src/x.

  • libfoo::x is not compiled at first. Then we compile libbar. In this case, libbar has to find x.rs somehow (or some other intermediate file, I don't know) in which case it can still use a relative path as its name?

In both cases we are assuming that /libfoo and /libbar's relative positions to each other are fixed across both builds (even if their absolute paths change). If this is not the case, then at the very least libbar has to "find" libfoo somehow. Whichever directory this "find" algorithm returns, (it could be /libfoo or /libfoo/build, I don't know), call this directory d, then you could additionally store relpath(d, cwd) somewhere in the rust metadata when building libfoo, so that the later libbar could still work with relative paths. It would be slightly more complex, but still easily achievable IMO. (And none of what I described involves canonicalising symlinks, which would mess with Buck.) I think this addresses the distributed scenarios that @jsgf described too, but I'm not familiar with the details so perhaps he could confirm that.

(edit: explain why I think the two comments sound inconsistent)
(edit: to clarify, by relpath(x, y) I mean the path from y to x, like python's os.path.relpath)

@luser
Contributor
luser commented Jan 20, 2017

libfoo::x is not compiled at first. Then we compile libbar. In this case, libbar has to find x.rs somehow (or some other intermediate file, I don't know) in which case it can still use a relative path as its name?

Sorry, I should have been clearer! The generic code gets converted into some sort of bytecode (I'm sure someone else knows the specifics here), and that's stored in the generated rlib along with some metadata (which I think is handled by librustc_metadata). If you list the contents of an rlib with ar t foo.rlib you'll see that it contains a .o file (the actually compiled bits of the crate) a rust.metadata.bin file and a .bytecode.deflate file. The compiler gets the filename from the metadata for the items it's instantiating from bytecode.

Regardless, given that it has a full path in the metadata, I don't see why getting a relative path to the libfoo comp_dir and then joining the relative path would be better than just taking a relative path from the libbar comp_dir to the full path from libfoo. In your example, we'd have libfoo::x being compiled with relpath("/libfoo/src/x", "/libbar/build") which would still be ../../libfoo/src/x.

@michaelwoerister
Contributor

In both cases we are assuming that /libfoo and /libbar's relative positions to each other are fixed across both builds.

This seems to be the main problem with this approach. I would be wary of positing that (1) the rule set must support situations where the compilation directory is not known, but (2) it can be assumed that relative positions never change. That seems too tailored to this specific situation to make for a good general rule. Also if there is no common prefix between two paths (as in D:\foo and X:\bar), there is no relative path between them and you have to know the original compilation directory of your upstream crate again to set up a mapping.


@jsgf How about if we introduce a variable __RUSTC_CWD to the mapping syntax? As in:

rustc -Zdebug-prefix-map-from=__RUSTC_CWD/./buck-out/gen/my/build-target#pic,rlib/ 
      -Zdebug-prefix-map-to=/something/
      ./buck-out/gen/my/build-target#rlib,pic/my/target/src/lib.rs

A variable like this would be guaranteed to match the prefix of any absolute path the compiler emits.
Items from upstream crates would have their paths map with the mapping given when that upstream crate was compiled originally.

@infinity0
Contributor
infinity0 commented Jan 20, 2017 edited

Sure, using the full path to lib.rs would also work. I didn't realise that was available; I was only following the earlier constraint "it's source location would be recorded as ../src/lib.rs". As long as we do get the correct final relative paths, how it's calculated is not so important to me.

However, my overall motivation is to avoid having absolute paths anywhere in the output, both debugging output but also in the rlibs, since they can be installed onto end-user systems. So that's where the second part of my suggestion comes in, that relates to how the "find" algorithm for how other crates are found. Suppose we call d instead crate_dir for clarity, then here's a concrete example:

First we build libfoo with prefix map /path/to/libfoo=/usr/src/rust/libfoo

libfoo/build/foo.rlib#libfoo.o:
  comp_dir = /usr/src/rust/libfoo/build, mapped from /path/to/libfoo/build
  name = ../src/lib.rs
libfoo/build/foo.rlib#rust.metdata.bin:
  crate_dir = /usr/src/rust/libfoo, mapped from /path/to/libfoo
  fn run_foo_generically:
    path = ../src/lib.rs

This is reproducible regardless of the rebuilder's build directory, as long as they set the right prefix-map.

Later on a different build machine, someone else builds libbar, using libfoo from /my/crates/libfoo, with no prefix map because they don't care about reproducibility, but they do care about correct debugging:

libbar/build/bar.rlib#libbar.o:
  comp_dir = /my/own/libbar/build
  name = ../src/lib.rs
  fn run_foo_generically:
    path = /my/crates/libfoo/src/lib.rs
  ^ probably this is not exactly how things work, but hopefully "similar enough" that you get the idea

And this last value /my/crates/libfoo/src/lib.rs would be calculated via:

  • joinpath(
    • relpath(
      • foo's comp_dir,
      • foo's crate_dir),
    • run_foo_generically's (rel)path (from rust.metadata.bin) )

(This assumes that relpath(comp_dir, crate_dir) is reproducible and not something like ../../../lol/trolls/gonna/troll/libfoo/build.)

@infinity0
Contributor

Actually, even in the case that someone does want to build libfoo somewhere completely random, they could build it using absolute source path names (which might get remapped). Later, anyone else such as libbar can still recover these source paths relative to the crate_dir, which is all that is needed to correctly resolve the paths on libbar's side.

@jsgf
Contributor
jsgf commented Jan 20, 2017

@michaelwoerister:

How about if we introduce a variable __RUSTC_CWD to the mapping syntax?

After going to the effort of adding the -from/-to variants to the command-line syntax to avoid needing to parse the string for a separator character, I don't think adding in metasyntactic variables is very consistent with that.

But more generally, I think there's a somewhat irreconcilable problem: sometimes absolute paths are the right thing to use, and sometimes relative is, on a purely case-by-case basis.

(My background summary for my own benefit)

For code that compiled directly to pure object file, the answer is pretty clear: each object has a corresponding set of files with a DW_AT_comp_dir and source paths relative to that. Tools can try to construct their own abspath by combining them, but also construct relative paths purely from the source names, or by having their own source_path to apply to each filename.

The problem arises from code that isn't completely generated into object code at its own build time, but defers it to later specialization in the context of some other module.

C/C++ has no problem with this, because this is always performed in source terms; the second compilation always needs to refer to the source of the first compilation, so we know they're at least being compiled in the same namespace, and the DW_AT_comp_dir+relative path will resolve to something meaningful.

In Rust it's trickier because there's always a compilation to a form of object file (either .rlib or .so), so the original sources are never needed, and the second compilation could be in a completely different filesystem namespace, making the concept of "path to source" for the first compilation potentially meaningless. Or they could be in the same namespace, but with no meaningful relative relationship. Or they could be in the same source tree, but the absolute position of that tree might be different from build to build (or between builder and editor).

So, given that, the questions that occur to me are:

Does Dwarf have a way of expressing what we want here?

  • That is, "code from comp_dir=.../libfoo src path ../src/thing.rs inlined and specialized into comp_dir=.../libbar src path some/other/path.rs"

How does C/C++ handle this with

  • precompiled headers?
  • C++ modules?
  • Link-time whole-program optimization?
@jsgf
Contributor
jsgf commented Jan 20, 2017 edited

Hm, on closer inspection, it looks to me like DW_AT_comp_dir and DW_AT_name for the source are a red herring; rustc only seems to generate an entry for the top-level lib.rs, and the rest of the sources don't appear there.

The real action is happening in the "Directory Table" (in readelf output, include_directories in the DWARF spec) and the "File Name Table" (file_names). It seems that rustc always generates bare names in the file name table (lib.rs), and generates a new directory entry for each path within the crate, as full paths (/my/source/is/here/libfoo/src/submodule).

The DWARF spec says the file names are relative to either DW_AT_comp_dir or a specific entry in the directory table.

So, for example, libfutures.rlib has:

  The Directory Table (offset 0x1b):
  1     /my/full/path/to/futures/src
  2     /my/full/path/to/futures/src/future
  3     /my/full/path/to/futures/src/stream
  4     /my/full/path/to/futures/src/sink
  5     /my/full/path/to/futures/src/task_impl
  6     /my/full/path/to/futures/src/sync
  7     /my/full/path/to/futures/src/sync/mpsc
[...]

 The File Name Table (offset 0x5f1):
  Entry Dir     Time    Size    Name
  1     1       0       0       lib.rs
  2     2       0       0       mod.rs
  3     2       0       0       lazy.rs
  4     0       0       0       <std macros>
  5     2       0       0       flatten_stream.rs
  6     2       0       0       join.rs
  7     2       0       0       select.rs
  8     2       0       0       chain.rs
  9     2       0       0       join_all.rs
  10    2       0       0       select_all.rs
  11    2       0       0       select_ok.rs
[...]

This means that lib.rs is relative to /my/full/path/to/futures/src, etc. I think rustc is wrong here. It should only have one path to this crate's sources, and then each name in the filenames list should be relative to that. For example:

Assuming DW_AT_comp_dir is /my/full/path/to/futures/src:

  The Directory Table (offset 0x1b):
  1     /my/full/path/to/futures/src
[...]

 The File Name Table (offset 0x5f1):
  Entry Dir     Time    Size    Name
  1     1       0       0       lib.rs
  2     1       0       0       future/mod.rs
  3     1       0       0       future/lazy.rs
  4     0       0       0       <std macros>
  5     1       0       0       future/flatten_stream.rs
  6     1       0       0       future/join.rs
  7     1       0       0       future/select.rs
  8     1       0       0       future/chain.rs
  9     1       0       0       future/join_all.rs
  10    1       0       0       future/select_all.rs
  11    1       0       0       future/select_ok.rs
[...]

so that all the names are also sensible relative to DW_AT_comp_dir as well as having an absolute path. (This is post remapping if the actual build happened in a separate build dir.)

What's more, code from other modules can use different directory entries:

 The Directory Table (offset 0x1b):
[...]
  8     /my/path/to/rust/1.14/src/rust/src
[...]

 The File Name Table (offset 0x5f1):
  Entry Dir     Time    Size    Name
[...]
  41    8       0       0       liballoc/arc.rs
[...]

So I think this goes back to @michaelwoerister's comment above about how rustc should construct the pathnames to emit based on the top-level source passed to rustc, and make sure that it propagates that all the way through to the DWARF info without rewriting them as absolute dir + filename.

@michaelwoerister
Contributor

@jsgf The other files appear in DW_AT_decl_file attributes. This is how DWARF encodes paths. Paths in the file name table are relative to their directory table entry, paths in the directory table are relative to the DW_AT_comp_dir of the compilation unit. It's done in a way that maps well to C include directories but there is no clear rule how this should be used (which doesn't mean that the way rustc does it now is a good one).

@jsgf
Contributor
jsgf commented Jan 20, 2017

@michaelwoerister DW_AT_decl_file compute the file index for a piece of the code, but the ultimate filename comes from the file name and directory tables.

@michaelwoerister
Contributor

@infinity0 In your example from above, when compiling libbar, I don't think the compiler would be able to know that the source of libfoo could be found under /my/crates/libfoo. It only knows where libfoo.rlib is, which is independent of libfoo's source location.

@michaelwoerister
Contributor

DW_AT_decl_file compute the file index for a piece of the code, but the ultimate filename comes from the file name and directory tables.

Yes, sorry, that didn't make too much sense. The important thing is that the file-name and directory tables are just an arbitrary encoding. It seems that GCC for example does the same as rustc (maybe because it takes the least space).

@infinity0
Contributor

@michaelwoerister I see, OK. This situation is indeed not handled well by DWARF. But I think it's possible to make it work, going back to the crate_dir concept. Instead of recording paths to libfoo relative to libbar, you could record these paths as a special dummy path like crate://libfoo/$relative_path, where $relative_path is as I suggested above, calculated using crate_dir and other information in libfoo.rlib.

I think this sort of solution is inevitable given the differences from C/C++ that @jsgf just talked about in detail. If I build libbar against libfoo.rlib but I might not have the latter's source code, but I want my users to potentially be able to get its source code to debug my library, then the only theoretically possible thing to do is to refer to a relative path, that is relative to some "abstract idea" of where the source code of libfoo might be.

Anyway, this is going slightly away from the topic of prefix maps. To summarise, I think we should separate the concerns here:

  • to support debugging source code from other libraries libfoo (whose source code might not be available at build time, but debugging symbols are), this can be achieved by recording crate_dir and using virtual paths (e.g. crate://libfoo/path/to/file/containing/a/generic/symbol)

  • to support reproducible builds, this can be achieved by using prefix-maps to transform paths as usual, e.g. crate_dir, comp_dir, name, those things in the Directory Table, etc.

Actually, if it's possible/suitable for rustc to auto-detect the top-level crate_dir of any given file that it's compiling, these prefix-maps aren't even necessary at all - all paths can simply be made relative to this directory, and debuggers can recreate these paths on their side no problem. Prefix-maps are only really needed when the build tool does not know the "top-level" directory of the source code and needs a parent buildsystem to pass it in, which is the case for traditional C compilers but not usually the case for more modern languages.

@jmesmon
Contributor
jmesmon commented Jan 22, 2017 edited

Perhaps I missed this, but is there a reason that we can't remap the paths when they are being stored into an rlib, to avoid the need to map them later on? ie: do the remap at the point where the source location is known.

By doing this, non-cargo builders have the control they need, and we'd just need some higher level support for path-remapping in cargo to handle it's multiple rustc invocations.

@michaelwoerister
Contributor

@infinity0 Your idea of "abstract source" matches up with what I had in mind as the primary motivation for prefix mapping and can also be implemented via regular prefix mapping if we do what @jmesmon suggests:

When you invoke the compiler something like this rustc ../src/lib.rs, you can have a mapping ../src/ -> mylib@v0.1.0/ (if you are using relative paths) or /absolute/path/to/your/src/ -> mylib@v0.1.0/ where mylib@v0.1.0 is a kind of abstract name for the given source code. In a later step the consumer (like GDB) can have it's own mapping for that or you remap it again with debugedit. For this to work, we need to do what @jmesmon says: remap paths as they are stored or -- equivalently -- store the mapping with the crate and apply it to anything that comes from that crate. This way, things coming from other crates potentially are already in the known "abstract source space" and you know what to do with them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment