Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: make Cargo embed dependency versions in the compiled binary #2801

Open
wants to merge 7 commits into
base: master
from

Conversation

@Shnatsel
Copy link

Shnatsel commented Nov 3, 2019

Rendered

This is the first step towards a proper security update story for Rust. You might also want to check out the comments on a prototype implementation that informed this RFC.

Shnatsel added 4 commits Sep 8, 2019
@igor-petruk

This comment has been minimized.

Copy link

igor-petruk commented Nov 3, 2019

One idea is to wrap Cargo.lock into another trivial format: yaml, json, custom text with two keys: ContentType and Body and you can version your format if Cargo.lock becomes cumbersome (for some yet unknown reasons) and will have to be replaced with something else.

@Alex-Addy

This comment has been minimized.

Copy link

Alex-Addy commented Nov 3, 2019

I would request that this is not enabled by default. As noted by RustDragon on reddit, this could leak potentially sensitive information through the embedded paths.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 3, 2019

As described in the RFC, not having this enabled by default pretty much defeats the point.

Would disclosing that the dependency originated from a local copy, but not disclosing the directory path be an acceptable solution? This would protect information such as usernames and paths.

- Tooling for extracting information from binaries (such as ELF sections) is already readily available. Tooling for parsing `Cargo.lock` also exists.
- This enables third parties such as cloud providers to scan your binaries for you. Google Cloud [already provides such a service](https://cloud.google.com/container-registry/docs/get-image-vulnerabilities), Amazon has [an open-source project you can deploy](https://aws.amazon.com/blogs/publicsector/detect-vulnerabilities-in-the-docker-images-in-your-applications/) while Azure [integrates several partner solutions](https://docs.microsoft.com/en-us/azure/security-center/security-center-vulnerability-assessment-recommendations).

Alternatives:

This comment has been minimized.

Copy link
@joshlf

joshlf Nov 3, 2019

Contributor

Do we actually need Cargo support for this? Would it be possible to have an empty crate with a build.rs script that performs the appropriate analysis and emits the appropriate linker flags? That would certainly defeat the goal of being on-by-default, but it could be a useful first step before Cargo support is ratified and deployed.

This comment has been minimized.

Copy link
@Shnatsel

Shnatsel Nov 3, 2019

Author

AFAIK this can't be fit into a crate because build.rs is invoked before the rest of the code is compiled. But this almost certainly can be implemented as a Cargo wrapper - just call objdump or the platform equivalent after Cargo is done. An example for Linux is included in the RFC.

Also I have a working prototype but with a different implementation approach - &'static str with start/stop markers.

This comment has been minimized.

Copy link
@joshlf

joshlf Nov 3, 2019

Contributor

When running build.rs, Cargo provides the CARGO_MANIFEST_DIR environment variable. I believe that you could use this to locate the Cargo.lock file, and use its location to emit the appropriate linker flags.

This comment has been minimized.

Copy link
@Shnatsel

Shnatsel Nov 3, 2019

Author

I have zero knowledge about using linkers directly (or configuring them), but I would welcome any attempts at prototyping this.

This comment has been minimized.

Copy link
@joshlf

joshlf Nov 3, 2019

Contributor

I have no time, but I'm posting this to This Week in Rust's CFP :)

@Alex-Addy

This comment has been minimized.

Copy link

Alex-Addy commented Nov 3, 2019

I'm not sure that not having it enabled by default would defeat the point. Having this information available to be enabled would still be immensely valuable to production users. My main concern is users accidentally having their information leaked just by handing out a binary.

@Alex-Addy

This comment has been minimized.

Copy link

Alex-Addy commented Nov 3, 2019

Would disclosing that the dependency originated from a local copy, but not disclosing the directory path be an acceptable solution? This would protect information such as usernames and paths.

Having thought about this a bit more I think that this would be fine. It might be desirable to be able to turn path embedding back on (for full auditing), with it being disabled by default (for safety).

@Lokathor

This comment has been minimized.

Copy link

Lokathor commented Nov 3, 2019

Yeah this should definitely be off by default, even if the info is scrubbed of paths and stuff.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 3, 2019

@Lokathor could you provide a rationale? I think the RFC makes a pretty strong case for the on-by-default approach. Prior art from Go and (as I've just learned) Ruby supports this as well.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 3, 2019

It has also been pointed out that this may interact with reproducible builds in interesting ways - I'm not sure if Cargo.lock is guaranteed to have a stable sorting order.

@Diggsey

This comment has been minimized.

Copy link
Contributor

Diggsey commented Nov 3, 2019

I don't necessarily think this should be off by default if the size increase is indeed as minimal as claimed: even in release mode some debug information is included already. I do think it should be removed when the binary is stripped though.

The RFC claims a sub 1% increase in size: I would like to see some more evidence to back that up; the "hello world" example is not enough. The number of dependencies increases the lock file size linearly, but it does not increase the size of the binary linearly due to unused code being removed at link time, so the relative size could go up. Furthermore, platform-specific dependencies are tracked in the lock file but are not compiled into the binary at all on other platforms.

Regarding reproducible builds: as long as the information in the Cargo.lock is encoded in a deterministic way, it shouldn't be a problem. If cargo picks a dependency version non-deterministically then your deterministic build is broken regardless of whether you incorporate the lock file directly.

One thing not covered by this RFC is what crate types will embed this information. Should the "cdylib" type embed the information? On Windows, DLLs are just executables with a different extension so you could reasonably argue that the same version information should be included.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 3, 2019

The RFC claims a sub 1% increase in size: I would like to see some more evidence to back that up; the "hello world" example is not enough.

Good point, will do.

Regarding reproducible builds: as long as the information in the Cargo.lock is encoded in a deterministic way, it shouldn't be a problem.

I don't think Cargo.lock has any guarantees regarding stable sorting order for dependencies, for example.

One thing not covered by this RFC is what crate types will embed this information.

It's under "unresolved questions" since I don't have a strong stance on this, but I am leaning towards "yes" for the same reasons you've described.

@Lokathor

This comment has been minimized.

Copy link

Lokathor commented Nov 3, 2019

Bonus data point: A minimal hello world on Windows is ~3.5k, and the Cargo.lock for that is ~1.9k.

The ratio generally gets better as your program grows. The amount of code that usage of a dependency puts into your binary is usually a lot more than the line or two it takes to list the dependency.

Dependencies that hurt the ratio (adding a dependency entries but not adding code into the final binary) would likely be things like proc-macro crates, macro_rules!-only crates (eg: cfg-if), and bindings crates (eg: winapi).

That said, I still don't want this in my programs because I just don't want extra stuff going in my files.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 3, 2019

Size comparisons on binaries of varying sizes, as measured on Linux with ThinLTO:

exa is 2.6Mb stripped, 1.5Mb unstripped, 25Kb Cargo.lock. <1% vs unstripped, <2% vs stripped
cargo-audit is 6.5Mb unstripped, 3.6Mb stripped, Cargo.lock is 41Kb. <1% vs unstripped, <2% vs stripped
cargo-geiger is 17Mb unstripped, 12Mb stripped, Cargo.lock is 95Kb. <1% even vs stripped
ripgrep is 32Mb unstripped, 4.5Mb stripped. Its Cargo.lock is 24Kb, <1% even vs stripped

We can do a Crater run to be sure.

@gmorenz

This comment has been minimized.

Copy link

gmorenz commented Nov 3, 2019

reproducible builds

It seems like we should aim to be reproducible even if the crates came from different sources. I.e. downloading something from a registry vs using a local path shouldn't change things.

Maybe we could use a format this is just a sorted sequence of (crate-name, version, hash) instead of directly embedding Cargo.toml? This would also resolve almost all of the information leak concerns.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 3, 2019

I fear omitting the registry where a certain dependency came from would mean losing too much information. This defeats use cases such as only allowing dependencies from a private registry, for example.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 3, 2019

reproducible builds

You can't reproduce a build without Cargo.lock, so if anything, including it in the binary is a helpful forensic artifact for reproducing its build.

@Ixrec

This comment has been minimized.

Copy link
Contributor

Ixrec commented Nov 3, 2019

Since build system and package management system are usually decoupled, most other languages did not have the opportunity to implement anything like this.

For prior art and motivation completeness: Where I work at Bloomberg, our C++ codebase has enough control over such things that we rolled our own version embedding tricks many years ago. On top of the security benefits, this has been utterly invaluable for diagnosing many bugs in production, and ensuring that fixes have been fully released. The specific format it uses doesn't make much sense for an open source ecosystem in a language with a proper module/package system, but I thought it was worth mentioning this sort of thing has precedent even in C++.

@rotty

This comment has been minimized.

Copy link

rotty commented Nov 3, 2019

As suggested by @Shnatsel, I'll drop a mention of my cargo-audit-tags Python script. It allows a similar check based on the source code, checking the historical Cargo.lock files in git, based on git tags. This allows checking released artifacts based on those tags for known vulnerabilities in dependencies. At work, we have this running as part of a daily CI job, which will send an email if vulnerabilities are detected.

Of course, this does not cover the use case of the RFC, but it is a related tool that has some overlap with the goals of the RFC.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Nov 4, 2019

First of all: thank you for working on the problems of dependency tracking and security analysis! Please don't take any of the feedback below as suggesting that I disagree with the goals.

I agree with the various other posts giving a rationale of not wanting to disclose excessive information, especially about local paths and registries. You need to scrub all of the potentially sensitive information, such as local registries, paths, and repository URLs.

A quick check on my system turns up Cargo.lock files upwards of 100k-140k.

I'm also concerned about exposing the potentially evolving Cargo.lock format (as mentioned in your unresolved questions section); that format has changed substantially over time.

I think we could reduce that format substantially to serve this particular purpose. But at the very least, it seems worth compressing it before embedding it. A quick check suggests that basic zlib compression reduces the size of Cargo.lock files by 75-85%.

You also need to consider the size of the delta from one version of a binary to another, as many distribution mechanisms want to distribute deltas. I can easily imagine the Cargo.lock delta taking up more space than the delta to the code itself. (Note that deltas interact with the issue of compression above; tools to compute deltas need to know about use of compression.)

You should discuss the issue of format versioning. You could theoretically evolve this format by using a new section name if the version changes.

Given that this uses a completely Rust-specific (and cargo-specific) format, the section name should not use something generic like .dep-list.

A "Hello World" on x86 Linux compiles into a ~1Mb file in the best case (recent Rust without jemalloc, LTO enabled).

That's a problem that people are trying to fix. And that's with std. It's possible to build binaries less than 1kB today.

WASM, asm.js and embedded platforms are excempt from this mechanism since they have very strict code size requirements.

(Typo: s/excempt/exempt/)

People care about code size on other platforms, too, and the same target may get used for large applications and embedded development. You can't reliably classify every target into "embedded" and "not embedded".

Also, this isn't just about code size; some use cases need to carefully construct binaries and care about any "unusual" sections that get generated in the binary. Some binaries will get built with linker scripts.

You don't specify any mechanism to enable/disable this. You need to provide and specify appropriate configuration for that.

For a first pass, I would propose addressing the issues above, and then proposing a version that's initially disabled by default. This will let people experiment with it, optimize it, audit the information it includes, and so on. We can always revisit questions of defaults and their interaction with various use cases in the future.

@vincentdephily

This comment has been minimized.

Copy link

vincentdephily commented Nov 4, 2019

Regarding the "leaking sensitive information" issue, beyond the obvious "strip local paths and private urls" mitigation, it's useful to think in terms of threat model. How does embedding version info affect the balance between attacker and defender ?

  • The version info is only beneficial if the system is dilligently updated.
  • A a system that can't be or isn't regularly updated could instead argue for security by obscurity.
  • An attacker typically has more practical ways to test for a vuln than reading the binary.
  • When dealing with zero-day vulns, the version info only benefits the attacker.

For most cases I think the version info is a net gain for the defender (and therefore it should eventually be on by default once it matures), but there'll always be cases when it isn't.

It's not just about security: as a user I may want to recompile my binaries when a performance fix is out for example.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 4, 2019

When I worked on an infosec team at a large payments company, we built a system based on Ruby's Gemfile.lock, which I think is quite similar to the goals of this proposal (and also as mentioned in the proposal).

It scanned filesystems for these files, audited them against a similar vulnerability database (RubySec), opened tickets for vulnerable apps, and auto-closed them when they were updated to non-vulnerable versions.

It seems there's a lot of concerns about potential leakage of internal repositories. However for the above feature to work, we needed that information (which we also used to file internal vulnerabilities against internal packages). Far from that information being a concern, it was something we actually leveraged in an enterprise context.

For enterprises deploying in-house applications, I don't think leaking this information is any concern whatsoever.

For crates published on crates.io, it also isn't a concern as these crates, by definition, can't refer to internal repositories.

So the real information leakage concern is: publicly published binaries (e.g. through a vendor's web site or an app store) leaking this information accidentally because the publishers did not opt out of removing it. To me this is a somewhat valid concern: it's information exposure and a potential loss in defense-in-depth.

But I do just want to make clear that I think, with that acknowledged, this is also a feature that enterprise users will want for internal binaries, to allow internal auditing of deployed application binaries.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 4, 2019

@tarcieri could you clarify whether specifying the origin of the package is required?

I am struggling to come up with use cases for it. I've considered using it to track down all binaries that depended on a registry that was found to be compromised, but crates from the compromised registry could execute arbitrary code through build.rs to alter the origin information embedded in the binary.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 4, 2019

Also, regarding path disclosure concerns: it has been pointed out that the panic messages embedded in the binary already disclose source code location for all crates that may panic. Stripping debugging symbols is not sufficient to remove this information either.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 4, 2019

could you clarify whether specifying the origin of the package is required?

cargo audit/RustSec presently don't consult the source of Cargo.lock whatsoever, so the keys in question could be redacted as those projects work today.

It would break the usefulness of using those Cargo.lock files for reproducing binaries. An easier solution there is just to check Cargo.lock into git along with the source code, however.

The main place we used that sort of information in an enterprise context was deciding whether to audit a RubyGem listed in Gemfile.lock against the public RubySec database, or our internal vulnerability database.

cargo audit doesn't support this now (i.e. maintaining an internal version of something like RustSec/advisory-db for an internal crate repository) but I think it would be an interesting feature to add.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 4, 2019

I imagine the private registry and the public registry have non-overlapping sets of packages, so I don't think there is any ambiguity there - you can simply check against both vulnerability databases.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 4, 2019

The information could also be sourced from Cargo.toml, particularly if you knew the git commit the binary was built from, but right now cargo audit works entirely on Cargo.lock alone.

That said, there have been requests to include Cargo.toml in the analysis for other reasons.

@notriddle

This comment has been minimized.

Copy link
Contributor

notriddle commented Nov 5, 2019

I'm not convinced that this is worthwhile.

From work I've done with enterprise PHP apps, I've never needed the information to actually be in the "binaries" (actually tarballs and Docker images). Actually mining the data from a file would require logging into the server and poking at the files, which only some of the developers are actually allowed to do, and while the server could have provided an API endpoint to get at the dependency version info, I've never actually needed that.

All I actually need is the git hash of the repository it was built from, which is both sufficient for version audits as long as you have access to the VCS, and useful in cases other than just that one. And unlike embedding all of Cargo.lock, there's basically negligible space overhead.

This reasoning (that the dependency information is pointless without access to the VCS) is even more clearly applicable to Rust, since actually doing the version update requires the source code to rebuild the binary.

@kornelski

This comment has been minimized.

Copy link
Contributor

kornelski commented Nov 5, 2019

Cargo.lock contains more information than necessary, like the whole dependency structure. I suggest trimming it down to minimum, such as just a flattened list of crates used, with only crate name and version pair per line. Or maybe even just the crate checksum, assuming that it'll be enough to match known-bad crates.

It should probably also contain Rustc/libstd version.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 5, 2019

@kornelski sounds good to me. I will update the RFC with this proposal.

It should probably also contain Rust/libstd version.

This is already encoded today, along with LLVM version. Try strings your_executable | grep 'rustc version'

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 5, 2019

@notriddle

Actually mining the data from a file would require logging into the server and poking at the files, which only some of the developers are actually allowed to do

Please see my post which describes a real-world example of a production deployment of an infosec system which uses a similar feature. Rather than "logging into the server and poking at the files", our system leveraged endpoint agents (already a requirement anywhere FIM is required for compliance).

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 5, 2019

@kornelski

Cargo.lock contains more information than necessary, like the whole dependency structure. I suggest trimming it down to minimum, such as just a flattened list of crates used, with only crate name and version pair per line.

cargo-audit, which is likely the main intended consumer of this information, leverages all of that information in order to display dependency trees of how impacted crates are included in a project. Including it would be helpful in leveraging that functionality using this method.

As mentioned earlier though, it doesn't leverage source, which could potentially be redacted with no effect on cargo-audit

@rotty

This comment has been minimized.

Copy link

rotty commented Nov 6, 2019

@kornelski

Cargo.lock contains more information than necessary, like the whole dependency structure. I suggest trimming it down to minimum, such as just a flattened list of crates used, with only crate name and version pair per line.

cargo-audit, which is likely the main intended consumer of this information, leverages all of that information in order to display dependency trees of how impacted crates are included in a project. Including it would be helpful in leveraging that functionality using this method.

Is this information really that useful when auditing binaries? To remedy the situation, you need to rebuild, so you need access to the source. When you have access to the corresponding source, you also have full Cargo.lock information.

@vincentdephily

This comment has been minimized.

Copy link

vincentdephily commented Nov 6, 2019

I don't like to cop out of an important decision by turning everything into an option, but it sounds like we'll need different levels of Cargo.lock redaction for different deployment contexts ?

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 6, 2019

Is this information really that useful when auditing binaries?

Yes, if someone were able to build a tool similar to the one I described here, it would allow for the full functionality of cargo-audit to be present based on endpoint agent-based collection of Cargo.lock files alone.

Contrarily, would degrading cargo-audit's capabilities by removing this information really save enough space to be beneficial in any way?

I don't like to cop out of an important decision by turning everything into an option, but it sounds like we'll need different levels of Cargo.lock redaction for different deployment contexts ?

It sounds like everyone is on board with redacting source, which seems to address most people's concerns.

@rotty

This comment has been minimized.

Copy link

rotty commented Nov 6, 2019

Is this information really that useful when auditing binaries?

Yes, if someone were able to build a tool similar to the one I described here, it would allow for the full functionality of cargo-audit to be present based on endpoint agent-based collection of Cargo.lock files alone.

Contrarily, would degrading cargo-audit's capabilities by removing this information really save enough space to be beneficial in any way?

I'm not sure about that, I just want to understand what information needs to be present to actually cover the scenario you describe; quoting from your comment referenced above:

It seems there's a lot of concerns about potential leakage of internal repositories. However for the above feature to work, we needed that information (which we also used to file internal vulnerabilities against internal packages)

If you redact just the source fields, as you agree is sensible (see quote below), wouldn't that be exactly be redacting the information you need, according to the above quote? If not, what's missing from the condensed format proposed by @kornelski (i.e. just include a flattened out list of dependencies, including their versions)? Quoting @Shnatsel:

I imagine the private registry and the public registry have non-overlapping sets of packages, so I don't think there is any ambiguity there - you can simply check against both vulnerability databases.

So I just don't understand why the minimalistic format is not sufficient to be able to implement a tool as you described. Note that I'm not arguing for going with the minimalistic format, I just want to understand its limitations beyond being inconvenient, as it cannot be fed into cargo audit.

I don't like to cop out of an important decision by turning everything into an option, but it sounds like we'll need different levels of Cargo.lock redaction for different deployment contexts ?

It sounds like everyone is on board with redacting source, which seems to address most people's concerns.

I assume that would also include redacting the crate URLs included in the entries in the dependencies, for example, turning

[[package]]
name = "arrayvec"
version = "0.4.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
 "nodrop 0.1.14 (registry+https://github.com/rust-lang/crates.io-index)",
]

into

[[package]]
name = "arrayvec"
version = "0.4.12"
dependencies = [
 "nodrop 0.1.14",
]

Is that assumption correct?

@Lokathor

This comment has been minimized.

Copy link

Lokathor commented Nov 6, 2019

in this case, the registry of a crate should be part of its "full name", similar to how the version is as well.

crates.io/crates/arrayvec-0.1.14 or something like that.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 6, 2019

@rotty

I'm not sure about that, I just want to understand what information needs to be present to actually cover the scenario you describe

There seems to be quite a bit of confusion in this post, some of which I'm responsible for, mea culpa.

Here is what cargo-audit presently uses in Cargo.lock:

[[package]]
name = "..."
version = "..."
dependencies = ["..."]

It does not presently use:

[[package]]
source = "..."

[metadata]
"..." = "..."

re: source

Wouldn't that be exactly be redacting the information you need, according to the above quote?

If you read the response from @Shnatsel here and my follow-up here, I changed my mind on that.

To further expound on that point: so long as the vulnerability databases contain packages with non-overlapping names, we can load both into cargo-audit's internal vulnerability database and it would not need to consult source.

However, separately from that, it has features for analyzing the dependency relationships between packages in order to display information about how those packages are included in projects.

I think it would very much be nice if that capability weren't lost and cargo-audit needed to function in a degraded mode when sourcing Cargo.lock files from binaries.

I also think the information in dependencies should compress very well w\ e.g. gzip, since it's repetitions of the same names (and sometimes version numbers/source info i.e. registry+https://github.com/rust-lang/crates.io-index, at least in the v1 format)

I think the space savings of redacting this information, especially post-compression, would be negligible.

I think the functionality lost from redacting this information is not negligible, and something we plan on leveraging more of in the future.

Of all of the things to redact, [metadata] seems like it'd be the biggest win for space-savings, since the SHA-256 hashes are uniformly random and therefore incompressible.

I don't imagine there will be much space savings on package.dependencies or package.source since these elements are highly repetitive. However, I'd still be in favor of redacting source due to the info leakage concerns.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 6, 2019

It appears that the minimalistic format of (name, version, hash) serves the cargo-audit use case okay. So far no compelling use cases have been demonstrated for leveraging the crate origin information, so it will be redacted for privacy concerns.

Dependency tree information is not strictly required, but it does provide actionable insights in some cases. For example, clap currently depends on a version of yaml that is vulnerable to a DoS attack, but it only ever reads it from a trusted source and only at compile time; it may be important to distinguish between yaml included through clap (in which case no action is required) and yaml used directly (in which case the binary is probably vulnerable).

I will update the RFC to include this as soon as I have the time.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 6, 2019

It appears that the minimalistic format of (name, version, hash) serves the cargo-audit use case okay.

This comes at the burden of cargo-audit having two input formats for data, which is something I'd really like to avoid unless there is a very compelling case for it.

I think if we redact the elements I just mentioned, after compression the space savings will be negligible.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 6, 2019

I'm afraid having two different formats is unavoidable because there are no stability guarantees for Cargo.lock format.

Also, this metadata needs to be easy to consume by readily available tooling, and TOML is not nearly as widespread as e.g. JSON. I'd go for a trivial, safe-to-parse subset of JSON - something along the lines of [{name: "crate1",version:"0.1.2",git:"deadb33f",sha256:"123456789098765432123456789"}] where the strings cannot contain the " character, each crate data section is limited to 1Mb and the entire thing to 64Mb. Possibly nesting to incorporate dependency chains, although that's tricky to make DoS-resilient. This would allow easily implementing both a naive parser in any language (just deserialize it as JSON) and a DoS-resilient parser that doesn't know about the complexity of JSON at all and just reads until the next marker or until read limit is reached.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 6, 2019

I'm afraid having two different formats is unavoidable because there are no stability guarantees for Cargo.lock format.

I'm already supporting V1 and V2 of the Cargo.lock format. So this would be a third format...

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 6, 2019

Sadly, yes. But at least the entire rest of the tooling that uses this would not have to add v3, v4 and v5. This is especially pertinent due to the fact that this metadata could be an exploitation vector when scanning third-party binaries.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 6, 2019

I'd suggest considering the Cargo.lock V2 format, with source and checksum redacted. It already moves the source information out-of-band of package.dependencies, making the dependencies list nicely minimal (it only includes the package name by default, and adds version information only in the case that it's necessary to disambiguate).

Re: JSON vs TOML, so long as the JSON is isomorphic to the TOML I really don't care. serde makes that easy.

Re:

the entire rest of the tooling that uses this would not have to add v3, v4 and v5.

That argument feels a little bit https://xkcd.com/927/ to me. We could always transcode to the Cargo.lock V2 format. In fact the cargo-lock crate used by cargo-audit has already been used to implement a V2 to V1 Cargo.lock converter.

@kornelski

This comment has been minimized.

Copy link
Contributor

kornelski commented Nov 6, 2019

A flat list could be transformed back into a tree by performing dependency resolution using the crates-io index. I think in practice it should be reliable and straightforward, given that Cargo has a rule of one version of crate per major version, and that considerably limits options, so even a naive greedy algorithm should do (it's a simpler problem than Cargo's own dependency resolution against version ranges).

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Nov 6, 2019

@Shnatsel

Also, regarding path disclosure concerns: it has been pointed out that the panic messages embedded in the binary already disclose source code location for all crates that may panic. Stripping debugging symbols is not sufficient to remove this information either.

I'd consider that a bug, and it shouldn't serve as precedent for embedding more such information.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 6, 2019

@kornelski

A flat list could be transformed back into a tree by performing dependency resolution using the crates-io index

Wouldn't that depend on knowing which cargo features are activated for each of the crates?

While there are ways to potentially recover this information, it also doesn't seem like that much overhead, especially after compression, to include it. I'd be happy to do some real-world measurements.

@Shnatsel

This comment has been minimized.

Copy link
Author

Shnatsel commented Nov 6, 2019

Regarding configuration: are there any use cases on toggling this on a per-profile basis? For example, enabling it for debug but not for release, or vice versa? Assuming the crate origin info is redacted, I cannot think of any, so I'm inclined to have a global way to toggle this.

@tarcieri

This comment has been minimized.

Copy link

tarcieri commented Nov 6, 2019

Seems like something that isn't profile-dependent to me. I'd want it on in the release profile, and for the other profiles I generally wouldn't care whether it's on or off.

@gmorenz

This comment has been minimized.

Copy link

gmorenz commented Nov 9, 2019

Is there a reason not to make it configurable per profile? I can't think of any great reasons but I can think of some ok reasons. Generally speaking that's where I would expect to find settings that effect how the binary is built.

Ok reasons: Saving the few extra bytes in release builds. Not sending along information that makes it easier to spot vulnerabilities in release binaries that you know you will never be able to update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.