Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Out-of-tree test suite #3557

Conversation

kirtchev-adacore
Copy link

@kirtchev-adacore kirtchev-adacore commented Jan 10, 2024

This PR proposes the creation of an external stand-alone test suite and associated tools for verifying the Rust toolchain that also fits into existing Rust development workflows.

Rendered

This PR proposes the creation of an external stand-alone test suite
and associated tools for verifying the Rust toolchain that also fits
into existing Rust development workflows.
@ehuss ehuss added the T-compiler Relevant to the compiler team, which will review and decide on the RFC. label Jan 10, 2024
@tgross35
Copy link
Contributor

I think that having some form of standalone end-to-end testing is a great idea. Also nice for other implementations, e.g. gccrs seems to be interested: https://gcc-rust.zulipchat.com/#narrow/stream/266897-general/topic/RFC.3A.20Out-of-tree.20test.20suite.

A few questions/notes:

  1. What kind of tests should live in this suite? Codegen tests make sense but UI tests are trickier, since the stderr may vary by implementation. Maybe the runner could accept a --ui-output-dir that can override files of the same name.
  2. Assuming there will still be tests directory in rust-lang/rust that isn't this submodule, how should a user decide where to place tests?
  3. When would this test suite get run in CI?

A downside of tests in a submodule is that workflow for adding tests with a feature gets less ergonomic - two places to commit and potentially get out of sync. Perhaps the main test suite source could still live in rust-lang/rust, but certain directories get automatically synced to rust-lang/rust-test-suit that has the OOT runner and additional tests.

llvm-test-suite provides a good reference.

Copy link
Member

@Nilstrieb Nilstrieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think such a big RFC is the wrong way to tackle this. As far as I understand, the problem being solved here is that you want to be able to test the distributed toolchains.

I think there are simpler ways to solve this problem, for example by just shipping the current testsuite as-is with some changes to tooling that would allow testing distributed toolchains.

Instead of bikeshedding about this RFC, I think it would be more productive if you came to Zulip t-compiler with a precise problem description and we could then brainstorm a nice solution from scratch, considering all the potential downsides.


Users and automation would install the OOTTS and invoke its test driver to verify tools and libraries of their choosing.

Rust developers would use the OOTTS in-tree as a git submodule of `rust-lang/rust`. The existing Rust infrastructure would delegate the building, testing, and packaging of the OOTTS to the OOTTS infrastructure and test driver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be a subtree, not a submodule, so that people can add modify in their Rust PRs.

Copy link
Contributor

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By your description of the problem, I am slightly worried that in general you have a different idea of how the build process goes down than what happens in actuality, both for toolchains and actual executions of rustc involving cargo.

text/3557-out-of-tree-test-suite.md Outdated Show resolved Hide resolved

Industrial use of language toolchains requires assurance that the toolchain that is distributed has been rigorously tested and is guaranteed - even warranted - to work correctly. To achieve this level of assurance, the toolchain is built and packaged for distribution, installed on a completely clean machine, and then tested. This separation of concerns, which mimics the way a user will interact with the toolchain, ensures that the artifacts distributed are the artifacts tested.

Currently, the Rust infrastructure does not support this behavior. Instead, testing validates intermediate states of the Rust toolchain as they exist in the build tree. An intermediate state may contain leftover artifacts from a previous build that could influence the validation. The Rust infrastructure may be directed to produce an malformed or incomplete tool (ex: `rustc` with a missing `libstd`), and yet hold the belief that the tool has successfully passed validation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "intermediate states" you describe often have more integrity than the actual release binary: we have had several bugs from combining, e.g. BOLT and strip (yes, binutils strip!) and performance-guided optimization. I believe we added a final test run at some point for these multiply-optimized binaries (@Kobzol would know better, perhaps?) to catch them having bugs, but it has generally not been the case that the final release binary should be given enormously more confidence than just ./x.py clean && ./x.py build --stage 2 (with the dist configuration set in the config.toml, I suppose? my understanding of the argument in favor of release binaries has been they have better UX, not their trustworthiness).

Of course, that only adds to the importance of making running the tests with it easy, but still, it feels like a misframing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That confidence however relies heavily on the correctness of post-test processing. If there is no post-install verification, then there is no way to detect a post-test bug in the infrastructure, which also includes pilot errors (forgetting to add rust-std to a ./x dist call).

Copy link
Member

@Nilstrieb Nilstrieb Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem of "obvious" post-install errors is easily solved with a few dozen lines of bash running all tools on hello world. For less obvious post-install errors, like a miscompilation, running the normal test suite on the in-tree code (like the current "post-optmization tests" is already done today.
Basically, the thing this OOTTS would be testing that the current test suite isn't testing is ./x dist, it would be nice to have more tests for it, but writing a test suite for that does not involve moving all UI tests etc, but instead creating something (a lot smaller) specifically for that, that just does some basic tests. We're not concerned about the compiler subtly breaking from the tar -c && tar -x, we're more concerned about components missing alltogether or being completely wrong in a way that's detected with a simple test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a small "install sanity check" test suite can still be in-tree just fine, optionally shipped as a component if you want. That wouldn't require nearly as much work as this RFC and could probably just be a compiler MCP (after talking to the compiler team about it).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with our Certification team, and a workflow of build -> test -> package -> install -> sanity check is OK from the point of view of tool qualification.

Given that the idea of utilizing a dedicated sanity check test suite is considerably smaller in scope and effort, this RFC should probably be retired.

@Nilstrieb how should we proceed at this point?

@workingjubilee
Copy link
Contributor

workingjubilee commented Jan 12, 2024

Because of the issues I mentioned re: "rustc is a cross compiler and host and target remain build parameters", I am slightly concerned that a test driver might have to be compiled, for each host, for each target, in a combinatorically explosive fashion, as is the case (as far as I understand it) with libstd's artifacts? I suppose if we could get it down to one per target, that might be nice, but then we can't use rustup to distribute this "run the test suite" artifact and actually support all the targets that can run tests, because some targets don't support host toolchains, thus rustup on them is... nonsensical, if it even works... but do support running a rather lot of the tests, which have to be compiled on some host and then run on the target.

I realize "run tests for a random target that can't host the toolchain but would support most of the test suite via the remote-test-client and remote-test-server" isn't exactly trivial, so improving the experience of doing that is certainly something can get behind. Unless this proposal isn't actually concerned about such targets except as a "future possibility", and is only truly worried about hosts? In which case, it should probably say that, because there are more bins than "we ship a host Rust toolchain" and "this target doesn't support std". People are happy to accept "this isn't implemented for this target" and are disappointed to see "this was supposed to work for every target! ...but actually doesn't".

@oli-obk
Copy link
Contributor

oli-obk commented Jan 12, 2024

I think it would be more productive if you came to Zulip t-compiler with a precise problem description and we could then brainstorm a nice solution from scratch, considering all the potential downsides.

my personal recommendation would be to make this RFC a living document e.g. on hackmd and then discuss this on zulip to figure out the details and potential smaller uncontroversial steps that you can implement immediately.

@kirtchev-adacore
Copy link
Author

  1. What kind of tests should live in this suite? Codegen tests make sense but UI tests are trickier, since the stderr may vary by implementation. Maybe the runner could accept a --ui-output-dir that can override files of the same name.

Ideally, that would be all tests, including codegen, UI, cargo, clippy, etc.

For UI, I see several possibilities:

  • If the testable is rustc, then the driver uses the existing .stderr files.
  • If the testable is some other conforming tool, then the driver accepts a --ui-output-dir path with similar hierarchical structure as UI, with custom .stderr files that are tailored to the tool. When checking for the presence of an error, the driver could ignore everything after an ERROR annotation, and check for the presence of a corresponding .stderr with an error diagnostic at that location.
  1. Assuming there will still be tests directory in rust-lang/rust that isn't this submodule, how should a user decide where to place tests?

Given that the OOTTS is supposed to be a complete testing replacement, there will no longer be a tests directory in rust-lang/rust.

  1. When would this test suite get run in CI?

The OOTTS would run after the toolchain is built, packaged, and installed in a clean environment, like so:

  1. Apply PR (if there is one)
  2. ./x build ...
  3. ./x dist ...
  4. ./x install ... in a clean environment
  5. Run the OOTTS.

@kirtchev-adacore
Copy link
Author

I think such a big RFC is the wrong way to tackle this. As far as I understand, the problem being solved here is that you want to be able to test the distributed toolchains.

The problem that is being solved is to ensure that currently performed post-test processing do not alter the validity of a toolchain. This basically amounts to testing an off the shelf toolchain.

I think there are simpler ways to solve this problem, for example by just shipping the current testsuite as-is with some changes to tooling that would allow testing distributed toolchains.

If there are simpler solutions, then that would be great!

Instead of bikeshedding about this RFC, I think it would be more productive if you came to Zulip t-compiler with a precise problem description and we could then brainstorm a nice solution from scratch, considering all the potential downsides.

I will follow up on this.

@kirtchev-adacore
Copy link
Author

By your description of the problem, I am slightly worried that in general you have a different idea of how the build process goes down than what happens in actuality, both for toolchains and actual executions of rustc involving cargo.

My understanding, which is derived from poking around sources and looking at very verbose logs, and which could be very wrong, is that ultimately the testing portion of the infrastructure constructs a call to cargo, passing various details such as which rustc to use. The call itself could be nested in higher-level tools such as compiletest.

Is this close to reality?

Note that even if the "right" rustc is being called and so forth, post-test processing can still alter the state of a toolchain, thus invalidating the testing that was performed.

@kirtchev-adacore
Copy link
Author

Because of the issues I mentioned re: "rustc is a cross compiler and host and target remain build parameters", I am slightly concerned that a test driver might have to be compiled, for each host, for each target, in a combinatorically explosive fashion, as is the case (as far as I understand it) with libstd's artifacts?

Unless I am misunderstanding, I do not think that this will be the case. The test driver would be compiled for each host. Since a testable cross Rust toolchain comes with its own cargo, rustc, host libstd, and target libstd, the test driver's job is to execute a compile command which indicates the host and target. cargo or rustc will then link the correct libstd and produce a target executable.

I realize "run tests for a random target that can't host the toolchain but would support most of the test suite via the remote-test-client and remote-test-server" isn't exactly trivial, so improving the experience of doing that is certainly something can get behind. Unless this proposal isn't actually concerned about such targets except as a "future possibility", and is only truly worried about hosts? In which case, it should probably say that, because there are more bins than "we ship a host Rust toolchain" and "this target doesn't support std". People are happy to accept "this isn't implemented for this target" and are disappointed to see "this was supposed to work for every target! ...but actually doesn't".

The proposal does take into account native and cross compilation, though it does not mention how to perform execution on the target (possibly via remote-test-client/-server).

@oli-obk
Copy link
Contributor

oli-obk commented Jan 17, 2024

Just to be sure: are you planning to do all the work that needs to be done for whatever solution we end up accepting?

@the8472
Copy link
Member

the8472 commented Jan 17, 2024

I think such a big RFC is the wrong way to tackle this. As far as I understand, the problem being solved here is that you want to be able to test the distributed toolchains.

I think there are simpler ways to solve this problem, for example by just shipping the current testsuite as-is with some changes to tooling that would allow testing distributed toolchains.

We have had a few post-PGO/BOLT miscompilations which tests don't catch since those optimizations only happen on the dist runners but not on the test ones. So it would also be useful for the rust repo itself we could run the testsuite against dist artifacts too.

@Nilstrieb
Copy link
Member

This has been fixed, we now run tests on post PGO artifacts. The optimized artifacts are tested before they are shipped.

@kirtchev-adacore
Copy link
Author

kirtchev-adacore commented Jan 18, 2024

Just to be sure: are you planning to do all the work that needs to be done for whatever solution we end up accepting?

@oli-obk No, AdaCore will not be doing all the work, as we were hoping that this effort would be community-led.

@kirtchev-adacore
Copy link
Author

kirtchev-adacore commented Jan 18, 2024

This has been fixed, we now run tests on post PGO artifacts. The optimized artifacts are tested before they are shipped.

@Nilstrieb Just to make sure I understand, the sanity checks you proposed are now performed?

@m-ou-se
Copy link
Member

m-ou-se commented Jan 18, 2024

No, AdaCore will not be doing all the work, as we were hoping that this effort would be community-led.

What does that mean?

To me, "community-led" means that someone from the community, like yourself or AdaCore, is doing the work.

@oli-obk
Copy link
Contributor

oli-obk commented Jan 18, 2024

No, AdaCore will not be doing all the work, as we were hoping that this effort would be community-led.

While I do like the motivation, I don't think it is very high priority for volunteer contributors. To state this more controversially: Who other than AdaCore benefits from this work?

I would recommend trying to find an existing contributor that you can contract to do this work. I do fear otherwise this RFC may, even if accepted, become nothing more than that. We've had various RFCs and MCPs before that were well motivated, but ended up lacking a dedicated implementor to actually do the work.

@kirtchev-adacore
Copy link
Author

Who other than AdaCore benefits from this work?

@oli-obk I suspect that Ferrous Systems (hello @skade, @pietroalbini !) would benefit from this work, as Ferrous Systems has and will continue to certify the Ferrocene toolchain for ISO 26262, and I assume they would care about the "Testing the artifacts to be distributed" and "Certification" motivations.

I would recommend trying to find an existing contributor that you can contract to do this work. I do fear otherwise this RFC may, even if accepted, become nothing more than that. We've had various RFCs and MCPs before that were well motivated, but ended up lacking a dedicated implementor to actually do the work.

Thanks for the suggestion, I will pass this along.

@kirtchev-adacore
Copy link
Author

@Nilstrieb Out of curiosity, what kind of test would detect an improper packaging of a tool or a library?

@kirtchev-adacore
Copy link
Author

@m-ou-se @Nilstrieb @oli-obk @the8472 Assuming that AdaCore does all the implementation work on the RFC, what would it take to get the RFC accepted?

@Nilstrieb
Copy link
Member

@Nilstrieb Out of curiosity, what kind of test would detect an improper packaging of a tool or a library?

Compile hello world after extracting the tarball or something like that.

@Nilstrieb
Copy link
Member

@m-ou-se @Nilstrieb @oli-obk @the8472 Assuming that AdaCore does all the implementation work on the RFC, what would it take to get the RFC accepted?

We'd want the best solution to your problems, which from what I've gathered so far, is not this RFC. If you list all the specific problems you're having, we may be able to come up with something better for each.

@kirtchev-adacore
Copy link
Author

@Nilstrieb To understand the problem we are trying to solve with this RFC, we will need to switch our view point to the safety critical software paradigm. In this world, human lives depend on the correct operation of software. In a worst case scenario, a software error could lead to falling airplanes, exploding nuclear plants, etc.

To prevent this from happening, compilers and their libraries and runtimes are certified against a software safety standard, such as DO-178C. The certification aims to build high degree of confidence that the compiler will not generate wrong code under various conditions. To prove this, the compiler and the generated code are subjected to rigorous testing, where each test is linked back to requirements. Since nothing is without fault, known compiler issues are described and supplemented with detection, workaround, and mitigation strategies.

One way to achieve high degree of testing confidence is to test the delivered version of the compiler. This method ensures that the compiler being certified is the exact same compiler used to compile safety critical software. Our RFC is essentially trying to modify the existing testing process to achieve this methodology, while at the same time maintain developer friendliness.

@kirtchev-adacore
Copy link
Author

Given the complexity and limited value to the community at large, I am retracting this RFC.
AdaCore plans to employ a sanity check test suite consisting of selected relocated cargo, clippy, core, and std tests, along with new tests, that exercise the toolchain after installation.
Thank you for your time, consideration, and reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-compiler Relevant to the compiler team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants