Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate unnecessary uses of functions that may panic #447

Open
briansmith opened this issue Jan 30, 2021 · 34 comments
Open

Eliminate unnecessary uses of functions that may panic #447

briansmith opened this issue Jan 30, 2021 · 34 comments

Comments

@briansmith
Copy link
Contributor

In the past, we've seen many issues filed, and many PRs merged, attempting to minimize the potential for panics. However, there has been no concerted effort to do so before now. We should make a concerted effort to make Rustls panic-free, tracked by this issue.

We should prefer to refactor the code so that, when practical (and sometimes maybe even when not so practical) the use of functions that panic are not used. This includes panic!(), Option::unwrap(), Result::unwrap(), etc.

In the past, we've seen that we were often able to rewrite code so that no failure was possible in the new version. I.e. the old code used some function that returned a Result() indicating it might fail, but the new code uses only functions that always succeed. This is what we should prefer to do when we can.

In some cases, we are certain that a panic cannot occur, but we must use a packing function like Result::unwrap() because the underlying APIs and/or the Rust type system provide us no alternative. In these cases we should try to encapsulate the thinking we're using to decide that the panic is truly impossible into a reusable and clearly-documented function and/or type, and use that type to eliminate "bare" uses of unwrap(). We should prefer doing this over bubbling truly impossible errors up to the caller.

When we must use a function that truly is fallible, we must avoid using unwrap() and similar on the Result (or Option or whatever) it returns. If practical, we should handle the failure in some way that allows us to keep making forward progress. Otherwise, we must bubble the error up to the caller so that the application can decide what to do with it.

We should have some static analysis in CI that gates the introduction of potentially-panicking code, especially hard-to-see ones like array indexing using the [] operator.

In some cases, a combination of the above probably needs to be done. For example, we should refactor the code so that Mutexes are held only for the bare minimum number of operations, and we must ensure that none of those operations panic. We should create an abstraction for taking a mutex that we're certain will never be poisoned into a reusable API, and replace all uses of unwrap() in mutex acquisition with that abstraction. Such an abstraction should have a clear name that allows us to easily find uses of it, so that people can regularly find and audit them, asynchronously and independently from the development of the project.

In some cases, we should improve the underlying libraries. FWIW, I am open and eager to improving ring and webpki to help Rustls achieve certainty that it is statically panic-free using the Rust type system.

@briansmith
Copy link
Contributor Author

There are a lot of uses of unwrap() and the like in the code base now. However, many of them are low priority because they are in tests and/or examples. Within the Rustls library proper, many of them are mutex acquisition so they can be eliminated right away. We should categorize the rest and suggest a rough plan of action so that we can potentially parallelize the effort of fixing them among many individual contributors. IME these kinds of fixes are often fixed right away once guidance for the fix is provided in the issue tracker.

@briansmith
Copy link
Contributor Author

briansmith commented Feb 2, 2021

This depends on issue #453, which is about replacing the x.is_none() + x.unwrap() pattern with more idiomatic and statically-panic-free alternatives.

@briansmith
Copy link
Contributor Author

PR #454 demonstrates a generalization of the is_none() + unwrap() pattern, where some check other than an is_none() has an equivalent effect to is_none() ahead of the unwraps().

@briansmith
Copy link
Contributor Author

I'm going to step away from this issue to give room for somebody else to contribute, as this is one of the easiest ways to get started learning the Rustls codebase and improving it. If nobody takes it over in the next week or so then I'll work the rest of this into my schedule over February.

@briansmith
Copy link
Contributor Author

PR #462 is a refactoring that eliminates some use of the slice[] indexing operator where it wasn't obvious whether the slicing would succeed. It changes the public API of Rustls. We should prioritize other fixes that would benefit from improving the public API so that the API can re-stabilize quickly.

@briansmith
Copy link
Contributor Author

briansmith commented Feb 3, 2021

Issue #461 is about refactoring some of the state held during the handshake. Such a refactoring will be needed to complete this project, as the reason we need so many unwrap()s is because the handshake state machine isn't strongly typed enough. Besides the specific refactoring in #461, additional refactorings would be needed. Examples:

  1. The function SessionCommon::get_suite_assert does not need to exist at all, AFAICT, but it is used ubiquitously. Most likely SessionCommon needs to be refactored down to something smaller in a similar way that PR Factor out HandshakeDetails into the explicit state machine #461 proposes HandshakeDetails be refactored away.

  2. Even after PR Reduce unwrap()ing in HandshakeHash. #451, HandshakeHash::rollup_for_hrr can only be safely called in certain places in the handshake, but this isn't enforced by the type system. So, we have self.ctx.take().unwrap() in there that cannot be removed without refactoring HandshakeHash.

@djc
Copy link
Member

djc commented Feb 3, 2021

I'm midway in a follow-up that splits up HandshakeHash (working names TranscriptBuffer and TranscriptContext).

@briansmith
Copy link
Contributor Author

Here's my assessment of this project so far:

We have removed a lot of unwraps and other potential panic points. However, having removed so many, we can now see that there a lot more unwrap()s or equivalent (e.g. slice indexing) than we expected. We have a lot of things going on that deserve urgency and I think a single "remove every panic" project is counterproductive to getting important things done with proper priority. So I'd like to classify panics into categories that admit a natural prioritization:

  1. Panics that can be triggered by an attacker. E.g. panics that occur during the handshake before it has been fully authenticated. These should be the highest priority.
  2. Panics that can only happen when a local failure occurs, that can't be induced by an external party. Panicking when the system PRNG fails is an example of this.
  3. Panics that can only happen if Rustls is misconfigured. Example: Enabling TLS 1.2 for a QUIC connection.
  4. Panics that almost definitely cannot actually happen, but where refactoring the code could make this statically (using the type system) a non-issue.

I feel like our efforts on improving the code for the fourth class of panics, the least urgent "impossible" ones, is taking too much time from the first two classes, which are urgent. The main benefit to fixing the fourth class is that it makes it easier to grep for "unwrap" and "[" with less noise, so it is easier to find the more important classes of issues. I think, though, we're now probably better off just sorting through the noise and explicitly prioritizing the remaining ones we can find.

@briansmith
Copy link
Contributor Author

I started triaging all the remaining uses of unwrap() specifically. I filed a few PRs and a few issues. I also have started maintaining a spreadsheet to keep track of the triage; I have shared it with @djc and @ctz.

@djc
Copy link
Member

djc commented Apr 16, 2021

I think your categories are very useful, thanks for writing that up. I've already started triaging some of the remaining panics from your sheet and will continue with that work.

I do think the changes that have been made don't only focus on fixing panics, but also arguably make the code easier to navigate and reason about. As such, while the benefits may not be immediate, I do think the work done is valuable.

For panics in category 3+, we should eventually annotate the invariants/causes in the source code rather than in a separate GSheet, so that we can keep track of the invariants more easily when assessing future changes.

@briansmith
Copy link
Contributor Author

In the spreadsheet, we still have ~20 untriaged unwrap()s. It would be great if @ctz could look at the ones that we haven't been able to prioritize and assign a priority 1-4 based on the criteria in my comment above.

@briansmith
Copy link
Contributor Author

I think your categories are very useful, thanks for writing that up. I've already started triaging some of the remaining panics from your sheet and will continue with that work.

Thanks!

I do think the changes that have been made don't only focus on fixing panics, but also arguably make the code easier to navigate and reason about. As such, while the benefits may not be immediate, I do think the work done is valuable.

I agree. I am not judging the importance of the work. The refactoring work is very valuable. The Rustls codebase is in much better shape already, and I'm excited about the improvements that continue. I'm just trying to get us to reschedule the more critical work ahead of the continued cleanup efforts.

For panics in category 3+, we should eventually annotate the invariants/causes in the source code rather than in a separate GSheet, so that we can keep track of the invariants more easily when assessing future changes.

I agree that once the most urgent work is done, any array indexing and/or unwrap() or other panic-inducing code should be annotated in some way to indicate why we think the panic is nothing to worry about.

I also created a priority #0 in the spreadsheet to mark changes that seem to require public API changes.

@briansmith
Copy link
Contributor Author

I also created a priority #0 in the spreadsheet to mark changes that seem to require public API changes.

Specifically, I'm hoping we can do something like this:

  1. Triage the remaining known panic locations to give each a priority 1-4, where fixes for items that would be priority 1 or 2 that require public API changes should bump the priority to 0.
  2. Fix all the priority 0 issues.
  3. Do a release.
  4. Continue the work on priority Fix typo in README. #1 and couldn't compile #2 issues, then Client authentication #3 and Make ClientConfig Sync #4, and keep making (minor version) releases on a weekly or biweekly or at least monthly basis.

@briansmith
Copy link
Contributor Author

Based on what's left untriaged in the spreadsheet, these are the unwraps that we're still unsure about:

rustls/src/client/tls12.rs:905:    .take_received_plaintext(m.take_opaque_payload().unwrap());
rustls/src/client/tls13.rs:232:    early_key_schedule.unwrap().into_handshake(&shared.shared_secret)
rustls/src/client/tls13.rs:1123:   conn.common.quic.params.as_ref().unwrap(),
rustls/src/client/tls13.rs:1193:   .take_received_plaintext(m.take_opaque_payload().unwrap());

rustls/src/msgs/handshake.rs:1030: let last_extension = self.extensions.last_mut().unwrap();
rustls/src/msgs/handshake.rs:2301: let offer = ch.get_psk().unwrap();

rustls/src/msgs/hsjoiner.rs:57:    let payload = msg.take_opaque_payload().unwrap();

rustls/src/msgs/message.rs:187:    self.into_opaque().take_opaque_payload().unwrap()

rustls/src/msgs/mod.rs:43:         let mut m = Message::read(&mut r).unwrap();

rustls/src/server/tls12.rs:879:    .take_received_plaintext(m.take_opaque_payload().unwrap());
rustls/src/server/tls13.rs:1046:   .take_received_plaintext(m.take_opaque_payload().unwrap());

I used rg --no-heading -n --sort path "unwrap[^_]" rustls/src | grep -v "_test.rs:" to find unwrap()s. In the future I hope to use semgrep and/or Clippy rules to be more precise about this.

@briansmith
Copy link
Contributor Author

briansmith commented Apr 20, 2021

@repi gave me a list of clippy lints that are useful for this project:

  • clippy::expect_used
  • clippy::unwrap_used
  • clippy::ok_expect
  • clippy::integer_division
  • clippy::index_slicing

Especially clippy::index_slicing seems useful since slicing is hard to find with grep.

It seems like if we deny these in lib.rs and then allow then in test modules, that would make keeping track of each occurance much easuer. The #[alllow(clippy::XXXX)]s would be the place where we document the reasoning for why the panic won't be hit.

@RazrFalcon
Copy link

RazrFalcon commented Apr 20, 2021

@briansmith Hi. There is also https://github.com/philipc/findpanics It's Linux-only (ELF to be more precise) and works only on binaries, but can also be very useful.

I'm also interested in some kind of static verification, but there are no good, user-friendly tools yet. And Rust's std has a lot of panics. I mean, even Iterator::enumerate can panic. So there will be a lot of false-positives.

@djc
Copy link
Member

djc commented Apr 21, 2021

I have a plan to deal with the take_opaque_payload() and related issues, see #667.

@briansmith
Copy link
Contributor Author

briansmith commented Apr 21, 2021

I have a plan to deal with the take_opaque_payload() and related issues, see #667.

Awesome! That would leave just:

rustls/src/client/tls13.rs:232:    early_key_schedule.unwrap().into_handshake(&shared.shared_secret)
rustls/src/client/tls13.rs:1123:   conn.common.quic.params.as_ref().unwrap(),

rustls/src/msgs/handshake.rs:1030: let last_extension = self.extensions.last_mut().unwrap();
rustls/src/msgs/handshake.rs:2301: let offer = ch.get_psk().unwrap();

rustls/src/msgs/mod.rs:43:         let mut m = Message::read(&mut r).unwrap();

@djc
Copy link
Member

djc commented Apr 21, 2021

(The last one is probably also in scope of #667.)

@briansmith
Copy link
Contributor Author

OK, great, then we'd be down to just:

rustls/src/client/tls13.rs:232:    early_key_schedule.unwrap().into_handshake(&shared.shared_secret)
rustls/src/client/tls13.rs:1123:   conn.common.quic.params.as_ref().unwrap(),

rustls/src/msgs/handshake.rs:1030: let last_extension = self.extensions.last_mut().unwrap();
rustls/src/msgs/handshake.rs:2301: let offer = ch.get_psk().unwrap();

@djc
Copy link
Member

djc commented Apr 22, 2021

(The unwrap in msgs/mod.rs is actually in test code, so we can just leave it there.)

@ctz ctz added this to In progress in rustls 0.20 release Apr 22, 2021
@Ericson2314
Copy link

Ericson2314 commented Apr 22, 2021

Seeing #283, this issue, https://www.abetterinternet.org/post/preparing-rustls-for-wider-adoption/ (which led me here), and https://daniel.haxx.se/blog/2020/10/09/rust-in-curl-with-hyper/ discussing OOM failures in particular, I am wondering if rust-lang/rust#84266, to make it easier to audit allocation failure paths, would help with this?

@briansmith
Copy link
Contributor Author

Another class of panics or worse, undefined behavior, is arithmetic underflows/overflows. Normally only debug builds panic on overflow/underflow but release builds can be configured to also panic in those situations.

In ring I try to use checked arithmetic (checked_add/checked_sub/etc.) whenever arithmetic inputs provided by the caller are used. We should consider doing similar.

@briansmith
Copy link
Contributor Author

@ctz I see that this is the last remaining issue in the 0.20 release milestone. However, this issue is very broad. Could we get more specific about which parts of this work are remaining for the 0.20 milestone?

@briansmith
Copy link
Contributor Author

Here's what I suggest for this project:

  1. Identify any high-priority work where panicking code seems to be reachable.
  2. Prioritize that work into two categories: Changes that would change the public API of Rustls, and those that wouldn't.
  3. Prioritize the fixes for any high-priority (reachable) panics that would change the public API.
  4. Do a release with a major version bump.
  5. Keep working on the other reachable panics and do minor version releases.

I am investigating a potential reachable panic in ring that would warrant a new version of Rustls that enforces a higher minimum version of ring is used.

@djc
Copy link
Member

djc commented May 14, 2021

Would the higher minimum version of ring still be 0.16.x, or do you mean requiring ring 0.17?

@mjgarton
Copy link

mjgarton commented Sep 5, 2021

Is there a publicly accessible list of remaining issues that need tackling for this?

@djc
Copy link
Member

djc commented Sep 5, 2021

There is not. What is your interest?

@mjgarton
Copy link

mjgarton commented Sep 6, 2021

I was initially just curious about the plan for a 0.2.0 release and I came across this issue. Depending on what needs tackling, I may have some time to work on some smaller items.

I realise though that publishing this list of issues is essentially publishing a list of unfixed exploits, so I understand why it's not public.

@djc
Copy link
Member

djc commented Sep 6, 2021

At this point I don't think this issue is blocking the 0.20 release. If you'd like to work on something, it might be useful to eliminate the different forms of panicking (for example, assert_eq!()) in client::client_conn::EarlyData state transitions. These should be statically unreachable, it would be nice to make that more obvious. I think the right approach would be to "split" the state between some state in ClientConnectionData which is only used to propagate data outwards towards the caller and state used internally, which can live in the client's state machine states (that is, the State impls).

(This is similar to how we have a suite both in the state machine states and as an Option in CommonState.)

Would be great if you can help out with this; let me know if you have further questions!

@mjgarton
Copy link

mjgarton commented Sep 7, 2021

Can you elaborate on where you mean by client's state machine states (that is, the State impls). please? I am still unfamiliar with the code base at the moment, so apologies if this should be obvious.

@djc
Copy link
Member

djc commented Sep 7, 2021

In the conn module, there is the State trait. Connections (either ClientConnection or ServerConnection, which both pretty much wrap ConnectionCommon go through a number of States. For the client, the States are defined in client::hs, client::tls12 and client::tls13. You can search for impl State to find all the state machine implementations:

djc-2019 tweaks rustls $ rg "impl State"
rustls/src/client/tls12.rs
209:impl State<ClientConnectionData> for ExpectCertificate {
270:impl State<ClientConnectionData> for ExpectCertificateStatusOrServerKx {
332:impl State<ClientConnectionData> for ExpectCertificateStatus {
385:impl State<ClientConnectionData> for ExpectServerKx {
563:impl State<ClientConnectionData> for ExpectServerDoneOrCertReq {
622:impl State<ClientConnectionData> for ExpectCertificateRequest {
700:impl State<ClientConnectionData> for ExpectServerDone {
874:impl State<ClientConnectionData> for ExpectNewTicket {
919:impl State<ClientConnectionData> for ExpectCcs {
1023:impl State<ClientConnectionData> for ExpectFinished {
1077:impl State<ClientConnectionData> for ExpectTraffic {

rustls/src/client/tls13.rs
386:impl State<ClientConnectionData> for ExpectEncryptedExtensions {
484:impl State<ClientConnectionData> for ExpectCertificateOrCertReq {
534:impl State<ClientConnectionData> for ExpectCertificateRequest {
622:impl State<ClientConnectionData> for ExpectCertificate {
693:impl State<ClientConnectionData> for ExpectCertificateVerify {
866:impl State<ClientConnectionData> for ExpectFinished {
1090:impl State<ClientConnectionData> for ExpectTraffic {
1150:impl State<ClientConnectionData> for ExpectQuicTraffic {

rustls/src/client/hs.rs
447:impl State<ClientConnectionData> for ExpectServerHello {
755:impl State<ClientConnectionData> for ExpectServerHelloOrHelloRetryRequest {

rustls/src/server/tls12.rs
502:impl State<ServerConnectionData> for ExpectCertificate {
577:impl State<ServerConnectionData> for ExpectClientKx {
646:impl State<ServerConnectionData> for ExpectCertificateVerify {
708:impl State<ServerConnectionData> for ExpectCcs {
827:impl State<ServerConnectionData> for ExpectFinished {
896:impl State<ServerConnectionData> for ExpectTraffic {

rustls/src/server/hs.rs
276:impl State<ServerConnectionData> for ExpectClientHello {

rustls/src/server/tls13.rs
733:impl State<ServerConnectionData> for ExpectCertificate {
813:impl State<ServerConnectionData> for ExpectCertificateVerify {
944:impl State<ServerConnectionData> for ExpectFinished {
1057:impl State<ServerConnectionData> for ExpectTraffic {
1117:impl State<ServerConnectionData> for ExpectQuicTraffic {

@briansmith
Copy link
Contributor Author

It seems like the next step in this is enabling the clippy and/or other static verification in CI that we're avoiding panicking APIs, with the appropriate #[allow(...)] around cases where the unwrap() or similar is what we intend to do (because we have verified it is impossible and the code would be worse if we tried to avoid the panicking function). I think once we have that then we can do a final audit of the codebase looking for those allow(...) annotations and then close this?

@djc
Copy link
Member

djc commented Jan 18, 2022

Last I checked there were still some obvious offenders, notably the client-side early data implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

5 participants