Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPP-435] Calls to memset and ZeroMemory may be deleted by the compiler #1933

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

zlaski-semmle
Copy link
Contributor

No description provided.

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results on 75 projects here: https://lgtm.com/query/2089939943018653184/

I think we're seeing a lot of false positives on this [early] version of the query:

(1) memset(&variable, ..., ...), where &variable doesn't flow to anywhere but nevertheless variable continues to be used.

(2) similarly memset(buf+offset, ..., ...) where there's no flow from buf+offset but buf continues to be used.

(3) memset to a pointer which, though it isn't used again, points to the same buffer as another pointer that is. i.e.

x = y;
memset(x, 0, sizeof(*x));
... use y

In all of these cases I don't think an optimizing compiler can remove the calls to memset.

* by the compiler if said buffer is not subsequently used. This is not desirable
* behavior if the buffer contains sensitive data that could be exploited by an attacker.
* The workaround is to use `memset_s` or `SecureZeroMemory`, use the `-fno-builtin-memset` compiler flag, or
* to write one's own buffer-clearing routine. See also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or to write one's own buffer-clearing routine

There's a theoretical risk that a sufficiently smart optimizing compiler might optimize out a call even to a user-defined buffer clearing routine. Can anybody comment about whether this is a real concern with current compilers? Should we be recommending this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://jira.semmle.com/browse/CPP-435. We should be recommending the buffer-clearing approaches in https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/yang. @zlaski-semmle I recommend reading that paper. I think the recommendations in the CWE-14 entry are simplistic and not very helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper summarizes several scrubbing approaches: separate compilation or weak linking, volatile function or data accesses, memory barriers, assembly-language implementation, disabled use of __builtin_memset. But the technique of choice is to use the secure scrubbing functions provided by each platform.

The most sensible strategy (in my opinion) is the one adopted by Tor: use SecureZeroMemory (Win) if available, then RtlSecureZeroMemory (Win) if available, then BSD’s explicit_bzero if available, then C11's memset_s if available, and then fall back on assembly language implementations and then finally the volatile function pointer technique.

But what do we tell our users? To read the paper? I still think we should point them to the platform-supplied functions, and then perhaps refer them to secure_memzero implemented in https://compsec.sysnet.ucsd.edu/secure_memzero.h.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, we shouldn't write anything about mitigation in @description. The @description should be rewritten to follow https://github.com/Semmle/ql/blob/master/docs/query-metadata-style-guide.md#query-descriptions-description. That means we don't need to boil our mitigation advice down to a single sentence.

I suggest we tell users to prefer using a platform-specific library function if one is available. We can list them by name, but I don't see a reason we should recommend some of them over others. If they are on a platform without such a library function, we can recommend using https://compsec.sysnet.ucsd.edu/secure_memzero.h. If they want to write their own function, we can refer them to Section 3 of the paper. I don't think we should take it upon us to explain the volatile-based techniques directly. There are too many caveats, and it's too easy to get it wrong.


from FunctionCall memset
where
memset.getTarget().getName() = "memset" or memset.getTarget().getName() = "ZeroMemory" and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or wmemset. But in the long run, I'd like to add something to the models library to cover this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget bzero.



<li>MITRE
<a href="https://cwe.mitre.org/data/definitions/14.html">CWE-14</a>.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CWE reference will be added automatically as the query has a cwe tag. Check what similar queries do.

Copy link
Contributor Author

@zlaski-semmle zlaski-semmle Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just generated the markdown from the .qlhelp, and CWE appears only once (the one I added). Or is there a different toolchain I should be using?

@geoffw0 geoffw0 added the C++ label Sep 16, 2019
@jbj
Copy link
Contributor

jbj commented Sep 16, 2019

Overall, I don't think this query should use data flow or taint tracking as it will give results that are hard to explain. It should instead mimic what a reasonable compiler does.

@zlaski-semmle
Copy link
Contributor Author

Overall, I don't think this query should use data flow or taint tracking as it will give results that are hard to explain. It should instead mimic what a reasonable compiler does.

Let's discuss it during our next meeting. I'm not sure how to make QL act as a "reasonable compiler", nor do I understand why taint tracking is not a proper solution.

@@ -17,12 +13,24 @@

import semmle.code.cpp.dataflow.TaintTracking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want data flow here, not taint tracking. Taint tracking extends data flow with some heuristic rules about how data content may influence other data content, but the correctness of this query does not need to depend on such heuristics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just ran some tests. With DataFlow::localFlow, torvalds/linux produces 599 alerts. With TaintTracking::localTaint, that number goes down to 510 alerts. Intuitively, this makes sense, as modifying the first argument to memset taints more subsequent statements/expressions, and hence more alerts are suppressed.

But what puzzles me is that we are still left with numerous false positives, way too many for this query to be usable. I've been extracting C/C++ code triggering the alert, but then haven't been able to reproduce the alert.

) and
not exists(Parameter parm |
TaintTracking::localTaint(DataFlow::parameterNode(parm),
DataFlow::exprNode(arg))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checking of flow from Parameter is just one case out of the many ways that arg can be aliased. It could be a whack-a-mole game to enumerate them all. I suggest that we make the query the other way around: give an alert only if there's provably no way that the memory could be read after being cleared. That means, to begin with, that we only support stack-allocated arrays.

The easy way to check whether a variable may escape is to use https://github.com/Semmle/ql/blob/master/cpp/ql/src/semmle/code/cpp/dataflow/EscapesTree.qll#L254. Unfortunately it's very conservative, so it would be better to use the IR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we make the query the other way around: give an alert only if there's provably no way that the memory could be read after being cleared.

How would one obtain such a proof? I'm in the dark here.

Copy link
Contributor Author

@zlaski-semmle zlaski-semmle Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easy way to check whether a variable may escape is to use https://github.com/Semmle/ql/blob/master/cpp/ql/src/semmle/code/cpp/dataflow/EscapesTree.qll#L254.

I'm still wrapping my head around this one. So does the VariableAccess parameter correspond to the pointer/array variable that is the first argument to memset? What about the Expr parameter? The description in the sources is underwhelming. I'm looking at ReturnStackAllocatedMemory.ql and it seems that the second parameter could correspond to the value of the return statement, but not always.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it's very conservative, so it would be better to use the IR.

So perhaps it's time for me to start learning about the IR.

@zlaski-semmle
Copy link
Contributor Author

So I've played with TaintTracking::localTaint tracking some more, and discovered some inconsistencies. For example, in

	__builtin_memset(&pw1a[3], 0, PW_SIZE); // GOOD
	return pw1a[4];

the taint from the memset is correctly propagated to the return, whereas in

	__builtin_memset(pw1a + 3, 0, PW_SIZE); // GOOD [FALSE POSITIVE]
	return pw1a[4];

the taint is not propagated, leading to a spurious "memset may be deleted" alert.

@zlaski-semmle
Copy link
Contributor Author

Results on 75 projects here: https://lgtm.com/query/2089939943018653184/

I think we're seeing a lot of false positives on this [early] version of the query:

Indeed, we still are, even after the version bump.

@jbj
Copy link
Contributor

jbj commented Sep 24, 2019

I've put this on the agenda for today's team meeting so we can discuss the big picture before diving further into any of these details.

@zlaski-semmle
Copy link
Contributor Author

I have committed an initial version of the Memset.qll model. (I've also committed some other changes but they are not exciting.) I based my work on the existing Memcpy.qll model, but obviously could have gone wrong somewhere. For the time being, I did not add Memset.qll to the import list in Models.qll, since I don't know what that would entail.

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the memset modelling to a separate PR. It can be merged independently and likely much sooner than the full query. Otherwise this PR will end up with 100+ comments on it and will become impossible to follow.

(
output.isOutParameterPointer(0) or
output.isOutReturnPointer()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the value in any of the four flow combinations defined by this predicate. I can't think of a practical query where we'd want such flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what do you think is the correct taint model here? In my way of thinking, both the initializer value and the length affect the output memory buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's technically true, but traditionally we haven't had much luck with taint through low-bandwidth channels like strlen. It has been the source of many false positives and was recently disabled from the security.TaintTracking library.

I suggest that we don't give a taint model for memset at all unless we have an example of a query and a snapshot where it would be beneficial.

(
this.hasName("memset") or
this.hasName("__builtin_memset") or
this.hasName("FillMemory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like FillMemory is a macro.

Apparently ZeroMemory and RtlZeroMemory are macros too, but bzero (used on BSD and Mac) is a real function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. They both resolve to... memset.

this instanceof TopLevelFunction and
(
this.hasName("memset") or
this.hasName("__builtin_memset") or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modern way to match these function names is to use hasGlobalName and the multi-argument hasQualifiedName predicates. The even more modern way (#1585) is not merged yet, unfortunately, so you'll have to add hasQualifiedName("std", "memset") to make sure you also match std::memset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #2027.

@zlaski-semmle
Copy link
Contributor Author

To continue working on this, I would like the following to be merged in first:
#2027.

@zlaski-semmle
Copy link
Contributor Author

I've committed my first stab at an IR-based query. Presently it is quite simple, checking for the presence of a LoadInstruction dominated by the MemsetCallInstruction. This is quite primitive, but amazingly the test results are not that bad.

I've tried tightening the restrictions on the LoadInstruction, e.g., by attempting to match a call argument to the MemsetCallFunction with the address operand of the LoadInstruction, but that proved to be overly restrictive -- (presumably) no LoadInstruction satisfying these criteria could be found, and so the query would flag every single memset call.

@zlaski-semmle
Copy link
Contributor Author

Thanks for the new test cases, @geoffw0 ! They will be extremely helpful.

It belongs in [zlaski/pointer-overflow-check] branch.

This reverts commit 9d6e8a5.
@jbj
Copy link
Contributor

jbj commented Oct 29, 2019

@zlaski-semmle I think the Jira ticket you're looking for is CPP-438: Query for pointer address wrapping.

@zlaski
Copy link
Contributor

zlaski commented Oct 29, 2019

@zlaski-semmle I think the Jira ticket you're looking for is CPP-438: Query for pointer address wrapping.

Yes, indeed. At any rate, I moved the bits to zlaski/pointer-overflow-check.

@zlaski
Copy link
Contributor

zlaski commented Nov 8, 2019

@dbartol This PR may contain bits suitable for yours (#2207).

@adityasharad adityasharad changed the base branch from master to main August 14, 2020 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants