-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPP-435] Calls to memset
and ZeroMemory
may be deleted by the compiler
#1933
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results on 75 projects here: https://lgtm.com/query/2089939943018653184/
I think we're seeing a lot of false positives on this [early] version of the query:
(1) memset(&variable, ..., ...)
, where &variable
doesn't flow to anywhere but nevertheless variable
continues to be used.
(2) similarly memset(buf+offset, ..., ...)
where there's no flow from buf+offset
but buf
continues to be used.
(3) memset
to a pointer which, though it isn't used again, points to the same buffer as another pointer that is. i.e.
x = y;
memset(x, 0, sizeof(*x));
... use y
In all of these cases I don't think an optimizing compiler can remove the calls to memset
.
* by the compiler if said buffer is not subsequently used. This is not desirable | ||
* behavior if the buffer contains sensitive data that could be exploited by an attacker. | ||
* The workaround is to use `memset_s` or `SecureZeroMemory`, use the `-fno-builtin-memset` compiler flag, or | ||
* to write one's own buffer-clearing routine. See also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or to write one's own buffer-clearing routine
There's a theoretical risk that a sufficiently smart optimizing compiler might optimize out a call even to a user-defined buffer clearing routine. Can anybody comment about whether this is a real concern with current compilers? Should we be recommending this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://jira.semmle.com/browse/CPP-435. We should be recommending the buffer-clearing approaches in https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/yang. @zlaski-semmle I recommend reading that paper. I think the recommendations in the CWE-14 entry are simplistic and not very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paper summarizes several scrubbing approaches: separate compilation or weak linking, volatile function or data accesses, memory barriers, assembly-language implementation, disabled use of __builtin_memset
. But the technique of choice is to use the secure scrubbing functions provided by each platform.
The most sensible strategy (in my opinion) is the one adopted by Tor: use SecureZeroMemory
(Win) if available, then RtlSecureZeroMemory
(Win) if available, then BSD’s explicit_bzero
if available, then C11's memset_s
if available, and then fall back on assembly language implementations and then finally the volatile function pointer technique.
But what do we tell our users? To read the paper? I still think we should point them to the platform-supplied functions, and then perhaps refer them to secure_memzero
implemented in https://compsec.sysnet.ucsd.edu/secure_memzero.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, we shouldn't write anything about mitigation in @description
. The @description
should be rewritten to follow https://github.com/Semmle/ql/blob/master/docs/query-metadata-style-guide.md#query-descriptions-description. That means we don't need to boil our mitigation advice down to a single sentence.
I suggest we tell users to prefer using a platform-specific library function if one is available. We can list them by name, but I don't see a reason we should recommend some of them over others. If they are on a platform without such a library function, we can recommend using https://compsec.sysnet.ucsd.edu/secure_memzero.h. If they want to write their own function, we can refer them to Section 3 of the paper. I don't think we should take it upon us to explain the volatile
-based techniques directly. There are too many caveats, and it's too easy to get it wrong.
|
||
from FunctionCall memset | ||
where | ||
memset.getTarget().getName() = "memset" or memset.getTarget().getName() = "ZeroMemory" and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or wmemset
. But in the long run, I'd like to add something to the models library to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget bzero
.
|
||
|
||
<li>MITRE | ||
<a href="https://cwe.mitre.org/data/definitions/14.html">CWE-14</a>.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the CWE reference will be added automatically as the query has a cwe tag. Check what similar queries do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just generated the markdown from the .qlhelp
, and CWE appears only once (the one I added). Or is there a different toolchain I should be using?
Overall, I don't think this query should use data flow or taint tracking as it will give results that are hard to explain. It should instead mimic what a reasonable compiler does. |
Let's discuss it during our next meeting. I'm not sure how to make QL act as a "reasonable compiler", nor do I understand why taint tracking is not a proper solution. |
@@ -17,12 +13,24 @@ | |||
|
|||
import semmle.code.cpp.dataflow.TaintTracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want data flow here, not taint tracking. Taint tracking extends data flow with some heuristic rules about how data content may influence other data content, but the correctness of this query does not need to depend on such heuristics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just ran some tests. With DataFlow::localFlow
, torvalds/linux
produces 599 alerts. With TaintTracking::localTaint
, that number goes down to 510 alerts. Intuitively, this makes sense, as modifying the first argument to memset
taints more subsequent statements/expressions, and hence more alerts are suppressed.
But what puzzles me is that we are still left with numerous false positives, way too many for this query to be usable. I've been extracting C/C++ code triggering the alert, but then haven't been able to reproduce the alert.
) and | ||
not exists(Parameter parm | | ||
TaintTracking::localTaint(DataFlow::parameterNode(parm), | ||
DataFlow::exprNode(arg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checking of flow from Parameter
is just one case out of the many ways that arg
can be aliased. It could be a whack-a-mole game to enumerate them all. I suggest that we make the query the other way around: give an alert only if there's provably no way that the memory could be read after being cleared. That means, to begin with, that we only support stack-allocated arrays.
The easy way to check whether a variable may escape is to use https://github.com/Semmle/ql/blob/master/cpp/ql/src/semmle/code/cpp/dataflow/EscapesTree.qll#L254. Unfortunately it's very conservative, so it would be better to use the IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we make the query the other way around: give an alert only if there's provably no way that the memory could be read after being cleared.
How would one obtain such a proof? I'm in the dark here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The easy way to check whether a variable may escape is to use https://github.com/Semmle/ql/blob/master/cpp/ql/src/semmle/code/cpp/dataflow/EscapesTree.qll#L254.
I'm still wrapping my head around this one. So does the VariableAccess
parameter correspond to the pointer/array variable that is the first argument to memset
? What about the Expr
parameter? The description in the sources is underwhelming. I'm looking at ReturnStackAllocatedMemory.ql
and it seems that the second parameter could correspond to the value of the return
statement, but not always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it's very conservative, so it would be better to use the IR.
So perhaps it's time for me to start learning about the IR.
So I've played with
the taint from the
the taint is not propagated, leading to a spurious "memset may be deleted" alert. |
Indeed, we still are, even after the version bump. |
I've put this on the agenda for today's team meeting so we can discuss the big picture before diving further into any of these details. |
I have committed an initial version of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the memset modelling to a separate PR. It can be merged independently and likely much sooner than the full query. Otherwise this PR will end up with 100+ comments on it and will become impossible to follow.
( | ||
output.isOutParameterPointer(0) or | ||
output.isOutReturnPointer() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the value in any of the four flow combinations defined by this predicate. I can't think of a practical query where we'd want such flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what do you think is the correct taint model here? In my way of thinking, both the initializer value and the length affect the output memory buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's technically true, but traditionally we haven't had much luck with taint through low-bandwidth channels like strlen
. It has been the source of many false positives and was recently disabled from the security.TaintTracking
library.
I suggest that we don't give a taint model for memset at all unless we have an example of a query and a snapshot where it would be beneficial.
( | ||
this.hasName("memset") or | ||
this.hasName("__builtin_memset") or | ||
this.hasName("FillMemory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like FillMemory
is a macro.
Apparently ZeroMemory
and RtlZeroMemory
are macros too, but bzero
(used on BSD and Mac) is a real function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. They both resolve to... memset
.
this instanceof TopLevelFunction and | ||
( | ||
this.hasName("memset") or | ||
this.hasName("__builtin_memset") or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modern way to match these function names is to use hasGlobalName
and the multi-argument hasQualifiedName
predicates. The even more modern way (#1585) is not merged yet, unfortunately, so you'll have to add hasQualifiedName("std", "memset")
to make sure you also match std::memset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #2027.
To continue working on this, I would like the following to be merged in first: |
f921979
to
81add37
Compare
I've committed my first stab at an IR-based query. Presently it is quite simple, checking for the presence of a I've tried tightening the restrictions on the |
cpp/ql/test/query-tests/Likely Bugs/Memory Management/MemsetMayBeDeleted/MemsetMayBeDeleted.cpp
Show resolved
Hide resolved
Thanks for the new test cases, @geoffw0 ! They will be extremely helpful. |
ff62d60
to
304f869
Compare
304f869
to
078bb9a
Compare
… as expected, but there are still false positives in, e.g., the Linux kernel.
078bb9a
to
c6e18fe
Compare
It belongs in [zlaski/pointer-overflow-check] branch. This reverts commit 9d6e8a5.
@zlaski-semmle I think the Jira ticket you're looking for is CPP-438: Query for pointer address wrapping. |
Yes, indeed. At any rate, I moved the bits to zlaski/pointer-overflow-check. |
No description provided.