-
Notifications
You must be signed in to change notification settings - Fork 1.7k
C/C++: how to optimize function pointer tracing? #13198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @iiins0mn1a, Unless I'm missing something I you're overthinking the configuration needed for your dataflow analysis. For example, I don't see why you need the Instead, I think you'd get much better performance by just tracing flow from all Based on the above points, I've slightly modified your query to get rid of the complexity added by keeping the /**
* @kind problem
*/
import cpp
import semmle.code.cpp.dataflow.new.DataFlow
predicate isSourceImpl(DataFlow::Node node, FunctionAccess fa) {
fa = node.asExpr() and
not any(ExprCall call).getExpr() = fa
}
predicate isSinkImpl(DataFlow::Node node, ExprCall call) {
node.asExpr() = call.getExpr() and
not exists(call.getTarget())
}
module FunctionPointerConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node node) { isSourceImpl(node, _) }
predicate isSink(DataFlow::Node node) { isSinkImpl(node, _) }
}
module MyDataFlow = DataFlow::Global<FunctionPointerConfig>;
Function traceFunctionPointer(Call call) {
result = call.getTarget()
or
exists(DataFlow::Node source, DataFlow::Node sink, FunctionAccess fa |
MyDataFlow::flow(source, sink) and
isSourceImpl(source, fa) and
isSinkImpl(sink, call) and
result = fa.getTarget()
)
}
from Function func, Call call
where func = traceFunctionPointer(call)
select call, "This calls $@", func, func.toString() I ran the above query on a large database with about 3 million lines of code, and it seemed to work fine on my machine. Obviously, it would be even faster if you could somehow If you want to make the analysis more precise by checking whether the number of arguments matches the numer of parameters (and whether the type of arguments matches the type of the parameters) you can do so in It's probably possible to do this checking earlier in the analysis by using a I hope that helps! |
Hi, @MathiasVP thank you for your replying~ The reason I invloved the Thanks for your modernized coding style lol, looks good to me, I will follow. And I would try this query on my machine later.
But I am still confusing about the example I raised before. Once we got some prior knowledge like the relationship between source and sink, is there any methods to let the dataflow procedure aware of this? Theoretically it could improve the performance dramatically. And does my "theory" make any sense? Really appreciate for your suggestion about variadic functions, it does help.
Unfortunately XD, my database is particularly large(I've mensioned that it's allmodconfig kernel....), and I am trying everything to make query runs better, could you tell me the usage of Again, thank you, it's been my pleasure to talk with you. |
I tried it on my kernel database, and it lasts about 2 hours and exited with code 137, same as before. I've read the issue #11676 so I set here is the analyzing log:
evaluator log:
|
Yes, that does look like an OOM.
It's difficult to get a lot from that evaluator log snippet, really. The predicates listed in the above table don't really have anything to do with your query. Instead, this is all part of the underlying libraries that will drive the dataflow analysis in the end (i.e., control-flow, alias analysis, dataflow, etc.). If you upload the full log somewhere I can take a deeper look at it. Here is a go at the "FlowState"-based version I hinted at: /**
* @kind problem
*/
import cpp
import semmle.code.cpp.dataflow.new.DataFlow
predicate isSourceImpl(DataFlow::Node node, TypeList sig, FunctionAccess fa) {
fa = node.asExpr() and
sig = getTypeListFromAccess(fa, 0) and
not any(ExprCall call).getExpr() = fa
}
predicate isSinkImpl(DataFlow::Node node, TypeList sig, ExprCall call) {
not exists(call.getTarget()) and
exists(Expr e |
e = call.getExpr() and
node.asExpr() = e and
sig = getTypeListFromType(e.getUnspecifiedType(), 0)
)
}
TypeList getTypeListFromType(FunctionPointerType type, int start) {
start = type.getNumberOfParameters() and
result = MkNil(type.getReturnType().getUnspecifiedType())
or
exists(Type head, TypeList tail |
getTypeListFromType(type, start, head, tail) and
result = MkCons(head, tail)
)
}
private newtype TypeList =
MkNil(Type returnType) {
returnType = any(FunctionAccess fa).getTarget().getUnspecifiedType()
or
returnType =
any(ExprCall call)
.getExpr()
.getUnspecifiedType()
.(FunctionPointerType)
.getReturnType()
.getUnspecifiedType()
} or
MkCons(Type head, TypeList tail) {
getTypeListFromAccess(_, _, head, tail)
or
getTypeListFromType(_, _, head, tail)
}
private TypeList getTypeListFromAccess(FunctionAccess fa, int start) {
start = fa.getTarget().getNumberOfParameters() and
result = MkNil(fa.getTarget().getUnspecifiedType())
or
exists(Type head, TypeList tail |
getTypeListFromAccess(fa, start, head, tail) and
result = MkCons(head, tail)
)
}
private predicate getTypeListFromAccess(FunctionAccess fa, int start, Type head, TypeList tail) {
head = fa.getTarget().getParameter(start).getUnspecifiedType() and
tail = getTypeListFromAccess(fa, start + 1)
}
private predicate getTypeListFromType(FunctionPointerType type, int start, Type head, TypeList tail) {
head = type.getParameterType(start).getUnspecifiedType() and
tail = getTypeListFromType(type, start + 1)
}
module FunctionPointerConfig implements DataFlow::StateConfigSig {
class FlowState = TypeList;
predicate isSource(DataFlow::Node node, FlowState sig) { isSourceImpl(node, sig, _) }
predicate isSink(DataFlow::Node node, FlowState sig) { isSinkImpl(node, sig, _) }
predicate isBarrier(DataFlow::Node node, FlowState signature) { none() }
predicate isAdditionalFlowStep(
DataFlow::Node node1, FlowState signature1, DataFlow::Node node2, FlowState signature2
) {
none()
}
}
module MyDataFlow = DataFlow::GlobalWithState<FunctionPointerConfig>;
Function traceFunctionPointer(Call call) {
result = call.getTarget()
or
exists(DataFlow::Node source, DataFlow::Node sink, FunctionAccess fa |
MyDataFlow::flow(source, sink) and
isSourceImpl(source, _, fa) and
isSinkImpl(sink, _, call) and
result = fa.getTarget()
)
}
from Function func, Call call
where func = traceFunctionPointer(call)
select call, "This calls $@", func, func.toString() It's the most efficient version I could come up with without any more knowledge of what specifically you're looking for. This implementation uses a "FlowState" to carry the type of the parameters and return type from the function access to the function call to ensure that the types match. I didn't see a super big speedup in my local experiment on a 3M LOC database, but maybe it'll have a bigger impact on your DB. In the end, you'll probably get a much more memory-efficient version if you can somehow restrict the set of Note that there are several ways this could make the analysis miss some results. For instance, if the function pointer is cast to a function pointer of a different type (for example, casting it to a function pointer with another return type). |
Hi @MathiasVP , happy to see your reply~
Does this mean that maybe this work is not that suitable for codeql...? I will upload my whole logs in my private repo later, and you will be able to access it. Thank you, sincerely. For the code, ok, I‘m almost crying(just an exaggerated expression QAQ), really appreciate. I will try this query later and diligently study the techniques within these codes and try to absorb them. |
I don't think so, no. It means that it's simply not feasible to do a precise interprocedural dataflow analysis to resolve all calls through function pointers in such a large database (without an even larger machine that what you have available). So your options are:
Sounds good. Thank you!
Awesome. Let me know if you have any questions 👍. |
I took a brief look at the log file you sent now, and it seems like we're OOM'ing before getting to any of the query-specific code. It would be fantastic if you're able to rerun the analysis and provide the full evaluator log. I also note that it looks like you're OOM'ing in an area I attempted (and I think succeeded) to fix an OOM in just yesterday. So if you wouldn't mind, it would be super helpful if you could _also attempt to rerun en evaluation on top of #13207 as it may help with your OOM issue. |
Ok, I will do it right now. @MathiasVP I only have one machine so I will rerun for OOM first and then try the latest version, maybe 14 hours later, I would update things in my repo. |
@MathiasVP logs are uploaded in repo, still failed with code 137 when ql changes to branch MathiasVP:use-equiv-class-in-getInstruction |
That's a shame. To make matters worse, I now see that the evaluator log is missing a crucial piece of information. Would it be possible to run it again, but with the additional flags |
hi, @MathiasVP, I've uploaded logs with your additional flags in my repo, check it out. edit: NOT DONE yet, too big for github repo, I will find a way later. |
You can create a release and add the log as an artifact. |
Thx for your sugguestion, I've managed to do so. If you do have some interests about this issue, I can also share logs with you. |
Ok, sorry for the delaying, you can find logs in release now... |
I am trying to extract functions
VariableCall
s may want to call, in my kernel database... And this is my query:As you can see, I tried to reduce the set of Source/Sink via some restriction, but its performance is still too bad to work normally.
For example, there is 100
VariableCall
s in my database, and for each VariableCall, it has 3 related canditate functions.If I want to figure out these 100
VariableCall
's related functions within one DataFlow configuration, like my code posted above, there will be 300 sources and 100 sinks, due to bottom-up fashion of codeql, dataflow procedure will have to be repeated for 300 * 100 times to get the results.I already know that one sink node can only relates to 3 source nodes, and this infomation can be fetched with
relatedFunctionApproximate()
, then I can just cut this whole process into 100 little parts, one Dataflow config for one specific sink node, and its corresponding source nodes count 3. In this situation, the total dataflow procedure will only have to be repeated for 3 * 100 times.But obviously, coding 100 dataflow configs is just not practical, what should I do to optimize this query? how can I let the dataflow procedure understand the relationship between source and sink?Is there any useful feature I can use to solve this problem?
Please help me T-T
The text was updated successfully, but these errors were encountered: