Skip to content

C/C++: how to optimize function pointer tracing? #13198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
iiins0mn1a opened this issue May 17, 2023 · 14 comments
Open

C/C++: how to optimize function pointer tracing? #13198

iiins0mn1a opened this issue May 17, 2023 · 14 comments
Labels
question Further information is requested

Comments

@iiins0mn1a
Copy link

I am trying to extract functions VariableCalls may want to call, in my kernel database... And this is my query:

predicate relatedFunctionApproximate(VariableCall vc, Function func) {
    count(vc.getAnArgument()) = count(func.getAParameter()) and
    forall(
        int index | exists(vc.getArgument(index)) | sameType( vc.getArgument(index), func.getParameter(index) )
    ) 
}

class FunctionPointerFlow extends DataFlow4::Configuration {
    VariableCall vc;
    FunctionPointerFlow() { this = "FunctionPointerFlow" }
    
    override predicate isSource(DataFlow::Node node) {
        node.asExpr() instanceof FunctionAccess and
        relatedFunctionApproximate(vc, node.asExpr().(FunctionAccess).getTarget())
    }
    
    override predicate isSink(DataFlow::Node node) {
        node.asExpr().(VariableAccess).getType() instanceof FunctionPointerType and 
        vc.getExpr() = node.asExpr()
    }
    //...
}

class FunctionPointerVariableAccess extends VariableAccess {
    FunctionPointerVariableAccess() {
        this.getType() instanceof FunctionPointerType
    }
}

Function traceFunctionPointer(FunctionPointerVariableAccess fp) {
    exists(
        DataFlow::Node src, DataFlow::Node sk, FunctionAccess source, FunctionPointerFlow config |
        src.asExpr() = source and sk.asExpr() = fp and
        result = source.getTarget() and
        config.hasFlow(src, sk)
        )
}

predicate relatedFunction(VariableCall vc, Function func) {
    func = traceFunctionPointer(vc.getExpr())
}

As you can see, I tried to reduce the set of Source/Sink via some restriction, but its performance is still too bad to work normally.

For example, there is 100 VariableCalls in my database, and for each VariableCall, it has 3 related canditate functions.

If I want to figure out these 100 VariableCall's related functions within one DataFlow configuration, like my code posted above, there will be 300 sources and 100 sinks, due to bottom-up fashion of codeql, dataflow procedure will have to be repeated for 300 * 100 times to get the results.

I already know that one sink node can only relates to 3 source nodes, and this infomation can be fetched with relatedFunctionApproximate(), then I can just cut this whole process into 100 little parts, one Dataflow config for one specific sink node, and its corresponding source nodes count 3. In this situation, the total dataflow procedure will only have to be repeated for 3 * 100 times.

But obviously, coding 100 dataflow configs is just not practical, what should I do to optimize this query? how can I let the dataflow procedure understand the relationship between source and sink?Is there any useful feature I can use to solve this problem?

Please help me T-T

@iiins0mn1a iiins0mn1a added the question Further information is requested label May 17, 2023
@MathiasVP
Copy link
Contributor

Hi @iiins0mn1a,

Unless I'm missing something I you're overthinking the configuration needed for your dataflow analysis. For example, I don't see why you need the VariableCall member variable in your FunctionPointerFlow class.

Instead, I think you'd get much better performance by just tracing flow from all FunctionAccesss which aren't used directly in a function call (which we can easily resolve using Call::getTarget). Your traceFunctionPointer predicate can then be split into two cases: Either we can resolve it trivially (and no dataflow is needed), or we need to perform a dataflow analysis to find the use of the function pointer.

Based on the above points, I've slightly modified your query to get rid of the complexity added by keeping the VariableCall in the configuration. I also "modernized" your query slightly to make use of the new API that uses a module for dataflow instead of a configuration. This isn't strictly necessary, but I just really like the new API 😂

/**
 * @kind problem
 */

import cpp
import semmle.code.cpp.dataflow.new.DataFlow

predicate isSourceImpl(DataFlow::Node node, FunctionAccess fa) {
  fa = node.asExpr() and
  not any(ExprCall call).getExpr() = fa
}

predicate isSinkImpl(DataFlow::Node node, ExprCall call) {
  node.asExpr() = call.getExpr() and
  not exists(call.getTarget())
}

module FunctionPointerConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node node) { isSourceImpl(node, _) }

  predicate isSink(DataFlow::Node node) { isSinkImpl(node, _) }
}

module MyDataFlow = DataFlow::Global<FunctionPointerConfig>;

Function traceFunctionPointer(Call call) {
  result = call.getTarget()
  or
  exists(DataFlow::Node source, DataFlow::Node sink, FunctionAccess fa |
    MyDataFlow::flow(source, sink) and
    isSourceImpl(source, fa) and
    isSinkImpl(sink, call) and
    result = fa.getTarget()
  )
}

from Function func, Call call
where func = traceFunctionPointer(call)
select call, "This calls $@", func, func.toString()

I ran the above query on a large database with about 3 million lines of code, and it seemed to work fine on my machine. Obviously, it would be even faster if you could somehow
restrict the set of FunctionAccesss in isSourceImpl even further (maybe you're only interested in accesses to a function with a particular name or in a particular compilation unit?).

If you want to make the analysis more precise by checking whether the number of arguments matches the numer of parameters (and whether the type of arguments matches the type of the parameters) you can do so in traceFunctionPointer after the dataflow analysis has finished (just be mindful of variadic functions, though!).

It's probably possible to do this checking earlier in the analysis by using a FlowState, but unless your database is particularly large I don't think it's worth the effort.

I hope that helps!

@iiins0mn1a
Copy link
Author

Hi, @MathiasVP thank you for your replying~

The reason I invloved the VariableCall vc in the configuration is that, I am trying to build the relationship between source and sink, which can reduce the number of source nodes, to get maybe a better performance QAQ. More specifically, a FunctionAccess to a function whose pattern(like num of args and types) doesn't match any VariableCall in sink nodes should not be in the set of source node.

Thanks for your modernized coding style lol, looks good to me, I will follow. And I would try this query on my machine later.

For example, there is 100 VariableCalls in my database, and for each VariableCall, it has 3 related canditate functions.

If I want to figure out these 100 VariableCall's related functions within one DataFlow configuration, like my code posted above, there will be 300 sources and 100 sinks, due to bottom-up fashion of codeql, dataflow procedure will have to be repeated for 300 * 100 times to get the results.

I already know that one sink node can only relates to 3 source nodes, and this infomation can be fetched with relatedFunctionApproximate(), then I can just cut this whole process into 100 little parts, one Dataflow config for one specific sink node, and its corresponding source nodes count 3. In this situation, the total dataflow procedure will only have to be repeated for 3 * 100 times.

But obviously, coding 100 dataflow configs is just not practical, what should I do to optimize this query? how can I let the dataflow procedure understand the relationship between source and sink?Is there any useful feature I can use to solve this problem?

But I am still confusing about the example I raised before. Once we got some prior knowledge like the relationship between source and sink, is there any methods to let the dataflow procedure aware of this? Theoretically it could improve the performance dramatically. And does my "theory" make any sense?

Really appreciate for your suggestion about variadic functions, it does help.

It's probably possible to do this checking earlier in the analysis by using a FlowState, but unless your database is particularly large I don't think it's worth the effort.

Unfortunately XD, my database is particularly large(I've mensioned that it's allmodconfig kernel....), and I am trying everything to make query runs better, could you tell me the usage of FlowState or give me some related link?

Again, thank you, it's been my pleasure to talk with you.

@iiins0mn1a
Copy link
Author

iiins0mn1a commented May 18, 2023

Thanks for your modernized coding style lol, looks good to me, I will follow. And I would try this query on my machine later.

I tried it on my kernel database, and it lasts about 2 hours and exited with code 137, same as before. I've read the issue #11676 so I set --ram to a smaller number and left about 16GB to other tasks on my 96GB ram machine. But it still fails with code 137, maybe OOM. Please help me,

here is the analyzing log:

[2023-05-18 03:16:53] Calling plumbing command: codeql resolve ram --ram=81920 --format=json
[2023-05-18 03:16:53] Plumbing command codeql resolve ram completed:
                      [
                        "-J-Xmx40960M",
                        "--off-heap-ram=40960"
                      ]
[2023-05-18 03:16:53] Spawning plumbing command: execute queries -J-Xmx40960M --off-heap-ram=40960 --verbosity=progress --logdir=/home/insomnia/codeql-linux-v6.0/database/log --save-cache --thread-multiplication=1 --evaluator-log=/home/insomnia/codeql-linux-v6.0/database/log/evaluator-log/evaluator.log --evaluator-log-level=5 --threads=24 --warnings=show --qlconfig-file=/home/insomnia/codeql-linux-v6.0/qlconfig.yml --rerun --output=/home/insomnia/codeql-linux-v6.0/database/results -- /home/insomnia/codeql-linux-v6.0/database/db-cpp path:/home/insomnia/query/new-starter/codeql-custom-queries-cpp/vp.ql
[2023-05-18 05:24:41] [ERROR] Spawned process exited abnormally (code 137; tried to run: [/home/insomnia/codeql/tools/linux64/java/bin/java, -Xmx40960M, -Djava.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib, -cp, /home/insomnia/codeql/tools/codeql.jar, com.semmle.cli2.CodeQL, env_argv, execute, queries, -J-Xmx40960M, _, _, _, --save-cache, _, _, _, _, _, _, --rerun, _, --, /home/insomnia/codeql-linux-v6.0/database/db-cpp, _])
[2023-05-18 05:24:41] Plumbing command codeql execute queries terminated with status 137.
[2023-05-18 05:24:41] Plumbing command codeql database run-queries completed with status 137.
[2023-05-18 05:24:41] Exiting with code 137

evaluator log:

Most expensive predicates for unfinished query vp.ql:

        time         | evals |  max @ iter | predicate
        -------------|-------|-------------|----------
               7m36s |       |             | SSAConstruction#2b11997e::DefUse::hasNonPhiDefinition#4#ffff@6ffd0bh6
               5m39s |       |             | shortestDistances@IRBlock#896e97af::startsBasicBlock#1#1@IRBlock#896e97af::adjacentInBlock#2#2#fff@9eaa9609
               5m30s |       |             | project#SSAConstruction#2b11997e::DefUse::hasDefinition#4#ffff@e852a8s0
               4m57s |       |             | SSAConstruction#2b11997e::Cached::getInstructionSuccessor#2#fff@6700d98s
               4m47s |       |             | SSAConstruction#2b11997e::DefUse::hasUse#4#ffff@42a844lp
               4m21s |       |             | SSAConstruction#19829a6f::Cached::getInstructionSuccessor#2#fff@3c3464gq
                4m6s |       |             | SSAConstruction#2b11997e::DefUse::hasDefinition#4#ffff@d898e1oi
               3m42s |       |             | SSAConstruction#2b11997e::PhiInsertion::definitionHasRedefinition#3#fff@99230frc
               .......

@MathiasVP
Copy link
Contributor

I tried it on my kernel database, and it lasts about 2 hours and exited with code 137, same as before. I've read the issue #11676 so I set --ram to a smaller number and left about 16GB to other tasks on my 96GB ram machine. But it still fails with code 137, maybe OOM. Please help me,

here is the analyzing log:

[2023-05-18 03:16:53] Calling plumbing command: codeql resolve ram --ram=81920 --format=json
[2023-05-18 03:16:53] Plumbing command codeql resolve ram completed:
                      [
                        "-J-Xmx40960M",
                        "--off-heap-ram=40960"
                      ]
[2023-05-18 03:16:53] Spawning plumbing command: execute queries -J-Xmx40960M --off-heap-ram=40960 --verbosity=progress --logdir=/home/insomnia/codeql-linux-v6.0/database/log --save-cache --thread-multiplication=1 --evaluator-log=/home/insomnia/codeql-linux-v6.0/database/log/evaluator-log/evaluator.log --evaluator-log-level=5 --threads=24 --warnings=show --qlconfig-file=/home/insomnia/codeql-linux-v6.0/qlconfig.yml --rerun --output=/home/insomnia/codeql-linux-v6.0/database/results -- /home/insomnia/codeql-linux-v6.0/database/db-cpp path:/home/insomnia/query/new-starter/codeql-custom-queries-cpp/vp.ql
[2023-05-18 05:24:41] [ERROR] Spawned process exited abnormally (code 137; tried to run: [/home/insomnia/codeql/tools/linux64/java/bin/java, -Xmx40960M, -Djava.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib, -cp, /home/insomnia/codeql/tools/codeql.jar, com.semmle.cli2.CodeQL, env_argv, execute, queries, -J-Xmx40960M, _, _, _, --save-cache, _, _, _, _, _, _, --rerun, _, --, /home/insomnia/codeql-linux-v6.0/database/db-cpp, _])
[2023-05-18 05:24:41] Plumbing command codeql execute queries terminated with status 137.
[2023-05-18 05:24:41] Plumbing command codeql database run-queries completed with status 137.
[2023-05-18 05:24:41] Exiting with code 137

Yes, that does look like an OOM.

evaluator log:

Most expensive predicates for unfinished query vp.ql:

        time         | evals |  max @ iter | predicate
        -------------|-------|-------------|----------
               7m36s |       |             | SSAConstruction#2b11997e::DefUse::hasNonPhiDefinition#4#ffff@6ffd0bh6
               5m39s |       |             | shortestDistances@IRBlock#896e97af::startsBasicBlock#1#1@IRBlock#896e97af::adjacentInBlock#2#2#fff@9eaa9609
               5m30s |       |             | project#SSAConstruction#2b11997e::DefUse::hasDefinition#4#ffff@e852a8s0
               4m57s |       |             | SSAConstruction#2b11997e::Cached::getInstructionSuccessor#2#fff@6700d98s
               4m47s |       |             | SSAConstruction#2b11997e::DefUse::hasUse#4#ffff@42a844lp
               4m21s |       |             | SSAConstruction#19829a6f::Cached::getInstructionSuccessor#2#fff@3c3464gq
                4m6s |       |             | SSAConstruction#2b11997e::DefUse::hasDefinition#4#ffff@d898e1oi
               3m42s |       |             | SSAConstruction#2b11997e::PhiInsertion::definitionHasRedefinition#3#fff@99230frc
               .......

It's difficult to get a lot from that evaluator log snippet, really. The predicates listed in the above table don't really have anything to do with your query. Instead, this is all part of the underlying libraries that will drive the dataflow analysis in the end (i.e., control-flow, alias analysis, dataflow, etc.). If you upload the full log somewhere I can take a deeper look at it.

Here is a go at the "FlowState"-based version I hinted at:

/**
 * @kind problem
 */

import cpp
import semmle.code.cpp.dataflow.new.DataFlow

predicate isSourceImpl(DataFlow::Node node, TypeList sig, FunctionAccess fa) {
  fa = node.asExpr() and
  sig = getTypeListFromAccess(fa, 0) and
  not any(ExprCall call).getExpr() = fa
}

predicate isSinkImpl(DataFlow::Node node, TypeList sig, ExprCall call) {
  not exists(call.getTarget()) and
  exists(Expr e |
    e = call.getExpr() and
    node.asExpr() = e and
    sig = getTypeListFromType(e.getUnspecifiedType(), 0)
  )
}

TypeList getTypeListFromType(FunctionPointerType type, int start) {
  start = type.getNumberOfParameters() and
  result = MkNil(type.getReturnType().getUnspecifiedType())
  or
  exists(Type head, TypeList tail |
    getTypeListFromType(type, start, head, tail) and
    result = MkCons(head, tail)
  )
}

private newtype TypeList =
  MkNil(Type returnType) {
    returnType = any(FunctionAccess fa).getTarget().getUnspecifiedType()
    or
    returnType =
      any(ExprCall call)
          .getExpr()
          .getUnspecifiedType()
          .(FunctionPointerType)
          .getReturnType()
          .getUnspecifiedType()
  } or
  MkCons(Type head, TypeList tail) {
    getTypeListFromAccess(_, _, head, tail)
    or
    getTypeListFromType(_, _, head, tail)
  }

private TypeList getTypeListFromAccess(FunctionAccess fa, int start) {
  start = fa.getTarget().getNumberOfParameters() and
  result = MkNil(fa.getTarget().getUnspecifiedType())
  or
  exists(Type head, TypeList tail |
    getTypeListFromAccess(fa, start, head, tail) and
    result = MkCons(head, tail)
  )
}

private predicate getTypeListFromAccess(FunctionAccess fa, int start, Type head, TypeList tail) {
  head = fa.getTarget().getParameter(start).getUnspecifiedType() and
  tail = getTypeListFromAccess(fa, start + 1)
}

private predicate getTypeListFromType(FunctionPointerType type, int start, Type head, TypeList tail) {
  head = type.getParameterType(start).getUnspecifiedType() and
  tail = getTypeListFromType(type, start + 1)
}

module FunctionPointerConfig implements DataFlow::StateConfigSig {
  class FlowState = TypeList;

  predicate isSource(DataFlow::Node node, FlowState sig) { isSourceImpl(node, sig, _) }

  predicate isSink(DataFlow::Node node, FlowState sig) { isSinkImpl(node, sig, _) }

  predicate isBarrier(DataFlow::Node node, FlowState signature) { none() }

  predicate isAdditionalFlowStep(
    DataFlow::Node node1, FlowState signature1, DataFlow::Node node2, FlowState signature2
  ) {
    none()
  }
}

module MyDataFlow = DataFlow::GlobalWithState<FunctionPointerConfig>;

Function traceFunctionPointer(Call call) {
  result = call.getTarget()
  or
  exists(DataFlow::Node source, DataFlow::Node sink, FunctionAccess fa |
    MyDataFlow::flow(source, sink) and
    isSourceImpl(source, _, fa) and
    isSinkImpl(sink, _, call) and
    result = fa.getTarget()
  )
}

from Function func, Call call
where func = traceFunctionPointer(call)
select call, "This calls $@", func, func.toString()

It's the most efficient version I could come up with without any more knowledge of what specifically you're looking for. This implementation uses a "FlowState" to carry the type of the parameters and return type from the function access to the function call to ensure that the types match. I didn't see a super big speedup in my local experiment on a 3M LOC database, but maybe it'll have a bigger impact on your DB.

In the end, you'll probably get a much more memory-efficient version if you can somehow restrict the set of FunctionAcceses.

Note that there are several ways this could make the analysis miss some results. For instance, if the function pointer is cast to a function pointer of a different type (for example, casting it to a function pointer with another return type).

@iiins0mn1a
Copy link
Author

Hi @MathiasVP , happy to see your reply~

It's difficult to get a lot from that evaluator log snippet, really. The predicates listed in the above table don't really have anything to do with your query. Instead, this is all part of the underlying libraries that will drive the dataflow analysis in the end (i.e., control-flow, alias analysis, dataflow, etc.). If you upload the full log somewhere I can take a deeper look at it.

Does this mean that maybe this work is not that suitable for codeql...? I will upload my whole logs in my private repo later, and you will be able to access it. Thank you, sincerely.

For the code, ok, I‘m almost crying(just an exaggerated expression QAQ), really appreciate. I will try this query later and diligently study the techniques within these codes and try to absorb them.

@MathiasVP
Copy link
Contributor

MathiasVP commented May 18, 2023

Does this mean that maybe this work is not that suitable for codeql...?

I don't think so, no. It means that it's simply not feasible to do a precise interprocedural dataflow analysis to resolve all calls through function pointers in such a large database (without an even larger machine that what you have available). So your options are:

  • Restrict the set of function pointers you're interested in. This could be both restrictions on which accesses you care about - in which case you can restrict your set of sources (maybe you only care about calls that takes at least 1 non-constant pointer as an argument?), or you could restrict the set of calls you care about - in which you can restrict your set of sinks (maybe you only care about calls that aren't guarded by a bounds check on one of the arguments?). Either of these (or a combination of them!) will increase your likelihood of success
  • Give up on interprocedurality and use a local flow analysis (see https://codeql.github.com/docs/codeql-language-guides/analyzing-data-flow-in-cpp-new/#local-data-flow). I would guess that you can do a local flow analysis without restricting your sources and sinks, but the analysis might still take a long time on such a big database.
  • Finally (and I don't really recommend this approach): you could get a larger machine 😉.

I will upload my whole logs in my private repo later, and you will be able to access it. Thank you, sincerely.

Sounds good. Thank you!

For the code, ok, I‘m almost crying(just an exaggerated expression QAQ), really appreciate. I will try this query later and diligently study the techniques within these codes and try to absorb them.

Awesome. Let me know if you have any questions 👍.

@MathiasVP
Copy link
Contributor

MathiasVP commented May 18, 2023

I took a brief look at the log file you sent now, and it seems like we're OOM'ing before getting to any of the query-specific code. It would be fantastic if you're able to rerun the analysis and provide the full evaluator log.

I also note that it looks like you're OOM'ing in an area I attempted (and I think succeeded) to fix an OOM in just yesterday. So if you wouldn't mind, it would be super helpful if you could _also attempt to rerun en evaluation on top of #13207 as it may help with your OOM issue.

@iiins0mn1a
Copy link
Author

iiins0mn1a commented May 18, 2023

Ok, I will do it right now.

@MathiasVP I only have one machine so I will rerun for OOM first and then try the latest version, maybe 14 hours later, I would update things in my repo.

@iiins0mn1a
Copy link
Author

@MathiasVP logs are uploaded in repo, still failed with code 137 when ql changes to branch MathiasVP:use-equiv-class-in-getInstruction

@MathiasVP
Copy link
Contributor

That's a shame. To make matters worse, I now see that the evaluator log is missing a crucial piece of information. Would it be possible to run it again, but with the additional flags --tuple-counting and --debug?

@iiins0mn1a
Copy link
Author

iiins0mn1a commented May 20, 2023

hi, @MathiasVP, I've uploaded logs with your additional flags in my repo, check it out.

edit: NOT DONE yet, too big for github repo, I will find a way later.

@intrigus-lgtm
Copy link
Contributor

You can create a release and add the log as an artifact.

@iiins0mn1a
Copy link
Author

You can create a release and add the log as an artifact.

Thx for your sugguestion, I've managed to do so. If you do have some interests about this issue, I can also share logs with you.

@iiins0mn1a
Copy link
Author

That's a shame. To make matters worse, I now see that the evaluator log is missing a crucial piece of information. Would it be possible to run it again, but with the additional flags --tuple-counting and --debug?

Ok, sorry for the delaying, you can find logs in release now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants