General issue - Using this taint-tracking script can only show the source node and sink node? #5104

yikesoftware · 2021-02-05T15:26:16Z

When I used the script that was found in https://github.com/github/codeql/blob/main/python/ql/src/Security/CWE-078/CommandInjection.ql to do a command-injection taint tracking, I found that it can only show the source node and the sink node but not give me the node between them so that I can't build a full path to record the way that taint data passed to vulnerable function.

File: CommandInjection.ql

/**
 * @name Uncontrolled command line
 * @description Using externally controlled strings in a command line may allow a malicious
 *              user to change the meaning of the command.
 * @kind path-problem
 * @problem.severity error
 * @sub-severity high
 * @precision high
 * @id py/command-line-injection
 * @tags correctness
 *       security
 *       external/owasp/owasp-a1
 *       external/cwe/cwe-078
 *       external/cwe/cwe-088
 */

import python
import semmle.python.security.dataflow.CommandInjection
import DataFlow::PathGraph

from CommandInjectionConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "This command depends on $@.", source.getNode(),
  "a user-provided value"

Example vuln code:

from flask import Flask, request
import subprocess
 
app = Flask(__name__)

@app.route('/run_command')
def get_run_command():
    binary = request.args.get("bin") #source1
    args = request.args.get("args") #source2
    if len(binary) > 0: #?
        if len(args) < 10: #?
            subprocess.Popen([binary, args]) #sink
            return "success"
    else:        
        return "fail"

def main():
    app.run(debug=True)

In my impression, the old version of CommandInjection example ql-script that does not encapsulate class “CommandInjectionConfiguration” can achieve the functions I mentioned above.

hmakholm · 2021-02-05T18:04:00Z

How do you view the results of your query? Only some of the several ways to do so will show you the paths -- for example, doing codeql query run from the command line will only show the rows of your explicit select statement, but executing the query in the Visual Studio Code extension ought to show the offending data path (or a representative sample of paths if there are many of them).

yoff · 2021-02-05T22:24:37Z

We have recently rewritten the data-flow analysis, which has led to slightly different path explanations. It is true that the previous version includes two more steps for the current example. I am curious as to which of those you find helpful? (I personally find the source and sink adequate in this case, but I am very interested in our users views here.)

yikesoftware · 2021-02-06T07:32:25Z

We have recently rewritten the data-flow analysis, which has led to slightly different path explanations. It is true that the previous version includes two more steps for the current example. I am curious as to which of those you find helpful? (I personally find the source and sink adequate in this case, but I am very interested in our users views here.)

Thank for your reply. In fact, My teacher and I are doing some research on stain analysis recently (Especially for web applications written in Python). It's very important for us to analyze all the nodes of the data stream to reconstruct the data path. I want to konw how I need to modify it to achieve this. (●'◡'●)

By the way, for some more complex programs, I think that getting a complete data path is conducive to the establishment of constraints, so as to solve the constraints and get the specific constraints from source to sink.

I don't know if my expression is accurate. Do you have any good suggestions?

yoff · 2021-02-09T07:59:22Z

That sounds interesting, I am always curious about research :-) However, I am not entirely convinced that your question is well-defined, in that "all the nodes" is a concept specific to the particular data-flow analysis chosen.

Our data-flow implementation goes to some trouble to only report a useful set of nodes, both to leave out "trivial steps", but also to leave out various synthetic nodes or steps that are not strongly tied to the syntax and thus may be harder to interpret. We aim for a "path-explanation", a way of convincing a reader that data does indeed flow from the source to the sink.

That being said, if you are doing this for research and you are feeling adventurous and happy to make customisations, I think you could change CastNode to any() and nodeIsHidden to none() (both are none() at the moment, but nodeIsHidden will probably soon change). That should make it show all the nodes we considered when constructing the path.

yikesoftware · 2021-02-09T10:36:26Z

That sounds interesting, I am always curious about research :-) However, I am not entirely convinced that your question is well-defined, in that "all the nodes" is a concept specific to the particular data-flow analysis chosen.

Our data-flow implementation goes to some trouble to only report a useful set of nodes, both to leave out "trivial steps", but also to leave out various synthetic nodes or steps that are not strongly tied to the syntax and thus may be harder to interpret. We aim for a "path-explanation", a way of convincing a reader that data does indeed flow from the source to the sink.

That being said, if you are doing this for research and you are feeling adventurous and happy to make customisations, I think you could change CastNode to any() and nodeIsHidden to none() (both are none() at the moment, but nodeIsHidden will probably soon change). That should make it show all the nodes we considered when constructing the path.

For accurate description, I took a screenshot of my query results in vscode.

The arrow points to the node displayed in the query result, and the underline identifies some nodes in the data stream transmission in the actual running process (just those I want to get), because through these nodes, we can know the constraints of the data stream in the transmission process, such as if len(code) < 10 and if "abcdefg" not in code:.

If I can get the coordinates of these nodes in the query results, I can directly extract the constraints, so as to avoid the symbol execution in the global scope and improve the efficiency.

I'm happy to make some changes to codeql for my research, but the project is very subtle and huge, and it really takes a lot of effort.

yoff · 2021-02-10T08:59:11Z

Ah, I see. When you refer to the data stream, you are talking about the messages sent by the application, not the nodes on the data-flow path. It seems like you would be better served with a custom query written to collect constraints?

yikesoftware · 2021-02-10T16:10:29Z

Ah, I see. When you refer to the data stream, you are talking about the messages sent by the application, not the nodes on the data-flow path. It seems like you would be better served with a custom query written to collect constraints?

Yes, I thought about it at first. But I'm not sure codeql supports such operations as extracting constraints. If I can, are there any templates for me to learn?

yoff · 2021-02-10T19:09:10Z

CodeQL is a fairly rich language, I would it expect you could get something useful. I am not sure if we have good templates for this, I did not readily find a query that looked promising. However, we do have some utility classes such as CompareNode and IfExprNode that could be helpful.

If command injection is your interest, you could take that taint tracking configuration as your starting point. Then you could define "relevant constraints" as those that have flow into them from a source and which also have flow out of them to a sink (you may have to fiddle with the setup to get "constraints per sink/source-pair" which should be your path constraints).

github-actions · 2021-04-16T01:52:14Z

This issue is stale because it has been open 14 days with no activity. Comment or remove the stale label in order to avoid having this issue closed in 7 days.

yikesoftware added the question label Feb 5, 2021

hmakholm added the Python label Feb 5, 2021

aeisenberg mentioned this issue Apr 12, 2021

Actions: Add workflow for marking stale questions #5659

Merged

github-actions bot added the Stale label Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General issue - Using this taint-tracking script can only show the source node and sink node? #5104

General issue - Using this taint-tracking script can only show the source node and sink node? #5104

yikesoftware commented Feb 5, 2021

hmakholm commented Feb 5, 2021

yoff commented Feb 5, 2021

yikesoftware commented Feb 6, 2021

yoff commented Feb 9, 2021

yikesoftware commented Feb 9, 2021

yoff commented Feb 10, 2021

yikesoftware commented Feb 10, 2021

yoff commented Feb 10, 2021

github-actions bot commented Apr 16, 2021

General issue - Using this taint-tracking script can only show the source node and sink node? #5104

General issue - Using this taint-tracking script can only show the source node and sink node? #5104

Comments

yikesoftware commented Feb 5, 2021

hmakholm commented Feb 5, 2021

yoff commented Feb 5, 2021

yikesoftware commented Feb 6, 2021

yoff commented Feb 9, 2021

yikesoftware commented Feb 9, 2021

yoff commented Feb 10, 2021

yikesoftware commented Feb 10, 2021

yoff commented Feb 10, 2021

github-actions bot commented Apr 16, 2021