Skip to content

General issue - Using this taint-tracking script can only show the source node and sink node? #5104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yikesoftware opened this issue Feb 5, 2021 · 9 comments
Labels
Python question Further information is requested Stale

Comments

@yikesoftware
Copy link

When I used the script that was found in https://github.com/github/codeql/blob/main/python/ql/src/Security/CWE-078/CommandInjection.ql to do a command-injection taint tracking, I found that it can only show the source node and the sink node but not give me the node between them so that I can't build a full path to record the way that taint data passed to vulnerable function.

File: CommandInjection.ql

/**
 * @name Uncontrolled command line
 * @description Using externally controlled strings in a command line may allow a malicious
 *              user to change the meaning of the command.
 * @kind path-problem
 * @problem.severity error
 * @sub-severity high
 * @precision high
 * @id py/command-line-injection
 * @tags correctness
 *       security
 *       external/owasp/owasp-a1
 *       external/cwe/cwe-078
 *       external/cwe/cwe-088
 */

import python
import semmle.python.security.dataflow.CommandInjection
import DataFlow::PathGraph

from CommandInjectionConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "This command depends on $@.", source.getNode(),
  "a user-provided value"

Example vuln code:

from flask import Flask, request
import subprocess
 
app = Flask(__name__)

@app.route('/run_command')
def get_run_command():
    binary = request.args.get("bin") #source1
    args = request.args.get("args") #source2
    if len(binary) > 0: #?
        if len(args) < 10: #?
            subprocess.Popen([binary, args]) #sink
            return "success"
    else:        
        return "fail"

def main():
    app.run(debug=True)

In my impression, the old version of CommandInjection example ql-script that does not encapsulate class “CommandInjectionConfiguration” can achieve the functions I mentioned above.

@yikesoftware yikesoftware added the question Further information is requested label Feb 5, 2021
@hmakholm hmakholm added the Python label Feb 5, 2021
@hmakholm
Copy link
Contributor

hmakholm commented Feb 5, 2021

How do you view the results of your query? Only some of the several ways to do so will show you the paths -- for example, doing codeql query run from the command line will only show the rows of your explicit select statement, but executing the query in the Visual Studio Code extension ought to show the offending data path (or a representative sample of paths if there are many of them).

@yoff
Copy link
Contributor

yoff commented Feb 5, 2021

We have recently rewritten the data-flow analysis, which has led to slightly different path explanations. It is true that the previous version includes two more steps for the current example. I am curious as to which of those you find helpful? (I personally find the source and sink adequate in this case, but I am very interested in our users views here.)

@yikesoftware
Copy link
Author

We have recently rewritten the data-flow analysis, which has led to slightly different path explanations. It is true that the previous version includes two more steps for the current example. I am curious as to which of those you find helpful? (I personally find the source and sink adequate in this case, but I am very interested in our users views here.)

Thank for your reply. In fact, My teacher and I are doing some research on stain analysis recently (Especially for web applications written in Python). It's very important for us to analyze all the nodes of the data stream to reconstruct the data path. I want to konw how I need to modify it to achieve this. (●'◡'●)

By the way, for some more complex programs, I think that getting a complete data path is conducive to the establishment of constraints, so as to solve the constraints and get the specific constraints from source to sink.

I don't know if my expression is accurate. Do you have any good suggestions?

@yoff
Copy link
Contributor

yoff commented Feb 9, 2021

That sounds interesting, I am always curious about research :-) However, I am not entirely convinced that your question is well-defined, in that "all the nodes" is a concept specific to the particular data-flow analysis chosen.

Our data-flow implementation goes to some trouble to only report a useful set of nodes, both to leave out "trivial steps", but also to leave out various synthetic nodes or steps that are not strongly tied to the syntax and thus may be harder to interpret. We aim for a "path-explanation", a way of convincing a reader that data does indeed flow from the source to the sink.

That being said, if you are doing this for research and you are feeling adventurous and happy to make customisations, I think you could change CastNode to any() and nodeIsHidden to none() (both are none() at the moment, but nodeIsHidden will probably soon change). That should make it show all the nodes we considered when constructing the path.

@yikesoftware
Copy link
Author

That sounds interesting, I am always curious about research :-) However, I am not entirely convinced that your question is well-defined, in that "all the nodes" is a concept specific to the particular data-flow analysis chosen.

Our data-flow implementation goes to some trouble to only report a useful set of nodes, both to leave out "trivial steps", but also to leave out various synthetic nodes or steps that are not strongly tied to the syntax and thus may be harder to interpret. We aim for a "path-explanation", a way of convincing a reader that data does indeed flow from the source to the sink.

That being said, if you are doing this for research and you are feeling adventurous and happy to make customisations, I think you could change CastNode to any() and nodeIsHidden to none() (both are none() at the moment, but nodeIsHidden will probably soon change). That should make it show all the nodes we considered when constructing the path.

For accurate description, I took a screenshot of my query results in vscode.

q1

The arrow points to the node displayed in the query result, and the underline identifies some nodes in the data stream transmission in the actual running process (just those I want to get), because through these nodes, we can know the constraints of the data stream in the transmission process, such as if len(code) < 10 and if "abcdefg" not in code:.

If I can get the coordinates of these nodes in the query results, I can directly extract the constraints, so as to avoid the symbol execution in the global scope and improve the efficiency.

I'm happy to make some changes to codeql for my research, but the project is very subtle and huge, and it really takes a lot of effort.

@yoff
Copy link
Contributor

yoff commented Feb 10, 2021

Ah, I see. When you refer to the data stream, you are talking about the messages sent by the application, not the nodes on the data-flow path. It seems like you would be better served with a custom query written to collect constraints?

@yikesoftware
Copy link
Author

Ah, I see. When you refer to the data stream, you are talking about the messages sent by the application, not the nodes on the data-flow path. It seems like you would be better served with a custom query written to collect constraints?

Yes, I thought about it at first. But I'm not sure codeql supports such operations as extracting constraints. If I can, are there any templates for me to learn?

@yoff
Copy link
Contributor

yoff commented Feb 10, 2021

CodeQL is a fairly rich language, I would it expect you could get something useful. I am not sure if we have good templates for this, I did not readily find a query that looked promising. However, we do have some utility classes such as CompareNode and IfExprNode that could be helpful.

If command injection is your interest, you could take that taint tracking configuration as your starting point. Then you could define "relevant constraints" as those that have flow into them from a source and which also have flow out of them to a sink (you may have to fiddle with the setup to get "constraints per sink/source-pair" which should be your path constraints).

@github-actions
Copy link
Contributor

This issue is stale because it has been open 14 days with no activity. Comment or remove the stale label in order to avoid having this issue closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Python question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

3 participants