Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Not an issue right now) Handle multiple returns #53

Closed
KevinHock opened this issue Jun 22, 2017 · 7 comments
Closed

(Not an issue right now) Handle multiple returns #53

KevinHock opened this issue Jun 22, 2017 · 7 comments
Assignees

Comments

@KevinHock
Copy link
Collaborator

KevinHock commented Jun 22, 2017

I'll try to work on this relatively soon, but to think out loud..

In interprocedural_cfg.py, we have

def return_handler(self, node, function_nodes):
    """Handle the return from a function during a function call."""
    call_node = None
    for n in function_nodes:
        if isinstance(n, ConnectToExitNode):
            LHS = CALL_IDENTIFIER + 'call_' + str(self.function_index)
            previous_node = self.nodes[-1]
            if not call_node:
                RHS = 'ret_' + get_call_names_as_string(node.func)
                r = RestoreNode(LHS + ' = ' + RHS, LHS, [RHS],
                                line_number=node.lineno,
                                path=self.filenames[-1])
                call_node = self.append_node(r)
                previous_node.connect(call_node)
        else:
            # lave rigtig kobling
            pass

which cleaned is

def return_handler(self, call_node, function_nodes):
    """Handle the return from a function during a function call.

    Args:
        call_node(ast.Call) : The node that calls the definition.
        function_nodes(list[Node]): List of nodes of the function being called.
    """
    for node in function_nodes:
        # Only Return's and Raise's can be of type ConnectToExitNode
        if isinstance(node, ConnectToExitNode):                
            # Create e.g. ¤call_1 = ret_func_foo RestoreNode
            LHS = CALL_IDENTIFIER + 'call_' + str(self.function_call_index)
            RHS = 'ret_' + get_call_names_as_string(call_node.func)
            return_node = RestoreNode(LHS + ' = ' + RHS,
                                      LHS,
                                      [RHS],
                                      line_number=call_node.lineno,
                                      path=self.filenames[-1])
            self.nodes[-1].connect(return_node)
            self.nodes.append(return_node)
            return 

Firstly, the for loop and the if statement seem to just serve the purpose of "Is there a node of type Return or Raise in the function?" But I think all functions should have at least one return node, right? I'm not sure if I understand the original intention that well e.g. what was going to be in the else?

Secondly, here is an example to illustrate the problem/need to handle multiple returns:

TODO

@KevinHock KevinHock self-assigned this Jun 22, 2017
@KevinHock
Copy link
Collaborator Author

KevinHock commented Jun 22, 2017

Okay, so, I was going to suggest we do something to sort of OR all the different return values, such that if one was tainted then we would say it returned a tainted value. (Maybe this could be accomplished with e.g. ret_foo = first_ret + second_ret + third_ret.)

But, it turns out there's another issue that we would need to fix for this to even be an issue, because we currently do the ORing above even when we should not, on all variables e.g.

def maybe_return_the_arg(foo):
    whatup = foo
    whatup = 'no vuln'
    return whatup

says the return value is tainted.

Where it should only say it's tainted when e.g.

def maybe_return_the_arg(foo):
    if foo.startswith('h'): # Ignore the actual condition, we can statically SMT solve a year from now.
        return 'huh' + foo
    else:
        return 'apple'

if we decide to fix the 1st example in this comment, then I'm pretty sure we'll introduce a false-negative in the 2nd example, because it will see ret_maybe_return_the_arg = 'apple' and think the tainted value (assigned in ret_maybe_return_the_arg = 'huh' + foo was overwritten.)

Overall, finding out that the 2nd example works right now, and thinking the 1st example is contrived and people don't normally write code like that, I care a lot less about any of this. But I'm glad I doc'd it ;)

If we were going to fix the first example, it would probably be in https://github.com/python-security/pyt/blob/master/pyt/reaching_definitions_taint.py but I don't plan on working on this any time soon.

@KevinHock KevinHock changed the title Handle multiple returns (Not an issue right now) Handle multiple returns Jun 22, 2017
@KevinHock
Copy link
Collaborator Author

To do the ORing ( e.g. ret_foo = first_ret + second_ret + third_ret) you could easily do it in return_connection_handler.

@adrianbn
Copy link
Contributor

Disclaimer: I may be saying things here that are known by the contributors to pyt. I'm sorry if this is just too obvious at this point.

I think this is an example of flow/path insensitivity, right? The first example says that pyt is flow insensitive, and the second one that pyt is path insensitive. Most commercial analyzers are flow sensitive, but not path sensitive due to the exponential explosion of path constraints involved.

I just started looking into pyt so I'm not quite sure yet how it performs taint analysis, but it looks like it assigns constraints to the nodes / qualifiers to the variables and uses reaching definitions + fixed point to solve them to get a result? I would have to look more into reaching definitions to see if the following applies. If it does, we're on a good place to solve at least the first issue described here.

In the example:

def maybe_return_the_arg(foo):
    whatup = foo
    whatup = 'no vuln'
    return whatup

we would need to assign each assignment to whatup a different qualifier, whatup1, and whatup2. Then the value that reaches the return is whatup2 which is not tainted. This will result in a correct analysis. This is equivalent to rewriting the program in Static Single Assignment (SSA).

Although I agree with @KevinHock that code like the above is rare, this issue can materialize in different ways that are more plausible.

value = request.form.get("value")
while(some_criteria(value)):
    r = sink(value)
    value = r
name = request.form.get('name')
name = sanitize(name)
sink(name)

I'd argue those examples, specially the second one warrant trying to address this issue.

On the other hand, in the example:

def maybe_return_the_arg(foo):
    if foo.startswith('h'): # Ignore the actual condition, we can statically SMT solve a year from now.
        return 'huh' + foo
    else:
        return 'apple'

we would need to add a path condition to each constraint. That will allow pyt to solve for those constraints that have compatible path conditions (e.g. foo.startswith('h')). In this particular case, I don't think pyt can solve foo.startswith('h') if foo is a source, so it should probably report an issue with a lower confidence level. In certain circumstances it may be possible to tell whether the path condition is true and compatible with other paths along the code. There's a video from University of Maryland that covers this very concisely: https://www.youtube.com/watch?v=UhY8_Wb40B0

Again sorry if I'm being captain obvious or too naive. I'm just trying to understand how pyt works and finding areas I can contribute to and I got to this open issue.

@KevinHock
Copy link
Collaborator Author

KevinHock commented Nov 22, 2018

You're definitely right about everything you just said @adrianbn! Thank you for contributing to and being interested in PyT :)

It turns out that PyT currently is flow-sensitive, and I didn't update this GitHub issue, my bad! 🤐 Here is where the vulnerability finding happens, I initially thought I fixed this when I added def-use chains, in this PR, but I put a print statement after if tainted_node_in_sink_arg: and it never got executed, so it must have been fixed before that. I might have fixed some bug in the reaching definitions code a long time ago and forgot exactly what.

Repro: I re-read what I had written in this issue and thought, "That seems weird, we should kill the previous definition? Maybe it is specific to return values of functions." I went to that part of the code, and figured I'd run it on an example to sanity check.

Before changing, we run:

(code) Kevins-MacBook-Pro:pyt kevin$ python -m pyt examples/vulnerable_code/command_injection.py 
1 vulnerability found:
Vulnerability 1:
File: examples/vulnerable_code/command_injection.py
 > User input at line 15, source "form[":
	 param = request.form['suggestion']
Reassigned in:
	File: examples/vulnerable_code/command_injection.py
	 > Line 16: command = 'echo ' + param + ' >> ' + 'menu.txt'
File: examples/vulnerable_code/command_injection.py
 > reaches line 18, sink "subprocess.call(":
	~call_1 = ret_subprocess.call(command, shell=True)

After changing, let's run:

(code) Kevins-MacBook-Pro:pyt kevin$ git diff
diff --git a/examples/vulnerable_code/command_injection.py b/examples/vulnerable_code/command_injection.py
index 44c45d9..8025f84 100644
--- a/examples/vulnerable_code/command_injection.py
+++ b/examples/vulnerable_code/command_injection.py
@@ -13,6 +13,7 @@ def index():
 @app.route('/menu', methods=['POST'])
 def menu():
     param = request.form['suggestion']
+    param = 'kill that def!'
     command = 'echo ' + param + ' >> ' + 'menu.txt'
 
     subprocess.call(command, shell=True)
(code) Kevins-MacBook-Pro:pyt kevin$ python -m pyt examples/vulnerable_code/command_injection.py 
No vulnerabilities found.

So that's good, maybe this is specific to return values, let me try my example.

(code) Kevins-MacBook-Pro:pyt kevin$ git diff
diff --git a/examples/vulnerable_code/command_injection.py b/examples/vulnerable_code/command_injection.py
index 44c45d9..7745b0a 100644
--- a/examples/vulnerable_code/command_injection.py
+++ b/examples/vulnerable_code/command_injection.py
@@ -10,10 +10,16 @@ def index():
 
     return render_template('command_injection.html', menu=menu)
 
+def is_it_just_return_values(tainted_param):
+    ret_me = tainted_param
+    ret_me = 'no vuln'
+    return ret_me
+
 @app.route('/menu', methods=['POST'])
 def menu():
     param = request.form['suggestion']
-    command = 'echo ' + param + ' >> ' + 'menu.txt'
+    hey = is_it_just_return_values(param)
+    command = 'echo ' + hey + ' >> ' + 'menu.txt'
 
     subprocess.call(command, shell=True)
 
(code) Kevins-MacBook-Pro:pyt kevin$ python -m pyt examples/vulnerable_code/command_injection.py 
No vulnerabilities found.

Okay, so that's good, I should have updated this Github issue.

Comment out the overwriting assignment:

(code) Kevins-MacBook-Pro:pyt kevin$ clear;git diff

diff --git a/examples/vulnerable_code/command_injection.py b/examples/vulnerable_code/command_injection.py
index 44c45d9..02d8ba8 100644
--- a/examples/vulnerable_code/command_injection.py
+++ b/examples/vulnerable_code/command_injection.py
@@ -10,10 +10,16 @@ def index():
 
     return render_template('command_injection.html', menu=menu)
 
+def is_it_just_return_values(tainted_param):
+    ret_me = tainted_param
+    # ret_me = 'no vuln'
+    return ret_me
+
 @app.route('/menu', methods=['POST'])
 def menu():
     param = request.form['suggestion']
-    command = 'echo ' + param + ' >> ' + 'menu.txt'
+    hey = is_it_just_return_values(param)
+    command = 'echo ' + hey + ' >> ' + 'menu.txt'
 
     subprocess.call(command, shell=True)

and we get

(code) Kevins-MacBook-Pro:pyt kevin$ python -m pyt examples/vulnerable_code/command_injection.py 
1 vulnerability found:
Vulnerability 1:
File: examples/vulnerable_code/command_injection.py
 > User input at line 20, source "form[":
	 param = request.form['suggestion']
Reassigned in:
	File: examples/vulnerable_code/command_injection.py
	 > Line 13: save_1_param = param
	File: examples/vulnerable_code/command_injection.py
	 > Line 21: temp_1_tainted_param = param
	File: examples/vulnerable_code/command_injection.py
	 > Line 13: tainted_param = temp_1_tainted_param
	File: examples/vulnerable_code/command_injection.py
	 > Line 14: ret_me = tainted_param
	File: examples/vulnerable_code/command_injection.py
	 > Line 16: ret_is_it_just_return_values = ret_me
	File: examples/vulnerable_code/command_injection.py
	 > Line 21: ~call_1 = ret_is_it_just_return_values
	File: examples/vulnerable_code/command_injection.py
	 > Line 21: hey = ~call_1
	File: examples/vulnerable_code/command_injection.py
	 > Line 22: command = 'echo ' + hey + ' >> ' + 'menu.txt'
File: examples/vulnerable_code/command_injection.py
 > reaches line 24, sink "subprocess.call(":
	~call_2 = ret_subprocess.call(command, shell=True)

💥

@KevinHock
Copy link
Collaborator Author

I'm not quite sure yet how it performs taint analysis, but it looks like it assigns constraints to the nodes / qualifiers to the variables and uses reaching definitions + fixed point to solve them to get a result?

I'm not certain what constraints/qualifiers mean, but the we mostly do reaching definitions + fixed point. I wrote some documentation in this readme, but let me know if you'd like me to clarify anything :D

In this particular case, I don't think pyt can solve if foo.startswith('h'): if foo is a source, so it should probably report an issue with a lower confidence level.

We sort of kind of do that with the following:

elif isinstance(cfg_node, IfNode):
potential_sanitiser = cfg_node

Note: potential_sanitiser is the only hack here, it is because we do not take p-use's into account yet.
e.g. we can only say potentially instead of definitely sanitised in the path_traversal_sanitised_2.py test.

This vulnerability is potentially sanitised by: Label: if '..' in image_name:

Again sorry if I'm being captain obvious or too naive. I'm just trying to understand how pyt works and finding areas I can contribute to and I got to this open issue.

No worries! I'm glad you are interested @adrianbn 😁 One area you could contribute to, although it's rather difficult, is that we currently use inlining instead of summaries, for inter-procedural analysis, which makes PyT slower than it needs to be.

Here are some videos, specifically the last one, explains function summaries well:
#57 Call Graphs
#58 Interprocedural Data Flow Analysis
#59 Procedure Summaries in Data Flow Analysis

Feel free to ping me with any questions. My current open PR tries to address certain expressions, so we can handle BoolOp's and IfExp's etc., I think you've motivated me to work on it / clean it up 👍 (It sort of became like a big chapter in a book I need a long plane ride to finish or something.)

@KevinHock
Copy link
Collaborator Author

We could try to turn Python code into SSA, and prune it via liveness, and add predicates so it's PSSA, and try to solve those predicates with e.g. z3str3 etc. or something similar. But handling all node types, being performant, and being battle tested (i.e. no recursion bugs or crashes), is higher priority to me, b/c I want it to be usable and run-able on tons of codebases.

@KevinHock
Copy link
Collaborator Author

I'm gonna go ahead and close this issue 👍 b/c it was solved before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants