Python: Better path explanations #5057

yoff · 2021-01-29T20:07:22Z

Purpose

This PR obtains nice path explanations by removing certain default taint-steps.
To compensate, it adds the notion of recursive content and a simulated flow summary for json.loads.
It is here so that we can test correctness and performance, and so we can discuss its fate.

Questions

Do we want all this in one PR or do we prefer 3 (recursive content, simulated flow steps, remove default taint)?
Do we need simulated flow summaries for other methods?
Should we actually wait for flow summaries being implemented rather than simulate them now?

Caveats

The main caveat is that some taint-tracking tests fail. In particular tests of the form

   l = [TAINTED_STRING]
   ensure_tainted(l)

and

   l = TAINTED_LIST
   ensure_tainted(l[0])

The former fails because l itself is no longer considered tainted, only its content is (and the difference is now noted).
The latter fails because we have not assigned content to TAINTED_LIST. We will have to evaluate if this is a show-stopper.

A secondary caveat is that no investigation has been made into what flow summaries are currently needed. A dist-compare should reveal this.

Tests

Differences job to test correctness and performance (showing no concerns).

Differences job on slugs from https://github.com/github/codeql-python-team/issues/288.

Dist-compare not yet run.

in terms of store steps with abstracted content

subscript and for.

RasmusWL

jsonLoadsStoreStep was a bit hard to read 😄 I read it initially as json load & store step, and not as json.loads store step 😄

Caveats

I think we need to be able to handle both caveats.

sink(some_tainted_list) where sink is a call to an external library that executes commands. (we have an internal issue for that here)
A list being tainted without a store step. This is for example the case with self.path_args in a tornado request handler class. This is a list containing tainted elements, but there is no explicit store step.

As far as I recall, the problem with the 2 above things is not so much that some of our current tests will now fail, but that the setup that Tom made doesn't support these instances out of the box. Right? -- I seem to recall him talking about it being possible to synthesize nodes to make it work though.

Conclusion

I've been postponing flow-summaries for a bit to get things working (in general). If really investing in flow-summaries is what we need to get proper path explanations, I guess we'll just have to bite the bullet.

Do you have an example at hand that illustrates how much better the path explanation becomes when making this change?

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

RasmusWL · 2021-02-10T10:08:28Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

+  or
+  jsonLoadsStoreStep(nodeFrom, c, nodeTo)


Like with additional taint steps, I think we should use the same a mechanism to allow defining these in each framework implementation file (a minor nitpick I think).

I completely agree that it cannot be in the core library like this. I believe that is what flow-summaries are for and that is the reason I have brought them up in this context.

RasmusWL · 2021-02-10T10:16:33Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

+predicate jsonLoadsStoreStep(CfgNode nodeFrom, Content c, Node nodeTo) {
+  exists(CallNode call, ControlFlowNode json |
+    json.isAttribute() and
+    json.getNode().(Attribute).getName() = "loads" and
+    json.getNode().(Attribute).getObject().(Name).getId() = "json" and
+    call.getFunction() = json and
+    nodeFrom.asCfgNode() = call.getArg(0) and
+    nodeTo.asCfgNode() = call and
+    c instanceof RecursiveElementContent
+  )
+}


Recently we discussed whether type-tracking and data-flow would end up recursively depending on each other, or if we could have a strictly layered approach (type-tracking depends on data-flow, but data-flow can't depend on type-tracking)...

My assumption (without thinking about performance implications) has always been that we need type-tracking to be able to define some data-flow things. However, the general assumption seemed to be that we could avoid this.

I think this is a prime example of needing type-tracking/API graphs to track json.loads (like we currently do), so we can define this store step properly.

I guess a discussion in person is better suited to figure out what to do about this

Indeed, as I understand it, flow summaries gives you a hook to define load and store steps. As I see it, we would use type trackers to select the function, so it all becomes recursive in the end.

So, here's my view on this. Type trackers need two things in order to function: local data flow and call graphs. The call graph is implemented using type tracking, so there's recursion here.

Type trackers cannot depend on global data flow. That would be catastrophic, as any change in flow (e.g. addition of flow steps) would then force a recomputation of everything that depends on type tracking, including the call graph.

So, if we decide to use API graphs to specify things like json.loads, then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).

This suggests that at the very least, content flow steps of various kinds must be kept out of the "simple local flow step" core that type tracking depends on. We may need to split up our definition of local flow for this, having one for "local flow for type trackers" and one (most likely a superset) for "local flow for global flow".

Totally agree that we cannot let type-trackers depend on global data-flow. However, I don't see the connection between that and the conclusion then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).

(I don't actually know if modeling with flow-summaries ends up affecting local flow or not, to be totally honest)

Totally agree that we cannot let type-trackers depend on global data-flow. However, I don't see the connection between that and the conclusion then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).

I think you may have misread what I wrote. I'm not concluding anything -- I'm stating a prerequisite that must be satisfied if we want to avoid a huge performance impact.

Perhaps this is best explained by considering a few cases. My observation basically boils down to the following question: "What is affected by adding a flow summary?"

Case 1: Only global data flow is affected. The set of local flow steps remains unchanged.

In this case, there's no issue, and we can freely use API graphs to specify flow summaries.

Case 2: Adding a flow summary affects local (and by extension global) data flow, but can be done "universally" across all queries.

In this case using API graphs to specify flow summaries will introduce recursion between type tracking and the definition of the flow summaries (since the flow summaries use API graphs in their specification, but in turn affect the local flow basis of type tracking).

This could still be okay, since we're only paying the penalty once, but I don't know how realistic the "universality" is.

Case 3: Adding a flow summary affects local (and by extension global) data flow, and different queries require different flow summary extensions.

This one is especially bad, since we'll now end up recalculating type trackers and call graphs for each query that extends the meaning of local flow.

To be honest, I don't like case 2 much either. I think we'll have a much more well-behaved system if we can stick to case 1.

github-actions bot added the Python label Jan 29, 2021

yoff added 5 commits February 4, 2021 17:11

Python: More taint propagation when formatting.

ae97a02

Python: simulate flow summary for json.loads

c56fb33

Python: experiment with rewriting taint steps

f3b63e3

in terms of store steps with abstracted content

Python: Add recursive content

0ecc8c6

Python: Turn off default taint steps for

695e593

subscript and for.

yoff force-pushed the python-dataflow-unpacking-assigment-experiments branch from 01f0d05 to 695e593 Compare February 4, 2021 16:16

RasmusWL reviewed Feb 10, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Better path explanations #5057

Python: Better path explanations #5057

Uh oh!

yoff commented Jan 29, 2021 •

edited

Loading

Uh oh!

RasmusWL left a comment •

edited

Loading

Uh oh!

Uh oh!

RasmusWL Feb 10, 2021

Uh oh!

yoff Feb 10, 2021

Uh oh!

RasmusWL Feb 10, 2021

Uh oh!

yoff Feb 10, 2021

Uh oh!

tausbn Feb 16, 2021

Uh oh!

RasmusWL Feb 16, 2021

Uh oh!

tausbn Feb 16, 2021

Uh oh!

Uh oh!

Python: Better path explanations #5057

Are you sure you want to change the base?

Python: Better path explanations #5057

Uh oh!

Conversation

yoff commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Questions

Caveats

Tests

Uh oh!

RasmusWL left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Caveats

Conclusion

Uh oh!

Uh oh!

RasmusWL Feb 10, 2021

Choose a reason for hiding this comment

Uh oh!

yoff Feb 10, 2021

Choose a reason for hiding this comment

Uh oh!

RasmusWL Feb 10, 2021

Choose a reason for hiding this comment

Uh oh!

yoff Feb 10, 2021

Choose a reason for hiding this comment

Uh oh!

tausbn Feb 16, 2021

Choose a reason for hiding this comment

Uh oh!

RasmusWL Feb 16, 2021

Choose a reason for hiding this comment

Uh oh!

tausbn Feb 16, 2021

Choose a reason for hiding this comment

Case 1: Only global data flow is affected. The set of local flow steps remains unchanged.

Case 2: Adding a flow summary affects local (and by extension global) data flow, but can be done "universally" across all queries.

Case 3: Adding a flow summary affects local (and by extension global) data flow, and different queries require different flow summary extensions.

Uh oh!

Uh oh!

yoff commented Jan 29, 2021 •

edited

Loading

RasmusWL left a comment •

edited

Loading