Skip to content

Python: Better path explanations #5057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

yoff
Copy link
Contributor

@yoff yoff commented Jan 29, 2021

Purpose

This PR obtains nice path explanations by removing certain default taint-steps.
To compensate, it adds the notion of recursive content and a simulated flow summary for json.loads.
It is here so that we can test correctness and performance, and so we can discuss its fate.

Questions

  • Do we want all this in one PR or do we prefer 3 (recursive content, simulated flow steps, remove default taint)?
  • Do we need simulated flow summaries for other methods?
  • Should we actually wait for flow summaries being implemented rather than simulate them now?

Caveats

The main caveat is that some taint-tracking tests fail. In particular tests of the form

   l = [TAINTED_STRING]
   ensure_tainted(l)

and

   l = TAINTED_LIST
   ensure_tainted(l[0])

The former fails because l itself is no longer considered tainted, only its content is (and the difference is now noted).
The latter fails because we have not assigned content to TAINTED_LIST. We will have to evaluate if this is a show-stopper.

A secondary caveat is that no investigation has been made into what flow summaries are currently needed. A dist-compare should reveal this.

Tests

Differences job to test correctness and performance (showing no concerns).

Differences job on slugs from https://github.com/github/codeql-python-team/issues/288.

Dist-compare not yet run.

@yoff yoff force-pushed the python-dataflow-unpacking-assigment-experiments branch from 01f0d05 to 695e593 Compare February 4, 2021 16:16
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonLoadsStoreStep was a bit hard to read 😄 I read it initially as json load & store step, and not as json.loads store step 😄

Caveats

I think we need to be able to handle both caveats.

  1. sink(some_tainted_list) where sink is a call to an external library that executes commands. (we have an internal issue for that here)
  2. A list being tainted without a store step. This is for example the case with self.path_args in a tornado request handler class. This is a list containing tainted elements, but there is no explicit store step.

As far as I recall, the problem with the 2 above things is not so much that some of our current tests will now fail, but that the setup that Tom made doesn't support these instances out of the box. Right? -- I seem to recall him talking about it being possible to synthesize nodes to make it work though.

Conclusion

I've been postponing flow-summaries for a bit to get things working (in general). If really investing in flow-summaries is what we need to get proper path explanations, I guess we'll just have to bite the bullet.

Do you have an example at hand that illustrates how much better the path explanation becomes when making this change?

Comment on lines +908 to +909
or
jsonLoadsStoreStep(nodeFrom, c, nodeTo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like with additional taint steps, I think we should use the same a mechanism to allow defining these in each framework implementation file (a minor nitpick I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree that it cannot be in the core library like this. I believe that is what flow-summaries are for and that is the reason I have brought them up in this context.

Comment on lines +1044 to +1054
predicate jsonLoadsStoreStep(CfgNode nodeFrom, Content c, Node nodeTo) {
exists(CallNode call, ControlFlowNode json |
json.isAttribute() and
json.getNode().(Attribute).getName() = "loads" and
json.getNode().(Attribute).getObject().(Name).getId() = "json" and
call.getFunction() = json and
nodeFrom.asCfgNode() = call.getArg(0) and
nodeTo.asCfgNode() = call and
c instanceof RecursiveElementContent
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently we discussed whether type-tracking and data-flow would end up recursively depending on each other, or if we could have a strictly layered approach (type-tracking depends on data-flow, but data-flow can't depend on type-tracking)...

My assumption (without thinking about performance implications) has always been that we need type-tracking to be able to define some data-flow things. However, the general assumption seemed to be that we could avoid this.

I think this is a prime example of needing type-tracking/API graphs to track json.loads (like we currently do), so we can define this store step properly.

I guess a discussion in person is better suited to figure out what to do about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, as I understand it, flow summaries gives you a hook to define load and store steps. As I see it, we would use type trackers to select the function, so it all becomes recursive in the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here's my view on this. Type trackers need two things in order to function: local data flow and call graphs. The call graph is implemented using type tracking, so there's recursion here.

Type trackers cannot depend on global data flow. That would be catastrophic, as any change in flow (e.g. addition of flow steps) would then force a recomputation of everything that depends on type tracking, including the call graph.

So, if we decide to use API graphs to specify things like json.loads, then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).

This suggests that at the very least, content flow steps of various kinds must be kept out of the "simple local flow step" core that type tracking depends on. We may need to split up our definition of local flow for this, having one for "local flow for type trackers" and one (most likely a superset) for "local flow for global flow".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree that we cannot let type-trackers depend on global data-flow. However, I don't see the connection between that and the conclusion then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).

(I don't actually know if modeling with flow-summaries ends up affecting local flow or not, to be totally honest)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree that we cannot let type-trackers depend on global data-flow. However, I don't see the connection between that and the conclusion then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).

I think you may have misread what I wrote. I'm not concluding anything -- I'm stating a prerequisite that must be satisfied if we want to avoid a huge performance impact.

Perhaps this is best explained by considering a few cases. My observation basically boils down to the following question: "What is affected by adding a flow summary?"

Case 1: Only global data flow is affected. The set of local flow steps remains unchanged.

In this case, there's no issue, and we can freely use API graphs to specify flow summaries.

Case 2: Adding a flow summary affects local (and by extension global) data flow, but can be done "universally" across all queries.

In this case using API graphs to specify flow summaries will introduce recursion between type tracking and the definition of the flow summaries (since the flow summaries use API graphs in their specification, but in turn affect the local flow basis of type tracking).

This could still be okay, since we're only paying the penalty once, but I don't know how realistic the "universality" is.

Case 3: Adding a flow summary affects local (and by extension global) data flow, and different queries require different flow summary extensions.

This one is especially bad, since we'll now end up recalculating type trackers and call graphs for each query that extends the meaning of local flow.

To be honest, I don't like case 2 much either. I think we'll have a much more well-behaved system if we can stick to case 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants