-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Python: Better path explanations #5057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Python: Better path explanations #5057
Conversation
in terms of store steps with abstracted content
subscript and for.
01f0d05
to
695e593
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jsonLoadsStoreStep
was a bit hard to read 😄 I read it initially as json load & store step, and not as json.loads
store step 😄
Caveats
I think we need to be able to handle both caveats.
sink(some_tainted_list)
wheresink
is a call to an external library that executes commands. (we have an internal issue for that here)- A list being tainted without a store step. This is for example the case with
self.path_args
in a tornado request handler class. This is a list containing tainted elements, but there is no explicit store step.
As far as I recall, the problem with the 2 above things is not so much that some of our current tests will now fail, but that the setup that Tom made doesn't support these instances out of the box. Right? -- I seem to recall him talking about it being possible to synthesize nodes to make it work though.
Conclusion
I've been postponing flow-summaries for a bit to get things working (in general). If really investing in flow-summaries is what we need to get proper path explanations, I guess we'll just have to bite the bullet.
Do you have an example at hand that illustrates how much better the path explanation becomes when making this change?
or | ||
jsonLoadsStoreStep(nodeFrom, c, nodeTo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like with additional taint steps, I think we should use the same a mechanism to allow defining these in each framework implementation file (a minor nitpick I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree that it cannot be in the core library like this. I believe that is what flow-summaries are for and that is the reason I have brought them up in this context.
predicate jsonLoadsStoreStep(CfgNode nodeFrom, Content c, Node nodeTo) { | ||
exists(CallNode call, ControlFlowNode json | | ||
json.isAttribute() and | ||
json.getNode().(Attribute).getName() = "loads" and | ||
json.getNode().(Attribute).getObject().(Name).getId() = "json" and | ||
call.getFunction() = json and | ||
nodeFrom.asCfgNode() = call.getArg(0) and | ||
nodeTo.asCfgNode() = call and | ||
c instanceof RecursiveElementContent | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently we discussed whether type-tracking and data-flow would end up recursively depending on each other, or if we could have a strictly layered approach (type-tracking depends on data-flow, but data-flow can't depend on type-tracking)...
My assumption (without thinking about performance implications) has always been that we need type-tracking to be able to define some data-flow things. However, the general assumption seemed to be that we could avoid this.
I think this is a prime example of needing type-tracking/API graphs to track json.loads
(like we currently do), so we can define this store step properly.
I guess a discussion in person is better suited to figure out what to do about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, as I understand it, flow summaries gives you a hook to define load and store steps. As I see it, we would use type trackers to select the function, so it all becomes recursive in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here's my view on this. Type trackers need two things in order to function: local data flow and call graphs. The call graph is implemented using type tracking, so there's recursion here.
Type trackers cannot depend on global data flow. That would be catastrophic, as any change in flow (e.g. addition of flow steps) would then force a recomputation of everything that depends on type tracking, including the call graph.
So, if we decide to use API graphs to specify things like json.loads
, then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).
This suggests that at the very least, content flow steps of various kinds must be kept out of the "simple local flow step" core that type tracking depends on. We may need to split up our definition of local flow for this, having one for "local flow for type trackers" and one (most likely a superset) for "local flow for global flow".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree that we cannot let type-trackers depend on global data-flow. However, I don't see the connection between that and the conclusion then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).
(I don't actually know if modeling with flow-summaries ends up affecting local flow or not, to be totally honest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree that we cannot let type-trackers depend on global data-flow. However, I don't see the connection between that and the conclusion then it must be the case that this does not end up affecting type tracking itself (and hence this cannot affect local flow either).
I think you may have misread what I wrote. I'm not concluding anything -- I'm stating a prerequisite that must be satisfied if we want to avoid a huge performance impact.
Perhaps this is best explained by considering a few cases. My observation basically boils down to the following question: "What is affected by adding a flow summary?"
Case 1: Only global data flow is affected. The set of local flow steps remains unchanged.
In this case, there's no issue, and we can freely use API graphs to specify flow summaries.
Case 2: Adding a flow summary affects local (and by extension global) data flow, but can be done "universally" across all queries.
In this case using API graphs to specify flow summaries will introduce recursion between type tracking and the definition of the flow summaries (since the flow summaries use API graphs in their specification, but in turn affect the local flow basis of type tracking).
This could still be okay, since we're only paying the penalty once, but I don't know how realistic the "universality" is.
Case 3: Adding a flow summary affects local (and by extension global) data flow, and different queries require different flow summary extensions.
This one is especially bad, since we'll now end up recalculating type trackers and call graphs for each query that extends the meaning of local flow.
To be honest, I don't like case 2 much either. I think we'll have a much more well-behaved system if we can stick to case 1.
Purpose
This PR obtains nice path explanations by removing certain default taint-steps.
To compensate, it adds the notion of recursive content and a simulated flow summary for
json.loads
.It is here so that we can test correctness and performance, and so we can discuss its fate.
Questions
Caveats
The main caveat is that some taint-tracking tests fail. In particular tests of the form
and
The former fails because
l
itself is no longer considered tainted, only its content is (and the difference is now noted).The latter fails because we have not assigned content to
TAINTED_LIST
. We will have to evaluate if this is a show-stopper.A secondary caveat is that no investigation has been made into what flow summaries are currently needed. A dist-compare should reveal this.
Tests
Differences job to test correctness and performance (showing no concerns).
Differences job on slugs from https://github.com/github/codeql-python-team/issues/288.
Dist-compare not yet run.