Skip to content

Shared: Generate more value-preserving flow summaries #19443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from 1 commit
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -345,12 +345,14 @@ module MakeModelGeneratorFactory<
/**
* Gets the summary model of `api`, if it follows the `fluent` programming pattern (returns `this`).
*/
private string captureQualifierFlow(DataFlowSummaryTargetApi api) {
private string captureQualifierFlow(DataFlowSummaryTargetApi api, string input, string output) {
exists(ReturnNodeExt ret |
api = returnNodeEnclosingCallable(ret) and
isOwnInstanceAccessNode(ret)
) and
result = ModelPrintingSummary::asLiftedValueModel(api, qualifierString(), "ReturnValue")
input = qualifierString() and
output = "ReturnValue" and
result = ModelPrintingSummary::asLiftedValueModel(api, input, output)
}

private int accessPathLimit0() { result = 2 }
Expand Down Expand Up @@ -430,7 +432,7 @@ module MakeModelGeneratorFactory<
predicate isSink(DataFlow::Node sink) {
sink instanceof ReturnNodeExt and
not isOwnInstanceAccessNode(sink) and
not exists(captureQualifierFlow(getAsExprEnclosingCallable(sink)))
not exists(captureQualifierFlow(getAsExprEnclosingCallable(sink), _, _))
}

predicate isAdditionalFlowStep = PropagateFlowConfigInput::isAdditionalFlowStep/4;
Expand Down Expand Up @@ -559,11 +561,24 @@ module MakeModelGeneratorFactory<
*
* `preservesValue` is `true` if the summary is value-preserving, and `false` otherwise.
*/
private string captureThroughFlow(DataFlowSummaryTargetApi api, boolean preservesValue) {
exists(string input, string output |
preservesValue = max(boolean b | captureThroughFlow0(api, _, input, _, output, b)) and
result = ModelPrintingSummary::asLiftedModel(api, input, output, preservesValue)
)
private string captureThroughFlow(
DataFlowSummaryTargetApi api, string input, string output, boolean preservesValue
) {
preservesValue = max(boolean b | captureThroughFlow0(api, _, input, _, output, b)) and
result = ModelPrintingSummary::asLiftedModel(api, input, output, preservesValue)
}

/**
* Gets the summary model(s) of `api`, if there is flow `input` to
* `output`. `preservesValue` is `true` if the summary is value-
* preserving, and `false` otherwise.
*/
string captureFlow(
DataFlowSummaryTargetApi api, string input, string output, boolean preservesValue
) {
result = captureQualifierFlow(api, input, output) and preservesValue = true
or
result = captureThroughFlow(api, input, output, preservesValue)
}

/**
Expand All @@ -573,9 +588,7 @@ module MakeModelGeneratorFactory<
* `preservesValue` is `true` if the summary is value-preserving, and `false` otherwise.
*/
string captureFlow(DataFlowSummaryTargetApi api, boolean preservesValue) {
result = captureQualifierFlow(api) and preservesValue = true
or
result = captureThroughFlow(api, preservesValue)
result = captureFlow(api, _, _, preservesValue)
}

/**
Expand Down Expand Up @@ -947,19 +960,20 @@ module MakeModelGeneratorFactory<
}

/**
* Gets the content based summary model(s) of the API `api` (if there is flow from a parameter to
* the return value or a parameter). `lift` is true, if the model should be lifted, otherwise false.
* Gets the content based summary model(s) of the API `api` (if there is flow from `input` to
* `output`). `lift` is true, if the model should be lifted, otherwise false.
* `preservesValue` is `true` if the summary is value-preserving, and `false` otherwise.
*
* Models are lifted to the best type in case the read and store access paths do not
* contain a field or synthetic field access.
*/
string captureFlow(ContentDataFlowSummaryTargetApi api, boolean lift, boolean preservesValue) {
exists(string input, string output |
captureFlow0(api, input, output, _, lift) and
preservesValue = max(boolean p | captureFlow0(api, input, output, p, lift)) and
result = ContentModelPrinting::asModel(api, input, output, preservesValue, lift)
)
string captureFlow(
ContentDataFlowSummaryTargetApi api, string input, string output, boolean lift,
boolean preservesValue
) {
captureFlow0(api, input, output, _, lift) and
preservesValue = max(boolean p | captureFlow0(api, input, output, p, lift)) and
result = ContentModelPrinting::asModel(api, input, output, preservesValue, lift)
}
}

Expand All @@ -972,23 +986,30 @@ module MakeModelGeneratorFactory<
* generate flow summaries using the heuristic based summary generator.
*/
string captureFlow(DataFlowSummaryTargetApi api, boolean lift) {
exists(boolean preservesValue |
result = ContentSensitive::captureFlow(api, lift, preservesValue)
or
not exists(DataFlowSummaryTargetApi api0 |
// If the heuristic summary is value-preserving then we keep both
// summaries. However, if we can generate any content-sensitive
// summary (value-preserving or not) then we don't include any taint-
// based heuristic summary.
preservesValue = false
result = ContentSensitive::captureFlow(api, _, _, lift, _)
or
exists(boolean preservesValue, string input, string output |
not exists(
DataFlowSummaryTargetApi api0, string input0, string output0, boolean preservesValue0
|
// If the heuristic summary is taint-based, and we can generate a content-sensitive
// summary that is value-preserving then we omit generating any heuristic summary.
preservesValue = false and
preservesValue0 = true
or
// However, if they're both value-preserving (or both taint-based) then we only
// generate a heuristic summary if we didn't generate a content-sensitive summary.
preservesValue = preservesValue0 and
input0 = input and
output0 = output
|
(api0 = api or api.lift() = api0) and
exists(ContentSensitive::captureFlow(api0, false, _))
exists(ContentSensitive::captureFlow(api0, input0, output0, false, preservesValue0))
or
api0.lift() = api.lift() and
exists(ContentSensitive::captureFlow(api0, true, _))
exists(ContentSensitive::captureFlow(api0, input0, output0, true, preservesValue0))
) and
result = Heuristic::captureFlow(api, preservesValue) and
result = Heuristic::captureFlow(api, input, output, preservesValue) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does a bit more than de-duplicating the generated summaries.
Prior to the change, a heuristic summary (value or taint) wouldn't be generated (at all) in a case there existed any content sensitive summary for the API. With this change a heuristic summary is generated on an input/output pair in case there doesn't exist a content sensitive summary for the same input/output pair for the given API.
My original comment merely stated that we should try and remove duplicate entries (where the only difference was the provenance df-generated/dfc-generated).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a good point. I think the behavior you want is what we get if we get rid of this disjunct:

// If the heuristic summary is taint-based, and we can generate a content-sensitive
// summary that is value-preserving then we omit generating any heuristic summary.
preservesValue = false and
preservesValue0 = true

right? Then we'd only generate a heuristic summary if we fail to generate a content-sensitive with exactly the same (input, output, kind) tuple.

I'm happy to do that if you'd rather want this behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised that my previous comment is not entirely correct.
The issue only pertains to taint like summaries and the cause is the other disjunkt:

preservesValue = preservesValue0 and
input0 = input and
output0 = output

That is, we only exclude a summary df-generated taint summary in case there doesn't exist a dfc-generated generated taint summary on the same input/output (string) pair. However, this causes a problem as the df-generated taint summaries truncates field/property accesses in the input/output strings (but the dataflow configuration allows a "limited" set of field/property reads/stores - which means we start producing field conflation issues with taint summaries).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Hm, yeah I can see how that's a problem. Should we delay deduplication until we see whether it's actually a problem in practice? I'd really like to somehow get this working as the taint summaries we get for OpenSSL isn't sufficient for what we need them for 😭

Copy link
Contributor

@michaelnebel michaelnebel May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried re-generating the .NET runtime models - and we do get some models that we don't want.
Also, I noticed that the printing of heuristic models might need to change for the exact value preserving models.
Is it ok if I experiment a bit and make some changes to the PR?
Considering to make the filtering like this:

// If the heuristic summary is taint-based, and we can generate a content-sensitive
// summary then we omit generating the heuristic summary.
preservesValue = false
or
// If they're both value-preserving then we only generate a heuristic summary if
// we didn't generate a content-sensitive summary on the same input/output pair.
preservesValue = true and
preservesValue0 = true and
input0 = input and
output0 = output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for regenerating the .NET runtime models 🙇

Yeah, feel free to experiment on the PR all you want. Let me know if I can do something to help out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also add: I think the filtering you propose there makes a lot of sense, and I also just checked that it still gives us the models we care about 🥳

Copy link
Contributor

@michaelnebel michaelnebel May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! 😄
I couldn't push the changes to this branch - could you cherry pick the last six commits from (when CI is green): #19481
The commits does the following:

  • Filter summaries based on the idea proposed above, which means that
    • For value based summaries we keep both heuristic and content-based (but de-duplicating based on provenance and keep the dfc-generated).
    • For taint based heuristic summaries we only keep these in case there doesn't exist any (value or taint) content-based summary for the same api.
  • Adjust the printing of value heuristic models. Taint steps allow (at least for C#) reading from a collection (as a taint step), which means that we used the type of the input to "guess" whether it was likely that we just read an element of a collection. However, this doesn't apply for value models.
  • Fix an unrelated issue of the printing of source models (as these don't support .Element etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fantastic, Michael! I've pushed your commits now 🙇

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent - and thank you very much for contributing to the model generator @MathiasVP - I really appreciate it!
I will start some DCA runs to check whether this impacted performance.

lift = true
)
}
Expand Down