Spec the multi-epoch unoptimization #211

martinthomson · 2025-06-20T05:42:49Z

Prior to this, the algorithm for deducting privacy budget used the L1 norm of the histogram, which the analysis shows is only possible when impressions from a single epoch are involved.

Again, this is pretty suboptimal if implemented directly, which is a recurring pattern. This iterates over all impressions twice. I didn't bother to cache the impressions that were selected. I didn't even bother to break out of the loop when the count hits two. Those would just make the "code" harder to read than it already is.

This is contrary to what is implied in #78.
Still, this closes #78, just in the opposite way to what was invisaged.

Preview | Diff

Prior to this, the algorithm for deducting privacy budget used the L1 norm of the histogram, which the analysis shows is only possible when impressions from a single epoch are involved. Again, this is pretty suboptimal if implemented directly, which is a recurring pattern. This iterates over all impressions twice. I didn't bother to cache the impressions that were selected. I didn't even bother to break out of the loop when the count hits two. Those would just make the "code" harder to read than it already is. This is contrary to what is implied in #78. Still, this closes #78, just in the opposite way to what was invisaged.

csharrison · 2025-06-24T16:12:07Z

This doesn't look correct to me, and I think the existing spec is fine. The sensitivity parameter passed to deduct privacy budget is the maximum value the attributed histogram contribution could hold, regardless of attribution algorithm. As such, it is always OK to deduct budget proportional to sensitivity * epsilon / globalSensitivity.

The L1 norm of the contribution vector is not used in the current spec (nor in this PR). If we supported scaling noise based on this, we would need to output the actual L1 norm from the attribution algorithm and deduct privacy budget after the histogram is filled (e.g. after 4.4.1.6). In this case, for some attribution algorithms, the L1 norm would be smaller than sensitivity.

tholop · 2025-06-24T19:56:32Z

I agree with @csharrison. Moreover, if I understand correctly, the change proposed in this PR is enabling/disabling multi-epoch accounting based on epochCount, which is obtained by counting how many epochs have matching impressions. This does not seem right, because the single-epoch optimization says that we can use the L1 norm instead of the global sensitivity only if the attribution window contains a single epoch, independently of any impression data. For instance, if the attribution window contains 2 epochs, but one epoch has no matching impressions (e.g., wrong topLevelSite or matchValue), then we should still do multi-epoch accounting.

In case you'd like extra context, here is our reference implementation for Cookie Monster. There is a bit of extra code irrelevant for PPA Level 1 (epoch-source losses and two-phase commit are for Big Bird).

do attribution and fill a histogram corresponds to https://github.com/columbia/pdslib/blob/19102c494009cd22045eb4971750505fc2c98b2b/src/pds/core.rs#L60
deduct privacy budget roughly corresponds to: https://github.com/columbia/pdslib/blob/19102c494009cd22045eb4971750505fc2c98b2b/src/pds/accounting.rs#L14. The current PPA spec doesn't have the single-epoch optimization, so it always uses sensitivity = value, which corresponds to the report_global_sensitivity in Cookie Monster. The laplace_noise_scale computed using epsilon and globalSensitivity, which corresponds to query_global_sensitivity in Cookie Monster: https://github.com/columbia/pdslib/blob/19102c494009cd22045eb4971750505fc2c98b2b/src/queries/ppa_histogram.rs#L182.
As @csharrison mentioned, when we want to implement the single-epoch optimization, we need to precompute the histogram to obtain the L1 norm. Cookie Monster does this here: https://github.com/columbia/pdslib/blob/19102c494009cd22045eb4971750505fc2c98b2b/src/pds/accounting.rs#L28.

Also, one question: are you certain that your report global sensitivity is just value, and not 2 * value? For the family of histogram queries we considered in Cookie Monster, we obtained a factor 2 in the general case (https://github.com/columbia/pdslib/blob/19102c494009cd22045eb4971750505fc2c98b2b/src/queries/histogram.rs#L140), because removing all the impressions from an epoch could result in a histogram with the same attributed value allocated to completely different buckets. The factor 2 disappears if we have only 1 epoch or only 1 bucket.

csharrison · 2025-06-24T20:25:43Z

Moreover, if I understand correctly, the change proposed in this PR is enabling/disabling multi-epoch accounting based on epochCount, which is obtained by counting how many epochs have matching impressions. This does not seem right, because the single-epoch optimization says that we can use the L1 norm instead of the global sensitivity only if the attribution window contains a single epoch, independently of any impression data.

Yes, I agree with this too.

Also, one question: are you certain that your report global sensitivity is just value, and not 2 * value?

Let me file an issue to clarify this. Our sensitivity field is tracking the max L1 norm of a vector, not the sensitivity induced by a DP neighbor, which is unclear. I agree if we consider the DP neighbor we are missing a 2x factor (although it could be accounted for elsewhere e.g. in the aggregation layer).

martinthomson · 2025-06-25T04:52:09Z

Ah, I see where this caused problems. Nevermind that then.

I'm not following the 2x question. I get the point about the effect of removing an impression from the database, but I don't think that our privacy unit is the impression, but the browser instance. So I think that the difference is accounted for. @csharrison, if you think we need something more concrete, maybe open a separate issue to track that.

csharrison · 2025-06-25T14:02:15Z

Ah, I see where this caused problems. Nevermind that then.

I'm not following the 2x question. I get the point about the effect of removing an impression from the database, but I don't think that our privacy unit is the impression, but the browser instance. So I think that the difference is accounted for. @csharrison, if you think we need something more concrete, maybe open a separate issue to track that.

Let's move the discussion to #212. I can also draft up an example.

martinthomson requested a review from csharrison June 20, 2025 05:42

csharrison mentioned this pull request Jun 24, 2025

Clarify what we mean by sensitivity #212

Closed

martinthomson closed this Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spec the multi-epoch unoptimization #211

Spec the multi-epoch unoptimization #211

martinthomson commented Jun 20, 2025 •

edited by pr-preview bot

Loading

Uh oh!

csharrison commented Jun 24, 2025

Uh oh!

tholop commented Jun 24, 2025

Uh oh!

csharrison commented Jun 24, 2025

Uh oh!

martinthomson commented Jun 25, 2025

Uh oh!

csharrison commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

4 participants

Spec the multi-epoch unoptimization #211

Spec the multi-epoch unoptimization #211

Conversation

martinthomson commented Jun 20, 2025 • edited by pr-preview bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

csharrison commented Jun 24, 2025

Uh oh!

tholop commented Jun 24, 2025

Uh oh!

csharrison commented Jun 24, 2025

Uh oh!

martinthomson commented Jun 25, 2025

Uh oh!

csharrison commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

4 participants

martinthomson commented Jun 20, 2025 •

edited by pr-preview bot

Loading