Skip to content

Conversation

@martinthomson
Copy link
Member

@martinthomson martinthomson commented Jun 20, 2025

Prior to this, the algorithm for deducting privacy budget used the L1 norm of the histogram, which the analysis shows is only possible when impressions from a single epoch are involved.

Again, this is pretty suboptimal if implemented directly, which is a recurring pattern. This iterates over all impressions twice. I didn't bother to cache the impressions that were selected. I didn't even bother to break out of the loop when the count hits two. Those would just make the "code" harder to read than it already is.

This is contrary to what is implied in #78.
Still, this closes #78, just in the opposite way to what was invisaged.


Preview | Diff

Prior to this, the algorithm for deducting privacy budget used the L1
norm of the histogram, which the analysis shows is only possible when
impressions from a single epoch are involved.

Again, this is pretty suboptimal if implemented directly, which is a
recurring pattern.  This iterates over all impressions twice.  I didn't
bother to cache the impressions that were selected.  I didn't even
bother to break out of the loop when the count hits two.  Those would
just make the "code" harder to read than it already is.

This is contrary to what is implied in #78.
Still, this closes #78, just in the opposite way to what was invisaged.
@martinthomson martinthomson requested a review from csharrison June 20, 2025 05:42
@csharrison
Copy link
Collaborator

This doesn't look correct to me, and I think the existing spec is fine. The sensitivity parameter passed to deduct privacy budget is the maximum value the attributed histogram contribution could hold, regardless of attribution algorithm. As such, it is always OK to deduct budget proportional to sensitivity * epsilon / globalSensitivity.

The L1 norm of the contribution vector is not used in the current spec (nor in this PR). If we supported scaling noise based on this, we would need to output the actual L1 norm from the attribution algorithm and deduct privacy budget after the histogram is filled (e.g. after 4.4.1.6). In this case, for some attribution algorithms, the L1 norm would be smaller than sensitivity.

@tholop
Copy link

tholop commented Jun 24, 2025

I agree with @csharrison. Moreover, if I understand correctly, the change proposed in this PR is enabling/disabling multi-epoch accounting based on epochCount, which is obtained by counting how many epochs have matching impressions. This does not seem right, because the single-epoch optimization says that we can use the L1 norm instead of the global sensitivity only if the attribution window contains a single epoch, independently of any impression data. For instance, if the attribution window contains 2 epochs, but one epoch has no matching impressions (e.g., wrong topLevelSite or matchValue), then we should still do multi-epoch accounting.

In case you'd like extra context, here is our reference implementation for Cookie Monster. There is a bit of extra code irrelevant for PPA Level 1 (epoch-source losses and two-phase commit are for Big Bird).

Also, one question: are you certain that your report global sensitivity is just value, and not 2 * value? For the family of histogram queries we considered in Cookie Monster, we obtained a factor 2 in the general case (https://github.com/columbia/pdslib/blob/19102c494009cd22045eb4971750505fc2c98b2b/src/queries/histogram.rs#L140), because removing all the impressions from an epoch could result in a histogram with the same attributed value allocated to completely different buckets. The factor 2 disappears if we have only 1 epoch or only 1 bucket.

@csharrison
Copy link
Collaborator

Moreover, if I understand correctly, the change proposed in this PR is enabling/disabling multi-epoch accounting based on epochCount, which is obtained by counting how many epochs have matching impressions. This does not seem right, because the single-epoch optimization says that we can use the L1 norm instead of the global sensitivity only if the attribution window contains a single epoch, independently of any impression data.

Yes, I agree with this too.

Also, one question: are you certain that your report global sensitivity is just value, and not 2 * value?

Let me file an issue to clarify this. Our sensitivity field is tracking the max L1 norm of a vector, not the sensitivity induced by a DP neighbor, which is unclear. I agree if we consider the DP neighbor we are missing a 2x factor (although it could be accounted for elsewhere e.g. in the aggregation layer).

@martinthomson
Copy link
Member Author

Ah, I see where this caused problems. Nevermind that then.

I'm not following the 2x question. I get the point about the effect of removing an impression from the database, but I don't think that our privacy unit is the impression, but the browser instance. So I think that the difference is accounted for. @csharrison, if you think we need something more concrete, maybe open a separate issue to track that.

@csharrison
Copy link
Collaborator

Ah, I see where this caused problems. Nevermind that then.

I'm not following the 2x question. I get the point about the effect of removing an impression from the database, but I don't think that our privacy unit is the impression, but the browser instance. So I think that the difference is accounted for. @csharrison, if you think we need something more concrete, maybe open a separate issue to track that.

Let's move the discussion to #212. I can also draft up an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spec the single epoch optimization

4 participants