Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulator is not actually random, leading to extremely inaccurate results #1

Closed
jkarlin opened this issue Jul 19, 2024 · 2 comments
Closed

Comments

@jkarlin
Copy link

jkarlin commented Jul 19, 2024

Hi. I’m an engineer on the Topics API for Chrome. I took a brief look at your code after seeing rather surprising results in the related paper and it’s important to point out an issue that I came across as it has a significant impact on the simulation (and therefore the paper’s) results.

You’re using a worker pool to create the topics for each user on sites A and B, but you’re not reseeding the random number generator on each worker (which is forked off the original process). The result is that each worker creates the same stream of random numbers!

This means that in your simulator, sites A and B are getting the same Topics for the same user, rather than chosen at random.

This is a significant problem with your published work. For example, fixing this bug in your code reduces the 5-epoch reidentification rate from ~57% to ~3% with params[1] provided in the README.

An easy fix is to add os.register_at_fork(after_in_child=np.random.seed) before creating your worker pool.

Josh

[1] python3 topics_simulator.py data/web_data/users_topics_5_weeks.tsv 5 topics_classifier/chrome4/config.json data/crux/crux_202401_chrome4_topics-api.tsv 10 1 data/reidentification_exp/5_weeks_10_unobserved`

@yohhaan
Copy link
Owner

yohhaan commented Jul 20, 2024

Hello Josh,

Thanks for reaching out and reporting this to our attention!

We looked into this subtle bug regarding the initialization of the random number generator seed across these forked processes. We confirm that numpy preserves the random state across forks and that the proposed solution fixes it by forcing an auto-seed for each new fork. Thus, we re-ran our simulation on these real dataset of browsing histories.

While the results that we now obtain have changed quantitatively; 2.3%, 2.9%, and 4.1% of these users are uniquely re-identified after 1, 2, and 3 observations of their topics, respectively, our findings do not change qualitatively: real users can be fingerprinted by the Topics API and the information leakage worsens over time as more users get uniquely re-identified.

Here is our plan; we will modify the simulator code (https://github.com/yohhaan/topics_api_analysis), update the corresponding metrics in the paper, and push a new version to arXiv (https://arxiv.org/abs/2403.19577) in which we will state your contribution and the help you provided. Thanks again!

Best,

Yohan

@yohhaan
Copy link
Owner

yohhaan commented Aug 7, 2024

Hello,

Corrections to the code have been made in commit b20e193 and revisions are posted here.

Thanks again!

@yohhaan yohhaan closed this as completed Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants