This repository lists the prototype algorithms our group developed for data collection/analysis under local differential privacy.
Frequency Oracle (primitive to estimate the histograms)
Density Oracle (for numerical/ordinal values)
Clarification: Citation 33 should be Ning Wang et al. Collecting and Analyzing Multidimensional Data with Local Differential Privacy. ICDE 2019.
Frequent Itemset Mining under LDP
Related Paper: Locally Differentially Private Frequent Itemset Mining
Errata: In Equation (10) of Section V, there are three terms, two of them misses the coefficient $\ell$.
Clarification: To find top-k itemsets, we also consider singleton estimates from SVIM (the method for singleton mining).
A list of Post-Porcess Methods for LDP
Multi-Dimensional Range Query under LDP
Code available at this repository.
I am slowly cleaning code for the protocols below:
Heavy Hitter Identification
Related Paper: Locally Differentially Private Heavy Hitter Identification
Errata: For the AOL dataset, there are 0.12M, instead of 0.2M unique queries. This is a typo that does not change any result.
Clarification: For the QUANTCAST data, I downloaded the data for one month (which contains 10 billion clicks), and sample for 5min (i.e., divide the # clicks by 30 * 24 * 5 * 60). The dataset is available upon request.
The source code is not opened yet, but the similar code (plus a data synthesizing component) for the central DP setting is opened at DPSyn by @Zhangzhk0819 (related info at nist challenge 1 and nist challenge 2).
I mainly used Python 3 with numpy. Although not tested, the code should be compatible with any recent version. I also use a special package called xxhash for high hashing throughput. It can be changed to the builtin hash function. In case your local environment does not work, this is the package versions I used from my local side.
pip install -r requirements.txt pytest