Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication
This package implements Hadamard Response for locally private discrete distribution estimation proposed in Communication, Efficient Sample Optimal Linear Time Locally Private Discrete Distribution Estimation.
Also included are implementations of k-Randomized Response(RR), k-RAPPOR and Subset Selection(SS) schemes.
Table of contents
We provide the packages as .py files and we also provide the .pynb files we wrote originally for testing in the Jupyter _Notebooks folder.
The comments in the code files are enough to understand how to use the functions. Here we provide some examples which hopefully will be helpful. The four schemes are implemented based on python classes. Before using, please first import the packages and then specialize the scheme with the alphabet size (k) and the required privacy level (eps).
import k2k_hadamard import Subsetselection import RR_RAPPOR subset = Subsetselection.Subsetselection(k, eps) #class for subset selection algorithm rappor = RR_RAPPOR.RAPPOR(k, eps) #class for RAPPOR rr = RR_RAPPOR.Randomized_Response(k, eps) #class for Randomized Response hr = k2k_hadamard.Hadamard_Rand_general(k, eps) #initialize hadamard response
When you simulate Hadamard response on a single computer, we provide the option of encoding acceleration by storing Hadamard matrix with the expense of large memory cost. If you want to accelerate the encoding process, just set variable encode_acc to be one when intializing hadamard response.
import k2k_hadamard hr = k2k_hadamard.Hadamard_Rand_general(k, eps, encode_acc = 1) #initialize hadamard response
We provide multiple classes for Hadamard response scheme. If you are focusing on the high privacy regime, i.e. the case when the privacy prarameter is small (small single digit), please use the class Hadamard_Rand_high_priv. If you want to test other cases, please use Hadamard_Rand_general, which is an improved version of Hadamard_Rand_general_original. The latter is the version we provide in the paper and it is easier to analyze, so we also provide it in the package.
The following script encodes the input string into its randomized version and then learns the distribution based on the output string:
out_string = hr.encode_string(in_list) prob_est = hr.decode_string(out_string) # estimate the original underlying distribution
If you only want to encode a single symbol, please use:
out_symbol = hr.encode_symbol(in_symbol)
When you decode the output string, you can also choose whether to use fast decoding and how to normalize the output. In general, fast decoding is preferred. For normalization, we provide two options, clip and normalize(default setting) and simplex projection (Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application).
The following script first encodes the input string into its randomized version and then learn the distribution based on the output string:
out_string = rr.encode_string(in_list) prob_est = rr.decode_string(out_string) # estimate the original underlying distribution
The same normalization options are also provided for randomized response.
Subset Selection and RAPPOR
The implementation of these two are quite similar, so here we give Subset Selection as an example.
We have three encoding modes for Subsetselection, standard, compress and light. For standard mode, the following encoding functions encode the input string into a 0/1 matrix of size (n,k) (n is the number of inputs and k is the alphabet size) where each row is the codeword corresponding to a certain input. To decode, we simply use decode_string function to decode this matrix:
out_string = subset.encode_string_fast(in_list) prob_est = subset.decode_string(outp_string,n) # estimate the original underlying distribution
The drawback of standard mode is that it will have a huge momory cost, so we provide the light mode, which encode the input string into a k-length string which represents the total number of 1's at each location. We will also return the time for the counting step as this should be added into the decoding time. For decoding, we use decode_counts:
counts, time = subset.encode_string_light(in_list) prob_est = subset.decode_counts(counts,n) # estimate the original underlying distribution
For compress mode, we encode each input symbol into a string of locations of 1's and then concatenate them together to get a bigger string. This may result in less communication cost when the privacy parameter is relatively large. When decoding, we need to first get the histogram and then use decode_counts:
out_list = subset.encode_string_compress(in_list) #subset selection counts, temp = np.histogram(out_list,range(k+1)) # k is the alphabet size prob_est = subset.decode_counts(counts, n) # estimate the original underlying distribution
For simulation, we provide functions to get geometric, uniform, two step, Zipf and Dirichlet distributions.
import functions dist = generate_geometric_distribution(k,lbd)
In file test.ipynb (also available in test.py), we provide a function to compare the perfomance of the four schemes in terms of l_1 and l_2 error:
data = test(k, eps, rep, point_num, step_sz, init, dist, encode_acc = 1, encode_mode = 0) #Args: # k : alphabet size, eps: privacy level, rep: repetition times to compute a point # point_num: number of points to compute # step_sz: distance between two sample sizes # init: initial sample size = init* step_sz # dist: underlying data distribution: choose from 'Uniform', 'Two_steps', 'Zipf', 'Dirchlet', 'Geometric' # encode_acc : control whether to use fast encoding for hadamard responce # mode: control encoding method for rappor and subset selection
You can customize the testing process by setting these parameters. The returned data file will contain the errors, time stamps, parameter settings and other related information. Plots for comparing l_1, l_2 error and decoding time will be generated. A .mat file with time stamp will also be stored when the testing process is done.