Disclosure Avoidance Repository
The Census Bureau is by law required to keep its survey responses confidential, and is beginning the transition from “ad-hoc” privacy techniques towards a formally private framework known as differential privacy. All public data releases must go thorough the Disclosure Review Board (DRB), whose newest policy states that any data release at the sub-state level or lower must be protected with noise injection techniques.
External researchers using restricted census data at Federal Statistical Research Data Centers (FSRDCs) are among the first affected by these policies, but all census data products will eventually require these methods. Researchers generally do not have a background in formal privacy, so they face a road block if they are interested in publishing sub-state results. This repository strives to deliver the tools and documentation to address this problem. Content here is WIP and all releases utilizing this library still require official approval from the DRB.
Differentially Private Computations
Differential privacy states that any information-related risk to a person should not change significantly as a result of that person's information being included, or not, in the analysis. It provides provable privacy guarantees with respect to the cumulative risk from successive data releases using a privacy "budget." Algorithms maintain differential privacy via the introduction of carefully crafted random noise into the computation. Types of computations that can be made differentiallly private:
- descriptive statistics
- supervised and unsupervised ML tasks
- generation of synthetic data
- Differential Privacy: An Introduction For Statistical Agencies Page et al.
- Differential Privacy: A Primer for a Non-technical Audience Wood et al.
- A Firm Foundation for Private Data Analysis Dwork.
- The Algorithmic Foundations of Differential Privacy Dwork & Roth.
- Introductory Readings in Formal Privacy for Economists Abowd, Schmutte, Sexton, & Vilhuber.
notebooks/ folder contains tutorials for some of the main workflows researchers practice when releasing sub-state data analyses. These tutorials can be viewed statically in the browser, or run locally using Jupyter. See below for how to install and run a Jupyter notebook locally.
census_dp/ folder contains implementations of common noise injections algorithms and error metrics. NOTE: These algorithms are not necessarily "formally" private. One reason for this is that many of our implementations currently use python's numpy library, which uses a random number generator that is not cryptographically secure. Read more in dp-future.
tests/ folder contains unit tests for the implementations in
census_dp/ using the pytest library. See instructions for running these tests below.
How to install & run a notebook
Install Anaconda (Census employees must submit a Remedy ticket).
Open an Anaconda prompt
Install git by typing the following into your Anaconda prompt and pressing enter.
conda install git
Navigate to the directory you would like to download this repository in.
- Clone this repository
git clone https://github.com/umadesai/census-dp.git
- Navigate to the notebooks folder.
- Run Jupyter Notebook.
This command should launch Jupyter Notebook locally in your browser. If it does not, open your browser and navigate to the localhost address that is provided in your Anaconda prompt.
Click on the IPython Notebook you would like to open. We recommend starting with dp-count.
Reference this sheet for help using Jupyter Notebook.
Setting up your conda environment
- Create the environment from the
conda env create -f env.yml
- Activate the new environment:
conda activate env
- Once you've finished your work in this environment, you can deactivate the environment using:
Importing a module from the library
If you want to use a module or algorithm from the library in your own python script, you can follow the structure of the example below.
from census_dp import laplace my_laplace = laplace.laplace_mech(mu=0, epsilon=1, sensitivity=1)
There are tests for each of the library modules, implemented with pytest. To run all the tests at once, run pytest from the base directory of the project.
This project is the work of members of the CED-Disclosure Avoidance team at the US Census Bureau.
- Uma Desai (umadesai)
- Sophie Song (sophiesong)
- Rolando Rodríguez (rrod515)
- Amy Lauger (amydlauger)
- Caleb Floyd (calebfloyd)
- Michael Freiman (mfreiman)
Hear more about the repository at the Annual Conference of the Federal Statistical Research Data Centers on September 5, 2019 at the Pyle Center, University of Wisconsin–Madison.
Thank you to the incredible contributions of those who have been researching differential privacy at the Census Bureau and academic institutions, specifically:
- Philip Leclerc, US Census Bureau
- Simson Garfinkel, US Census Bureau
- John Abowd, US Census Bureau
- Ashwin Machanavajjhala, Duke University
- Michael Hay, Colgate University
- Gerome Miklau, University of Mass., Amherst
- Daniel Kifer, Penn State University
- Cynthia Dwork, Harvard University