Skip to content
Disclosure Avoidance Repository
Jupyter Notebook Python
Branch: master
Clone or download
Latest commit a98b4bc Aug 9, 2019

README.md

Disclosure Avoidance Repository

Motivation

The Census Bureau is by law required to keep its survey responses confidential, and is beginning the transition from “ad-hoc” privacy techniques towards a formally private framework known as differential privacy. All public data releases must go thorough the Disclosure Review Board (DRB), whose newest policy states that any data release at the sub-state level or lower must be protected with noise injection techniques.

External researchers using restricted census data at Federal Statistical Research Data Centers (FSRDCs) are among the first affected by these policies, but all census data products will eventually require these methods. Researchers generally do not have a background in formal privacy, so they face a road block if they are interested in publishing sub-state results. This repository strives to deliver the tools and documentation to address this problem. Content here is WIP and all releases utilizing this library still require official approval from the DRB.

Differentially Private Computations

Differential privacy states that any information-related risk to a person should not change significantly as a result of that person's information being included, or not, in the analysis. It provides provable privacy guarantees with respect to the cumulative risk from successive data releases using a privacy "budget." Algorithms maintain differential privacy via the introduction of carefully crafted random noise into the computation. Types of computations that can be made differentiallly private:

  • descriptive statistics
  • supervised and unsupervised ML tasks
  • generation of synthetic data

Read more

Getting started

Repository Overview

The notebooks/ folder contains tutorials for some of the main workflows researchers practice when releasing sub-state data analyses. These tutorials can be viewed statically in the browser, or run locally using Jupyter. See below for how to install and run a Jupyter notebook locally.

The census_dp/ folder contains implementations of common noise injections algorithms and error metrics. NOTE: These algorithms are not necessarily "formally" private. One reason for this is that many of our implementations currently use python's numpy library, which uses a random number generator that is not cryptographically secure. Read more in dp-future.

The tests/ folder contains unit tests for the implementations in census_dp/ using the pytest library. See instructions for running these tests below.

How to install & run a notebook

  1. Install Anaconda (Census employees must submit a Remedy ticket).

  2. Open an Anaconda prompt

  3. Install git by typing the following into your Anaconda prompt and pressing enter.

conda install git
  1. Navigate to the directory you would like to download this repository in.

    For example:

cd Downloads/privacy/
  1. Clone this repository
git clone https://github.com/umadesai/census-dp.git
  1. Navigate to the notebooks folder.
cd census-dp/notebooks
  1. Run Jupyter Notebook.
jupyter notebook

This command should launch Jupyter Notebook locally in your browser. If it does not, open your browser and navigate to the localhost address that is provided in your Anaconda prompt.

  1. Click on the IPython Notebook you would like to open. We recommend starting with dp-count.

  2. Reference this sheet for help using Jupyter Notebook.

Setting up your conda environment

  1. Create the environment from the env.yml file:
conda env create -f env.yml
  1. Activate the new environment:
conda activate env
  1. Once you've finished your work in this environment, you can deactivate the environment using:
conda deactivate

Importing a module from the library

If you want to use a module or algorithm from the library in your own python script, you can follow the structure of the example below.

from census_dp import laplace

my_laplace = laplace.laplace_mech(mu=0, epsilon=1, sensitivity=1)

Running tests

There are tests for each of the library modules, implemented with pytest. To run all the tests at once, run pytest from the base directory of the project.

pytest

Contributors

This project is the work of members of the CED-Disclosure Avoidance team at the US Census Bureau.

Hear more about the repository at the Annual Conference of the Federal Statistical Research Data Centers on September 5, 2019 at the Pyle Center, University of Wisconsin–Madison.

Acknowledgements

Thank you to the incredible contributions of those who have been researching differential privacy at the Census Bureau and academic institutions, specifically:

  • Philip Leclerc, US Census Bureau
  • Simson Garfinkel, US Census Bureau
  • John Abowd, US Census Bureau
  • Ashwin Machanavajjhala, Duke University
  • Michael Hay, Colgate University
  • Gerome Miklau, University of Mass., Amherst
  • Daniel Kifer, Penn State University
  • Cynthia Dwork, Harvard University
You can’t perform that action at this time.