Skip to content

Commit

Permalink
Initial commit.
Browse files Browse the repository at this point in the history
  • Loading branch information
Ted Lawless committed Jun 18, 2018
0 parents commit 0c2a162
Show file tree
Hide file tree
Showing 11 changed files with 916 additions and 0 deletions.
121 changes: 121 additions & 0 deletions .gitignore
@@ -0,0 +1,121 @@
*dblite
raw/*
build/*

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
© 2018 GitHub, Inc.
Terms
Privacy
Security
Status
Help
Contact GitHub
API
Training
Shop
Blog
About

20 changes: 20 additions & 0 deletions LICENSE.txt
@@ -0,0 +1,20 @@
Copyright 2018, Brown University, Providence, RI.

All Rights Reserved

Permission to use, copy, modify, and distribute this software and its
documentation for any purpose other than its incorporation into a
commercial product is hereby granted without fee, provided that the
above copyright notice appear in all copies and that both that
copyright notice and this permission notice appear in supporting
documentation, and that the name of Brown University not be used in
advertising or publicity pertaining to distribution of the software
without specific, written prior permission.

BROWN UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY
PARTICULAR PURPOSE. IN NO EVENT SHALL BROWN UNIVERSITY BE LIABLE FOR
ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
144 changes: 144 additions & 0 deletions README.md
@@ -0,0 +1,144 @@
# Worked example of Secure Infrastructure for Research with Administrative Data (SIRAD)

`sirad` is an integration framework for data from administrative systems. It
deidentifies administrative data by removing and replacing personally
identifiable information (PII) with a global anonymized identifier, allowing
researchers to securely join data on an individual from multiple tables without
knowing the individual's identity.

This is a simplified demonstration of how `sirad` works on simulated data; for
more details on how it is used in practice with real administrative data,
please see our manuscript preprint:

> J.S. Hastings, M. Howison, T. Lawless, J. Ucles, P. White. 2018.
> Integrating Administrative Data for Policy Insights.
> (link to arXiv preprint)
In this worked example, we simulate two administrative data sets:

**1. IRS 1040 tax returns**, identified by social security number (SSN), first/last
name, and date of birth (DOB)
**2. Credit history**, identified by first/last name and date of birth (DOB)

`sirad` uses a deterministic matching algorithm to match records across the two
data sets corresponding to the same individual. It then assigns an anonymized
identifier (the `sirad_id`) to each matched individual, and creates a
deidentified table for each data set where the SSNs, names, and DOBs have been
replaced with the `sirad_id`. Finally, we demonstrate an analysis that uses the
`sirad_id` to join adjusted gross income from the tax returns table to credit
scores in the credit history table.

**Note**: the data are simulated by the `simulate.py` script using
[Faker](https://github.com/joke2k/faker), which creates realistic PII that does
not represent actual individuals. Any data in this example that looks
personally identifiable is not!

## Installing dependencies

Requires Python 3.6 or later. There are several options for installing the
dependencies (list in `requirements.txt`).

You can use **pip** to install them globally with
`pip install -r requirements.txt`.

If you do not have write access to install globally, you can install into your
home directory with
`pip install --user -r requirements.txt`.

If you have Anaconda Python, you can use **conda** to install them in your
root environment with
`conda install -c riipl-org --file requirements.txt`.

Or if you would prefer to create a named conda environment, use
`conda install -c riipl-org -n sirad-example --file requirements.txt`
and activate it with
`source activate sirad-example`.

## Running the example

### Step 1: Simulate data

Command: `python simulate.py`

### Step 2: Process the raw data into separate PII, data, and link files

Command: `sirad process`

`sirad` processes a set of **raw** data files specified by a set of **layout
files**. In this example, there are two simulated raw data files generated in
Step 1: tax records (`raw/tax.txt`) and credit history
(`raw/credit_scores.txt`). Their layouts are `layouts/tax.yaml` and
`layouts/credit_scores.yaml`. The layouts are YAML files that describe the
column layout and field types in the raw data files.

The processing step uses the `pii` properties in the layout to split the PII
fields from the data fields in each row of the raw files. It randomly shuffles
the order of the PII rows when writing to the PII file. The data file has the
same row order as the raw data file. The link file provides a lookup table
that re-links the shuffled PII rows to the data rows.

### Step 3: Stage the processed files in a database

Command: `sirad stage`

This step stages the PII, data, and link files in a relational database.

### Step 4: Create a versioned research database

Command: `sirad research --version 1`

This step uses the PII database to construct a global anonymized identifier
(the `sirad_id`), then uses the link files to attach it to each data table in
the database. The result is a **research** database which contains no PII, but
in which individual-level data in different tables can be joined by the
anonymized identifier. Research databases are versioned to support reproducible
analysis.

## Resulting database

After the build finishes, an sqlite database called `research_v1.db` will be
created in the `build` directory. This database has two tables created from
the simulated data:

### tax

sirad_id | record_id | job | file_date | adjusted_gross_income | import_dt
-|-|-|-|-|-

### credit_scores

sirad_id | record_id | credit_score | import_dt
-|-|-|-

Notes:
* `sirad_id` is an anonymized identifier created from the PII.
* `record_id` is a primary key for the research/data records, and `pii_id` is a
shuffled primary key for the PII records.
* `import_dt` is a timestamp for when the raw data were processed.
* All PII fields (SSN, first/last, DOB) have been removed from the research database.

The results are organized in the following directory structure:
* `raw/`: the simulated raw data files
* `build/processed`: processed data files (organized by `data`, `pii`, and `link`)
* `build/db`: the staging databases for the processed files
* `build/research_v1.db`: the final research database

In a real-world application, only the `research_v1.db` database would be
accessible to researchers. The `raw`, `processed`, and `db` directories should
be stored in a restricted location that is inaccessible to any individual
researcher, for example by using encryption with a multi-party key or
passphrase, auditing, real-time alerting, and/or other appropriate security
controls that ensure an individual researcher cannot access build files that
contain PII.

## Example analysis

`scatterplot.py` demonstrates an analysis that uses the `sirad_id` to
anonymously join records about individuals. It selects adjusted gross income
from the `tax` table joined to the corresponding credit score from the
`credit_scores` table, then generates this scatter plot:

![scatterplot](scatterplot.png)

**Note:** these variables are correlated by construction, and were drawn from a
joint distribution (with added noise) in the simulation.
14 changes: 14 additions & 0 deletions layouts/credit_scores.yaml
@@ -0,0 +1,14 @@
source: credit_scores.txt
type: csv
delimiter: "|"
fields:
- first:
pii: first_name
- last_name:
pii: last_name
- birth_date:
pii: dob
type: date
format: "%m-%d-%Y"
- credit_score:
type: int
22 changes: 22 additions & 0 deletions layouts/tax.yaml
@@ -0,0 +1,22 @@
source: tax.txt
type: csv
delimiter: "|"
fields:
- first_name:
pii: first_name
- last_name:
pii: last_name
- ssn:
hash: true
pii: ssn
ssn: true
- job
- birth_date:
pii: dob
type: date
format: "%m-%d-%Y"
- file_date:
type: date
format: "%m/%d/%Y"
- adjusted_gross_income:
type: int
3 changes: 3 additions & 0 deletions requirements.txt
@@ -0,0 +1,3 @@
faker
numpy
sirad
Binary file added scatterplot.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions scatterplot.py
@@ -0,0 +1,24 @@
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sqlite3

sql = """
SELECT t.sirad_id,
adjusted_gross_income,
credit_score
FROM tax t
INNER JOIN credit_scores cs
ON t.sirad_id=cs.sirad_id
"""

with sqlite3.connect("build/research_v1.db") as cxn:
df = pd.read_sql(sql, cxn, index_col="sirad_id")

print(df.head())

sns.set_style("whitegrid")
sns.regplot("adjusted_gross_income", "credit_score", data=df, truncate=True)
plt.tight_layout()
plt.savefig("scatterplot.png")

0 comments on commit 0c2a162

Please sign in to comment.