Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Ted Lawless
committed
Jun 18, 2018
0 parents
commit 0c2a162
Showing
11 changed files
with
916 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
*dblite | ||
raw/* | ||
build/* | ||
|
||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# celery beat schedule file | ||
celerybeat-schedule | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
© 2018 GitHub, Inc. | ||
Terms | ||
Privacy | ||
Security | ||
Status | ||
Help | ||
Contact GitHub | ||
API | ||
Training | ||
Shop | ||
Blog | ||
About | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
Copyright 2018, Brown University, Providence, RI. | ||
|
||
All Rights Reserved | ||
|
||
Permission to use, copy, modify, and distribute this software and its | ||
documentation for any purpose other than its incorporation into a | ||
commercial product is hereby granted without fee, provided that the | ||
above copyright notice appear in all copies and that both that | ||
copyright notice and this permission notice appear in supporting | ||
documentation, and that the name of Brown University not be used in | ||
advertising or publicity pertaining to distribution of the software | ||
without specific, written prior permission. | ||
|
||
BROWN UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, | ||
INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY | ||
PARTICULAR PURPOSE. IN NO EVENT SHALL BROWN UNIVERSITY BE LIABLE FOR | ||
ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES | ||
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN | ||
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF | ||
OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# Worked example of Secure Infrastructure for Research with Administrative Data (SIRAD) | ||
|
||
`sirad` is an integration framework for data from administrative systems. It | ||
deidentifies administrative data by removing and replacing personally | ||
identifiable information (PII) with a global anonymized identifier, allowing | ||
researchers to securely join data on an individual from multiple tables without | ||
knowing the individual's identity. | ||
|
||
This is a simplified demonstration of how `sirad` works on simulated data; for | ||
more details on how it is used in practice with real administrative data, | ||
please see our manuscript preprint: | ||
|
||
> J.S. Hastings, M. Howison, T. Lawless, J. Ucles, P. White. 2018. | ||
> Integrating Administrative Data for Policy Insights. | ||
> (link to arXiv preprint) | ||
In this worked example, we simulate two administrative data sets: | ||
|
||
**1. IRS 1040 tax returns**, identified by social security number (SSN), first/last | ||
name, and date of birth (DOB) | ||
**2. Credit history**, identified by first/last name and date of birth (DOB) | ||
|
||
`sirad` uses a deterministic matching algorithm to match records across the two | ||
data sets corresponding to the same individual. It then assigns an anonymized | ||
identifier (the `sirad_id`) to each matched individual, and creates a | ||
deidentified table for each data set where the SSNs, names, and DOBs have been | ||
replaced with the `sirad_id`. Finally, we demonstrate an analysis that uses the | ||
`sirad_id` to join adjusted gross income from the tax returns table to credit | ||
scores in the credit history table. | ||
|
||
**Note**: the data are simulated by the `simulate.py` script using | ||
[Faker](https://github.com/joke2k/faker), which creates realistic PII that does | ||
not represent actual individuals. Any data in this example that looks | ||
personally identifiable is not! | ||
|
||
## Installing dependencies | ||
|
||
Requires Python 3.6 or later. There are several options for installing the | ||
dependencies (list in `requirements.txt`). | ||
|
||
You can use **pip** to install them globally with | ||
`pip install -r requirements.txt`. | ||
|
||
If you do not have write access to install globally, you can install into your | ||
home directory with | ||
`pip install --user -r requirements.txt`. | ||
|
||
If you have Anaconda Python, you can use **conda** to install them in your | ||
root environment with | ||
`conda install -c riipl-org --file requirements.txt`. | ||
|
||
Or if you would prefer to create a named conda environment, use | ||
`conda install -c riipl-org -n sirad-example --file requirements.txt` | ||
and activate it with | ||
`source activate sirad-example`. | ||
|
||
## Running the example | ||
|
||
### Step 1: Simulate data | ||
|
||
Command: `python simulate.py` | ||
|
||
### Step 2: Process the raw data into separate PII, data, and link files | ||
|
||
Command: `sirad process` | ||
|
||
`sirad` processes a set of **raw** data files specified by a set of **layout | ||
files**. In this example, there are two simulated raw data files generated in | ||
Step 1: tax records (`raw/tax.txt`) and credit history | ||
(`raw/credit_scores.txt`). Their layouts are `layouts/tax.yaml` and | ||
`layouts/credit_scores.yaml`. The layouts are YAML files that describe the | ||
column layout and field types in the raw data files. | ||
|
||
The processing step uses the `pii` properties in the layout to split the PII | ||
fields from the data fields in each row of the raw files. It randomly shuffles | ||
the order of the PII rows when writing to the PII file. The data file has the | ||
same row order as the raw data file. The link file provides a lookup table | ||
that re-links the shuffled PII rows to the data rows. | ||
|
||
### Step 3: Stage the processed files in a database | ||
|
||
Command: `sirad stage` | ||
|
||
This step stages the PII, data, and link files in a relational database. | ||
|
||
### Step 4: Create a versioned research database | ||
|
||
Command: `sirad research --version 1` | ||
|
||
This step uses the PII database to construct a global anonymized identifier | ||
(the `sirad_id`), then uses the link files to attach it to each data table in | ||
the database. The result is a **research** database which contains no PII, but | ||
in which individual-level data in different tables can be joined by the | ||
anonymized identifier. Research databases are versioned to support reproducible | ||
analysis. | ||
|
||
## Resulting database | ||
|
||
After the build finishes, an sqlite database called `research_v1.db` will be | ||
created in the `build` directory. This database has two tables created from | ||
the simulated data: | ||
|
||
### tax | ||
|
||
sirad_id | record_id | job | file_date | adjusted_gross_income | import_dt | ||
-|-|-|-|-|- | ||
|
||
### credit_scores | ||
|
||
sirad_id | record_id | credit_score | import_dt | ||
-|-|-|- | ||
|
||
Notes: | ||
* `sirad_id` is an anonymized identifier created from the PII. | ||
* `record_id` is a primary key for the research/data records, and `pii_id` is a | ||
shuffled primary key for the PII records. | ||
* `import_dt` is a timestamp for when the raw data were processed. | ||
* All PII fields (SSN, first/last, DOB) have been removed from the research database. | ||
|
||
The results are organized in the following directory structure: | ||
* `raw/`: the simulated raw data files | ||
* `build/processed`: processed data files (organized by `data`, `pii`, and `link`) | ||
* `build/db`: the staging databases for the processed files | ||
* `build/research_v1.db`: the final research database | ||
|
||
In a real-world application, only the `research_v1.db` database would be | ||
accessible to researchers. The `raw`, `processed`, and `db` directories should | ||
be stored in a restricted location that is inaccessible to any individual | ||
researcher, for example by using encryption with a multi-party key or | ||
passphrase, auditing, real-time alerting, and/or other appropriate security | ||
controls that ensure an individual researcher cannot access build files that | ||
contain PII. | ||
|
||
## Example analysis | ||
|
||
`scatterplot.py` demonstrates an analysis that uses the `sirad_id` to | ||
anonymously join records about individuals. It selects adjusted gross income | ||
from the `tax` table joined to the corresponding credit score from the | ||
`credit_scores` table, then generates this scatter plot: | ||
|
||
![scatterplot](scatterplot.png) | ||
|
||
**Note:** these variables are correlated by construction, and were drawn from a | ||
joint distribution (with added noise) in the simulation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
source: credit_scores.txt | ||
type: csv | ||
delimiter: "|" | ||
fields: | ||
- first: | ||
pii: first_name | ||
- last_name: | ||
pii: last_name | ||
- birth_date: | ||
pii: dob | ||
type: date | ||
format: "%m-%d-%Y" | ||
- credit_score: | ||
type: int |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
source: tax.txt | ||
type: csv | ||
delimiter: "|" | ||
fields: | ||
- first_name: | ||
pii: first_name | ||
- last_name: | ||
pii: last_name | ||
- ssn: | ||
hash: true | ||
pii: ssn | ||
ssn: true | ||
- job | ||
- birth_date: | ||
pii: dob | ||
type: date | ||
format: "%m-%d-%Y" | ||
- file_date: | ||
type: date | ||
format: "%m/%d/%Y" | ||
- adjusted_gross_income: | ||
type: int |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
faker | ||
numpy | ||
sirad |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
import matplotlib.pyplot as plt | ||
import pandas as pd | ||
import seaborn as sns | ||
import sqlite3 | ||
|
||
sql = """ | ||
SELECT t.sirad_id, | ||
adjusted_gross_income, | ||
credit_score | ||
FROM tax t | ||
INNER JOIN credit_scores cs | ||
ON t.sirad_id=cs.sirad_id | ||
""" | ||
|
||
with sqlite3.connect("build/research_v1.db") as cxn: | ||
df = pd.read_sql(sql, cxn, index_col="sirad_id") | ||
|
||
print(df.head()) | ||
|
||
sns.set_style("whitegrid") | ||
sns.regplot("adjusted_gross_income", "credit_score", data=df, truncate=True) | ||
plt.tight_layout() | ||
plt.savefig("scatterplot.png") | ||
|
Oops, something went wrong.