Initial commit.

ripl-org · Jun 18, 2018 · 0c2a162 · 0c2a162
commit 0c2a162
Show file tree

Hide file tree

Showing 11 changed files with 916 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,121 @@
+*dblite
+raw/*
+build/*
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+© 2018 GitHub, Inc.
+Terms
+Privacy
+Security
+Status
+Help
+Contact GitHub
+API
+Training
+Shop
+Blog
+About
+
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,20 @@
+Copyright 2018, Brown University, Providence, RI.
+
+                        All Rights Reserved
+
+Permission to use, copy, modify, and distribute this software and its
+documentation for any purpose other than its incorporation into a
+commercial product is hereby granted without fee, provided that the
+above copyright notice appear in all copies and that both that
+copyright notice and this permission notice appear in supporting
+documentation, and that the name of Brown University not be used in
+advertising or publicity pertaining to distribution of the software
+without specific, written prior permission.
+
+BROWN UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
+INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY
+PARTICULAR PURPOSE.  IN NO EVENT SHALL BROWN UNIVERSITY BE LIABLE FOR
+ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,144 @@
+# Worked example of Secure Infrastructure for Research with Administrative Data (SIRAD)
+
+`sirad` is an integration framework for data from administrative systems. It
+deidentifies administrative data by removing and replacing personally
+identifiable information (PII) with a global anonymized identifier, allowing
+researchers to securely join data on an individual from multiple tables without
+knowing the individual's identity.
+
+This is a simplified demonstration of how `sirad` works on simulated data; for
+more details on how it is used in practice with real administrative data,
+please see our manuscript preprint:
+
+> J.S. Hastings, M. Howison, T. Lawless, J. Ucles, P. White. 2018.
+> Integrating Administrative Data for Policy Insights.
+> (link to arXiv preprint)
+
+In this worked example, we simulate two administrative data sets:
+
+**1. IRS 1040 tax returns**, identified by social security number (SSN), first/last
+   name, and date of birth (DOB)  
+**2. Credit history**, identified by first/last name and date of birth (DOB)
+
+`sirad` uses a deterministic matching algorithm to match records across the two
+data sets corresponding to the same individual. It then assigns an anonymized
+identifier (the `sirad_id`) to each matched individual, and creates a
+deidentified table for each data set where the SSNs, names, and DOBs have been
+replaced with the `sirad_id`. Finally, we demonstrate an analysis that uses the
+`sirad_id` to join adjusted gross income from the tax returns table to credit
+scores in the credit history table.
+
+**Note**: the data are simulated by the `simulate.py` script using
+[Faker](https://github.com/joke2k/faker), which creates realistic PII that does
+not represent actual individuals. Any data in this example that looks
+personally identifiable is not!
+
+## Installing dependencies
+
+Requires Python 3.6 or later.  There are several options for installing the
+dependencies (list in `requirements.txt`).
+
+You can use **pip** to install them globally with  
+`pip install -r requirements.txt`.
+
+If you do not have write access to install globally, you can install into your
+home directory with  
+`pip install --user -r requirements.txt`.
+
+If you have Anaconda Python, you can use **conda** to install them in your
+root environment with  
+`conda install -c riipl-org --file requirements.txt`.
+
+Or if you would prefer to create a named conda environment, use  
+`conda install -c riipl-org -n sirad-example --file requirements.txt`  
+and activate it with  
+`source activate sirad-example`.
+
+## Running the example
+
+### Step 1: Simulate data
+
+Command: `python simulate.py`
+
+### Step 2: Process the raw data into separate PII, data, and link files
+
+Command: `sirad process`
+
+`sirad` processes a set of **raw** data files specified by a set of **layout
+files**. In this example, there are two simulated raw data files generated in
+Step 1: tax records (`raw/tax.txt`) and credit history
+(`raw/credit_scores.txt`). Their layouts are `layouts/tax.yaml` and
+`layouts/credit_scores.yaml`. The layouts are YAML files that describe the
+column layout and field types in the raw data files.
+
+The processing step uses the `pii` properties in the layout to split the PII
+fields from the data fields in each row of the raw files. It randomly shuffles
+the order of the PII rows when writing to the PII file. The data file has the
+same row order as the raw data file.  The link file provides a lookup table
+that re-links the shuffled PII rows to the data rows.
+
+### Step 3: Stage the processed files in a database
+
+Command: `sirad stage`
+
+This step stages the PII, data, and link files in a relational database.
+
+### Step 4: Create a versioned research database
+
+Command: `sirad research --version 1`
+
+This step uses the PII database to construct a global anonymized identifier
+(the `sirad_id`), then uses the link files to attach it to each data table in
+the database.  The result is a **research** database which contains no PII, but
+in which individual-level data in different tables can be joined by the
+anonymized identifier. Research databases are versioned to support reproducible
+analysis.
+
+## Resulting database
+
+After the build finishes, an sqlite database called `research_v1.db` will be
+created in the `build` directory.  This database has two tables created from
+the simulated data:
+
+### tax
+
+sirad_id | record_id | job | file_date | adjusted_gross_income | import_dt
+-|-|-|-|-|-
+
+### credit_scores
+
+sirad_id | record_id | credit_score | import_dt
+-|-|-|-
+
+Notes:
+* `sirad_id` is an anonymized identifier created from the PII.
+* `record_id` is a primary key for the research/data records, and `pii_id` is a
+  shuffled primary key for the PII records.
+* `import_dt` is a timestamp for when the raw data were processed.
+* All PII fields (SSN, first/last, DOB) have been removed from the research database.
+
+The results are organized in the following directory structure:
+* `raw/`: the simulated raw data files
+* `build/processed`: processed data files (organized by `data`, `pii`, and `link`)
+* `build/db`: the staging databases for the processed files
+* `build/research_v1.db`: the final research database
+
+In a real-world application, only the `research_v1.db` database would be
+accessible to researchers.  The `raw`, `processed`, and `db` directories should
+be stored in a restricted location that is inaccessible to any individual
+researcher, for example by using encryption with a multi-party key or
+passphrase, auditing, real-time alerting, and/or other appropriate security
+controls that ensure an individual researcher cannot access build files that
+contain PII.
+
+## Example analysis
+
+`scatterplot.py` demonstrates an analysis that uses the `sirad_id` to
+anonymously join records about individuals. It selects adjusted gross income
+from the `tax` table joined to the corresponding credit score from the
+`credit_scores` table, then generates this scatter plot:
+
+![scatterplot](scatterplot.png)
+
+**Note:** these variables are correlated by construction, and were drawn from a
+joint distribution (with added noise) in the simulation.
diff --git a/layouts/credit_scores.yaml b/layouts/credit_scores.yaml
@@ -0,0 +1,14 @@
+source: credit_scores.txt
+type: csv
+delimiter: "|"
+fields:
+- first:
+    pii: first_name
+- last_name:
+    pii: last_name
+- birth_date:
+    pii: dob
+    type: date
+    format: "%m-%d-%Y"
+- credit_score:
+    type: int
diff --git a/layouts/tax.yaml b/layouts/tax.yaml
@@ -0,0 +1,22 @@
+source: tax.txt
+type: csv
+delimiter: "|"
+fields:
+- first_name:
+    pii: first_name
+- last_name:
+    pii: last_name
+- ssn:
+    hash: true
+    pii: ssn
+    ssn: true
+- job
+- birth_date:
+    pii: dob
+    type: date
+    format: "%m-%d-%Y"
+- file_date:
+    type: date
+    format: "%m/%d/%Y"
+- adjusted_gross_income:
+    type: int
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,3 @@
+faker
+numpy
+sirad
diff --git a/scatterplot.png b/scatterplot.png
diff --git a/scatterplot.py b/scatterplot.py
@@ -0,0 +1,24 @@
+import matplotlib.pyplot as plt
+import pandas as pd
+import seaborn as sns
+import sqlite3
+
+sql = """
+      SELECT t.sirad_id,
+             adjusted_gross_income,
+             credit_score
+        FROM tax t
+  INNER JOIN credit_scores cs
+          ON t.sirad_id=cs.sirad_id
+      """
+
+with sqlite3.connect("build/research_v1.db") as cxn:
+    df = pd.read_sql(sql, cxn, index_col="sirad_id")
+
+print(df.head())
+
+sns.set_style("whitegrid")
+sns.regplot("adjusted_gross_income", "credit_score", data=df, truncate=True)
+plt.tight_layout()
+plt.savefig("scatterplot.png")
+