Skip to content
Mark Howison edited this page Nov 18, 2021 · 11 revisions

Documentation

Configuration file

To set configuration options, create a file called sirad_config.py and place either in the directory where you are executing the sirad command or somewhere else on your Python path. See _options in config.py for a complete list of possible options and default values.

For an example of a configuration file, see sirad_config.py from the SIRAD worked example repo.

The following options are available:

  • DATA_SALT: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults to None.

  • PII_SALT: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults to None.

  • LAYOUTS: directory that contains layout files. Defaults to layouts/.

  • RAW_DIR, DATA_DIR, PII_DIR, LINK_DIR, RESEARCH_DIR: paths to where the original data, the processed files, and the research files will be saved.

  • VERSION: the current version number of the processed and research files.

YAML layout format

sirad uses YAML files to define the layout, or structure, of raw data files. These YAML files define each column in the incoming data and how it should be processed.

For an example of a YAML layout file, see tax.yaml from the SIRAD worked example repo.

The following properties can be specified in a YAML layout file:

source (required)

The path to the source file, relative to RAW_DIR.

type (default: csv)

The following values for file type are supported:

  • csv - delimited text file, defaulting to comma-delimited (see delimiter below)
  • fixed - fixed width format, which requires the specification of a width property for each field (see fields below)
  • xlsx - Excel .xlsx file (note: .xls is not currently supported)

delimiter (default: ',')

For the csv file type, this specifies the delimiter to use. Common alternatives to comma-delimited include tab-delimited ('\t') and pipe-delimited ('|').

encoding (default: ascii)

The file encoding to use when opening a source file of type csv or fixed. If you do not know the encoding ahead of time, you can detect the encoding by running the Unix file command on the source file. Line endings (LF or CRLF) are detected automatically by the file parser. Non-ASCII characters are automatically transliterated to ASCII according to the character mapping found in readers.py.

header (default: True)

Whether to read the first line of the file as the column headers.

fields (required)

A list specifying the header name and type of each field in the source file.

For a fixed-width source file, or when setting header=False:

  • The fields list must be in the same order as the contents of the source file.

For a csv or xlsx file:

  • You can specify a different order, which will be used as the order in the output.
  • Every field that appears in the fields list must also appear with the same name in the source file header.
  • If a field exists in the source file header, but not in the fields list, it will be skipped in the output.

Each field consists of a name, optionally followed with a dictionary of the following field properties:

type (default: varchar)

Specify date if you wish to interpret the value as a date and convert to a standardized YYYYMMDD format during processing.

pii

Marks the field as a type of personally identifiable information (PII). The field will be included in the PII_DIR output and not in the DATA_DIR output. The named PII fields used in calculating the sirad_id are:

  • first_name
  • last_name
  • dob

The named PII fields used in censuscoding addresses have one of the following prefixes for address type (additional types can be added by editing research.py):

  • home
  • mailing
  • employer
  • employer1
  • employer2
  • employer3

and one of the following suffixes for the address element:

  • _address: a field containing the entire street address including street number, ex. 3 Main St
  • _street: a field containing only the street name
  • _street_num: a field containing only the street number
  • _city
  • _zip5: the five digit zip code
  • _zip9: a nine digit zip code

hash

Replaces the value with an irreversible SHA-1 hash of the value, using the salt in PII_SALT for PII_DIR output or the DATA_SALT for DATA_DIR output. Commonly used in conjunction with ssn or with sensitive identifiers that will be included in DATA_DIR output.

ssn

Marks the field as containing a Social Security Number, which will be validated according to the rules found in dataset.py. A field with _invalid appended will be added to the output with the result of the validation.

format

Specifies the date format in strftime notation for a field of date type. You can specify multiple formats separated by '|' in the case where the input data does not have a consistent format, and each format will be attempted in order after splitting on the '|' separator.

width

For a fixed-width file, this specifies the number of characters that will be read for this field.

skip

Skip the field in all output. This is equivalent to omitting the field from the fields list for a csv or xlsx file, but can be useful if you want to document the existence of the field in the layout file.

data

Includes the field in the data output. Used to force a field marked pii to be included in both the PII_DIR and DATA_DIR outputs. This is useful in the case where a field is needed for calculating the sirad_id or for censuscoding, but is not actually considered PII. Examples might include dob for sirad_id (date of birth may not be classified as PII in a data sharing agreement) or a city or zip code field for censuscoding.

Output

All output from SIRAD is in pipe-delimited CSV files, and the pipe character is stripped from all field values.

Staging during processing

The sirad process command stages output files in the following output directories (which can be deleted after a successful run of the sirad research command):

DATA_DIR

Contains an output CSV file corresponding to each layout file, using the basename of the layout file. The field record_id is prepended, and is the row number from the source file (1-based indexing). Only fields that were not marked as pii (except those with data=True) are included, and in the order the provided in the fields list.

PII_DIR

Contains an output CSV file corresponding to each layout file, using the basename of the layout file. The row order is randomly shuffled relative to the source file, so that the PII files cannot be directly joined to the data files. The field record_id is prepended, which is the row number after random shuffling (1-based indexing). Only fields marked as pii are included, and they are renamed according to the PII name. Additionally, each field marked as ssn has a corresponding _invalid field with the indicator for SSN validation that is appended at the end of the fields.

LINK_DIR

Contains an output CSV file corresponding to each layout file, using the basename of the layout file. This file contains a record_id field which corresponds to the record_id in the data file, and a pii_id field which corresponds to the record_id in the PII file. This mapping provides a link between the randomly-shuffled PII rows and the data rows.

Versioned release

The sirad research command generates a final, versioned release of de-identified data that can be used in research. It uses the PII_DIR files to construct the sirad_id and perform censuscoding, and the LINK_DIR to map and prepend any fields constructed from the PII to each of the DATA_DIR files.

An output CSV file corresponding to each layout file is written to RESEARCH_DIR, using the basename of the layout file. If the source file contained PII sufficient to construct a sirad_id (first name, last name, DOB) then a sirad_id field is prepended. For each type of address (home, mailing, employer, employer1, employer2, employer3), if the source file contained PII sufficient for censuscoding (address/zip or street/street_num/zip), then a corresponding triplet of anonymous geolocations (_city, _zip, _blkgrp) is prepended for that address type.

Transformations

As described above, the following transformations are applied in the final output:

record_id

A row identifier, called record_id, is added to every output file.

Dates

Fields marked as type=date are interpreted according to the format value (which can be a pipe-delimited list of formats), and then transformed to a normalized YYYYMMDD format in the output. Values that cannot be interpreted according to the format string are replaced with nulls, and a warning is printed when the --debug option is used.

PII

All PII fields are removed from the output, unless they are explicitly marked with data=true.

sirad_id

The sirad_id field is added to the output for any file that contains sufficient PII to construct it.

SSNs

Each field marked as ssn=true has a corresponding _invalid field with the indicator for SSN validation added to the output.

Censuscoding

For each set of address PII fields that can be censuscoded, a triplet of (_city, _zip, _blkgrp) fields is added to the output. Even though the original _city and _zip PII fields are dropped from the output (as per the transformation on PII described above), the censuscoder adds normalized versions of these fields back into the output. To normalize _city, characters are converted to upper case and only letter and space characters are retained. To normalize _zip, only digit characters are retained.