Tools to organize, document, and validate the variables of interest in scientific studies
Command line usage
masterfile --help will list all the subcommands.
masterfile create masterfile_path out_file
masterfile join masterfile_path out_file
masterfile extract [-s|--skip ROWS] [--index_column COL] masterfile_path csv_file out_file
asterfile validate masterfile_path [file [file ...]]
Draft API usage example
import masterfile # Load all of the .csv files from /path, and the dictionary files in # /path/dictionaries. Takes settings info from a 'settings.json' file in # /path. # joins the .csv files on 'participant_id', which will be used as the index # There will be warnings if the data look bad in some way mf = masterfile.load('/path') # Get the pandas dataframe associated df = mf.dataframe # aliased as mf.df # All the variable stuff is less important, people can go look in data dicts # So we'll write that stuff later. v = mf.lookup('sr_t1_panas_pa') v.contacts # list_of_names v.measure.contact # Someone v.modality # Component("self-report")
CSV file format
CSV files should be comma-separated (no surprise there) and have DOS line endings (CRLF). They should not have the stupid UTF-8 signature at the start. UTF-8 characters are fine. Missing data is indicated by an empty cell. Quoting should be like Excel does.
Basically, you want Excel-for-Windows-style CSV files with no UTF-8 signature.
- CSV format
- Has AT LEAST two columns: component, short_name
- Those are the indexes
- There shouldn't be any repeats in the index
- The settings.json file should contain a "components" thing that says what should exist in the component column
- Things with blank component are ignored (TODO: Maybe?)
- CSV format
- Live in exclusions/
- One row per ppt, one column per value
- Has index column, same as data file
- Blanks mean "Use this value," nonblanks mean "exclude this value"
- Things in the cells may be codes; these codes may be defined in settings.json
- If data is excluded for more than one reason, separate codes with ","
- Not all rows / columns in masterfiles need to be included in exclusion files. Missing rows / columns are treated like blank values.
Here are some (all?) of the things to do to verify you have semantically reasonable data:
- Variable parts not in dictionaries
- Missing participant_id column
- Repeated paticipant_id column
- Blanks in participant_id column
- Duplicate columns
- Column names not matching format
Getting started for development
Create a virtualenv:
virtualenv ~/env/masterfile source ~/env/masterfile/bin/activate
Install the requirements and this module for development:
pip install -r requirements_dev.txt pip install -e .
Run tests across all supported Python versions:
To run in a specific python version:
tox -e py37
schema is copyright (c) 2012 Vladimir Keleshev, firstname.lastname@example.org
attrs is copyright (c) 2015 Hynek Schlawack