You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we're scoring lots of questionnaires (say, 30 scoresheets across 4 timepoints), total runtime is quite long. Doing a dev-test-check cycle with a "score everything" script is frustrating.
One solution would be to use a proper dependency manager like make and another would be to just say "don't do that, test one scoresheet at a time." But the first is a challenge unless you're real techy, and the second is kinda tedious too.
I've done some very preliminary profiling and it looks like the bulk of the time when running score_data is in startup — to a small extent spawning the interpreter, but to a larger extent, doing imports. (time python -c "import numpy" takes more than 300ms on my example system.)
So! One way to approach this would be to add a multiscore (or maybe score_multi) program, which would take, say, a list of scoresheets, a data file, and an output path. Maybe add options for doing a simple string replacement (for subbing in things like timepoint) and output filename pattern.
A more complex approach could take CSV file that would specify scoring invocations, but I suspect that would be too weighty to set up and use.
Then, in the above case, instead of 120 invocations of score_data, we'd have four invocations of multiscore and instead of ~40 seconds of startup overhead, we'd be looking at ~1.3 seconds.
The text was updated successfully, but these errors were encountered:
Score data with a set of scoresheets.
Usage:
score_multi [options] <datafile> <output_dir> <scoresheet>...
score_multi -h | --help
Options:
--output_pattern=<pat> Some kind of thing that lets you put stuff in the filenames, eg "scored_%s.csv"
--column-replace=<replace> Something like TP:t1 to replace {{TP}} with "t1"
--ignore-missing Don't raise an error on missing column names (still does log to stderr?)
...other options like score_data uses
I'm actually thinking the CSV file option is the way to go, despite being maybe a little gross. This way, it's easier for more "IT-ish" folk to set up a script, and for "data-ish" folk to modify scoresheets.
And anyhow, scorify's entire schtick is basically "do dumb, gross stuff with CSV files" so maybe it's fine?
When we're scoring lots of questionnaires (say, 30 scoresheets across 4 timepoints), total runtime is quite long. Doing a dev-test-check cycle with a "score everything" script is frustrating.
One solution would be to use a proper dependency manager like
make
and another would be to just say "don't do that, test one scoresheet at a time." But the first is a challenge unless you're real techy, and the second is kinda tedious too.I've done some very preliminary profiling and it looks like the bulk of the time when running
score_data
is in startup — to a small extent spawning the interpreter, but to a larger extent, doing imports. (time python -c "import numpy"
takes more than 300ms on my example system.)So! One way to approach this would be to add a
multiscore
(or maybescore_multi
) program, which would take, say, a list of scoresheets, a data file, and an output path. Maybe add options for doing a simple string replacement (for subbing in things like timepoint) and output filename pattern.A more complex approach could take CSV file that would specify scoring invocations, but I suspect that would be too weighty to set up and use.
Then, in the above case, instead of 120 invocations of
score_data
, we'd have four invocations ofmultiscore
and instead of ~40 seconds of startup overhead, we'd be looking at ~1.3 seconds.The text was updated successfully, but these errors were encountered: