Speed things up — maybe with a multiscore program? #32

njvack · 2024-02-29T15:58:15Z

When we're scoring lots of questionnaires (say, 30 scoresheets across 4 timepoints), total runtime is quite long. Doing a dev-test-check cycle with a "score everything" script is frustrating.

One solution would be to use a proper dependency manager like make and another would be to just say "don't do that, test one scoresheet at a time." But the first is a challenge unless you're real techy, and the second is kinda tedious too.

I've done some very preliminary profiling and it looks like the bulk of the time when running score_data is in startup — to a small extent spawning the interpreter, but to a larger extent, doing imports. (time python -c "import numpy" takes more than 300ms on my example system.)

So! One way to approach this would be to add a multiscore (or maybe score_multi) program, which would take, say, a list of scoresheets, a data file, and an output path. Maybe add options for doing a simple string replacement (for subbing in things like timepoint) and output filename pattern.

A more complex approach could take CSV file that would specify scoring invocations, but I suspect that would be too weighty to set up and use.

Then, in the above case, instead of 120 invocations of score_data, we'd have four invocations of multiscore and instead of ~40 seconds of startup overhead, we'd be looking at ~1.3 seconds.

The text was updated successfully, but these errors were encountered:

njvack · 2024-02-29T16:03:21Z

Potential docopt string:

Score data with a set of scoresheets.

Usage:
  score_multi [options] <datafile> <output_dir> <scoresheet>...
  score_multi -h | --help

Options:
  --output_pattern=<pat>      Some kind of thing that lets you put stuff in the filenames, eg "scored_%s.csv"
  --column-replace=<replace>  Something like TP:t1 to replace {{TP}} with "t1"
  --ignore-missing            Don't raise an error on missing column names (still does log to stderr?)
  ...other options like score_data uses

njvack · 2024-04-10T20:29:47Z

I'm actually thinking the CSV file option is the way to go, despite being maybe a little gross. This way, it's easier for more "IT-ish" folk to set up a script, and for "data-ish" folk to modify scoresheets.

And anyhow, scorify's entire schtick is basically "do dumb, gross stuff with CSV files" so maybe it's fine?

njvack closed this as completed in 1282f48 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed things up — maybe with a multiscore program? #32

Speed things up — maybe with a multiscore program? #32

njvack commented Feb 29, 2024

njvack commented Feb 29, 2024 •

edited

njvack commented Apr 10, 2024

Speed things up — maybe with a multiscore program? #32

Speed things up — maybe with a multiscore program? #32

Comments

njvack commented Feb 29, 2024

njvack commented Feb 29, 2024 • edited

njvack commented Apr 10, 2024

njvack commented Feb 29, 2024 •

edited