Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed things up — maybe with a multiscore program? #32

Closed
njvack opened this issue Feb 29, 2024 · 2 comments
Closed

Speed things up — maybe with a multiscore program? #32

njvack opened this issue Feb 29, 2024 · 2 comments

Comments

@njvack
Copy link
Member

njvack commented Feb 29, 2024

When we're scoring lots of questionnaires (say, 30 scoresheets across 4 timepoints), total runtime is quite long. Doing a dev-test-check cycle with a "score everything" script is frustrating.

One solution would be to use a proper dependency manager like make and another would be to just say "don't do that, test one scoresheet at a time." But the first is a challenge unless you're real techy, and the second is kinda tedious too.

I've done some very preliminary profiling and it looks like the bulk of the time when running score_data is in startup — to a small extent spawning the interpreter, but to a larger extent, doing imports. (time python -c "import numpy" takes more than 300ms on my example system.)

So! One way to approach this would be to add a multiscore (or maybe score_multi) program, which would take, say, a list of scoresheets, a data file, and an output path. Maybe add options for doing a simple string replacement (for subbing in things like timepoint) and output filename pattern.

A more complex approach could take CSV file that would specify scoring invocations, but I suspect that would be too weighty to set up and use.

Then, in the above case, instead of 120 invocations of score_data, we'd have four invocations of multiscore and instead of ~40 seconds of startup overhead, we'd be looking at ~1.3 seconds.

@njvack
Copy link
Member Author

njvack commented Feb 29, 2024

Potential docopt string:

Score data with a set of scoresheets.

Usage:
  score_multi [options] <datafile> <output_dir> <scoresheet>...
  score_multi -h | --help

Options:
  --output_pattern=<pat>      Some kind of thing that lets you put stuff in the filenames, eg "scored_%s.csv"
  --column-replace=<replace>  Something like TP:t1 to replace {{TP}} with "t1"
  --ignore-missing            Don't raise an error on missing column names (still does log to stderr?)
  ...other options like score_data uses

@njvack
Copy link
Member Author

njvack commented Apr 10, 2024

I'm actually thinking the CSV file option is the way to go, despite being maybe a little gross. This way, it's easier for more "IT-ish" folk to set up a script, and for "data-ish" folk to modify scoresheets.

And anyhow, scorify's entire schtick is basically "do dumb, gross stuff with CSV files" so maybe it's fine?

@njvack njvack closed this as completed in 1282f48 May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant