I have written a Stata program that makes the process of splitting parallelizable code across cores quite easy, leveraging the gnu_parallelize tool.
This programs reduces the process to three simple components:
- Write the code that you want to parallelize as a program that can be called repeatedly, specifying the arguments you need. You can put this program in any do-file you want; there’s an option (extract_prog) that means you can put the program in the same do file as a ton of other code without any problems.
- Call the gnu_parallelize program. If there is only a single option to your program (e.g. looping over states, and all you need is the state name) you can pass that input directly into gnu_parallelize with the prep_input_file() option.
- OPTIONAL: If your program requires more than one input (e.g., a state and an urban/rural dummy, or as many other inputs as you need) then gnu_parallelize needs an input .txt file to feed the necessary program options in parallel. Use
file writeto create a text file where each line represents all of the program arguments required for a single parallelized run in left to right order.
A couple other things to note. The program will call GNU’s parallel
command from the shell, with various arguments to be described in the
following section – using Stata in batch mode. This means that any
globals / locals / whatever will not be transferred from your original
Stata session, and need to be written out to your text file OR read
in as part of your program (e.g. your program could have a preamble
that runs a …/settings.do file that contains your various specs).
Also note: currently the program requires a $tmp global marking your tempfile directory. Set this global before use.
gnu_parallelize
syntax , MAX_jobs(real) PROGram(string) [INput_txt(string) options(string) progloc(string) pre_comma rmtxt maxvar DIAGnostics trace tracedepth(real 2) manual_input static_options(string) extract_prog prep_input_file(string)]
max_jobs** Important! Ceiling for number of jobs to be run simultaneously. CHECK SERVER LOAD USING ‘TOP’ BEFORE USING THIS PROGRAM! I would not recommend exceeding 10 (15 max); use fewer if you can wait.program(string)** Name of the program that contains your code to be parallelized.
input_txt(string)Location and filename of your .txt file with your program arguments. Location of this file shouldn’t be necessary unless you’re changing your pwd in your code…options(string)The names of your program options, in sequential order. NOTE: if your options don’t change over the program runs, you can add them using “static_options” instead of putting them into your input .txt file.progloc(string)If you are writing your parallelized code program in your do-file, then the location of the do-file goes here and the “extract_prog” option needs to be called. Also, if your program is external to the default Stata load path, specify it here.pre_commaIf your program has a primary argument (e.g. namelist) before the options, turn on pre_comma.rmtextDeletes your .txt filemaxvarEnlarges Stata’s maximum number of variables in memory to 30,000 – useful for very large dataset operationsdiagnosticsIf engaged, two temporary files with diagnostic value will be saved instead of wiped: (1) your log file, which will be saved in ‘parallelizing_dofile.log’; and (2) the do-file itself – ‘parallelizing_dofile.log’. This do-file is autogenerated by the gnu_parallelization program, and is the mechanism we use to call your parallelized program a number of times simultaneously. NOTE – the log file writes from all jobs running simultaneously, not in order!traceTraces output to your log. Default tracedepth is 2.tracedepth(real 2)Customizes the tracedepth as desired.extract_progSince we are running in parallel in batch mode, we need to have the parallelized code program in a stand-alone do file without any other unrelated code. This option cuts out the program, by name, using a shell script into a temporary file. This allows you to write your parallelizing program and call gnu_parallelize all in the same do file. IMPORTANT! You must include the standard program footer for this to work, e.g.:
/* *********** END program compile_chirps_step2 ******************************* */
prep_input_file** As all the parallelizing program arguments must be passed in batch mode, these are most easily read from an external .txt file with all the necessary arguments in a single line for each instance to be run in parallel (e.g. if you only need to loop over states, then we need a txt file with each single state on each line). If you have only one argument to pass, thenprep_input_filewill do it for you – just pass in a list or a macro.manual_input** This is for any instance where you would like to manually write the exact program call for each of your runs to be done in parallel – you write each program call on a separate line of your input .txt file.static_options(string)** Will add the string (e.g. a string of however many options you have to you program that do not change across all program calls run in parallel) to the end of each program call.
A simple example of the program in action can be found in gnu_parallel_example.do.
NOTE: do not set max_jobs any higher than 10 or 15 at most! It is easy
to overload your server, depending on your capabilities. max_jobs is the
ceiling for the number of instances to be run at a single time, at
least as a rule of thumb. Even a simple wget parallized with
gnu_parallel can crash your server if you run too many of them at
once. If using gnu_parallelize, actively monitor server activity to
make sure you aren't overloading your system.
I wrote the first draft of these functions while working for Sam Asher and Paul Novosad, who were gracious enough to encourage me to continue to develop it into a utility and make it open source. Thanks a ton.