The GNU_PARALLELIZE program for Stata

I have written a Stata program that makes the process of splitting parallelizable code across cores quite easy, leveraging the gnu_parallelize tool.

This programs reduces the process to three simple components:

Write the code that you want to parallelize as a program that can be called repeatedly, specifying the arguments you need. You can put this program in any do-file you want; there’s an option (extract_prog) that means you can put the program in the same do file as a ton of other code without any problems.
Call the gnu_parallelize program. If there is only a single option to your program (e.g. looping over states, and all you need is the state name) you can pass that input directly into gnu_parallelize with the prep_input_file() option.
OPTIONAL: If your program requires more than one input (e.g., a state and an urban/rural dummy, or as many other inputs as you need) then gnu_parallelize needs an input .txt file to feed the necessary program options in parallel. Use file write to create a text file where each line represents all of the program arguments required for a single parallelized run in left to right order.

A couple other things to note. The program will call GNU’s parallel command from the shell, with various arguments to be described in the following section – using Stata in batch mode. This means that any globals / locals / whatever will not be transferred from your original Stata session, and need to be written out to your text file OR read in as part of your program (e.g. your program could have a preamble that runs a …/settings.do file that contains your various specs).

Also note: currently the program requires a $tmp global marking your tempfile directory. Set this global before use.

The program

gnu_parallelize

syntax , MAX_jobs(real) PROGram(string) [INput_txt(string) options(string) progloc(string) pre_comma rmtxt maxvar DIAGnostics trace tracedepth(real 2) manual_input static_options(string) extract_prog prep_input_file(string)]

Required inputs

max_jobs ** Important! Ceiling for number of jobs to be run simultaneously. CHECK SERVER LOAD USING ‘TOP’ BEFORE USING THIS PROGRAM! I would not recommend exceeding 10 (15 max); use fewer if you can wait.
program(string) ** Name of the program that contains your code to be parallelized.

Optional inputs

input_txt(string) Location and filename of your .txt file with your program arguments. Location of this file shouldn’t be necessary unless you’re changing your pwd in your code…
options(string) The names of your program options, in sequential order. NOTE: if your options don’t change over the program runs, you can add them using “static_options” instead of putting them into your input .txt file.
progloc(string) If you are writing your parallelized code program in your do-file, then the location of the do-file goes here and the “extract_prog” option needs to be called. Also, if your program is external to the default Stata load path, specify it here.
pre_comma If your program has a primary argument (e.g. namelist) before the options, turn on pre_comma.
rmtext Deletes your .txt file
maxvar Enlarges Stata’s maximum number of variables in memory to 30,000 – useful for very large dataset operations
diagnostics If engaged, two temporary files with diagnostic value will be saved instead of wiped: (1) your log file, which will be saved in ‘parallelizing_dofile.log’; and (2) the do-file itself – ‘parallelizing_dofile.log’. This do-file is autogenerated by the gnu_parallelization program, and is the mechanism we use to call your parallelized program a number of times simultaneously. NOTE – the log file writes from all jobs running simultaneously, not in order!
trace Traces output to your log. Default tracedepth is 2.
tracedepth(real 2) Customizes the tracedepth as desired.
extract_prog Since we are running in parallel in batch mode, we need to have the parallelized code program in a stand-alone do file without any other unrelated code. This option cuts out the program, by name, using a shell script into a temporary file. This allows you to write your parallelizing program and call gnu_parallelize all in the same do file. IMPORTANT! You must include the standard program footer for this to work, e.g.:

/* *********** END program compile_chirps_step2 ******************************* */

prep_input_file ** As all the parallelizing program arguments must be passed in batch mode, these are most easily read from an external .txt file with all the necessary arguments in a single line for each instance to be run in parallel (e.g. if you only need to loop over states, then we need a txt file with each single state on each line). If you have only one argument to pass, then prep_input_file will do it for you – just pass in a list or a macro.
manual_input ** This is for any instance where you would like to manually write the exact program call for each of your runs to be done in parallel – you write each program call on a separate line of your input .txt file.
static_options(string) ** Will add the string (e.g. a string of however many options you have to you program that do not change across all program calls run in parallel) to the end of each program call.

An example:

A simple example of the program in action can be found in gnu_parallel_example.do.

NOTE: do not set max_jobs any higher than 10 or 15 at most! It is easy to overload your server, depending on your capabilities. max_jobs is the ceiling for the number of instances to be run at a single time, at least as a rule of thumb. Even a simple wget parallized with gnu_parallel can crash your server if you run too many of them at once. If using gnu_parallelize, actively monitor server activity to make sure you aren't overloading your system.

Acknowledgements

I wrote the first draft of these functions while working for Sam Asher and Paul Novosad, who were gracious enough to encourage me to continue to develop it into a utility and make it open source. Thanks a ton.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
gnu_parallel_example.do		gnu_parallel_example.do
stata_parallelize.do		stata_parallelize.do

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The GNU_PARALLELIZE program for Stata

The program

Required inputs

Optional inputs

An example:

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The GNU_PARALLELIZE program for Stata

The program

Required inputs

Optional inputs

An example:

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages