This branch contains code and documentation for the system used in the OOPSLA 2019 paper titled "AutoPandas : Neural-Backed Generators for Program Synthesis". The paper can be accessed here (open-access).
-
The system has been tested against Python 3.6.5 and
tensorflow-gpu==1.9.0
-
Do a development install
pip install -e .
-
Compile generators
autopandas_v2 generators compile
autopandas_v2 generators compile-randomized
-
Switch to the snapshot at https://github.com/rbavishi/atlas/tree/oopsla19-snapshot
-
Download pre-trained models here and extract the zip files. There should be two - (1)
model_pandas_generators
and (2)model_pandas_functions
. Note that a GPU is necessary to use these models. We have observed NaNs being returned by our models when performing inference on the CPU. -
Run the following to reproduce results in Table 2 in the paper. Note that execution times may differ across runs. We have observed non-trivial deviations over different hardware due to different predictions by the models which have sort of a cascading effect. However, the benchmarks solved within the time-limit and the number of programs explored should be similar.
TF_CPP_MIN_LOG_LEVEL=3 autopandas_v2 evaluate synthesis "PandasBenchmarks.*" model_pandas_generators model_pandas_functions pandas_synthesis_results.csv --top-k-args 1000 --use-old-featurization --timeout 1200
-
Note that the
--use-old-featurization
is important only for the snapshot. If the models have been retrained, you should skip this option (after switching to the latest commit, of course).
Raw data consists of inputs, programs and their outputs along with choices made by the generators.
Basic usage is as follows. The viable_sequences.pkl
file should contain a set of tuples representing valid combinations
of functions that raw data should be generated for.
- Generate
1 million
data-points with sequences fromviable_sequences.pkl
using32
processes with minimum and maximum length of sequences allowed as1
and3
. Save the data toraw_data.pkl
autopandas_v2 generators training-data raw raw_data.pkl --sequences viable_sequences.pkl --processes 32 --min-depth 1 --max-depth 3 --num-training-points 1000000
- Generate another
1 million
points with the same constraints but append it to the existing data inraw_data.pkl
autopandas_v2 generators training-data raw raw_data.pkl --append --sequences viable_sequences.pkl --processes 32 --min-depth 1 --max-depth 3 --num-training-points 1000000
Full Usage -
usage: autopandas_v2 generators training-data raw [-h] [--debug]
[--processes PROCESSES]
[--chunksize CHUNKSIZE]
[--task-timeout TASK_TIMEOUT]
[--max-exploration MAX_EXPLORATION]
[--max-arg-trials MAX_ARG_TRIALS]
[--max-seq-trials MAX_SEQ_TRIALS]
[--blacklist-threshold BLACKLIST_THRESHOLD]
[--min-depth MIN_DEPTH]
[--max-depth MAX_DEPTH]
[--num-training-points NUM_TRAINING_POINTS]
--sequences SEQUENCES
[--no-repeat] [--append]
outfile
positional arguments:
outfile Path to output file
optional arguments:
-h, --help show this help message and exit
--debug Debug-level logging
--processes PROCESSES
Number of processes to use
--chunksize CHUNKSIZE
Pebble Chunk Size. Only touch this if you understand
the source
--task-timeout TASK_TIMEOUT
Timeout for a datapoint generation task (for
multiprocessing). Useful for avoiding enumeration-
gone-wrong cases, where something is taking a long
time or is consuming too many resources
--max-exploration MAX_EXPLORATION
Maximum number of arg combinations to explore before
moving on
--max-arg-trials MAX_ARG_TRIALS
Maximum number of argument trials to actually execute
--max-seq-trials MAX_SEQ_TRIALS
Maximum number of trials to generate data for a single
sequence
--blacklist-threshold BLACKLIST_THRESHOLD
Maximum number of trials for a sequence before giving
up forever. Use -1 to have no threshold
--min-depth MIN_DEPTH
Minimum length of sequences allowed
--max-depth MAX_DEPTH
Maximum length of sequences allowed
--num-training-points NUM_TRAINING_POINTS
Number of training examples to generate
--sequences SEQUENCES
Path to pickle file containing sequences that the
generator can stick to while generating data. Helps in
generating random data that mimics actual usage of the
API in the wild. Can also be a comma plus colon-
separated string containing functions to use. For
example - df.pivot:df.index,df.columns:df.T allows the
sequences (df.pivot, df.index) and (df.columns, df.T)
--no-repeat Produce only 1 training example for each sequence
--append Whether to append to an already existing dataset
For retraining the models from scratch, it is advisable to generate raw data for different lengths independently, and also
split them into training and validation sets. The following commands create training and validation sets of sizes
1000000
and 10000
respectively for each sequence length ranging from 1
to 3
separately.
Note that this process can take a long time and may need babysitting. For example, the process may not
exit even if the required number of data-points have been generated due to threads not terminating. In this case, manual
intervention using SIGINT
(Ctrl-C
) is required.
The pandas_mined_seqs.pkl
can be obtained here.
autopandas_v2 generators training-data raw training_raw_data.pkl --sequences pandas_mined_seqs.pkl --processes 32 --min-depth 1 --max-depth 1 --num-training-points 1000000
autopandas_v2 generators training-data raw training_raw_data.pkl --append --sequences pandas_mined_seqs.pkl --processes 32 --min-depth 2 --max-depth 2 --num-training-points 1000000
autopandas_v2 generators training-data raw training_raw_data.pkl --append --sequences pandas_mined_seqs.pkl --processes 32 --min-depth 3 --max-depth 3 --num-training-points 1000000
autopandas_v2 generators training-data raw validation_raw_data.pkl --sequences pandas_mined_seqs.pkl --processes 32 --min-depth 1 --max-depth 1 --num-training-points 10000
autopandas_v2 generators training-data raw validation_raw_data.pkl --append --sequences pandas_mined_seqs.pkl --processes 32 --min-depth 2 --max-depth 2 --num-training-points 10000
autopandas_v2 generators training-data raw validation_raw_data.pkl --append --sequences pandas_mined_seqs.pkl --processes 32 --min-depth 3 --max-depth 3 --num-training-points 10000
The raw data used in the artifact can be found here. Each file is of the form <split_name>_depth<depth>_<set_no>.pkl
where split_name
is one of validation
and training
, <depth>
is the number of functions in the sequence, and <set_no>
is one of 1
and 2
. The set number simply results from parallelization.
The raw data has to be converted into graphs for training. The basic command structure is as follows. Currently, the implementation differentiates between the models for operators inside generators and the model used for predicting function sequences to explore. Hence we generate two sets of structured data.
- Generate structured data from raw data contained in
raw_data.pkl
and store the data in a directory namedstruct_data_generators
using32
processes
autopandas_v2 generators training-data generators raw_data.pkl struct_data_generators --processes 32
- Generate structured data from raw data contained in
other_data.pkl
but append it to the already existing data instruct_data_generators
, again using32
processes
autopandas_v2 generators training-data generators other_data.pkl struct_data_generators --append-arg-level
Tip - Appending is useful for combining data of different depths, since it is a prerequisite for training. However this does not mean that you should not parallelize. The final combination of data can happen during training.
Full Usage -
usage: autopandas_v2 generators training-data generators [-h] [--debug] [-f]
[--append-arg-level]
[--processes PROCESSES]
[--chunksize CHUNKSIZE]
[--task-timeout TASK_TIMEOUT]
raw_data_path outdir
positional arguments:
raw_data_path Path to pkl containing the raw I/O example data
outdir Path to output directory where the generated data is
to be stored
optional arguments:
-h, --help show this help message and exit
--debug Debug-level logging
-f, --force Force recreation of outdir if it exists
--append-arg-level Append training-data at argument-operator level
instead of overwriting by default
--processes PROCESSES
Number of processes to use
--chunksize CHUNKSIZE
Pebble Chunk Size. Only touch this if you understand
the source
--task-timeout TASK_TIMEOUT
Timeout for a datapoint generation task (for
multiprocessing). Useful for avoiding enumeration-
gone-wrong cases, where something is taking a long
time or is consuming too many resources
- Generate structured data from raw data contained in
raw_data.pkl
and store the data in a file namedstruct_data_functions.pkl
using32
processes
autopandas_v2 generators training-data function-seq raw_data.pkl struct_data_functions.pkl --processes 32
- Generate structured data from raw data contained in
other_data.pkl
but append it to the already existing data instruct_data_functions.pkl
, again using32
processes
autopandas_v2 generators training-data function-seq other_data.pkl struct_data_functions.pkl --append
Tip - Appending is useful for combining data of different depths, since it is a prerequisite for training
Full Usage -
usage: autopandas_v2 generators training-data function-seq [-h] [--debug]
[--append]
[--processes PROCESSES]
[--chunksize CHUNKSIZE]
[--task-timeout TASK_TIMEOUT]
raw_data_path
outfile
positional arguments:
raw_data_path Path to pkl containing the raw I/O example data
outfile Path to output file where the generated data is to be
stored
optional arguments:
-h, --help show this help message and exit
--debug Debug-level logging
--append Append training-data to the existing dataset
represented by outfileinstead of overwriting by
default
--processes PROCESSES
Number of processes to use
--chunksize CHUNKSIZE
Pebble Chunk Size. Only touch this if you understand
the source
--task-timeout TASK_TIMEOUT
Timeout for a datapoint generation task (for
multiprocessing). Useful for avoiding enumeration-
gone-wrong cases, where something is taking a long
time or is consuming too many resources
If using the script in the previous section to generate training and validation data for each depth separately, you can use the following set of commands to convert that raw data into structured data.
autopandas_v2 generators training-data generators training_raw_data.pkl training_struct_data_generators --processes 32
autopandas_v2 generators training-data generators validation_raw_data.pkl validation_struct_data_generators --processes 32
autopandas_v2 generators training-data function-seq training_raw_data.pkl training_struct_data_functions.pkl --processes 32
autopandas_v2 generators training-data function-seq validation_raw_data.pkl validation_struct_data_functions.pkl --processes 32
Now we are ready to train the models.
Basic usage is as follows. The command trains a model for 100 epochs, and early-stopping after 25 epochs with no improvement in accuracy.
autopandas_v2 generators training train-functions model_functions --train training_struct.pkl --valid validation_struct.pkl --use-disk --config-str '{"batch_size": 50000}' --patience 25 --num-epochs 100
Full usage is the following.
usage: autopandas_v2 generators training train-functions [-h]
[--device DEVICE]
[--config CONFIG]
[--config-str CONFIG_STR]
[--use-memory]
[--use-disk] --train
TRAIN --valid VALID
[--restore-file RESTORE_FILE]
[--restore-params RESTORE_PARAMS]
[--freeze-graph-model]
[--load-shuffle]
[--num-epochs NUM_EPOCHS]
[--patience PATIENCE]
[--num-training-points NUM_TRAINING_POINTS]
modeldir
positional arguments:
modeldir Path to the directory to save the model in
optional arguments:
-h, --help show this help message and exit
--device DEVICE ID of Device (GPU) to use
--config CONFIG File containing hyper-parameter configuration (JSON
format)
--config-str CONFIG_STR
String containing hyper-parameter configuration (JSON
format)
--use-memory Store all processed graphs in memory. Fastest
processing, but can easilyrun out of memory
--use-disk Use disk for storing processed graphs as opposed to
computing them every timeSpeeds things up a lot but
can take a lot of space
--train TRAIN Path to train file
--valid VALID Path to validation file
--restore-file RESTORE_FILE
File to restore weights from
--restore-params RESTORE_PARAMS
File to restore params from (pkl)
--freeze-graph-model Freeze graph model components
--load-shuffle Shuffle data when loading. Useful when passing num-
training-points
--num-epochs NUM_EPOCHS
Maximum number of epochs to run training for
--patience PATIENCE Maximum number of epochs to wait for validation
accuracy to increase
--num-training-points NUM_TRAINING_POINTS
Number of training points to use. Default : -1 (all)
Basic usage is as follows. The command trains a model for 100 epochs, and early-stopping after 25 epochs with no improvement in accuracy.
autopandas_v2 generators training train-generators --train training_struct_data --valid validation_struct_data --use-disk --num-epochs 100 --patience 25 --config-str '{"layer_timesteps": [1,1,1], "batch_size": 50000}' --ignore-if-exists model_generators
Full usage is as follows.
usage: autopandas_v2 generators training train-generators [-h]
[--device DEVICE]
[--config CONFIG]
[--config-str CONFIG_STR]
[--use-memory]
[--use-disk] --train
TRAIN --valid VALID
[--restore-file RESTORE_FILE]
[--restore-params RESTORE_PARAMS]
[--freeze-graph-model]
[--load-shuffle]
[--num-epochs NUM_EPOCHS]
[--patience PATIENCE]
[--num-training-points NUM_TRAINING_POINTS]
[--include INCLUDE [INCLUDE ...]]
[--restore-if-exists]
[--ignore-if-exists]
modeldir
positional arguments:
modeldir Path to the directory to save the model(s) in
optional arguments:
-h, --help show this help message and exit
--device DEVICE ID of Device (GPU) to use
--config CONFIG File containing hyper-parameter configuration (JSON
format)
--config-str CONFIG_STR
String containing hyper-parameter configuration (JSON
format)
--use-memory Store all processed graphs in memory. Fastest
processing, but can easilyrun out of memory
--use-disk Use disk for storing processed graphs as opposed to
computing them every timeSpeeds things up a lot but
can take a lot of space
--train TRAIN Path to train file
--valid VALID Path to validation file
--restore-file RESTORE_FILE
File to restore weights from
--restore-params RESTORE_PARAMS
File to restore params from (pkl)
--freeze-graph-model Freeze graph model components
--load-shuffle Shuffle data when loading. Useful when passing num-
training-points
--num-epochs NUM_EPOCHS
Maximum number of epochs to run training for
--patience PATIENCE Maximum number of epochs to wait for validation
accuracy to increase
--num-training-points NUM_TRAINING_POINTS
Number of training points to use. Default : -1 (all)
--include INCLUDE [INCLUDE ...]
fn:identifier tuples to include in training list
--restore-if-exists If a model already exists, pick up training from there
--ignore-if-exists If the model exists, skip.
If using the scripts from previous section to generate data for each depth separately, the following set of commands can be used to train the models. Again, babysitting may be required to restart the training in case of crashes.
autopandas_v2 generators training train-functions model_functions --train training_struct_data_functions.pkl --valid validation_struct_data_functions.pkl --use-disk --config-str '{"batch_size": 50000}' --patience 25 --num-epochs 100
autopandas_v2 generators training train-generators --train training_struct_data_generators --valid validation_struct_data_generators --use-disk --num-epochs 100 --patience 25 --config-str '{"layer_timesteps": [1,1,1], "batch_size": 50000}' --ignore-if-exists model_generators
For questions, contact rbavishi@cs.berkeley.edu
.