Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Data Simulator

Used to generate datasets based on dictionary


It is sometimes necessary to create simulated data when it is impractical to obtain real data. This is an important technique to generate data that can be used for building models or running services over datasets that may have protected information or may not be available for legal reasons. The functions in this simulation suite allow a user to:

  • Simulate and validate data
  • Organize simulated data by nodes in a data model and export to json for easy upload.

Basic Functionality

Dictionary Validation

This function is very helpful for user to validate dictionary

data-simulator validate --url

Required arguments:

  • url: s3 dictionary link

Simulating data

Simulate the data using dictionary

data-simulator simulate --url --path ./data-simulator/sample_test_data --program DEV --project test

Required arguments:

  • url: s3 dictionary link
  • path: path to save files to
  • program
  • project

Optional arguments:

  • max_samples: maximum number of instances for each node. default is 1
  • required_only: only simulate required properties
  • random: randomly generate the numbers of node instances (up to max_samples). If this argument is not used, all nodes have max_samples instances
  • node_num_instances_file ./file.json: generate the numbers of node instances specified in the JSON file. The file should contain the number of instances (integer) to generate for each node name, for example: {"submitted_unaligned_reads": 100}. max_samples instances are generated for nodes that are not specified in the file.
  • consent_codes: whether to include generation of random consent codes

Submission Order

Generate a submission order given a node name and a dictionary

data-simulator submission_order --url --node_name case --path ./data-simulator/sample_test_data

Required arguments:

  • url: s3 dictionary link
  • path: path to save file to

Optional arguments:

  • node_name: node to generate the submission order for. by default, the command selects a random data file node
  • skip: skip raising an exception if gets an error

Submitting Data

Submit the data via sheepdog api

data-simulator submitting_data --host --project DEV/test --dir ./data-simulator/sample_test_data --access_token_file ./token --chunk_size 10

Required arguments:

  • dir: path containing data
  • host
  • project: program name and project code separated by a forward slash
  • access_token_file

Optional arguments:

  • chunk_size: default is 1


Poetry needs to be installed before installing data simulator. Please follow for installing poetry.

To install data simulator, run the following command.

poetry install -vv

Running tests locally

poetry install -vv
poetry run pytest -vv ./tests