Data Simulator

Used to generate graph data based on a Gen3 data dictionary.

Motivation

It is sometimes necessary to create simulated data when it is impractical to obtain real data. This is an important technique to generate data that can be used for building models or running services over datasets that may have protected information or may not be available for legal reasons. The functions in this simulation suite allow a user to:

Simulate and validate data
Organize simulated data by nodes in a data model and export to json for easy upload.

Basic Functionality

Data simulator contains various commands to help simulate, test, and validate data dictionaries. These commands are generally accessed via data-simulator. However, if you are not managing your own virtual environment externally, you may need to prepend poetry run to your commands, as is described in the poetry documentation here. Additionally, make sure you use data-simulator with the most recent release of our services in order to ensure expected behavior. In the examples below, we use bhcdictonary which, at time of writing, is on release 3.1.1.

Dictionary Validation

This function is very helpful for user to validate dictionary

data-simulator validate --url https://s3.amazonaws.com/dictionary-artifacts/bhcdictionary/<release_version>/schema.json

Required arguments:

url: s3 dictionary link

Simulating data

Simulate the data using dictionary

data-simulator simulate --url https://s3.amazonaws.com/dictionary-artifacts/bhcdictionary/<release_version>/schema.json --path ./tests/TestData --program DEV --project test

Required arguments:

url: s3 dictionary link
path: path to save files to
program
project

Optional arguments:

max_samples: maximum number of instances for each node. default is 1
required_only: only simulate required properties
random: randomly generate the numbers of node instances (up to max_samples). If this argument is not used, all nodes have max_samples instances
node_num_instances_file ./file.json: generate the numbers of node instances specified in the JSON file. The file should contain the number of instances (integer) to generate for each node name, for example: {"submitted_unaligned_reads": 100}. max_samples instances are generated for nodes that are not specified in the file.
consent_codes: whether to include generation of random consent codes

Submission Order

Generate a submission order given a node name and a dictionary

data-simulator submission_order --url https://s3.amazonaws.com/dictionary-artifacts/bhcdictionary/<release_version>/schema.json --node_name case --path ./data-simulator/sample_test_data

Required arguments:

url: s3 dictionary link
path: path to save file to

Optional arguments:

node_name: node to generate the submission order for. by default, the command selects a random data file node
skip: skip raising an exception if gets an error

Submitting Data

Submit the data via sheepdog api

data-simulator submitting_data --host http://devplanet.planx-pla.net --project DEV/test --dir ./data-simulator/sample_test_data --access_token_file ./token --chunk_size 10

Required arguments:

dir: path containing data
host
project: program name and project code separated by a forward slash
access_token_file

Optional arguments:

chunk_size: default is 1

Setup

Poetry needs to be installed before installing data simulator. Please follow https://python-poetry.org/docs/#installation for installing poetry.

To install data simulator, run the following command.

poetry install -vv

Running tests locally

poetry install -vv
poetry run pytest -vv ./tests

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
.github/workflows		.github/workflows
bin		bin
datasimulator		datasimulator
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
Dockerfile		Dockerfile
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
generator_configuration.json		generator_configuration.json
generator_configuration_1.json		generator_configuration_1.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Simulator

Motivation

Basic Functionality

Dictionary Validation

Simulating data

Submission Order

Submitting Data

Setup

Running tests locally

About

Releases 21

Packages

Contributors 19

Languages

License

uc-cdis/data-simulator

Folders and files

Latest commit

History

Repository files navigation

Data Simulator

Motivation

Basic Functionality

Dictionary Validation

Simulating data

Submission Order

Submitting Data

Setup

Running tests locally

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 21

Packages 0

Contributors 19

Languages

Packages