ngs generate

The ngs generate subcommand serves as a tool to generate NGS files. This tool is still highly experimental, and as such, it's command line interface is constantly changing. The tool packages one or more SequenceProviders, which generate random sequences for inclusion in the generated file. Currently, it only supports generating reads from a provided FASTA file in the form of a ReferenceGenomeSequenceProvider, but you can imagine other providers such as a VariantProvider etc.

Running `ngs generate`

You can call the tool by using the following command. Both regular and gzipped FASTQ files are supported as output.

ngs generate \
    --reads-one-file <READ_ONE_PATH> \
    --reads-two-file <READ_TWO_PATH> \
    --reference-provider \
        GenomeOne.fa:100000:150:20:150:50 \
        GenomeTwo.fa:100000:150:20:150:50 \
    --num-reads 1000000

Reference Genome Sequence Provider

Reference genome sequence providers are specified on the command line as a carefully crafted string of six fields delimited by a colon (:). Using the example GenomeOne.fa:100000:150:20:150:50, these fields, in order, are:

GenomeOne.fa — Reference FASTA. Path to the reference FASTA file upon which this reference genome sequence provider is based.
100000 — Error Rate. A sequencing error will be simulated 1 in every N nucleobases generated.
150 — Inner Distance Mean. The inner distance for generated reads are pulled from a normal distribution. This field represents the mean of that normal distribution.
20 — Inner Distance Standard Deviation. The inner distance for generated reads are pulled from a normal distribution. This field represents the standard deviation of that normal distribution.
150 — Read Length. Represents the read length to be generated for this reference genome sequence provider.
50 — Relative Weight. This field is used in conjunction with the relative weights specified for other reference genome sequence providers to determine how often reads will be pulled from this reference genome sequence provider. For instance, if you have two reference genome sequence providers—one which has a relative weight of 90 and another which has a relative weight of 10—then the first will be pulled from 90% of the time and the second will be pulled from 10% of the time. Note that these are not required to sum to 100, as the sum of all relative weights provided is used as the denominator.

Tips

Make sure you set a sensible combination of read length + mean/std deviation for inner distance, or else you will run into issues where the fragment being generated is too short to support the read length. Inner distances are clamped at mean +/- (3 * std), so you can make sure that you will never run into this case by ensuring read length >= (-mean + 3 * std).

Limitations

Inner distances pulled from the specified normal distribution are clamped at mean - (3 * std) and mean + (3 * std). This is to improve predictability in the tool when generating fragment lengths. Without this clamping, rare but extreme values pulled from the inner distance distribution would routinely generate fragments too short to support the specified read length. Now, you can simply ensure that this won't happen by following the respective tip on this topic above.

Subcommands
- ngs convert
- ngs derive
- ngs generate
- ngs index
- ngs list
- ngs plot
- ngs qc
  - Record-based Facets
  - Sequence-based Facets
    - Coverage metrics
    - Edit metrics
- ngs view
Development
- Common arguments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ngs generate

Running `ngs generate`

Reference Genome Sequence Provider

Tips

Limitations

Clone this wiki locally

ngs generate

Running ngs generate

Reference Genome Sequence Provider

Tips

Limitations

Clone this wiki locally

Running `ngs generate`