Skip to content

ngs generate

Clay McLeod edited this page Sep 17, 2022 · 4 revisions

The ngs generate subcommand serves as a tool to generate NGS files. This tool is still highly experimental, and as such, it's command line interface is constantly changing. The tool packages one or more SequenceProviders, which generate random sequences for inclusion in the generated file. Currently, it only supports generating reads from a provided FASTA file in the form of a ReferenceGenomeSequenceProvider, but you can imagine other providers such as a VariantProvider etc.

Running ngs generate

You can call the tool by using the following command. Both regular and gzipped FASTQ files are supported as output.

ngs generate \
    --reads-one-file <READ_ONE_PATH> \
    --reads-two-file <READ_TWO_PATH> \
    --reference-provider \
        GenomeOne.fa:100000:150:20:150:50 \
        GenomeTwo.fa:100000:150:20:150:50 \
    --num-reads 1000000

Reference Genome Sequence Provider

Reference genome sequence providers are specified on the command line as a carefully crafted string of six fields delimited by a colon (:). Using the example GenomeOne.fa:100000:150:20:150:50, these fields, in order, are:

  • GenomeOne.faReference FASTA. Path to the reference FASTA file upon which this reference genome sequence provider is based.
  • 100000Error Rate. A sequencing error will be simulated 1 in every N nucleobases generated.
  • 150Inner Distance Mean. The inner distance for generated reads are pulled from a normal distribution. This field represents the mean of that normal distribution.
  • 20Inner Distance Standard Deviation. The inner distance for generated reads are pulled from a normal distribution. This field represents the standard deviation of that normal distribution.
  • 150Read Length. Represents the read length to be generated for this reference genome sequence provider.
  • 50Relative Weight. This field is used in conjunction with the relative weights specified for other reference genome sequence providers to determine how often reads will be pulled from this reference genome sequence provider. For instance, if you have two reference genome sequence providers—one which has a relative weight of 90 and another which has a relative weight of 10—then the first will be pulled from 90% of the time and the second will be pulled from 10% of the time. Note that these are not required to sum to 100, as the sum of all relative weights provided is used as the denominator.

Tips

  • Make sure you set a sensible combination of read length + mean/std deviation for inner distance, or else you will run into issues where the fragment being generated is too short to support the read length. Inner distances are clamped at mean +/- (3 * std), so you can make sure that you will never run into this case by ensuring read length >= (-mean + 3 * std).

Limitations

  • Inner distances pulled from the specified normal distribution are clamped at mean - (3 * std) and mean + (3 * std). This is to improve predictability in the tool when generating fragment lengths. Without this clamping, rare but extreme values pulled from the inner distance distribution would routinely generate fragments too short to support the specified read length. Now, you can simply ensure that this won't happen by following the respective tip on this topic above.