Skip to content

Troubleshooting

Lisa Meed edited this page Aug 10, 2021 · 2 revisions

The processing time needed to create BioGraph files will vary depending on your dataset and the hardware processing it. This section will help you understand how and why these fluctuations happen, and provides suggestions for troubleshooting non-optimal run times.

Known factors that affect BioGraph's run time

The overall runtime increases approximately linearly with the number of bases. If there are a greater number of reads (or a similar number of reads with longer sequences per read), the time to create a BioGraph will increase. For example, to running a 60x dataset will take almost twice as long as running a 30x dataset. Conversely, fewer reads typically run faster.

Gauging progress

Although BioGraph reports progress to indicate the approximate work remaining, keep in mind that the indicators are not perfectly linear with time for every step. In particular, I/O bottlenecks can make progress seem to freeze from time to time, especially when writing outputs to disk.

Unexpectedly long runtimes

If the time to create a BioGraph appears to be significantly longer than expected, there are a number of hardware considerations to investigate:

  • Is the temporary storage used attached to a local device? Writing over a network to shared storage is typically much slower than using a physically attached SSD or disk.

  • Does temporary storage use a RAID system? Many RAID configurations optimize for safety, not for speed. Unless it is using a RAID scheme optimized for performance (striping, mirroring, etc.), this configuration may not allow data to be read from the disk at a high enough rate to keep the CPUs busy. This can be observed as CPUs sitting idle in the "disk sleep" state while waiting for I/O.

  • Are there enough cores? The creation of a BioGraph file is highly parallelized, so additional cores will improve performance (assuming optimized disk I/O and sufficient memory). Conversely, fewer cores will lead to longer processing time. We recommend 32 or more physical cores for production systems. Keep in mind that virtual cores are not as performant as physical cores.

  • Does your system have at least 64GB of memory? If not, BioGraph creation will need to create more temporary files, which will place a burden on a system’s I/O and slow down processing time. On memory-constrained systems, file system performance becomes even more critical.

  • Are system resources shared with other processes? On a cluster installation, BioGraph should be the only process running on the worker node. On a cloud installation, other virtual instances may be running on the same physical hardware and can impact I/O, CPU cache hit rates, available memory bandwidth, and overall performance. For optimal, predictable performance, use dedicated or "bare metal" instances whenever possible.

  • Are you achieving maximum parallelization? See Optimizing Performance for tips on how to do this.

  • Are you processing more than you need? Decoy assemblies, unused alternate assemblies, and centromere regions can require significant processing time. If you do not intend to use some regions for your analysis, you can exclude them using a bed file at the discovery step. This will reduce the overall processing workload.

    The public references at s3://spiral-public/references/ include a regions.bed file that contains only the autosomes and sex chromosomes. It specifically excludes alt contigs, decoys, mitochondria, telomeres, centromeres, and known regions of heterochromatin.

    $ biograph full_pipeline --discovery "--bed regions.bed" ...

Common Error Messages

$ biograph
Traceback:
...
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /path/to/bg7/lib/python3.6/site-packages/biograph/_capi_38.so)

This error message can occur when attempting to run BioGraph on an operating system that is unsupported. See System Requirements for supported operating systems.

Getting more help

Run logs are automatically created in the --out BioGraph directory under qc/create_log.txt and qc/variants_log.txt. The console also logs valuable information and can be saved to a file by piping biograph through tee.

$ biograph full_pipeline ... 2>&1 | tee out.log
[INFO] Running biograph full_pipeline ...

For further information, contact us at Spiral Genetics. System performance tuning is a complex subject, and we will be happy to assist you to attain optimal performance for your dataset and hardware.

Clone this wiki locally