Implement genomics pipeline #37

dlebauer · 2015-12-04T17:29:14Z

Overview

TERRA Reference team coordinate implementation of a basic genome seq pipeline (e.g. Process Reads —> Map & Assemble Reads —> Call Variants —> Annotate Variants) described in Proposed formats and databases for genomics data reference-data#19 and summarized in fig. below
Most of our 400 lines will be resequenced, but ~40 lines will be sequenced for de novo assembly, so the pipeline(s) will need to accommodate both of these pipeline paths.
Mike Gore will lead the system architecture

Questions to address:

Resources?

How much computing resources are required?
What are the data sizes?
- sequencing coverage
- number of samples, libraries, lanes
- expected rate of data production over time
do we have sample datasets so that we can set up the pipeline prior to receiving data? (perhaps re-create Maize pipeline)
What software needs to be installed
- Biocluster stack is listed in biocluster_modules.txt

Division of Labor

Who does what? Among HPCBio, NCSA, Danforth, Cornell, other teams
What will be done where?
How will the data move from one location to another? At what stage in their processing?
To what extent do the workflows need to be automated?
Pipeline has been implemented on several systems at NCSA; code is available on Github: https://github.com/HPCBio/BW_VariantCalling (Documentation)
- Can TERRA use this workflow? What modifications will be necessary? Or would it be worthwhile to start from scratch?

robalba1 · 2015-12-08T16:49:50Z

Sample datasets: You might want to use tomato or rice, instead of maize, because these two genomes are much closer to the size of the sorghum genome. The maize genome is almost three times the size of the sorghum genome. I would suggest tomato, which is a bit larger at 950 Mb. The benefit of using rice, which is a bit smaller than the sorghum genome, is that it is also a monocot like sorghum. Tomato is a dicot.

robalba1 · 2015-12-08T17:01:45Z

Data sizes: 200 sorghum genomes during the first six months of Year 1 and 200 genomes during the first six months of Year 2. Most will be resequenced, so 20-30x coverage. ~40 will be de novo sequenced, which will be more like 75-100x coverage. We can probably get 50x-70x coverage of a sorghum genome per lane (lots of assumptions here, so this is a crude estimate). For the re-sequencing, we will probably only need 1-2 libraries per genome. For the de novo sequencing, we should probably use a suite of overlapping libraries, which could mean 5-7 libraries per genome.

ghost · 2015-12-10T14:43:35Z

I guess we can discuss during the next conference call but I think we don't need 75-100x per genome. If we target 40x (which is already considered high coverage) for the de novo sequenced lines, we are good.

ghost · 2015-12-10T15:20:36Z

Computer resources:

ghost · 2015-12-10T15:24:48Z

All the software in the overview are open sources. They all work on unix systems. Installation is very straightforward. I don't think any requires special libraries and dependencies additional to what come native on most systems.

ghost · 2015-12-10T15:42:31Z

Pipeline implementation: Again, we can discuss during the conference call, but I think we don't want to re-use existing pipelines as is. The set of parameter values to use for read alignment depends on the genome characteristics of each species. Same for SNP and genotype calling, the best calling is achieved differently for different species (expected diversity), and even within a species it varies with read length, depth of coverage etc. GATK offers great flexibility for the user to choose the best set of parameter values, but this has to be decided based on the characteristics of the dataset. That requires human expertise. (e.g. https://www.broadinstitute.org/gatk/guide/bp_step.php?p=3). In summary, I don't think we can fully set up the pipeline before we receive the data, except for the raw read filtering step (this is very standard). But once we get our best set of values based on a subset of samples, we can automatize it for the entire sample size and even additional sequences incorporated later.

nfahlgren · 2015-12-10T15:51:42Z

Even if settings for individual programs need to change, we can still set up a complete pipeline with test data (from another species like Rob suggested). The pipeline aspect is just how programs are linked together so that raw data can be input on one end and results can be output on the other. The pipeline can of course be tweaked later when we have sorghum data.

dlebauer · 2015-12-10T16:53:53Z

@robalba1 can you point to any example data? Is there any Sorghum data that we can start with?

ghost · 2015-12-10T17:44:38Z

Can we download reads from SequenceReadArchive (SRA) in NCBI? There are some Sorghum, whole genome sequencing with paired-end Illumina HiSeq2000 deposited. For example, http://www.ncbi.nlm.nih.gov/sra/SRX974517%5Baccn%5D

dlebauer · 2016-04-27T19:07:59Z

Closing this now.

Sample data is on iPlant; genomics pipeline will use CoGe as described in #41

dlebauer added this to the vα y1 q3 Initial data products and pipeline accessibility milestone Apr 27, 2016

dlebauer added the 4 - Done label Apr 27, 2016

dlebauer closed this as completed Apr 27, 2016

dlebauer removed the 4 - Done label May 5, 2016

ghost added data/genomics 4 - Done and removed 3 - Review labels Jan 3, 2017

Chris-Schnaufer mentioned this issue Aug 29, 2019

terrautils/spatial.py: add EPSG parameter where appropriate, document EPSG assumption #601

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement genomics pipeline #37

Implement genomics pipeline #37

dlebauer commented Dec 4, 2015 •

edited

robalba1 commented Dec 8, 2015

robalba1 commented Dec 8, 2015

ghost commented Dec 10, 2015

ghost commented Dec 10, 2015

ghost commented Dec 10, 2015

ghost commented Dec 10, 2015

nfahlgren commented Dec 10, 2015

dlebauer commented Dec 10, 2015

ghost commented Dec 10, 2015

dlebauer commented Apr 27, 2016

Implement genomics pipeline #37

Implement genomics pipeline #37

Comments

dlebauer commented Dec 4, 2015 • edited

Overview

Questions to address:

Resources?

Division of Labor

robalba1 commented Dec 8, 2015

robalba1 commented Dec 8, 2015

ghost commented Dec 10, 2015

ghost commented Dec 10, 2015

ghost commented Dec 10, 2015

ghost commented Dec 10, 2015

nfahlgren commented Dec 10, 2015

dlebauer commented Dec 10, 2015

ghost commented Dec 10, 2015

dlebauer commented Apr 27, 2016

dlebauer commented Dec 4, 2015 •

edited