Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement genomics pipeline #37

Closed
dlebauer opened this issue Dec 4, 2015 · 10 comments
Closed

Implement genomics pipeline #37

dlebauer opened this issue Dec 4, 2015 · 10 comments

Comments

@dlebauer
Copy link
Member

dlebauer commented Dec 4, 2015

Overview

  • TERRA Reference team coordinate implementation of a basic genome seq pipeline (e.g. Process Reads —> Map & Assemble Reads —> Call Variants —> Annotate Variants) described in Proposed formats and databases for genomics data reference-data#19 and summarized in fig. below
  • Most of our 400 lines will be resequenced, but ~40 lines will be sequenced for de novo assembly, so the pipeline(s) will need to accommodate both of these pipeline paths.
  • Mike Gore will lead the system architecture

image

Questions to address:

Resources?

  • How much computing resources are required?
  • What are the data sizes?
    • sequencing coverage
    • number of samples, libraries, lanes
    • expected rate of data production over time
  • do we have sample datasets so that we can set up the pipeline prior to receiving data? (perhaps re-create Maize pipeline)
  • What software needs to be installed

Division of Labor

  • Who does what? Among HPCBio, NCSA, Danforth, Cornell, other teams
  • What will be done where?
  • How will the data move from one location to another? At what stage in their processing?
  • To what extent do the workflows need to be automated?
  • Pipeline has been implemented on several systems at NCSA; code is available on Github: https://github.com/HPCBio/BW_VariantCalling (Documentation)
    • Can TERRA use this workflow? What modifications will be necessary? Or would it be worthwhile to start from scratch?
@robalba1
Copy link

robalba1 commented Dec 8, 2015

Sample datasets: You might want to use tomato or rice, instead of maize, because these two genomes are much closer to the size of the sorghum genome. The maize genome is almost three times the size of the sorghum genome. I would suggest tomato, which is a bit larger at 950 Mb. The benefit of using rice, which is a bit smaller than the sorghum genome, is that it is also a monocot like sorghum. Tomato is a dicot.

@robalba1
Copy link

robalba1 commented Dec 8, 2015

Data sizes: 200 sorghum genomes during the first six months of Year 1 and 200 genomes during the first six months of Year 2. Most will be resequenced, so 20-30x coverage. ~40 will be de novo sequenced, which will be more like 75-100x coverage. We can probably get 50x-70x coverage of a sorghum genome per lane (lots of assumptions here, so this is a crude estimate). For the re-sequencing, we will probably only need 1-2 libraries per genome. For the de novo sequencing, we should probably use a suite of overlapping libraries, which could mean 5-7 libraries per genome.

@ghost
Copy link

ghost commented Dec 10, 2015

I guess we can discuss during the next conference call but I think we don't need 75-100x per genome. If we target 40x (which is already considered high coverage) for the de novo sequenced lines, we are good.

@ghost
Copy link

ghost commented Dec 10, 2015

Computer resources:
computer_resources

@ghost
Copy link

ghost commented Dec 10, 2015

All the software in the overview are open sources. They all work on unix systems. Installation is very straightforward. I don't think any requires special libraries and dependencies additional to what come native on most systems.

@ghost
Copy link

ghost commented Dec 10, 2015

Pipeline implementation: Again, we can discuss during the conference call, but I think we don't want to re-use existing pipelines as is. The set of parameter values to use for read alignment depends on the genome characteristics of each species. Same for SNP and genotype calling, the best calling is achieved differently for different species (expected diversity), and even within a species it varies with read length, depth of coverage etc. GATK offers great flexibility for the user to choose the best set of parameter values, but this has to be decided based on the characteristics of the dataset. That requires human expertise. (e.g. https://www.broadinstitute.org/gatk/guide/bp_step.php?p=3). In summary, I don't think we can fully set up the pipeline before we receive the data, except for the raw read filtering step (this is very standard). But once we get our best set of values based on a subset of samples, we can automatize it for the entire sample size and even additional sequences incorporated later.

@nfahlgren
Copy link
Member

Even if settings for individual programs need to change, we can still set up a complete pipeline with test data (from another species like Rob suggested). The pipeline aspect is just how programs are linked together so that raw data can be input on one end and results can be output on the other. The pipeline can of course be tweaked later when we have sorghum data.

@dlebauer
Copy link
Member Author

@robalba1 can you point to any example data? Is there any Sorghum data that we can start with?

@ghost
Copy link

ghost commented Dec 10, 2015

Can we download reads from SequenceReadArchive (SRA) in NCBI? There are some Sorghum, whole genome sequencing with paired-end Illumina HiSeq2000 deposited. For example, http://www.ncbi.nlm.nih.gov/sra/SRX974517%5Baccn%5D

@dlebauer dlebauer added this to the vα y1 q3 Initial data products and pipeline accessibility milestone Apr 27, 2016
@dlebauer
Copy link
Member Author

Closing this now.

Sample data is on iPlant; genomics pipeline will use CoGe as described in #41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants