New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement genomics pipeline #37
Comments
Sample datasets: You might want to use tomato or rice, instead of maize, because these two genomes are much closer to the size of the sorghum genome. The maize genome is almost three times the size of the sorghum genome. I would suggest tomato, which is a bit larger at 950 Mb. The benefit of using rice, which is a bit smaller than the sorghum genome, is that it is also a monocot like sorghum. Tomato is a dicot. |
Data sizes: 200 sorghum genomes during the first six months of Year 1 and 200 genomes during the first six months of Year 2. Most will be resequenced, so 20-30x coverage. ~40 will be de novo sequenced, which will be more like 75-100x coverage. We can probably get 50x-70x coverage of a sorghum genome per lane (lots of assumptions here, so this is a crude estimate). For the re-sequencing, we will probably only need 1-2 libraries per genome. For the de novo sequencing, we should probably use a suite of overlapping libraries, which could mean 5-7 libraries per genome. |
I guess we can discuss during the next conference call but I think we don't need 75-100x per genome. If we target 40x (which is already considered high coverage) for the de novo sequenced lines, we are good. |
All the software in the overview are open sources. They all work on unix systems. Installation is very straightforward. I don't think any requires special libraries and dependencies additional to what come native on most systems. |
Pipeline implementation: Again, we can discuss during the conference call, but I think we don't want to re-use existing pipelines as is. The set of parameter values to use for read alignment depends on the genome characteristics of each species. Same for SNP and genotype calling, the best calling is achieved differently for different species (expected diversity), and even within a species it varies with read length, depth of coverage etc. GATK offers great flexibility for the user to choose the best set of parameter values, but this has to be decided based on the characteristics of the dataset. That requires human expertise. (e.g. https://www.broadinstitute.org/gatk/guide/bp_step.php?p=3). In summary, I don't think we can fully set up the pipeline before we receive the data, except for the raw read filtering step (this is very standard). But once we get our best set of values based on a subset of samples, we can automatize it for the entire sample size and even additional sequences incorporated later. |
Even if settings for individual programs need to change, we can still set up a complete pipeline with test data (from another species like Rob suggested). The pipeline aspect is just how programs are linked together so that raw data can be input on one end and results can be output on the other. The pipeline can of course be tweaked later when we have sorghum data. |
@robalba1 can you point to any example data? Is there any Sorghum data that we can start with? |
Can we download reads from SequenceReadArchive (SRA) in NCBI? There are some Sorghum, whole genome sequencing with paired-end Illumina HiSeq2000 deposited. For example, http://www.ncbi.nlm.nih.gov/sra/SRX974517%5Baccn%5D |
Closing this now. Sample data is on iPlant; genomics pipeline will use CoGe as described in #41 |
Overview
Questions to address:
Resources?
Division of Labor
The text was updated successfully, but these errors were encountered: