WHdenovo

A cost-effective approach to diploid assembly for single samples and trios. It includes the following steps: construct sequence graph from illumina, align long reads to the graph and partition these long reads to two haplotypes.

Installation

Conda installation

conda install -c shilpagarg whdenovo

Data Simulation

For simulating Illumina data:

whdenovo simulate illumina N-bp.fasta <het> out/illumina/

And it will show you the file you'll need for pacbio data simulation and WHdenovo test

For simulating PacBio data:

whdenovo simulate pacbio sample.fastq <coverage> out/pacbio/ \
                       <mom_hap1.fasta> <mom_hap2.fasta> \
                       <dad_hap1.fasta> <dad_hap2.fasta> \
                       <child_hap1.fasta> <child_hap2.fasta>

Assembly

Trio case:

whdenovo partition --illumina1 <illumina_child_1.fq> --illumina2 <illumina_child_2.fq> \
                       --pacbio <pacbio_mom.fasta> <pacbio_dad.fasta> <pacbio_child.fasta>
                       -p ped [-t <thread>] [-o out/path] [--lowc INT] [--highc INT]

e.g.

whdenovo partition --illumina1 16513_illumina_10k_1.5/child.het1.5.cov30_1.fq --illumina2 16513_illumina_10k_1.5/child.het1.5.cov30_2.fq \
                       --pacbio 16513_pacbio_10k_1.5_20/pacbio_mom.fasta 16513_pacbio_10k_1.5_20/pacbio_dad.fasta 16513_pacbio_10k_1.5_20/pacbio_child.fasta \
                       -p ped -t 24 -o test.simu --lowc 5 --highc 60

Individual case

whdenovo partition --illumina1 <illumina_who_1.fq> --illumina2 <illumina_who_2.fq> \ 
                       --pacbio <pacbio_who.fasta> [-t <thread>] [-o out/path]

For assembling the genome from partitioned reads given by partition.py:

conda activate flye
whdenovo assemble -f son.inputreads.fa \
                      -0 path/to/output/HP0.reads -1 path/to/output/HP1.reads \
                      --assemble -s 15k -t 40 <--pacbio|--nano>

If you wish to use other assemblers with partitioned reads, just remove the --assemble argument and you may not need to activate the virtual environment.

Result Validation

For validating the partitioning of simulated the data.

whdenovo validate -p out/path -f <pacbio_who.fasta>

For validating the partitioning of real data when you have ground truth classification:

whdenovo validate -p out/path -t <tagged.reads.txt>

An example for tagged.reads.txt is at test/haplotagged.reads.txt, which should include the HP tag and PS tag from haplotagged BAM file.

We acknowledge the support of dependencies such as bfc, SPAdes, vg and GraphAligner.

Citations

A graph-based approach to diploid genome assembly. Link
A haplotype-aware de novo assembly of related individuals using pedigree sequence graph. Link

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
bioconda-recipes		bioconda-recipes
misc		misc
patches		patches
test		test
trioasm		trioasm
whdenovo		whdenovo
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py
versioneer.py		versioneer.py

License

shilpagarg/WHdenovo

Folders and files

Latest commit

History

Repository files navigation

WHdenovo

Installation

Data Simulation

Assembly

Result Validation

Citations

About

Resources

License

Stars

Watchers

Forks

Languages