Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding which reads contribute to contigs #3

Closed
dportik opened this issue Jan 28, 2021 · 4 comments
Closed

Understanding which reads contribute to contigs #3

dportik opened this issue Jan 28, 2021 · 4 comments

Comments

@dportik
Copy link

dportik commented Jan 28, 2021

Hi Xiaowen,
I am wondering if it is possible to obtain a list of reads that contribute to each contig in the assembly?

This seems like it would be highly valuable for metagenomics, as it can identify all reads associated with specific bacterial genomes. In addition, it would be extremely valuable for a more specific use-case I describe below.

I am working on a problem where I am trying to assemble an endosymbiont bacteria from a larger HiFi dataset focused on the host organism. The assembly of the full dataset with hifiasm did not produce a complete bacteria contig, it was present as several smaller contigs. I attempted to re-assemble and improve the quality of these results. To accomplish this, I have:

  1. Mapped contigs from a hifiasm assembly of the full dataset to a reference of the target bacteria, to identify and extract relevant bacteria contigs.
  2. Mapped reads to those putative bacteria contigs to identify reads that are most likely target bacteria, and extract them.
  3. Performed assembly with this subset of putative bacteria reads using hifiasm-meta.

This resulted in a complete, circular genome for the target bacteria, along with a few small tangled contigs, suggesting the approach worked pretty well. The small contigs in the new assembly are likely some combination of host reads and perhaps strain variation.

The genome has a few frameshifts and I would like to try polishing it using only the reads that were used to build the complete bacteria contig. I have used minimap2 to align the subset of reads to this contig, and there are several short regions in which some proportion of reads map poorly (alignments are <1000 bp and they are being hard clipped >3000 bp on each side). I think these are potentially host reads. I can filter these out using samclip, but it would be helpful to know whether or not they were used to construct this contig, and therefore deserve to be excluded.

Given metagenomic assemblies often result in several complete genomes, I think the same topic will come up. Polishing would also be desirable, but problematic read alignments would be more prevalent due to more species, shared repeats, etc. Having the ability to assign reads to particular contigs would be a tremendous help here too.

Any advice would be greatly appreciated!

Thanks,
Dan

@xfengnefx
Copy link
Owner

xfengnefx commented Jan 29, 2021

Hi Dan, I'm not sure if this is helpful as I haven't tried to re-assemble or polish. Please correct me if i've misunderstood your question:

In the ideal scenario, i.e. moderate coverage & no read containment, all reads involved in a contig are described the 'A' lines in the gfa file. However, none of two requirements hold true for metagenome datasets.

For containment reads: the shorter read of a containment pair is discarded before transitive reduction in hifiasm(-meta), but they are available along with other types of overlaps in the paf dump by specifying --write-paf (to only write paf and do nothing else, use --write-paf -e -o input_name). Currently hifiasm-meta writes two paf files, one contains overlaps between the same haplotype or phasing unknown, the other contains overlaps between haplotypes; you probably would want to only use the former one to start with. Basically, get the list of reads from 'A' lines, generate paf dump, and fish the containments from the paf dump.

For high coverage scenario: hifiasm(-meta) caps the total number of targets for a read before heading to the graph part. It drops the shorter overlaps first if has to. This cap can be specified by switch -N (INT, default: 100). Hifiasm-meta is aware of this cap but does not dump the discarded overlaps. For now you might have to either set -N higher, or enable the debug lines at https://github.com/xfengnefx/hifiasm-meta/blob/meta/anchor.cpp#L249 to get the info. I think the later route is only an alternative to alignment if input coverage is moderate, since most discarded reads aren't used in the read/unitig/contig graph at all, and the shorter overlaps could be no better than what you can get from read to contig alignment. Maybe changing -N when needed is sufficient.

I think this way you should be able to retrieve most of the relevant reads, but in practice the outcome depends on how you define "contribute".

By the way, this repo is still under development and we are actively seeking more datasets for dev purpose. Every dataset we got was vastly different. If you happen to have a shareable dataset and is happy to do so, we sincerely appreciate it :)

edit:typo

@dportik
Copy link
Author

dportik commented Feb 8, 2021

Hi Xiaowen,
Thanks for your detailed response. It sounds like the paf dump would provide the information I am interested in. I will go ahead and try this out and let you know if I have other questions.

We are working on getting sequence data for two empirical samples which will be made publicly available. As soon as they are ready, I will let you know!

@xfengnefx
Copy link
Owner

@dportik Great, paf files should contain most of the relevant info. I've implemented the dump for the high coverage scenario case. I happen to have other small bug fixes alongside, so they will be committed together soon. Will let you know in this thread when it's pushed.

Looking forward to the data release, thank you!

@xfengnefx
Copy link
Owner

@dportik I have pushed r030, which introduced --dump-all-ovlp. It is effectively https://github.com/xfengnefx/hifiasm-meta/blob/meta/anchor.cpp#L249 reformatted, please refer to the https://github.com/xfengnefx/hifiasm-meta/blob/meta/anchor.cpp#L260 block.

This switch will write all overlaps ever calculated during the final overlapping to a tab delimited file named "output_name.all_ovlp_pairs". You need a custom script and a p_ctg contig graph to fish out the relevant ones. It's a last resort to paf dump and alignment, i'm not sure if it's actually helpful in your case.

Note: 1) the switch has no effect if bin files exist (i.e. you've done an assembly run in the folder or the folder specified by -B under the same name specified by -o), since this info is only available during the final overlapping. 2) if read selection is enabled and takes effect, this dump will only contain overlaps between the kept reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants