Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFA file size issue #2

Closed
alienzj opened this issue Dec 30, 2020 · 14 comments
Closed

GFA file size issue #2

alienzj opened this issue Dec 30, 2020 · 14 comments

Comments

@alienzj
Copy link

alienzj commented Dec 30, 2020

Hi, I use hifiasm-meta to assemble urogenital tract metagenomics data from CAMI.

This data was simulated by CAMISIM, average read length: 3,000 bp, read length s.d.: 1,000 bp.

Run log:

$ hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz

[M::hamt_assemble] Skipped read selection.
[M::ha_analyze_count] lowest: count[16383] = 0
[M::hamt_ft_gen::278.101*4.89@16.432GB] ==> filtered out 0 k-mers occurring 750 or more times
[M::hamt_assemble] Generated flt tab.
alloc 1666925 uint16_t
[M::ha_pt_gen::398.464*4.70] ==> counted 131777689 distinct minimizer k-mers
[M::ha_pt_gen] count[16383] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[16383] = 0
tot_cnt=59765
tot_pos=59765
[M::ha_pt_gen::431.595*5.13] ==> indexed 59765 positions
[M::hamt_assemble::439.470*5.61@16.432GB] ==> corrected reads for round 1
[M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
[M::hamt_assemble] size of buffer: 0.132GB
[M::ha_pt_gen::470.852*6.04] ==> counted 131777979 distinct minimizer k-mers
[M::ha_pt_gen] count[16383] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[16383] = 0
tot_cnt=59765
tot_pos=59765
[M::ha_pt_gen::506.590*6.28] ==> indexed 59765 positions
[M::hamt_assemble::514.866*6.69@16.432GB] ==> corrected reads for round 2
[M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
[M::hamt_assemble] size of buffer: 0.132GB
[M::ha_pt_gen::559.852*6.81] ==> counted 131777979 distinct minimizer k-mers
[M::ha_pt_gen] count[16383] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[16383] = 0
tot_cnt=59765
tot_pos=59765
[M::ha_pt_gen::597.090*6.98] ==> indexed 59765 positions
[M::hamt_assemble::606.630*7.37@16.432GB] ==> corrected reads for round 3
[M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
[M::hamt_assemble] size of buffer: 0.132GB
[M::ha_pt_gen::643.258*7.55] ==> counted 131777979 distinct minimizer k-mers
[M::ha_pt_gen] count[16383] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[16383] = 0
tot_cnt=59765
tot_pos=59765
[M::ha_pt_gen::674.827*7.68] ==> indexed 59765 positions
[M::hamt_assemble::683.525*7.98@16.432GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
[M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0
[M::hist_readlength] <1.0k:
[M::hist_readlength] 1.0k: ]]]]]]]]
[M::hist_readlength] 1.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 2.0k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 2.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 3.0k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 3.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]                                                                                                                                                                                    
[M::hist_readlength] 4.0k: ]]]]]]]]]]]]]]]]]]]]]]]
[M::hist_readlength] 4.5k: ]]]]]]]]]]]]]
[M::hist_readlength] 5.0k: ]]]]]]]
[M::hist_readlength] 5.5k: ]]]]
[M::hist_readlength] 6.0k: ]]
[M::hist_readlength] 6.5k: ]
[M::hist_readlength] 7.0k: ]
[M::hist_readlength] 7.5k: ]
[M::hist_readlength] 8.0k: ]
[M::hist_readlength] 8.5k: ]
[M::hist_readlength] 9.0k: ]
[M::hist_readlength] 9.5k: ]
[M::hist_readlength] 10.0k: ]
[M::hist_readlength] 10.5k: ]
[M::hist_readlength] 11.0k: ]
[M::hist_readlength] 11.5k: ]
[M::hist_readlength] >50.0k: 0
Writing reads to disk...
wrote cmd of length 323: version=0.13-r308, CMD= hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
Bin file was created on Wed Dec 30 15:31:02 2020
Hifiasm_meta 0.1-r022 (hifiasm code base 0.13-r308).
Reads has been written.
[hamt::write_All_reads] Writing per-read coverage info...
[hamt::write_All_reads] Finished writing.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
[M::hamt_output_unitig_graph_advance] Writing GFA...
[M::hamt_output_unitig_graph_advance] Writing GFA...
[M::hamt_output_unitig_graph_advance] Writing GFA...
Inconsistency threshold for low-quality regions in BED files: 70%
Writing debug asg to disk...
[M::write_debug_assembly_graph] took 0.02s

[M::main] Hifiasm code base version: 0.13-r308
[M::main] Hifiasm_meta version: 0.1-r022
[M::main] CMD: hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
[M::main] Real time: 691.048 sec; CPU: 5463.747 sec; Peak RSS: 16.432 GB

Output:

$ ll
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.a_ctg.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.a_ctg.noseq.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.dbg_asg
.rw-r--r-- zhujie 2782  1.2 GB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ec.bin
.rw-r--r-- zhujie 2782 38.2 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ec.mt.bin
.rw-r--r-- zhujie 2782  6.7 MB Wed Dec 30 15:31:00 2020 cami_0.hifiasm_meta.out.ovecinfo.bin
.rw-r--r-- zhujie 2782  9.5 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ovlp.reverse.bin
.rw-r--r-- zhujie 2782  9.5 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ovlp.source.bin
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_ctg.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_ctg.noseq.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_utg.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_utg.noseq.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.gfa
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.lowQ.bed
.rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.noseq.gfa

All GFA file size is zero.

Any help? Thanks ~

@simroux
Copy link

simroux commented Dec 30, 2020

I see the same behavior with a "real" (i.e. non-simulated) metagenomics PacBio library. All the "*.bin" files seem to be correctly written, but all the other files are empty. I also see the same absence of overlaps:

[M::hamt_assemble::4567.208*11.17@17.077GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0

But other assemblers are able to generate contigs from the same library, so I would expect some overlaps to be found ?

@xfengnefx
Copy link
Owner

xfengnefx commented Jan 3, 2021

@alienzj Sorry for the late reply. It's weird that you didn't get "ha_hist_line" lines (just did a fresh clone + test run, which was sane). This step should be identical to the stable hifiasm (since no read selection was involved), and local test runs were all sane, I'm not sure where did it go wrong. Could you try assemble with the stable hifiasm and see if the overlap states look right?

@xfengnefx
Copy link
Owner

@simroux Sorry for the late reply. What's the full STDERR output? Is there "ha_hist_line" lines? If there's none, could you try assemble with the stable hifiasm? If hifiasm prints out ha_hist_line lines and reports overlaps, then it's my bug. I think the issue you encountered is similar to @alienzj 's, but I'm not sure why yet, since local test runs have been alright...

@simroux
Copy link

simroux commented Jan 3, 2021

@xfengnefx : I do see the "ha_hist_line" lines, but not sure if they reflect overlap or not. I attached the full log, hopefully it will help clarify what happens !
asm.log

@xfengnefx
Copy link
Owner

@simroux Thanks for the log, then probably it's a different problem than the main post of this thread. The read length distribution looks unusual. Is this a pacbio hifi dataset, or hifi mixed with other libraries? I think we usually expect hifi reads to fall within the 5kb~20kb range, 1kb seems very short.

@simroux
Copy link

simroux commented Jan 4, 2021

@xfengnefx Pretty sure these are HiFi reads (will get confirmation tomorrow). We typically try to aim the ~ 5-20k range, but real environmental samples being what they are, we often end up with a fair number of (very) short reads.

@xfengnefx
Copy link
Owner

@simroux Thank you, I appreciate it!

Anyway, it's weird to see no overlaps at all. What's the file size of *ovecinfo.bin? This file logs all overlaps ever calculated, regardless of whether they were threw away later on. We've seen hifiasm_meta failed on two real private samples, but that was fragmented contigs, not no overlaps. And another assembler also failed those samples.

I'm not sure why, let me discuss this with others tomorrow and see what they think.

@simroux
Copy link

simroux commented Jan 4, 2021

@xfengnefx bin files seem relatively small to me:
1.3G Dec 29 19:55 asm.ec.bin
41M Dec 29 19:55 asm.ec.mt.bin
21M Dec 29 19:55 asm.ovecinfo.bin
11M Dec 29 19:55 asm.ovlp.reverse.bin
11M Dec 29 19:55 asm.ovlp.source.bin

We're not in a rush, but curious to know if this is an issue with our input file.

@alienzj alienzj closed this as completed Jan 4, 2021
@alienzj alienzj reopened this Jan 4, 2021
@alienzj
Copy link
Author

alienzj commented Jan 4, 2021

@alienzj Sorry for the late reply. It's weird that you didn't get "ha_hist_line" lines (just did a fresh clone + test run, which was sane). This step should be identical to the stable hifiasm (since no read selection was involved), and local test runs were all sane, I'm not sure where did it go wrong. Could you try assemble with the stable hifiasm and see if the overlap states look right?

Hi, I test CAMI data (no Hifi) with stable hifiasm (not hifiasm_meta).

$ hifiasm -o test_hifiasm/cami_0.hifiasm.out -t 32 /zfssz3/pub/database/openstack.cebitec.uni-bielefeld.de/swift/v1/C
AMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz

[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_ft_gen::483.745*3.69@16.369GB] ==> filtered out 1384687 k-mers occurring -5 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::662.355*4.53] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::713.792*5.53] ==> indexed 0 positions
[M::ha_assemble::742.572*6.55@16.369GB] ==> corrected reads for round 1
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::797.425*7.35] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::849.905*8.01] ==> indexed 0 positions
[M::ha_assemble::880.619*8.84@16.369GB] ==> corrected reads for round 2
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::934.191*9.41] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::984.778*9.89] ==> indexed 0 positions
[M::ha_assemble::1012.330*10.48@16.369GB] ==> corrected reads for round 3
[M::ha_assemble] # bases: 5000000696; # corrected bases: 0; # recorrected bases: 0
[M::ha_assemble] size of buffer: 0.131GB
[M::ha_pt_gen::1066.756*10.89] ==> counted 131818714 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[4095] = 0
[M::ha_pt_gen::1117.322*11.24] ==> indexed 0 positions
[M::ha_assemble::1143.542*11.71@16.369GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
[M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: -1
[M::purge_dups] purge duplication coverage threshold: -1
[M::adjust_utg_by_primary] primary contig coverage range: [0, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
[M::main] Version: 0.13-r307
[M::main] CMD: hifiasm -o test_hifiasm/cami_0.hifiasm.out -t 32 /zfssz3/pub/database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
[M::main] Real time: 1154.208 sec; CPU: 13399.291 sec; Peak RSS: 16.369 GB

There are no "ha_hist_line" lines I can see.
Maybe non-hifi reads do not meet the requirements of hifiasm.

Thanks.

@xfengnefx
Copy link
Owner

@alienzj Thanks for the confirmation. Looks like it's hifiasm's behavior.

@simroux
Copy link

simroux commented Jan 5, 2021

@xfengnefx : Quick update: after I double checked, it turned out that my previous input file was a mix of ccs and non-ccs reads. I have now re-run hifiasm-meta on the same data but with only the ccs reads, and this gave me a nice assembly (none of the complete circular chromosome sadly, but definitely on par or better compared to our previous attempt with the same reads). So this was a user problem, i.e. I should have read more carefully that the input really * must * be ccs reads, not any PacBio reads :-)
Thanks for the quick replies and the help with troubleshooting my mistake !

@xfengnefx
Copy link
Owner

@simroux great, glad to know it's not an unknown bug on my side, thank you for the confirmation! I guess if non-ccs reads are longer than ccs, hifiasm's containment removal step might happen to favor the longer non-ccs ones (and overlaps between them are then discarded because of quality). The meta fork has heuristics for hifi low coverage + het corner cases, but probably can't help with non-ccs reads. It's just my guessing...

For the assembly performance, does the dataset expect low coverages? Is it mostly plasmids or very short circular genomes? We saw one case of low coverage and one case of shared sequence causing troubles in the mock datasets, and I'm currently fixing plasmids. The heuristic needs more test data to improve, however. We definitely appreciate it if you are happy to share data for development. Thanks!

@simroux
Copy link

simroux commented Jan 5, 2021

@xfengnefx I think it's mostly low-coverage and some strain variation (Bandage shows a large blob of ~ 50Mb which looks like bacterial genomes, but coverage is "only" ~ 15x, so not super great). Data is currently unpublished so I don't think we can share it right now, but I'll follow up as soon as this would be possible, and we are also moving forward with further tests given the promising results on this first dataset.

@xfengnefx
Copy link
Owner

@simroux I've seen the blob thing in one private dataset we have access to (readme only showed the sheep because the private one was shared for internal dev only). It's probably strains + horizontal gene transfer things. For a few datasets we don't have direct access to, some did much better than the others, and library preparation approach also seem to affect the outcome. There's a work in progress fix, I'll push to the repo if it works out.

Closing this issue for now, but please feel free to follow up or drop a new one anytime. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants