New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GFA file size issue #2
Comments
I see the same behavior with a "real" (i.e. non-simulated) metagenomics PacBio library. All the "*.bin" files seem to be correctly written, but all the other files are empty. I also see the same absence of overlaps:
But other assemblers are able to generate contigs from the same library, so I would expect some overlaps to be found ? |
@alienzj Sorry for the late reply. It's weird that you didn't get "ha_hist_line" lines (just did a fresh clone + test run, which was sane). This step should be identical to the stable hifiasm (since no read selection was involved), and local test runs were all sane, I'm not sure where did it go wrong. Could you try assemble with the stable hifiasm and see if the overlap states look right? |
@simroux Sorry for the late reply. What's the full STDERR output? Is there "ha_hist_line" lines? If there's none, could you try assemble with the stable hifiasm? If hifiasm prints out ha_hist_line lines and reports overlaps, then it's my bug. I think the issue you encountered is similar to @alienzj 's, but I'm not sure why yet, since local test runs have been alright... |
@xfengnefx : I do see the "ha_hist_line" lines, but not sure if they reflect overlap or not. I attached the full log, hopefully it will help clarify what happens ! |
@simroux Thanks for the log, then probably it's a different problem than the main post of this thread. The read length distribution looks unusual. Is this a pacbio hifi dataset, or hifi mixed with other libraries? I think we usually expect hifi reads to fall within the 5kb~20kb range, 1kb seems very short. |
@xfengnefx Pretty sure these are HiFi reads (will get confirmation tomorrow). We typically try to aim the ~ 5-20k range, but real environmental samples being what they are, we often end up with a fair number of (very) short reads. |
@simroux Thank you, I appreciate it! Anyway, it's weird to see no overlaps at all. What's the file size of *ovecinfo.bin? This file logs all overlaps ever calculated, regardless of whether they were threw away later on. We've seen hifiasm_meta failed on two real private samples, but that was fragmented contigs, not no overlaps. And another assembler also failed those samples. I'm not sure why, let me discuss this with others tomorrow and see what they think. |
@xfengnefx bin files seem relatively small to me: We're not in a rush, but curious to know if this is an issue with our input file. |
Hi, I test CAMI data (no Hifi) with stable hifiasm (not hifiasm_meta).
There are no "ha_hist_line" lines I can see. Thanks. |
@alienzj Thanks for the confirmation. Looks like it's hifiasm's behavior. |
@xfengnefx : Quick update: after I double checked, it turned out that my previous input file was a mix of ccs and non-ccs reads. I have now re-run hifiasm-meta on the same data but with only the ccs reads, and this gave me a nice assembly (none of the complete circular chromosome sadly, but definitely on par or better compared to our previous attempt with the same reads). So this was a user problem, i.e. I should have read more carefully that the input really * must * be ccs reads, not any PacBio reads :-) |
@simroux great, glad to know it's not an unknown bug on my side, thank you for the confirmation! I guess if non-ccs reads are longer than ccs, hifiasm's containment removal step might happen to favor the longer non-ccs ones (and overlaps between them are then discarded because of quality). The meta fork has heuristics for hifi low coverage + het corner cases, but probably can't help with non-ccs reads. It's just my guessing... For the assembly performance, does the dataset expect low coverages? Is it mostly plasmids or very short circular genomes? We saw one case of low coverage and one case of shared sequence causing troubles in the mock datasets, and I'm currently fixing plasmids. The heuristic needs more test data to improve, however. We definitely appreciate it if you are happy to share data for development. Thanks! |
@xfengnefx I think it's mostly low-coverage and some strain variation (Bandage shows a large blob of ~ 50Mb which looks like bacterial genomes, but coverage is "only" ~ 15x, so not super great). Data is currently unpublished so I don't think we can share it right now, but I'll follow up as soon as this would be possible, and we are also moving forward with further tests given the promising results on this first dataset. |
@simroux I've seen the blob thing in one private dataset we have access to (readme only showed the sheep because the private one was shared for internal dev only). It's probably strains + horizontal gene transfer things. For a few datasets we don't have direct access to, some did much better than the others, and library preparation approach also seem to affect the outcome. There's a work in progress fix, I'll push to the repo if it works out. Closing this issue for now, but please feel free to follow up or drop a new one anytime. Thank you! |
Hi, I use hifiasm-meta to assemble urogenital tract metagenomics data from CAMI.
This data was simulated by CAMISIM, average read length: 3,000 bp, read length s.d.: 1,000 bp.
Run log:
Output:
All GFA file size is zero.
Any help? Thanks ~
The text was updated successfully, but these errors were encountered: