Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on fastq data #9

Closed
KGerhardt opened this issue Sep 13, 2020 · 9 comments
Closed

Segfault on fastq data #9

KGerhardt opened this issue Sep 13, 2020 · 9 comments

Comments

@KGerhardt
Copy link

I've encountered a segfault occurring with some data I was trimming. Falco works fine on the untrimmed data, but fails on the post-trim results.

I was trying to narrow down the break to a single read, but I've encountered a strange result - the data attached is a collection of 16 reads, broken into two groupings of 8. Oddly, Falco works on either grouping of 8, but not on the concatenated 16. Even more strangely, it works on all 16 reads when I change the ordering of the reads in the file.

segfault_fq.zip

@guilhermesena1
Copy link
Collaborator

guilhermesena1 commented Sep 14, 2020

Hello,

Thank you so much for reporting the bug and taking the time to create a small test set that allows us to reproduce the segfault. With that I was able to find a problem in datasets with uneven read length distribution: When a longer read is processed, sometimes the tile matrix was not long enough to store the tile quality data for the new read. Because this is a very serious problem, I have decided to add a new release for falco that addresses this fix + some of the fixes brought up by previous issues. If at all possible, I'd truly appreciate it if you are able to test the new release and follow up on whether you are able to run it in your full test dataset.

The new release can be found in the releases page.

Thank you in advance!

@KGerhardt
Copy link
Author

Happy to have helped diagnose a problem. I will be testing it on the dataset that originally produced the error I uploaded and many more. I will follow up with results.

Thank you for being extremely punctual in addressing this issue.

@KGerhardt
Copy link
Author

I've tested the new release on the same dataset. It completes the QC for the data. However, there is an issue with one of the HTML reports. The data output text file looks correct to me, but the HTML file doesn't display the visual component of most of the reports, only the grade. I am attaching the data and the report generated from it.

broken_html.zip

As an aside - there are a few quality of life issues I've noticed.

  1. Supplying multiple samples is accepted, but they all share the same output name. Consequently, each samples' report is generated in the output directory and is subsequently overwritten by the next report so that only the last file remains.

  2. The threads argument doesn't appear functional. It neither increases the speed of processing per file, nor simultaneously processes multiple files in the manner that fastqc does. Is this argument simply here to allow falco to be a drop-in replacement for fastqc?

Thank you,

Kenji

@guilhermesena1
Copy link
Collaborator

Hi Kenji,

My sincerest apologies for not reaching out sooner. I was unable to reproduce the error on several fastqs, but I know from the report what is breaking it: Your fastq has varying read lengths, and some tiles don't have reads that reach a certain length, which was causing a division by zero that broke the html. I pushed a fix to the main repository if you are still interested, and I'll try to create some examples that may help me guarantee the issue has been addressed.

Thank you as well for bringing the other two issues to my attention. The threads argument is a dummy argument for compatibility with fastqc, which is currently not being used. Other users brought this up and I understand the confusion, so I will update the default message to emphasize that this flag is currently being ignored, and create another issue for the multiple output case, which indeed is not faithfully replicating fastqc behavior.

@KGerhardt KGerhardt reopened this Oct 20, 2020
@KGerhardt
Copy link
Author

Oh, no problem. I was going to open another issue to specifically address the html report issue, but a fix ahead of time works, too. And I have the datasets that were generating the problem here, too, so I'll attach that,
1.trimmed_GEM1001-T0-VM1.1.fq.gz
2.trimmed_GEM1001-T0-VM1.2.fq.gz

Thank you for addressing these, I appreciate it.

@guilhermesena1
Copy link
Collaborator

Perfect! Using this I think I can confirm that the commit I referenced above should displays the html and the output is consistent with fastqc, with empty tiles showing "average" (0 deviations from the mean) tile quality. I also pushed fixes to the outputs overriding each other when multiple files are provided and a temporary warning on the readme that the threads flag is currently ignored. Thank you once again for the very detailed feedback on the problems and for providing the data that made debugging so much simpler.

@KGerhardt
Copy link
Author

I'm more than happy to help. I'm with the Konstantinidis lab at GT, and we're in the process of replacing fastqc in one our software developments with Falco, so we've been testing it on a smorgasbord of data and situations.

I do have a quick question - I've written a wrapper script around Falco for one of my own projects, and it relies on the naming structure that Falco previously had. If a user only submits one file to Falco, will the resulting names be the same as in previous versions? Or is there a new naming convention regardless of how many files are given to Falco at once?

@guilhermesena1
Copy link
Collaborator

guilhermesena1 commented Oct 22, 2020

Yes, when a single file is provided the names will be the same as before, so if you run

falco -o foo bar.fq

it will generate foo/fastqc_data.txt, foo/fastqc_report.html and foo/summary.txt

If the user runs

falco -o foo bar1.fq bar2.fq

it will generate foo/bar1.fq_fastqc_data.txt and foo/bar2.fq_fastqc_data.txt, and the same goes for the other two outputs (summary.txt and fastqc_report.html).

@KGerhardt
Copy link
Author

Great! Then my script should continue working with the new version.

Thank you for your work and responsiveness on all of this. I really appreciate it. Looking forward to testing the new version when it's ready for release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants