Segfault on fastq data #9

KGerhardt · 2020-09-13T17:40:02Z

I've encountered a segfault occurring with some data I was trimming. Falco works fine on the untrimmed data, but fails on the post-trim results.

I was trying to narrow down the break to a single read, but I've encountered a strange result - the data attached is a collection of 16 reads, broken into two groupings of 8. Oddly, Falco works on either grouping of 8, but not on the concatenated 16. Even more strangely, it works on all 16 reads when I change the ordering of the reads in the file.

segfault_fq.zip

guilhermesena1 · 2020-09-14T02:33:27Z

Hello,

Thank you so much for reporting the bug and taking the time to create a small test set that allows us to reproduce the segfault. With that I was able to find a problem in datasets with uneven read length distribution: When a longer read is processed, sometimes the tile matrix was not long enough to store the tile quality data for the new read. Because this is a very serious problem, I have decided to add a new release for falco that addresses this fix + some of the fixes brought up by previous issues. If at all possible, I'd truly appreciate it if you are able to test the new release and follow up on whether you are able to run it in your full test dataset.

The new release can be found in the releases page.

Thank you in advance!

KGerhardt · 2020-09-14T02:57:31Z

Happy to have helped diagnose a problem. I will be testing it on the dataset that originally produced the error I uploaded and many more. I will follow up with results.

Thank you for being extremely punctual in addressing this issue.

KGerhardt · 2020-09-14T16:48:16Z

I've tested the new release on the same dataset. It completes the QC for the data. However, there is an issue with one of the HTML reports. The data output text file looks correct to me, but the HTML file doesn't display the visual component of most of the reports, only the grade. I am attaching the data and the report generated from it.

broken_html.zip

As an aside - there are a few quality of life issues I've noticed.

Supplying multiple samples is accepted, but they all share the same output name. Consequently, each samples' report is generated in the output directory and is subsequently overwritten by the next report so that only the last file remains.
The threads argument doesn't appear functional. It neither increases the speed of processing per file, nor simultaneously processes multiple files in the manner that fastqc does. Is this argument simply here to allow falco to be a drop-in replacement for fastqc?

Thank you,

Kenji

guilhermesena1 · 2020-10-20T05:11:13Z

Hi Kenji,

My sincerest apologies for not reaching out sooner. I was unable to reproduce the error on several fastqs, but I know from the report what is breaking it: Your fastq has varying read lengths, and some tiles don't have reads that reach a certain length, which was causing a division by zero that broke the html. I pushed a fix to the main repository if you are still interested, and I'll try to create some examples that may help me guarantee the issue has been addressed.

Thank you as well for bringing the other two issues to my attention. The threads argument is a dummy argument for compatibility with fastqc, which is currently not being used. Other users brought this up and I understand the confusion, so I will update the default message to emphasize that this flag is currently being ignored, and create another issue for the multiple output case, which indeed is not faithfully replicating fastqc behavior.

KGerhardt · 2020-10-20T13:23:23Z

Oh, no problem. I was going to open another issue to specifically address the html report issue, but a fix ahead of time works, too. And I have the datasets that were generating the problem here, too, so I'll attach that,
1.trimmed_GEM1001-T0-VM1.1.fq.gz
2.trimmed_GEM1001-T0-VM1.2.fq.gz

Thank you for addressing these, I appreciate it.

guilhermesena1 · 2020-10-22T18:05:02Z

Perfect! Using this I think I can confirm that the commit I referenced above should displays the html and the output is consistent with fastqc, with empty tiles showing "average" (0 deviations from the mean) tile quality. I also pushed fixes to the outputs overriding each other when multiple files are provided and a temporary warning on the readme that the threads flag is currently ignored. Thank you once again for the very detailed feedback on the problems and for providing the data that made debugging so much simpler.

KGerhardt · 2020-10-22T18:33:56Z

I'm more than happy to help. I'm with the Konstantinidis lab at GT, and we're in the process of replacing fastqc in one our software developments with Falco, so we've been testing it on a smorgasbord of data and situations.

I do have a quick question - I've written a wrapper script around Falco for one of my own projects, and it relies on the naming structure that Falco previously had. If a user only submits one file to Falco, will the resulting names be the same as in previous versions? Or is there a new naming convention regardless of how many files are given to Falco at once?

guilhermesena1 · 2020-10-22T18:40:23Z

Yes, when a single file is provided the names will be the same as before, so if you run

falco -o foo bar.fq

it will generate foo/fastqc_data.txt, foo/fastqc_report.html and foo/summary.txt

If the user runs

falco -o foo bar1.fq bar2.fq

it will generate foo/bar1.fq_fastqc_data.txt and foo/bar2.fq_fastqc_data.txt, and the same goes for the other two outputs (summary.txt and fastqc_report.html).

KGerhardt · 2020-10-22T18:45:28Z

Great! Then my script should continue working with the new version.

Thank you for your work and responsiveness on all of this. I really appreciate it. Looking forward to testing the new version when it's ready for release.

KGerhardt closed this as completed Oct 19, 2020

KGerhardt reopened this Oct 20, 2020

KGerhardt closed this as completed Oct 22, 2020

guilhermesena1 mentioned this issue Dec 22, 2020

Segmentation fault for FASTQ (merged reads from NovaSeq) #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault on fastq data #9

Segfault on fastq data #9

KGerhardt commented Sep 13, 2020

guilhermesena1 commented Sep 14, 2020 •

edited

KGerhardt commented Sep 14, 2020

KGerhardt commented Sep 14, 2020

guilhermesena1 commented Oct 20, 2020

KGerhardt commented Oct 20, 2020

guilhermesena1 commented Oct 22, 2020

KGerhardt commented Oct 22, 2020

guilhermesena1 commented Oct 22, 2020 •

edited

KGerhardt commented Oct 22, 2020

Segfault on fastq data #9

Segfault on fastq data #9

Comments

KGerhardt commented Sep 13, 2020

guilhermesena1 commented Sep 14, 2020 • edited

KGerhardt commented Sep 14, 2020

KGerhardt commented Sep 14, 2020

guilhermesena1 commented Oct 20, 2020

KGerhardt commented Oct 20, 2020

guilhermesena1 commented Oct 22, 2020

KGerhardt commented Oct 22, 2020

guilhermesena1 commented Oct 22, 2020 • edited

KGerhardt commented Oct 22, 2020

guilhermesena1 commented Sep 14, 2020 •

edited

guilhermesena1 commented Oct 22, 2020 •

edited