tagseq processing not recognizing fastq headers #3

laurahspencer · 2020-04-29T03:01:24Z

I'm following your tagSeq_processing_README.txt protocol to trim and filter reads, generated from QuantSeq libraries run on an Illumina NovaSeq platform this month. The output indicates that a very large portion of my reads do not have headers:

Upon inspection, the fastq files don't appear to lack headers, but I'm wondering if the tagseq_clipper.pl script is looking for a different header format? My headers are in the following format:

Here are abbreviated versions of an untrimmed file, and the trimmed file showing reads that passed the tagseq_clipper.pl script:
example_files.zip

I admittedly am unfamiliar with perl scripts, so any help would be great.

z0on · 2020-04-29T05:34:04Z

hi laura - easily solvable! can you please poke me tomorrow if i forget to reply? in short, "header" is not the read title, but the lead 5' portion of the read used for de-duplication. do you have those in your quant-seq?

…

28 апр. 2020 г., в 22:01, Laura H Spencer ***@***.***> написал(а): I'm following your tagSeq_processing_README.txt protocol to trim and filter reads, generated from QuantSeq libraries run on an Illumina NovaSeq platform this month. The output indicates that a very large portion of my reads do not have headers: Upon inspection, the fastq files don't appear to lack headers, but I'm wondering if the tagseq_clipper.pl script is looking for a different header format? My headers are in the following format: Here are abbreviated versions of an untrimmed file, and the trimmed file showing reads that passed the tagseq_clipper.pl script: example_files.zip I admittedly am unfamiliar with perl scripts, so any help would be great. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

laurahspencer · 2020-04-30T18:38:02Z

Ah, good to know! The QuantSeq manual/FAQ doesn't indicate whether or not deduplication is necessary (below is a screen shot of their recommended trimming), but my data is single-read without UMIs, and from a couple things I've read online deduplication isn't recommended (or possible?) for this type of data. Let me know if you think otherwise!

Recommended trimming according to QuantSeq's FAQ:

z0on · 2020-05-01T02:45:58Z

Hi Laura - my position is that deduplication is always needed because otherwise your counts-based stats (like DESeq2) are not valid; plus it removes noise due to over-dispersion of amplified counts. That said, if you don't have means to deduplicate you have no choice. Fortunately, it is still OK to publish stuff based on non-deduped data! so why do you want to use the tagseq pipeline, if I may ask?.. there is really nothing special to it, except maybe deduplication :) What is the reference you are going to map to? cheers Misha

…

On Thu, Apr 30, 2020 at 1:38 PM Laura H Spencer ***@***.***> wrote: Ah, good to know! The QuantSeq manual/FAQ doesn't indicate whether or not deduplication is necessary (below is a screen shot of their recommended trimming), but my data is single-read without UMIs, and from a couple things I've read online deduplication isn't recommended (or possible?) for this type of data. Let me know if you think otherwise! Recommended trimming according to QuantSeq's FAQ <https://www.lexogen.com/quantseq-3mrna-sequencing/#quantseqfaq>: [image: image] <https://user-images.githubusercontent.com/17264765/80649399-c82cf000-8a26-11ea-967f-bd91d21584d0.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZUHGARKQ4J6LZGT5WV7OTRPHARTANCNFSM4MTLFU6A> .

laurahspencer · 2020-05-01T20:29:38Z

Hi Misha-
I used your pipeline back in fall 2018 on some pilot QuantSeq data, at the suggestion of a colleague. It worked well then, but I don't think you had incorporated deduplication yet (?). I will probably depart from your process a bit, now that I more fully understand what your pipeline is intended for. I will align data to the Olympia oyster (Ostrea lurida) genome, which my lab developed.

Regarding deduplication, that's interesting to know, and I'll definitely have to do more reading on the matter. I'm now wondering if there is a tool I can use to identify duplicates based on the read sequences themselves (i.e. identical sequences), despite not having paired data or molecular identifiers... if you know of any, please let me know! Thanks for all you help!

z0on · 2020-05-01T20:43:35Z

Hi Laura - I see! If you map to genome, my pipeline is really not too useful. Just use any mapper of your choice and then featureCounts to extract counts (you might wish to adjust your genome’s GFF file extend gene regions 1-2kb towards 3’ ; otherwise gene annotations are often missing the non-coding 3’regions where our reads are mapping) Yes, you can mark duplicates just based on reads, using Picard tool. Still, since in quant-seq your reads will be piled up in a relatively narrow region near 3', there is a danger of over-deduping (i.e. some reads might legitimately map to the same place because there is not much choice where they could map). Check in IGV viewer how your read pile-ups look. (both IGV viewer and Picard are tools by Broad institute) cheers Misha

…

On May 1, 2020, at 3:29 PM, Laura H Spencer ***@***.***> wrote: Hi Misha- I used your pipeline back in fall 2018 on some pilot QuantSeq data, at the suggestion of a colleague. It worked well then, but I don't think you had incorporated deduplication yet (?). I will probably depart from your process a bit, now that I more fully understand what your pipeline is intended for. I will align data to the Olympia oyster (Ostrea lurida) genome, which my lab <https://faculty.washington.edu/sr320/> developed. Regarding deduplication, that's interesting to know, and I'll definitely have to do more reading on the matter. I'm now wondering if there is a tool I can use to identify duplicates based on the read sequences themselves (i.e. identical sequences), despite not having paired data or molecular identifiers... if you know of any, please let me know! Thanks for all you help! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZUHGGDUHSCUQK7MXMPHPLRPMWMBANCNFSM4MTLFU6A>.

laurahspencer changed the title ~~tagseq_clipper.pl not recognizing fastq header~~ tagseq processing not recognizing fastq headers Apr 29, 2020

laurahspencer closed this as completed May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tagseq processing not recognizing fastq headers #3

tagseq processing not recognizing fastq headers #3

laurahspencer commented Apr 29, 2020

z0on commented Apr 29, 2020 via email

laurahspencer commented Apr 30, 2020

z0on commented May 1, 2020 via email

laurahspencer commented May 1, 2020

z0on commented May 1, 2020 via email

tagseq processing not recognizing fastq headers #3

tagseq processing not recognizing fastq headers #3

Comments

laurahspencer commented Apr 29, 2020

z0on commented Apr 29, 2020 via email

laurahspencer commented Apr 30, 2020

z0on commented May 1, 2020 via email

laurahspencer commented May 1, 2020

z0on commented May 1, 2020 via email