Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrectly detect non-extendible sequenced as adapter #2

Closed
zhaoc1 opened this issue Aug 3, 2020 · 10 comments
Closed

incorrectly detect non-extendible sequenced as adapter #2

zhaoc1 opened this issue Aug 3, 2020 · 10 comments

Comments

@zhaoc1
Copy link

zhaoc1 commented Aug 3, 2020

Hi,

I am trying out reseq with one wgs sequencing data (truseq) and got the following error message (using suggested command in the Readme):

>>> 03-08-20 18:02:28: info:  Detected non-extendible sequence 'GCTCTTCCG' as adapter for read 2
!!! T& reseq::utilities::at(seqan::String<TString>&, size_t) [with T = seqan::SimpleType<unsigned char, seqan::Dna_>; size_t = long unsigned int] [~/path/to/ReSeq/reseq/utilities.hpp:132]: error: Called position 9. String size is only 9.
terminate called after throwing an instance of 'std::out_of_range'
  what():  Accessing String after last position
Aborted (core dumped)

(1) the reads used to generate the bam file are already adapter-trimmed.
(2) GCTCTTCCG is just part of the reads, not adapter (as shown in the following screen shot)
image

Any idea about cause this error message? Thank you!

Best
Chunyu

@schmeing
Copy link
Owner

schmeing commented Aug 6, 2020

The automatic adapter detection is the problem, because it is already adapter trimmed. If possible, I strongly advise to use untrimmed datasets for training. It shouldn't produce the out_of_range error and I'll have a look if I spot the reason for it, but without adapters you need to turn off the automatic detection by specifying decoy adapters: --adapterFile TruSeq_single

@schmeing
Copy link
Owner

schmeing commented Oct 1, 2020

I could not reproduce the crash, but looking through the code I found a minor bug, that might in rare cases lead to this error. I pushed the bugfix to the devel branch. In case the crash persists, please reopen this issue providing a dataset that leads to it, so that I can reproduce and fix it.

@schmeing schmeing closed this as completed Oct 1, 2020
@cilliannolan
Copy link

Hi there,
I am facing this issue also.
If I turn off the automatic adapter detection using the --adapterFile flag, must the provided list contain the true adapters used? Or does it simply function to turn off that step?

@schmeing
Copy link
Owner

ReSeq uses the adapters for two things (independent whether you specified them with --adapterFile or they were automatically detected): 1. It searches for adapters in unmapped reads to continue the fragment length distribution in the region below the read length. 2. It uses the adapters to continue the simulated sequences if the fragment length is shorter than the read length.

So if you have adapter trimmed reads (which are not recommended but work) or large fragment length so that the adapter content is basically non-existent, you can specify whatever you want as long as it is not part of the reference. I suggested --adapterFile TruSeq_single, because those adapters are implemented in ReSeq and you don't have to create the .fa and .mat files yourself.

If you do have adapters and they are not detected by the automatic detection it is recommended to specify the correct ones. In this case I would be happy if you could share the dataset, so I could check why the automatic detection fails.

Please let me know if you have further questions or something is unclear.

@schmeing schmeing reopened this Oct 27, 2020
@cilliannolan
Copy link

Thank you very much for offering to take a look at the dataset. Here you can find the files with which I am having issues:

https://www.dropbox.com/sh/7w5mpsxvcs4ukkg/AADroiOBFN3WiOYNbuROHUYla?dl=0

And this is the command used:

reseq illuminaPE -b NA12878.sorted.cleaned.subsampled.bam -j 16 -r ucsc.hg19-GATKBundle.fa -1 reads_out_1 -2 reads_out_2 -c 40 --vcfIn HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz --refBiasFile [REFBIASFILE] --noInDelErrors --noSubstitutionErrors

@schmeing
Copy link
Owner

schmeing commented Nov 3, 2020

Looking at the read length distribution with
samtools view -F 2304 NA12878.sorted.cleaned.subsampled.bam | awk '{print length($10)}' | sort | uniq -c | sort -k2,2nr
I see reads ranging from 151 down to 35 bases, so I strongly assume it is adapter trimmed.

Checking the unmapped reads for adapters (normally you observe adapters that ligate directly to each other) with the following command:
samtools view -F 2304 -f 4 NA12878.sorted.cleaned.subsampled.bam | awk '{print substr($10,1,30)}' | sort | uniq -c | sort -k1,1nr | head -n 30
gives 871 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN as the top hit, which proves my adapter trimming hypothesis.

Finally, I checked the read parts that exceed the mapped fragment (assuming the forward-reverse order of reads):
samtools view -F 2316 NA12878.sorted.cleaned.subsampled.bam | awk '($7 == "="){rev=int($2%32/16);mrev=int($2%64/32); if(rev && !mrev){alen=$8-$4}; if(!rev && mrev){alen=$4-$8}; if(alen > 0){print substr($10,length($10)-alen+1,length($10))}}' | sort | uniq -c | sort -k1,1nr | head -n 30
I get a fairly random distribution of nucleotides, so mapping errors obscure all adapter remnants that might still be in the data.

This mean two things:

  1. There is no way for the automatic detection to work, because there are simply no adapters to be found
  2. You can savely use --adapterFile TruSeq_single to avoid the crash. You will have a minimal adapter presence of TruSeq adapters in you simulated data due to the mapping errors, but you can filter or remove those as explained in the Null Sequencing Error issue.

Finally a few remarks about your command:
-1 reads_out_1 -2 reads_out_2: The names are take as they are and no file endings are attached, so you might want to add .fq or similar to clarify that they are in fastq format.
Using only --vcfIn means you are not introducing any variants (except if you do something in your [REFBIASFILE]). For simulating variants you would need --vcfSim, but given that you also don't simulate any errors, no variants might be exactly what you want.
Finally, in your version you need to specify --refBias file so that --refBiasFile [REFBIASFILE] works. Since this is an unnecessary complication I just pushed a fix for this to the devel branch.

@schmeing
Copy link
Owner

schmeing commented Nov 3, 2020

Since the reference file was missing in your dropbox, I could not run ReSeq on the data. Does it crash (Aborted (core dumped)) or simply show the >>> 03-08-20 18:02:28: info: Detected non-extendible sequence 'GCTCTTCCG' as adapter for read 2 and exits?

@cilliannolan
Copy link

Thank you very much for examining the dataset.

When I run Reseq I get the second error info: Detected non-extendible sequence etc.. and the program exits.

"Using only --vcfIn means you are not introducing any variants (except if you do something in your [REFBIASFILE]). For simulating variants you would need --vcfSim, but given that you also don't simulate any errors, no variants might be exactly what you want." - You are correct no simulated variants is exactly what I want in this case.

I will rerun Reseq as you've specified and let you know if I run into any more issues.
Thanks again for all the assistance.

@cilliannolan
Copy link

cilliannolan commented Nov 3, 2020

Quick question, neither of the files present in Reseq/adapters/ (TruSeq_single.fa and TruSeq_single.mat) are named simply TruSeq_single, so is the command --adapterFile TruSeq_single enough? Or should I point it to one of these files specifically.

@schmeing
Copy link
Owner

schmeing commented Nov 3, 2020

Ok good, so it does not crash and exits as it is supposed to, when the automatic detection fails.
--adapterFile TruSeq_single is a shortcut for the adapters included in ReSeq. Alternatively you can specify the full path /full/path/Reseq/adapters/TruSeq_single.fa.
The parameter to define the .mat file is --adapterMatrix, but if they are named equally, as in this case, ReSeq automatically detects it.

@schmeing schmeing closed this as completed Dec 1, 2020
@schmeing schmeing mentioned this issue Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants