Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MicrobeCensus crashes, "ValueError: Sequence and quality captions differ." #4

Closed
taltman opened this issue Apr 27, 2015 · 17 comments
Closed

Comments

@taltman
Copy link

taltman commented Apr 27, 2015

I let MC decide the file type and the FASTQ quality score encoding, so not sure how this happened.

Any help in figuring this out would be great. Thanks!

taltman1@corn02:/dev/shm/taltman1_tmp/MicrobeCensus$ time run_microbe_census.py -n 40711 -l 500 -t 16 my.fastq test.out
Traceback (most recent call last):
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/scripts/run_microbe_census.py", line 48, in
est_ags, args = microbe_census.run_pipeline(args)
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 480, in run_pipeline
process_seqfile(args, paths)
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 273, in process_seqfile
for rec in parse(open_file(args['seqfile']), args['fastq_format'] if args['file_type'] == 'fastq' else 'fasta'):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 582, in parse
for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 922, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.

real 0m9.063s
user 0m8.930s
sys 0m0.169s

@snayfach
Copy link
Owner

Could you send me a test input file that generates this error?

Thanks!

On Sun, Apr 26, 2015 at 8:24 PM, Tomer Altman notifications@github.com
wrote:

I let MC decide the file type and the FASTQ quality score encoding, so not
sure how this happened.

Any help in figuring this out would be great. Thanks!

taltman1@corn02:/dev/shm/taltman1_tmp/MicrobeCensus$ time
run_microbe_census.py -n 40711 -l 500 -t 16 my.fastq test.out
Traceback (most recent call last):
File
"/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/scripts/run_microbe_census.py",
line 48, in
est_ags, args = microbe_census.run_pipeline(args)
File
"/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py",
line 480, in run_pipeline
process_seqfile(args, paths)
File
"/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py",
line 273, in process_seqfile
for rec in parse(open_file(args['seqfile']), args['fastq_format'] if
args['file_type'] == 'fastq' else 'fasta'):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 582, in
parse
for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033,
in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 922,
in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.

real 0m9.063s
user 0m8.930s
sys 0m0.169s


Reply to this email directly or view it on GitHub
#4.

@snayfach
Copy link
Owner

Here is the offending sequence in your dataset:

@SRR172902.422002
NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR172902.422002 ltrim=1
#.4455543555555555555654554455545554445555334346344554445555555556555555555

The quality and sequence headers must be the same, otherwise BioPython throws an error.

@taltman
Copy link
Author

taltman commented Apr 27, 2015

Well, the same, modulo the first char, right? :-)

Thanks for catching this. I will pass along the error to the SPAdes team, as I used their read corrector for trimming. Though, based on the looks of that read, I'm having my doubts...

@taltman taltman closed this as completed Apr 27, 2015
@taltman
Copy link
Author

taltman commented Apr 27, 2015

The odd thing is that I ran this file through DIAMOND as well, without any complaints. I guess the BioPython parser is strict.

@taltman taltman reopened this Apr 27, 2015
@snayfach
Copy link
Owner

I've modified the code so that this should no longer be an issue. Could you try pulling the latest code?

@taltman
Copy link
Author

taltman commented Apr 27, 2015

Not that Wikipedia is authoritative, but:

https://en.wikipedia.org/wiki/FASTQ_format

"Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again."

It indeed looks like the BioPython parser is needlessly strict.

@taltman
Copy link
Author

taltman commented Apr 27, 2015

I installed from the tarball rather than cloning. I'll try cloning now.

@taltman
Copy link
Author

taltman commented Apr 27, 2015

Now I get this error, with exit status 1, and no output:

Warning: sequence record could not be parsed from input file. Skipping...
Error! No reads remaining after filtering!

@snayfach
Copy link
Owner

What command did you use to run it? Did you use defaults?

On Sun, Apr 26, 2015 at 10:03 PM, Tomer Altman notifications@github.com
wrote:

Now I get this error, with exit status 1, and no output:

Warning: sequence record could not be parsed from input file. Skipping...
Error! No reads remaining after filtering!


Reply to this email directly or view it on GitHub
#4 (comment)
.

@taltman
Copy link
Author

taltman commented Apr 27, 2015

Exact same as in original post. No changes.

@taltman
Copy link
Author

taltman commented Apr 27, 2015

Does it run to completion for you?

@snayfach
Copy link
Owner

You've specified 500 bp reads (-l 500), but the input file contains only short reads. If you remove -l 500, and let MicrobeCensus pick the read length to use, it should work.

Also, you specified 40,711 reads, but in general you will need more reads than this to get an accurate estimate of AGS. I'd suggest at least 500,000. But I can understand using fewer reads just for testing.

If you try running the program again using default parameters (at least removing -l 500) it should run to completion.

@taltman
Copy link
Author

taltman commented Apr 27, 2015

I specified -l 500, because my reads have already been trimmed, and I'd rather not have MicrobeCensus re-trim my trimmed reads. I read the option documentation as meaning: any reads longer than 500 will be trimmed to 500. Is there a different way to achieve this?

As for the low # of reads, that was a mistake. Sorry to bother you.

@taltman
Copy link
Author

taltman commented Apr 27, 2015

I can confirm that the program now works. Excellent!

I did get this line in the terminal, though:
Warning: sequence record could not be parsed from input file. Skipping...
Not exactly sure what that means. Might be helpful to specify the line number for the offending input, along with the FAS{A|Q} identifier.

@snayfach
Copy link
Owner

MicrobeCensus trims reads to a uniform length, because it uses read-length specific parameters when estimating AGS. The documentation should read: all reads are trimmed to this length, and reads shorter than this length are discarded.

You can use the verbose flag (-v) to get a better sense of what the software is actually doing at each step. It might help things make more sense.

@snayfach
Copy link
Owner

Thanks for the advice! I'll add that.

@snayfach snayfach closed this as completed Oct 1, 2015
snayfach pushed a commit that referenced this issue Dec 20, 2016
Incremented version to 1.1.0

New sequence parser
-avoids using BioPython, which was slow and threw errors
-solves Issue #4: #4
-improved detection of quality score encoding

New command line options
-added option '-r' to specify external RAPsearch v2.15 binary
-removed options '-f' and '-c'; file formats and quality encodings are now always auto-detected
-fixed option '-e' for just estimating AGS
@snayfach
Copy link
Owner

I've finally fixed this issue in MicrobeCensus (v1.1.0). The program should no longer crash when sequence and quality captions differ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants