Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>') #346

Closed
semiramisCJ opened this issue Sep 2, 2017 · 5 comments

Comments

@semiramisCJ
Copy link

We don't have problems running Roary with GFF3 files from Prokka, but Roary dies when we try to use different GFF3 files (described at the end), even though all the GFF3 files have the nucleotide sequence at the end of the file, they have the optional '##FASTA' line and they have the fasta headers.

Roary gives the following message:

2017/09/01 20:32:01 Extracting proteins from GFF files

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>')
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::fasta::next_seq /usr/local/share/perl5/Bio/SeqIO/fasta.pm:126
STACK: Bio::Roary::FilterUnknownsFromFasta::_filter_fasta_sequences_and_return_new_file /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:58
STACK: Bio::Roary::FilterUnknownsFromFasta::filtered_fasta_files /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:28
STACK: Bio::Roary::PrepareInputFiles::_input_fasta_files_filtered /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:58
STACK: Bio::Roary::PrepareInputFiles::fasta_files /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:82
STACK: Bio::Roary::CommandLine::Roary::run /usr/local/share/perl5/Bio/Roary/CommandLine/Roary.pm:277
STACK: /usr/local/bin/roary:14

I converted the GBK files to GFF3 with:
a) seqret module + python to send the all fasta records at the end of the file [seqret_* files]
b) GFF in BCBio and SeqIO in python 2.7 + SeqIO (again) to add the nucleotide sequence at the end [py_* files]

py_A_sp_B1.gff3.txt
py_A_denitrificans_K601.gff3.txt
py_A_denitrificans_BC.gff3.txt
seqret_A_sp_B1.gbk.gff3.txt
seqret_A_denitrificans_K601.gbk.gff3.txt
seqret_A_denitrificans_BC.gbk.gff3.txt

Could somebody help us to find out how to solve this issue?
Thanks in advance & best regards.

P.S.- roary -a gives the following details

Please cite Roary if you use any of the results it produces:
Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill,
"Roary: Rapid large-scale prokaryote pan genome analysis", Bioinformatics, 2015 Nov 15;31(22):3691-3693
doi: http://doi.org/10.1093/bioinformatics/btv421
Pubmed: 26198102

2017/09/01 20:18:49 Looking for 'Rscript' - found /usr/local/R/bin/Rscript
2017/09/01 20:18:49 Determined Rscript version is 3.2
2017/09/01 20:18:49 Looking for 'awk' - found /bin/awk
2017/09/01 20:18:49 Looking for 'bedtools' - found /usr/local/bedtools/bin/bedtools
2017/09/01 20:18:49 Determined bedtools version is 2.25
2017/09/01 20:18:49 Looking for 'blastp' - found /usr/local/blast+/bin/blastp
2017/09/01 20:18:49 Determined blastp version is 2.2.28
2017/09/01 20:18:49 Looking for 'grep' - found /bin/grep
2017/09/01 20:18:49 Optional tool 'kraken' not found in your $PATH
2017/09/01 20:18:49 Optional tool 'kraken-report' not found in your $PATH
2017/09/01 20:18:49 Looking for 'mafft' - found /usr/bin/mafft
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl5/Bio/Roary/External/CheckTools.pm line 129.
2017/09/01 20:18:49 Determined mafft version is
2017/09/01 20:18:49 Looking for 'makeblastdb' - found /usr/local/blast+/bin/makeblastdb
2017/09/01 20:18:49 Determined makeblastdb version is 2.2.28
2017/09/01 20:18:49 Looking for 'mcl' - found /usr/local/bin/mcl
2017/09/01 20:18:49 Determined mcl version is 12-068
2017/09/01 20:18:49 Looking for 'parallel' - found /usr/local/masurca/bin/parallel
2017/09/01 20:18:49 Determined parallel version is 20120822
2017/09/01 20:18:49 Roary needs parallel 20130422 or higher. Please upgrade and try again.
2017/09/01 20:18:49 Looking for 'prank' - found /usr/local/bin/prank
2017/09/01 20:18:49 Looking for 'sed' - found /bin/sed
2017/09/01 20:18:49 Looking for 'cd-hit' - found /usr/local/cd-hit/cd-hit
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl5/Bio/Roary/External/CheckTools.pm line 129.
2017/09/01 20:18:49 Determined cd-hit version is
Use of uninitialized value in numeric lt (<) at /usr/local/share/perl5/Bio/Roary/External/CheckTools.pm line 130.
2017/09/01 20:18:49 Roary needs cd-hit 4.6 or higher. Please upgrade and try again.
2017/09/01 20:18:49 Looking for 'FastTree' - found /usr/local/bin/FastTree
2017/09/01 20:18:50 Determined FastTree version is 2.1
2017/09/01 20:18:50 Roary version 3.7.0

@tseemann
Copy link
Contributor

tseemann commented Sep 2, 2017

The pyA GFF3 file is a bit unusual. it has the ##sequence-region stuff littered throughout rather than at the top.

I loaded your pyA file into http://genometools.org/cgi-bin/gff3validator.cgi and got this error

Validation unsuccessful!

GenomeTools error: attribute "pseudo=" on line 5 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/py_A_sp_B1.gff3.txt" has no value

The convertor you used has a problem... The /pseudo tag in Genbank is a value-less key. However, GFF3 does not support value-less keys and is putting pseudo= in the file. This is wrong.

You could try sed -e 's/;pseudo=//g' < old.gff > new.gff and see if that works.

@semiramisCJ
Copy link
Author

semiramisCJ commented Sep 5, 2017

Thank you very much for your soon reply!!

I fixed the py_* files in order to solve all the issues I found via the GFF3 validator and I put the ##sequence region lines at the top. However, Roary dies with the same message even though the GFF3 online validator says that the validation was successful for each of the files

A_sp_B1.gff3.txt
A_denitrificans_K601.gff3.txt
A_denitrificans_BC.gff3.txt

Could you please help us to find what else is wrong with the converted files?
Thank you very much in advance and best regards.

2017/09/04 20:14:31 Fixing input GFF files
2017/09/04 20:14:42 Extracting proteins from GFF files

MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>')
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::fasta::next_seq /usr/local/share/perl5/Bio/SeqIO/fasta.pm:126
STACK: Bio::Roary::FilterUnknownsFromFasta::_filter_fasta_sequences_and_return_new_file /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:58
STACK: Bio::Roary::FilterUnknownsFromFasta::filtered_fasta_files /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:28
STACK: Bio::Roary::PrepareInputFiles::_input_fasta_files_filtered /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:58
STACK: Bio::Roary::PrepareInputFiles::fasta_files /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:82
STACK: Bio::Roary::CommandLine::Roary::run /usr/local/share/perl5/Bio/Roary/CommandLine/Roary.pm:277
STACK: /usr/local/bin/roary:14

@andrewjpage
Copy link
Member

andrewjpage commented Sep 5, 2017 via email

@andrewjpage
Copy link
Member

Additionally I have updated Roary to capture this case and fix it on the fly.

@tseemann
Copy link
Contributor

tseemann commented Sep 8, 2017

@semiramisCJ do use the auto-detect you will need to upgrade via CPAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants