Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not obtain pan_genome_sequences #426

Closed
pedrorvc opened this issue Oct 17, 2018 · 14 comments
Closed

Could not obtain pan_genome_sequences #426

pedrorvc opened this issue Oct 17, 2018 · 14 comments

Comments

@pedrorvc
Copy link

Hello, I am using Roary to generate a pan-genome with 274 .gff files obtained from Prokka.
The Roary command I'm using is roary -e -f roary_intermediates -n -p 4 -z *.gff, in order to obtain the nucleotide sequences of the results.
Roary completes the job but the following warning is given:

Attribute (fasta_file) does not pass the type constraint because: Validation failed for 'Str' with value undef at reader Bio::Roary::Output::GroupsMultifastaNucleotide::fasta_file (defined at /usr/local/share/perl/5.18.2/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 29) line 15
	Bio::Roary::Output::GroupsMultifastaNucleotide::fasta_file('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x17b208700)') called at /usr/local/share/perl/5.18.2/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 43
	Bio::Roary::Output::GroupsMultifastaNucleotide::_build__input_seqio('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x17b208700)') called at reader Bio::Roary::Output::GroupsMultifastaNucleotide::_input_seqio (defined at /usr/local/share/perl/5.18.2/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 30) line 7
	Bio::Roary::Output::GroupsMultifastaNucleotide::_input_seqio('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x17b208700)') called at /usr/local/share/perl/5.18.2/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 53
	Bio::Roary::Output::GroupsMultifastaNucleotide::populate_files('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x17b208700)') called at /usr/local/share/perl/5.18.2/Bio/Roary/Output/GroupsMultifastasNucleotide.pm line 65
	Bio::Roary::Output::GroupsMultifastasNucleotide::create_files('Bio::Roary::Output::GroupsMultifastasNucleotide=HASH(0x17abef948)') called at /usr/local/share/perl/5.18.2/Bio/Roary/PostAnalysis.pm line 131
	Bio::Roary::PostAnalysis::run('Bio::Roary::PostAnalysis=HASH(0x2312168)') called at /usr/local/share/perl/5.18.2/Bio/Roary/CommandLine/RoaryPostAnalysis.pm line 128
	Bio::Roary::CommandLine::RoaryPostAnalysis::run('Bio::Roary::CommandLine::RoaryPostAnalysis=HASH(0x22ebc20)') called at /usr/local/bin/pan_genome_post_analysis line 14

When I checked the results the pan_genome_sequences was empty.
Does this warning mean that some of my .gff files do not contain the fasta sequence?

P.S.: here goes the output of roary -a and one of my .gff files (in .txt format only to upload it here)

9_Escherichia_coli_FAP1_CP009578.1_Netherlands.txt

2018/10/17 15:27:49 Looking for 'Rscript' - found /usr/bin/Rscript
2018/10/17 15:27:49 Determined Rscript version is 3.4
2018/10/17 15:27:49 Looking for 'awk' - found /usr/bin/awk
2018/10/17 15:27:49 Looking for 'bedtools' - found /usr/bin/bedtools
2018/10/17 15:27:49 Determined bedtools version is 2.17
2018/10/17 15:27:49 Looking for 'blastp' - found /home/geo1/SW/ncbi-blast-2.7.1+/bin/blastp
2018/10/17 15:27:49 Determined blastp version is 2.7.1
2018/10/17 15:27:49 Looking for 'grep' - found /bin/grep
2018/10/17 15:27:49 Optional tool 'kraken' not found in your $PATH
2018/10/17 15:27:49 Optional tool 'kraken-report' not found in your $PATH
2018/10/17 15:27:49 Looking for 'mafft' - found /usr/bin/mafft
2018/10/17 15:27:49 Determined mafft version is 7.392
2018/10/17 15:27:49 Looking for 'makeblastdb' - found /home/geo1/SW/ncbi-blast-2.7.1+/bin/makeblastdb
2018/10/17 15:27:49 Determined makeblastdb version is 2.7.1
2018/10/17 15:27:49 Looking for 'mcl' - found /usr/bin/mcl
2018/10/17 15:27:49 Determined mcl version is 12-135
2018/10/17 15:27:49 Looking for 'parallel' - found /usr/bin/parallel
2018/10/17 15:27:49 Determined parallel version is 20130922
2018/10/17 15:27:49 Looking for 'prank' - found /usr/bin/prank
2018/10/17 15:27:49 Determined prank version is 140110
2018/10/17 15:27:49 Looking for 'sed' - found /bin/sed
2018/10/17 15:27:49 Looking for 'cdhit' - found /usr/bin/cdhit
2018/10/17 15:27:49 Determined cdhit version is 4.6
2018/10/17 15:27:49 Looking for 'fasttree' - found /usr/bin/fasttree
2018/10/17 15:27:49 Determined fasttree version is 2.1
2018/10/17 15:27:49 Roary version 3.12.0

@tseemann
Copy link
Contributor

@pedrorvc type tail -n 1 *.gff and see if they all have DNA sequence on last line. If not, you can't use them.

@pedrorvc
Copy link
Author

@pedrorvc type tail -n 1 *.gff and see if they all have DNA sequence on last line. If not, you can't use them.

@tseemann Thank you very much!
I followed your suggestion and verified that 8 files have sequences ending with NN, one of which is the file I attached.
Would this trigger the warning?

@pedrorvc
Copy link
Author

So I have been trying to understand what causes this error and, after various attempts, the problem seems to lie in the amount of .gff files given to process.
One of the things I tried was to run the same command with batches of 50 files (and afterwards 100 files) and I was able to successfully obtain the pan_genome_sequences, which led me to believe that some sort of file limit could be causing the warning.
At the moment of this post, I was able to successfully obtain the pan_genome_sequences with a maximum of 235 .gff files.
If a file limit exists, can it trigger this warning?

@pedrorvc pedrorvc reopened this Oct 20, 2018
@pedrorvc
Copy link
Author

Apologies for closing the issue due to a misclick.

@tseemann
Copy link
Contributor

Maybe you are running out of RAM ?

@pedrorvc
Copy link
Author

It don't think so. I accompanied the RAM usage of my most recent failed run and I still had about 6GB of RAM available.

@maesaar
Copy link

maesaar commented Oct 21, 2018

@pedrorvc I have used Roary with more than 1000 files and have not had any issues.

@tseemann
Copy link
Contributor

What does ulimit -a say for your account?

@pedrorvc
Copy link
Author

For my account ulimit -a says this:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 64019
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 64019
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@andrewjpage
Copy link
Member

Your allowable number of open files is too low. For example on my system its:
open files (-n) 1048576

I would recommend you increase it (your system administrator can assist if you don't know how to do it).

@andrewjpage
Copy link
Member

Additionally you appear to be running quite old versions of software which would indicate your running a very old version of linux?

@pedrorvc
Copy link
Author

@andrewjpage I increased the number of open files and it solved the problem, thank you very much!

@JinxiangChenHome
Copy link

Your allowable number of open files is too low. For example on my system its:
open files (-n) 1048576

I would recommend you increase it (your system administrator can assist if you don't know how to do it).

Hi andrewjpage,

I changed the value of open files to 1048576, but the problem is still not solved.

I used 1800 gff files to produce pan genome. My machine information: 16 VCPUS RAM 64G Instances 2 5T

I don't konw how to solve this problem. I have the same mistake information as him.

My mistake information as follows:


please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

Attribute (fasta_file) does not pass the type constraint because: Validation failed for 'Str' with value undef at reader Bio::Roary::Output::GroupsMultifastaNucleotide::fasta_file (defined at /usr/local/share/perl/5.22.1/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 29) line 15
Bio::Roary::Output::GroupsMultifastaNucleotide::fasta_file('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x96345d5d8)') called at /usr/local/share/perl/5.22.1/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 43
Bio::Roary::Output::GroupsMultifastaNucleotide::_build__input_seqio('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x96345d5d8)') called at reader Bio::Roary::Output::GroupsMultifastaNucleotide::_input_seqio (defined at /usr/local/share/perl/5.22.1/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 30) line 8
Bio::Roary::Output::GroupsMultifastaNucleotide::_input_seqio('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x96345d5d8)') called at /usr/local/share/perl/5.22.1/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 53
Bio::Roary::Output::GroupsMultifastaNucleotide::populate_files('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x96345d5d8)') called at /usr/local/share/perl/5.22.1/Bio/Roary/Output/GroupsMultifastasNucleotide.pm line 65
Bio::Roary::Output::GroupsMultifastasNucleotide::create_files('Bio::Roary::Output::GroupsMultifastasNucleotide=HASH(0x900572c78)') called at /usr/local/share/perl/5.22.1/Bio/Roary/PostAnalysis.pm line 131
Bio::Roary::PostAnalysis::run('Bio::Roary::PostAnalysis=HASH(0x4df69a8)') called at /usr/local/share/perl/5.22.1/Bio/Roary/CommandLine/RoaryPostAnalysis.pm line 128
Bio::Roary::CommandLine::RoaryPostAnalysis::run('Bio::Roary::CommandLine::RoaryPostAnalysis=HASH(0x10fa1f0)') called at /usr/local/bin/pan_genome_post_analysis line 14


Looking forward to your reply. Thank you!

Best regards,

Jinxiang

@xianggx01
Copy link

Hi, I used 2472 gff files to produce pan genome and encountered this problem too, while the software could successfully run with 232 gff files. This is my mistake information:

"Please cite Roary if you use any of the results it produces:
Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill,
"Roary: Rapid large-scale prokaryote pan genome analysis", Bioinformatics, 2015 Nov 15;31(22):3691-3693
doi: http://doi.org/10.1093/bioinformatics/btv421
Pubmed: 26198102

Use of uninitialized value in require at /usr/local/lib64/perl5/Moose/Meta/TypeConstraint.pm line 60.
Attribute (fasta_file) does not pass the type constraint because: Validation failed for 'Str' with value undef at reader Bio::Roary::Output::GroupsMultifastaNucleotide::fasta_file (defined at /usr/local/share/perl5/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 29) line 15
Bio::Roary::Output::GroupsMultifastaNucleotide::fasta_file('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x10ac17f350)') called at /usr/local/share/perl5/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 43
Bio::Roary::Output::GroupsMultifastaNucleotide::_build__input_seqio('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x10ac17f350)') called at reader Bio::Roary::Output::GroupsMultifastaNucleotide::_input_seqio (defined at /usr/local/share/perl5/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 30) line 7
Bio::Roary::Output::GroupsMultifastaNucleotide::_input_seqio('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x10ac17f350)') called at /usr/local/share/perl5/Bio/Roary/Output/GroupsMultifastaNucleotide.pm line 53
Bio::Roary::Output::GroupsMultifastaNucleotide::populate_files('Bio::Roary::Output::GroupsMultifastaNucleotide=HASH(0x10ac17f350)') called at /usr/local/share/perl5/Bio/Roary/Output/GroupsMultifastasNucleotide.pm line 65
Bio::Roary::Output::GroupsMultifastasNucleotide::create_files('Bio::Roary::Output::GroupsMultifastasNucleotide=HASH(0xfd12db478)') called at /usr/local/share/perl5/Bio/Roary/PostAnalysis.pm line 131
Bio::Roary::PostAnalysis::run('Bio::Roary::PostAnalysis=HASH(0x4ba5ec8)') called at /usr/local/share/perl5/Bio/Roary/CommandLine/RoaryPostAnalysis.pm line 128
Bio::Roary::CommandLine::RoaryPostAnalysis::run('Bio::Roary::CommandLine::RoaryPostAnalysis=HASH(0x1d99188)') called at /usr/local/bin/pan_genome_post_analysis line 14"

Looking forward to any reply. Thank you!

Best regards,
Guoxiu Xiang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants