Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roary_plots.py generating flawed plots #221

Closed
swlong opened this issue Jan 6, 2016 · 21 comments
Closed

roary_plots.py generating flawed plots #221

swlong opened this issue Jan 6, 2016 · 21 comments
Labels

Comments

@swlong
Copy link

swlong commented Jan 6, 2016

I am running roary_plots.py after successfully running Roary as well as FastTree similar to the instructions provided, and the following occurs:

  1. A warning is generated:
    FutureWarning: order is deprecated. use sort_values(...)
    idx = roary.sum(axis=1).order(ascending=False).index

  2. The three plots are generated but they are all erroneous in one way or another.

  • The pangenome_matrix.png appears to have a valid matrix but the tree is non-existant, just a straight line. If I open the newick file in dendroscope, it appears valid (ie. it exists).
  • The pangenome_pie.png is essentially empty - a straight line w/ zeros and labels superimposed is all that is present
  • The pangenome_frequency.png lists tens of thousands of genomes on the X axis (instead of the 51 that were used) and the distribution as presented appears to be incorrect, or doesn't represent the distribution of genes between the 51 genomes present in the pangenome.
  1. Unsure if it is related, but I am finding the following error generated apparently during the MAFFT step (apologies if this is a completely separate issue, I'll branch to a different issue report):
    ------------- EXCEPTION: Bio::Root::Exception -------------
    MSG: Could not open pan_genome_sequences/group_16429.fa.aln: No such file or directory
    STACK: Error::throw
    STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
    STACK: Bio::Root::IO::_initialize_io /usr/share/perl5/Bio/Root/IO.pm:351
    STACK: Bio::SeqIO::_initialize /usr/share/perl5/Bio/SeqIO.pm:491
    STACK: Bio::SeqIO::fasta::_initialize /usr/share/perl5/Bio/SeqIO/fasta.pm:87
    STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:372
    STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:413
    STACK: Bio::Roary::SortFasta::_input_seqio /usr/local/share/perl/5.18.2/Bio/Roary/SortFasta.pm:27
    STACK: Bio::Roary::SortFasta::sort_fasta /usr/local/share/perl/5.18.2/Bio/Roary/SortFasta.pm:68
    STACK: Bio::Roary::CommandLine::GeneAlignmentFromNucleotides::run /usr/local/share/perl/5.18.2/Bio/Roary/CommandLine/GeneAlignmentFromNucleotides.pm:107

STACK: /usr/local/bin/protein_alignment_from_nucleotides:14

This seems to be happening for some (but not all) clusters... yet core_gene_alignment.aln is still being generated and contains data.

Has anyone else seen this problem? I will attach example data momentarily. I am running Roary on a Biolinux 8 box.

Best,
S. W. Long

@andrewjpage
Copy link
Member

Something seems to have gone quite wrong indeed, sorry about that. How many core genes do you get in your summary statistics file? Where did the input files come from (PROKKA?) and does each one have a unique prefix so that the IDs of each gene are unique to the set?

@swlong
Copy link
Author

swlong commented Jan 7, 2016

Summary stats file:
Core genes (99% <= strains <= 100%) 835
Soft core genes (95% <= strains < 99%) 2547
Shell genes (15% <= strains < 95%) 2145
Cloud genes (0% <= strains < 15%) 10903
Total genes (0% <= strains <= 100%) 16430

This was run with default settings for blastp. Input files were downloaded from genbank as gb files with full sequence then converted to GFF3 using the bp_genbank2gff3.pl script. Input files have unique prefixes - a few input files were "fixed" by Roary for having duplicate gene IDs.

I was going to upload a smaller sample run to see if I could replicate problems but IT forced a reboot on my system overnight and killed my run... hopefully later today or tomorrow I should have some actual datafiles to share.

Additional oddity: pangenome_matrix.png reports 51 strains in tree even though only 48 strains were used to generate the dataset.

@swlong
Copy link
Author

swlong commented Jan 11, 2016

I have replicated the issues with a smaller dataset (5 Cdiff genomes) and the issues appear to be the same. The tar.gz of the directory is a bit too large to upload directly here so I created a repository to allow for easy access. Hoping to get to the bottom of this, as Roary is a very useful tool.

Directory containing all files and output can be found here:
https://github.com/swlong/SampleData.git

In short, here was my workflow:

  1. Downloaded 5 Cdiff complete genomes from Genbank.
  2. Converted .gb to .gff using bp_genbank2gff3.pl.
  3. Ran Roary with "roary -p 12 -e --mafft -r -v *.gff"
  4. Made a tree with "fasttreeMP -nt -gtr core_gene_alignment.aln > CdiffTree.newick"
  5. Ran roary_plots " roary_plots.py CdiffTree.newick gene_presence_absence.csv "

The issues, including the Bio::Root::Exception error during the post analysis step and the appearance of the roary_plots remains the same. I'm hoping providing this data helps find a solution. Let me know if I can be of any further service.

Best,
S. Wesley Long

P.S. Summary stats for the Cdiff dataset (only 5 genomes):
Core genes (99% <= strains <= 100%) 2629
Soft core genes (95% <= strains < 99%) 0
Shell genes (15% <= strains < 95%) 2635
Cloud genes (0% <= strains < 15%) 0
Total genes (0% <= strains <= 100%) 5264

@andrewjpage
Copy link
Member

Thanks for the data. It looks like Roary ran to completion. I reran the roary_plots script and it produced a proper tree, so I suspect theres an issue with versioning of the python dependancies for this script (Phylo). We'll take a look to see if we can track it down.

JSCandy should be able to show you the same information (its an experimental interactive viewer) if you want to give it a shot. Once you load up your data you can change the viewing mode by clicking on the JSCandy logo.
http://jameshadfield.github.io/JScandy/

Theres also the roary2svg.pl script which gives a similar view (but not against a tree).

@swlong
Copy link
Author

swlong commented Jan 13, 2016

Andrew,

Thanks for the help. I do agree that it looks like roary is running appropriately. I'm not sure what to make of the downstream Bio::Root::Exception. Thanks for the JSCandy suggestion, it appears to work in a similar manner to generate the matrix plot and may have other uses as well.

Best,
Wesley

@mgalardini
Copy link
Contributor

Hi Wesley,

I believe that the very last version of the script fixes all the problems you've witnessed:

  1. the deprecation warning
  2. The tree straight line is due to a bug in the Bio.Phylo package, solved by upgrading Biopython; the other errors are probably due to the change in format of the gene_presence_absence.csv file, which is now properly taken care of.

Plus, now there's a new option "--labels" to add sample names to the tree.

I also agree that JScandy is very cool and useful.

Marco

@alichenari2018
Copy link

Hi everybody
I used tutorial scripts.
roary -f -e -n -v *.gff
It is done and finish successfully. But when I run "python roary_plots.py core_gene_alignment.nwk gene_presence_absence.csv" I had an error:
python: can't open file 'roary_plots.py': [Errno 2] No such file or directory
By the way, there is not core_gene_alignment.nwk among my obtained files and folders.
Please help me.
Regards

@mgalardini
Copy link
Contributor

Hi there,

from the look of the error, it seems that you do not have the script in the same directory where you are calling it. Please download it from here and place it in your working directory.

Also, please keep in mind that the the script doesn't necessarily expect you to have the input files named in a certain way; just run the command above by changing the input to match your input files:

python roary_plots.py YOUR_TREE.nwk YOUR_ROARY_OUTPUT.csv

Hope this helps,
Marco

@alichenari2018
Copy link

alichenari2018 commented Feb 27, 2018 via email

@mgalardini
Copy link
Contributor

Hi,

I'm not sure I understood your last message: if you look again at my previous reply you'll see a link to the roary_plots.py script. At any rate, you can download it like this:

wget -O roary_plots.py "https://raw.githubusercontent.com/sanger-pathogens/Roary/master/contrib/roary_plots/roary_plots.py"

Hope this helps,
Marco

@vappiah
Copy link

vappiah commented Jun 17, 2020

Hi All,

I executed the roary_plots.py for 12 gffs and the trees were drawn alright but no labels were given.

@mgalardini
Copy link
Contributor

Hi, did you add the --labels option to it?

@vappiah
Copy link

vappiah commented Jun 17, 2020

Hi @mgalardini . I was confused about the --label options so I did not include that . Below is the command I used
./roary_plots.py roaryresult2/mytree.newick roaryresult2/gene_presence_absence.csv

@mgalardini
Copy link
Contributor

I see; retry it with the --labels option added and see if that solves your problem

@vappiah
Copy link

vappiah commented Jun 17, 2020

Thanks @mgalardini the --labels worked. I noticed that some of my label were truncated.
The labels with 4 characters were okay but those longer (such as mycobacterium_ulcerans_strain) were truncated to Mycobacter.
Is there a way to show the full names?

@mgalardini
Copy link
Contributor

mgalardini commented Jun 17, 2020 via email

@vappiah
Copy link

vappiah commented Jun 19, 2020

Thanks @mgalardini I added the --format svg option. I can now edit using inkscape.

@mgalardini
Copy link
Contributor

Great, glad it worked!

@Julio92-C
Copy link

Hi @mgalardini, is there a way to change the color of the output graph?

@mgalardini
Copy link
Contributor

Yes, see this line and change the plt.cm.Blues part to have a different color for the heatmap.

@Julio92-C
Copy link

Julio92-C commented Jul 13, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants