All numeric input file base names prevents labeling #178

louellette · 2020-02-10T21:50:36Z

Summary:

All numeric input file base names prevent labeling

Description:

If all the .fna files in the input directory have numeric base names, then the labels.txt file is not being used. It appears as if the file base names become floating-point. See attached pdf output.

Reproducible Steps:

average_nucleotide_identity.py -i -o pyani --labels labels.txt -g --gmethod mpl --gformat pdf --workers 16 -l pyani.log -m ANIb

Let me know if you need more details.

Current Output:

All looks good on terminal output.

Faecalibacterium_prausnitzii_ANIb_percentage_identity.pdf

Expected Output:

Labeled output.

pyani Version:

pyani version: 0.2.9

installed dependencies

If you are running a version of pyani v0.3 or later, then please run the command pyani listdeps at the command line, and enter the output below.

Python Version:

Python 3.7.4

Operating System:

ubuntu 16.04

The text was updated successfully, but these errors were encountered:

widdowquinn · 2020-02-11T07:56:52Z

Hi @louellette,

Thanks for your interest in pyani.

I think I see what the problem might be. It looks like IDs are being used that have the format of floating-point numbers, and I'm not catching that somewhere. That might be in the FASTA input, or the labels.txt file (or both).

Would it please be possible for you to attach a minimal failing example (only needs a couple of genomes, and the labels.txt and classes.txt files) so I can test and confirm a fix?

L.

louellette · 2020-02-11T13:44:12Z

Thanks, I emailed the files but it looks like they did not go through due to the size of the fasta files. I'm attaching the labels.txt file here. I suppose you could just rename any old fasta file to the names in my labels.txt to test. Thank you, Lisa
labels.txt

widdowquinn · 2020-02-11T16:28:54Z

Thanks, Lisa. I think you're correct about naming the FASTA files. I'll get onto that.

L.

widdowquinn · 2020-02-11T18:48:13Z

Hi Lisa,

I've been investigating this, and I have a few comments.

The problem is not with the labels.txt file. It is actually with the input filenames. You've not listed what these are (but I can guess from the PDF), but I can only reproduce your issue if I rename all input files to have numbers (with decimal points) for names. Having (what can be interpreted as) floating point numbers for sequence filenames isn't something I've come across very much.

The analysis results are represented in a pandas dataframe, and the Index for this dataframe is taken to be made up of the input filenames. pandas tries to be helpful (it isn't helpful, in this case) and if all input filenames are interpretable as floating point numbers, it will - silently - create a Float64Index. I think I missed this in testing because it's not something I expected pandas to do, and I wasn't expecting all the input files to have names that were strings that could be interpreted as floating point numbers. This is quite a corner case!

Simply remapping the index to str doesn't work properly, because the string representation of some of those float64s isn't exact.

I'm coming to the end of my train journey, so will have to pause for a while. The root of the problem is pandas interpreting your filenames as floating point numbers when it reads in the tabular output. The quickest solution would be to rename (at least one of) your sequence files. I will fix the bug, but if you need to make progress now, this is the quickest option.

L.

This was a knotty problem. The user had input files whose names were all interpretable as floating-point numbers. When loading the output CSV files, the names of files were interpreted as float64 datatypes, but only in the dataframe index, not the headers. This meant that the label/class files were not being used, and the sequence names in the graphical output had extra digits where the floats did not have an exact representation. The solution was to read the CSV file without the `index_col=0` argument, but specifying that column zero shoul dbe the str datatype. Once loaded, this column could be specified as a new index for the dataframe. This fix forms the basis for release 0.2.10

louellette · 2020-02-11T19:29:07Z

Thank you. I have implemented the workaround which simply involved prefixing my filenames with a letter and using that file name in the labels.txt file. Works fine.

just fyi, the file names I was using are PATRIC genome ids.

Applies changes that fix issue #178 and tidy surrounding code.

widdowquinn · 2020-02-12T08:28:25Z

Hi Lisa,

Release v0.2.10 should have fixed this bug - my test input/output is in the attached files. I hope it works for you.

labels.txt
ANIm_percentage_identity.pdf

I see why you ended up with filenames in that form, now. I'm surprised this hasn't come up before!

There should be bioconda and pypi releases along soon. I'll close the issue once they're up (assuming this fixes your problem).

L.

widdowquinn · 2020-02-13T08:38:19Z

HI Lisa,

That's v0.2.10 up at:

pypi: https://pypi.org/project/pyani/0.2.10/
bioconda: https://anaconda.org/bioconda/pyani

I'll close this issue now, but please do raise any other problems you find, and thank you so much for finding this bug - I don't think I'd ever have noticed it.

L.

widdowquinn self-assigned this Feb 11, 2020

widdowquinn added the bug something isn't working how it should label Feb 11, 2020

widdowquinn added a commit that referenced this issue Feb 11, 2020

update CHANGES.md for issue #178 fix

d61ce10

widdowquinn added a commit that referenced this issue Feb 12, 2020

Merge branch 'issue_178' into version_0_2

3d06ecd

Applies changes that fix issue #178 and tidy surrounding code.

widdowquinn closed this as completed Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All numeric input file base names prevents labeling #178

All numeric input file base names prevents labeling #178

louellette commented Feb 10, 2020

widdowquinn commented Feb 11, 2020

louellette commented Feb 11, 2020

widdowquinn commented Feb 11, 2020

widdowquinn commented Feb 11, 2020 •

edited

louellette commented Feb 11, 2020

widdowquinn commented Feb 12, 2020 •

edited

widdowquinn commented Feb 13, 2020

All numeric input file base names prevents labeling #178

All numeric input file base names prevents labeling #178

Comments

louellette commented Feb 10, 2020

Summary:

Description:

Reproducible Steps:

Current Output:

Expected Output:

pyani Version:

installed dependencies

Python Version:

Operating System:

widdowquinn commented Feb 11, 2020

louellette commented Feb 11, 2020

widdowquinn commented Feb 11, 2020

widdowquinn commented Feb 11, 2020 • edited

louellette commented Feb 11, 2020

widdowquinn commented Feb 12, 2020 • edited

widdowquinn commented Feb 13, 2020

widdowquinn commented Feb 11, 2020 •

edited

widdowquinn commented Feb 12, 2020 •

edited