Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All numeric input file base names prevents labeling #178

Closed
louellette opened this issue Feb 10, 2020 · 7 comments
Closed

All numeric input file base names prevents labeling #178

louellette opened this issue Feb 10, 2020 · 7 comments
Assignees
Labels
bug something isn't working how it should

Comments

@louellette
Copy link

Summary:

All numeric input file base names prevent labeling

Description:

If all the .fna files in the input directory have numeric base names, then the labels.txt file is not being used. It appears as if the file base names become floating-point. See attached pdf output.

Reproducible Steps:

average_nucleotide_identity.py -i -o pyani --labels labels.txt -g --gmethod mpl --gformat pdf --workers 16 -l pyani.log -m ANIb

Let me know if you need more details.

Current Output:

All looks good on terminal output.

Faecalibacterium_prausnitzii_ANIb_percentage_identity.pdf

Expected Output:

Labeled output.

pyani Version:

pyani version: 0.2.9

installed dependencies

If you are running a version of pyani v0.3 or later, then please run the command pyani listdeps at the command line, and enter the output below.

Python Version:

Python 3.7.4

Operating System:

ubuntu 16.04

@widdowquinn
Copy link
Owner

Hi @louellette,

Thanks for your interest in pyani.

I think I see what the problem might be. It looks like IDs are being used that have the format of floating-point numbers, and I'm not catching that somewhere. That might be in the FASTA input, or the labels.txt file (or both).

Would it please be possible for you to attach a minimal failing example (only needs a couple of genomes, and the labels.txt and classes.txt files) so I can test and confirm a fix?

L.

@widdowquinn widdowquinn self-assigned this Feb 11, 2020
@widdowquinn widdowquinn added the bug something isn't working how it should label Feb 11, 2020
@louellette
Copy link
Author

Thanks, I emailed the files but it looks like they did not go through due to the size of the fasta files. I'm attaching the labels.txt file here. I suppose you could just rename any old fasta file to the names in my labels.txt to test. Thank you, Lisa
labels.txt

@widdowquinn
Copy link
Owner

Thanks, Lisa. I think you're correct about naming the FASTA files. I'll get onto that.

L.

@widdowquinn
Copy link
Owner

widdowquinn commented Feb 11, 2020

Hi Lisa,

I've been investigating this, and I have a few comments.

The problem is not with the labels.txt file. It is actually with the input filenames. You've not listed what these are (but I can guess from the PDF), but I can only reproduce your issue if I rename all input files to have numbers (with decimal points) for names. Having (what can be interpreted as) floating point numbers for sequence filenames isn't something I've come across very much.

The analysis results are represented in a pandas dataframe, and the Index for this dataframe is taken to be made up of the input filenames. pandas tries to be helpful (it isn't helpful, in this case) and if all input filenames are interpretable as floating point numbers, it will - silently - create a Float64Index. I think I missed this in testing because it's not something I expected pandas to do, and I wasn't expecting all the input files to have names that were strings that could be interpreted as floating point numbers. This is quite a corner case!

Simply remapping the index to str doesn't work properly, because the string representation of some of those float64s isn't exact.

I'm coming to the end of my train journey, so will have to pause for a while. The root of the problem is pandas interpreting your filenames as floating point numbers when it reads in the tabular output. The quickest solution would be to rename (at least one of) your sequence files. I will fix the bug, but if you need to make progress now, this is the quickest option.

L.

widdowquinn added a commit that referenced this issue Feb 11, 2020
This was a knotty problem. The user had input files whose names were
all interpretable as floating-point numbers. When loading the output
CSV files, the names of files were interpreted as float64 datatypes,
but only in the dataframe index, not the headers. This meant that
the label/class files were not being used, and the sequence names
in the graphical output had extra digits where the floats did not
have an exact representation.

The solution was to read the CSV file without the `index_col=0`
argument, but specifying that column zero shoul dbe the str
datatype. Once loaded, this column could be specified as a new
index for the dataframe.

This fix forms the basis for release 0.2.10
widdowquinn added a commit that referenced this issue Feb 11, 2020
@louellette
Copy link
Author

Thank you. I have implemented the workaround which simply involved prefixing my filenames with a letter and using that file name in the labels.txt file. Works fine.

just fyi, the file names I was using are PATRIC genome ids.

widdowquinn added a commit that referenced this issue Feb 12, 2020
Applies changes that fix issue #178 and tidy surrounding code.
@widdowquinn
Copy link
Owner

widdowquinn commented Feb 12, 2020

Hi Lisa,

Release v0.2.10 should have fixed this bug - my test input/output is in the attached files. I hope it works for you.

labels.txt
ANIm_percentage_identity.pdf

I see why you ended up with filenames in that form, now. I'm surprised this hasn't come up before!

There should be bioconda and pypi releases along soon. I'll close the issue once they're up (assuming this fixes your problem).

L.

@widdowquinn
Copy link
Owner

HI Lisa,

That's v0.2.10 up at:

I'll close this issue now, but please do raise any other problems you find, and thank you so much for finding this bug - I don't think I'd ever have noticed it.

L.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something isn't working how it should
Projects
None yet
Development

No branches or pull requests

2 participants