Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with HeaderlessTSVTaxonomyFormat #127

Closed
JTFouquier opened this issue Jun 24, 2017 · 10 comments
Closed

Issues with HeaderlessTSVTaxonomyFormat #127

JTFouquier opened this issue Jun 24, 2017 · 10 comments

Comments

@JTFouquier
Copy link

JTFouquier commented Jun 24, 2017

@thermokarst I think the issue I was having with the HeaderlessTSVTaxonomyFormat was possibly related to the wrong base class being used? I'm not sure but it is obviously not like the rest.
https://github.com/qiime2/q2-types/blob/master/q2_types/feature_data/_format.py#L69

I noticed this because I was getting this error and I didn't know why it was trying to do a transformation to a HeaderlessTSVTaxonomyFormat when I am positive I already gave it a HeaderlessTSVTaxonomyFormat.
screen shot 2017-06-23 at 5 26 55 pm

Well, it's also not defined yet, so I'll just make my own for now. Thanks for all your help!

@jairideout
Copy link
Member

@JTFouquier we intentionally didn't create a transformer to turn files with headers into headerless files, in order to discourage using/generating these types of files (it's generally considered a bad practice in data science). Thus, we support importing headerless taxonomy files in order to be compatible with popular reference databases (Greengenes, for example, doesn't have headers in its taxonomy files), but we don't support exporting into a headerless format. The feature classifier tutorial has an example of importing a Greengenes (headerless) file into an artifact. Internally, the artifact (.qza file) stores the data with a header (i.e. it turns the HeaderlessTSVTaxonomyFormat into a TSVTaxonomyDirectoryFormat).

If you'd like to get your taxonomy file back into a headerless format, you can export the data from your .qza (e.g. using qiime tools export) and delete the header line from the file.

Does that solve your issue? I might be misunderstanding what you're trying to do -- if you could provide more details about your use-case in q2-ghost-tree that would be helpful.

@JTFouquier
Copy link
Author

Thanks @jairideout. Yes, the UNITE ITS database's taxonomy files are headerless, so I originally selected the HeaderlessTSVTaxonomy as the type.

This command qiime tools import --input-path minitaxonomy.txt --type FeatureData[Taxonomy] --output-path minitaxonomy_headerless.qza --source-format HeaderlessTSVTaxonomyFormat produces this:

screen shot 2017-06-29 at 10 09 28 am

I did not realize that my .qza was not actually in HeaderlessTSVTaxonomyFormat because I chose that as the --source-format when I imported it to a .qza. I totally should have used qiime tools peek... I forgot.

And when I used qiime tools import with "TaxonomyGT" & "TaxonomyGTDirectoryFormat", I get this:
screen shot 2017-06-29 at 10 09 34 am

I understand what's going on and why you guys chose to convert to a TSVTaxonomyDirectoryFormat file, but it's not intuitive since I knew I was working with a headerless file. Does that make sense? This was just the only format I had trouble with. Perhaps there could be an alert in conversion cases like this? I'm sure it's documented somewhere but I missed it. :)

Thanks!

@ebolyen
Copy link
Member

ebolyen commented Jun 29, 2017

@JTFouquier that misunderstanding makes complete sense. I'm not sure we'd want to alert/say anything when it happens as it's typically going to be the rule that the data is converted (and I don't want the user to mistake the alert as something they need to worry about). It's only in the case where you happen to already have it in the canonical format that it will leave it alone.

We definitely need some better education/docs on this feature. We're still kind of feeling out how we teach these ideas.

@JTFouquier
Copy link
Author

Yeah I understand.... a warning would only be nice for developers and would be confusing for users.

One more thing related to this. Ghost-tree does have stricter requirements in the taxonomy file than a typical taxonomy file. Would I want to just use the existing checks in ghost-tree itself or do I want to proceed with my TaxonomyGTFormatDirectory... because at this point I can use the TSVTaxonomyDirectoryFormat now that I see what happened.

@ebolyen
Copy link
Member

ebolyen commented Jun 29, 2017

@JTFouquier you may need a custom semantic type (potentially), could you remind us of the restrictions that ghost-tree has on the format/data?

@JTFouquier
Copy link
Author

JTFouquier commented Jun 29, 2017

It has to contain genera, so 'g__' ....

I will be adding a feature soon to use other graft points (as an option) so 'f__' or 'o__' etc, .... so maybe it wouldn't make sense to have a special semantic type.

I would like to not have to require the 'g__' like format, but I don't always trust taxonomy files without that designation since taxonomy varies....

@ebolyen
Copy link
Member

ebolyen commented Jul 6, 2017

Thinking about this a bit, I think you should probably organize it such that you take a given taxonomic level (a number like in q2-feature-classifier) then you aren't tied to the prefix, and it is the user's responsibility to identify at what level they would like the engraftment.

Barring that, you'll need some kind of validation to assert that the format has the prefixes you expect because our format and semantic type do not have an opinion (this is also why the level instead of prefix is probably a better idea). If you were to keep the prefix you might want to require a property on your type, e.g.:

FeatureData[Taxonomy % Properties('ghost-tree')] 

Seeing as you expect a specific ontology for the taxonomy (the greengenes ontology).
What this will do is force the user to import their data with this property, which will prevent users from passing FeatureData[Taxonomy] generally.

In short all FeatureData[Taxonomy % Properties('ghost-tree')] are FeatureData[Taxonomy], but not all FeatureData[Taxonomy] are FeatureData[Taxonomy % Properties('ghost-tree')] which sounds like your current situation. But I think a taxonomic level as input is probably still the better way to go becaue then you don't need any of this.

@jairideout
Copy link
Member

Closing as it looks like the original questions about HeaderlessTSVTaxonomyFormat were resolved.

@JTFouquier
Copy link
Author

@ebolyen, coming back to this, can you please point me to where in q2-feature-classifier is a good example of how to use the taxonomic level instead of the prefix? I'm not really sure how to apply that. I have something like --graft-level=o as a click option, but the code still relies on the prefix. I'm happy changing it, I'm just confused about how to tell ghost-tree to use other taxonomic levels if I don't give it the prefix. Thanks! (issue is still closed).

@jairideout
Copy link
Member

q2-taxa's collapse method may be helpful -- it accepts a taxonomic level parameter to collapse at (similar to QIIME 1's summarize_taxa.py script).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants