Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence names starting with "Sp_" #65

Closed
snayfach opened this issue Sep 19, 2022 · 5 comments
Closed

Sequence names starting with "Sp_" #65

snayfach opened this issue Sep 19, 2022 · 5 comments

Comments

@snayfach
Copy link

I built a taxdump using a custom taxonomy with the command:
taxonkit create-taxdump genome_taxonomy.tsv -A 1 -O out--force

A few of the accessions in genome_taxonomy.tsv start with "Sp_" and I noticed this prefix was removed in the taxid.map output file causing some issues.

I'll find a workaround, but thought you might want to know

@shenwei356
Copy link
Owner

shenwei356 commented Sep 19, 2022

Thank you, Stephen.

There's a bug when using the command you used, the colname of the accession column would be treated as one of the ranks, which messed up all the ranks. I've fixed it but haven't released it yet. Please use the binary here:

- fix bug of handling non-GTDB data when using `-A/--field-accession` and no rank names given.

But it seems not the issue you met. Can you paste some data to reproduce?

@shenwei356
Copy link
Owner

I figure out what happend. Please wait for a few minutes.

@shenwei356
Copy link
Owner

shenwei356 commented Sep 19, 2022

Fixed. The old default regular expression ^\w\w_(.+)$ wrongly removed the Sp_ prefix, which is meant to remove the prefix GB_ or RS_ of GB_GCA_001941065.1 in GTDB taxnomy data. Now it's changed:

--field-accession-re string       regular expression to extract assembly accession (default "^(.+)$")

Also fix the command to create taxdump from MGV data

@snayfach
Copy link
Author

snayfach commented Sep 19, 2022

Fixed! ... An unrelated question I was hoping you could answer: how should I format the input file for sequences that are unclassified at a given rank? Can I use "unclassified" or an empty string "" or do I need to include the parent taxon e.g. "unclassified_proteobacteria"?

@shenwei356
Copy link
Owner

shenwei356 commented Sep 19, 2022

Just leave it blank (empty string ""), the accession would point to the closest node above the node in taxid.map

$ cat taxonomy.tsv  | csvtk pretty -t
id                superkingdom   phylum       class     order        family              genus            species
---------------   ------------   ----------   -------   ----------   -----------------   --------------   ---------------------
GCF_001027105.1   Bacteria       Firmicutes   Bacilli   Bacillales   Staphylococcaceae   Staphylococcus   Staphylococcus aureus
test              Bacteria       Firmicutes   Bacilli   Bacillales   Staphylococcaceae   Staphylococcus   

$ taxonkit create-taxdump -A 1 taxonomy.tsv -O t --force

$ cat t/taxid.map  | taxonkit lineage --data-dir t/ -i 2 -t  | csvtk pretty -Ht
GCF_001027105.1   1569132721   Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus   609216830;3642462009;1845768359;813944714;1997712377;1824050977;1569132721
test              1824050977   Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus                         609216830;3642462009;1845768359;813944714;1997712377;1824050977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants