Sequence names starting with "Sp_" #65

snayfach · 2022-09-19T14:04:12Z

I built a taxdump using a custom taxonomy with the command:
taxonkit create-taxdump genome_taxonomy.tsv -A 1 -O out--force

A few of the accessions in genome_taxonomy.tsv start with "Sp_" and I noticed this prefix was removed in the taxid.map output file causing some issues.

I'll find a workaround, but thought you might want to know

The text was updated successfully, but these errors were encountered:

shenwei356 · 2022-09-19T14:59:09Z

Thank you, Stephen.

There's a bug when using the command you used, the colname of the accession column would be treated as one of the ranks, which messed up all the ranks. I've fixed it but haven't released it yet. Please use the binary here:

- fix bug of handling non-GTDB data when using `-A/--field-accession` and no rank names given.

But it seems not the issue you met. Can you paste some data to reproduce?

shenwei356 · 2022-09-19T15:16:13Z

I figure out what happend. Please wait for a few minutes.

shenwei356 · 2022-09-19T15:47:16Z

Fixed. The old default regular expression ^\w\w_(.+)$ wrongly removed the Sp_ prefix, which is meant to remove the prefix GB_ or RS_ of GB_GCA_001941065.1 in GTDB taxnomy data. Now it's changed:

--field-accession-re string       regular expression to extract assembly accession (default "^(.+)$")

Also fix the command to create taxdump from MGV data

snayfach · 2022-09-19T15:52:41Z

Fixed! ... An unrelated question I was hoping you could answer: how should I format the input file for sequences that are unclassified at a given rank? Can I use "unclassified" or an empty string "" or do I need to include the parent taxon e.g. "unclassified_proteobacteria"?

shenwei356 · 2022-09-19T16:00:12Z

Just leave it blank (empty string ""), the accession would point to the closest node above the node in taxid.map

$ cat taxonomy.tsv  | csvtk pretty -t
id                superkingdom   phylum       class     order        family              genus            species
---------------   ------------   ----------   -------   ----------   -----------------   --------------   ---------------------
GCF_001027105.1   Bacteria       Firmicutes   Bacilli   Bacillales   Staphylococcaceae   Staphylococcus   Staphylococcus aureus
test              Bacteria       Firmicutes   Bacilli   Bacillales   Staphylococcaceae   Staphylococcus   

$ taxonkit create-taxdump -A 1 taxonomy.tsv -O t --force

$ cat t/taxid.map  | taxonkit lineage --data-dir t/ -i 2 -t  | csvtk pretty -Ht
GCF_001027105.1   1569132721   Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus   609216830;3642462009;1845768359;813944714;1997712377;1824050977;1569132721
test              1824050977   Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus                         609216830;3642462009;1845768359;813944714;1997712377;1824050977

shenwei356 added a commit that referenced this issue Sep 19, 2022

fix the default option value of --field-accession-re, #65

a93923a

snayfach closed this as completed Sep 19, 2022

shenwei356 mentioned this issue Sep 22, 2022

Update TaxonKit to v0.13.0 bioconda/bioconda-recipes#37081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence names starting with "Sp_" #65

Sequence names starting with "Sp_" #65

snayfach commented Sep 19, 2022

shenwei356 commented Sep 19, 2022 •

edited

Loading

shenwei356 commented Sep 19, 2022

shenwei356 commented Sep 19, 2022 •

edited

Loading

snayfach commented Sep 19, 2022 •

edited

Loading

shenwei356 commented Sep 19, 2022 •

edited

Loading

Sequence names starting with "Sp_" #65

Sequence names starting with "Sp_" #65

Comments

snayfach commented Sep 19, 2022

shenwei356 commented Sep 19, 2022 • edited Loading

shenwei356 commented Sep 19, 2022

shenwei356 commented Sep 19, 2022 • edited Loading

snayfach commented Sep 19, 2022 • edited Loading

shenwei356 commented Sep 19, 2022 • edited Loading

shenwei356 commented Sep 19, 2022 •

edited

Loading

shenwei356 commented Sep 19, 2022 •

edited

Loading

snayfach commented Sep 19, 2022 •

edited

Loading

shenwei356 commented Sep 19, 2022 •

edited

Loading