[Feature Request] Fuzzy name searching with name2taxid #88

jolespin · 2023-11-01T15:24:13Z

Prerequisites

make sure you're are using the latest version by taxonkit version
read the usage

Describe your issue

describe the problem

Similar to https://github.com/etetoolkit/ete/blob/1582ea2aa0d28065f4757b8b5af74367f6abe19f/ete4/ncbi_taxonomy/ncbiquery.py#L112C30-L112C30

    def get_fuzzy_name_translation(self, name, sim=0.9):
        """Return taxid, species name and match score from the NCBI database.

        The results are for the best match for name in the NCBI
        database of taxa names, with a word similarity >= `sim`.

        :param name: Species name (does not need to be exact).
        :param 0.9 sim: Min word similarity to report a match (from 0 to 1).
        """

For example, EukZoo has an annotation from a source organism id AddRef0031 labeled as species Paramecium tetraurelia and strain Stock d4-2. A manual search for this shows https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=412030 but the keyword "stock" is missing even though they are certainly the same.

Would it be possible to include this type of searching?

The text was updated successfully, but these errors were encountered:

fgvieira · 2024-03-15T13:38:35Z

This would be quite helpful!

shenwei356 · 2024-03-15T18:31:01Z

Oh, I missed this issue before. There are some existing packages I can use.

shenwei356 · 2024-06-15T10:56:26Z

Implemented with https://github.com/suggest-go/suggest/ , it supports writing the index to file but I didn't make it. So right now, it's an in-memory index, which is slow to build for every run.

Fuzzy match:

memusg -t -s "echo Paramecium tetraurelia strain Stock d4-2 | taxonkit name2taxid -f  --verbose | taxonkit lineage -L -nr -i 2"
11:52:09.824 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
11:52:13.027 [INFO] 3942782 names parsed
11:52:13.027 [INFO] creating indexing for name searching ...
11:52:47.166 [INFO] indexing finished
Paramecium tetraurelia strain Stock d4-2        412030  Paramecium tetraurelia strain d4-2      strain

elapsed time: 37.530s
peak rss: 3.6 GB

Exact match:

memusg -t -s "echo Paramecium tetraurelia strain Stock d4-2 | ./taxonkit name2taxid  --verbose | taxonkit lineage -L -nr -i 2"
11:51:34.730 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
11:51:37.907 [INFO] 3942782 names parsed
Paramecium tetraurelia strain Stock d4-2

elapsed time: 3.328s
peak rss: 1.73 GB

Try it:

  -f, --fuzzy             allow fuzzy match
  -n, --fuzzy-top-n int   choose top n matches in fuzzy search (default 1)

shenwei356 added the new feature label Mar 15, 2024

shenwei356 added a commit that referenced this issue Jun 15, 2024

name2taxid: Add support of fuzzy match. #88

6547781

shenwei356 mentioned this issue Jun 15, 2024

[Feature] taxonkit list from taxon name #93

Open

2 tasks

shenwei356 mentioned this issue Jul 3, 2024

Update TaxonKit to 0.17.0 bioconda/bioconda-recipes#48917

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Fuzzy name searching with name2taxid #88

[Feature Request] Fuzzy name searching with name2taxid #88

jolespin commented Nov 1, 2023 •

edited

Loading

fgvieira commented Mar 15, 2024

shenwei356 commented Mar 15, 2024 •

edited

Loading

shenwei356 commented Jun 15, 2024

[Feature Request] Fuzzy name searching with name2taxid #88

[Feature Request] Fuzzy name searching with name2taxid #88

Comments

jolespin commented Nov 1, 2023 • edited Loading

Prerequisites

Describe your issue

fgvieira commented Mar 15, 2024

shenwei356 commented Mar 15, 2024 • edited Loading

shenwei356 commented Jun 15, 2024

jolespin commented Nov 1, 2023 •

edited

Loading

shenwei356 commented Mar 15, 2024 •

edited

Loading