Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Fuzzy name searching with name2taxid #88

Open
3 tasks done
jolespin opened this issue Nov 1, 2023 · 3 comments
Open
3 tasks done

[Feature Request] Fuzzy name searching with name2taxid #88

jolespin opened this issue Nov 1, 2023 · 3 comments

Comments

@jolespin
Copy link

jolespin commented Nov 1, 2023

Prerequisites

  • make sure you're are using the latest version by taxonkit version
  • read the usage

Describe your issue

  • describe the problem

Similar to https://github.com/etetoolkit/ete/blob/1582ea2aa0d28065f4757b8b5af74367f6abe19f/ete4/ncbi_taxonomy/ncbiquery.py#L112C30-L112C30

    def get_fuzzy_name_translation(self, name, sim=0.9):
        """Return taxid, species name and match score from the NCBI database.

        The results are for the best match for name in the NCBI
        database of taxa names, with a word similarity >= `sim`.

        :param name: Species name (does not need to be exact).
        :param 0.9 sim: Min word similarity to report a match (from 0 to 1).
        """

For example, EukZoo has an annotation from a source organism id AddRef0031 labeled as species Paramecium tetraurelia and strain Stock d4-2. A manual search for this shows https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=412030 but the keyword "stock" is missing even though they are certainly the same.

Would it be possible to include this type of searching?

@fgvieira
Copy link

This would be quite helpful!

@shenwei356
Copy link
Owner

shenwei356 commented Mar 15, 2024

Oh, I missed this issue before. There are some existing packages I can use.

@shenwei356
Copy link
Owner

Implemented with https://github.com/suggest-go/suggest/ , it supports writing the index to file but I didn't make it. So right now, it's an in-memory index, which is slow to build for every run.

Fuzzy match:

memusg -t -s "echo Paramecium tetraurelia strain Stock d4-2 | taxonkit name2taxid -f  --verbose | taxonkit lineage -L -nr -i 2"
11:52:09.824 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
11:52:13.027 [INFO] 3942782 names parsed
11:52:13.027 [INFO] creating indexing for name searching ...
11:52:47.166 [INFO] indexing finished
Paramecium tetraurelia strain Stock d4-2        412030  Paramecium tetraurelia strain d4-2      strain

elapsed time: 37.530s
peak rss: 3.6 GB

Exact match:

memusg -t -s "echo Paramecium tetraurelia strain Stock d4-2 | ./taxonkit name2taxid  --verbose | taxonkit lineage -L -nr -i 2"
11:51:34.730 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
11:51:37.907 [INFO] 3942782 names parsed
Paramecium tetraurelia strain Stock d4-2

elapsed time: 3.328s
peak rss: 1.73 GB

Try it:

  -f, --fuzzy             allow fuzzy match
  -n, --fuzzy-top-n int   choose top n matches in fuzzy search (default 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants