New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ways to improve fuzzy search speed on larger data sets? #607
Comments
Update: I've created a table of only unique names. This reduces the search space from over 16 million, to just about 640,000. Interestingly, it takes less than 2 seconds to create this table using Python. Performing the same search that we did earlier for Any ideas for slashing the search speed nearly 10 fold? |
Ultimately, I'm needing to serve searches like this to multiple users (at times concurrently). Given the size of the database I'm working with, can anyone comment as to whether I should be storing this in something like MySQL or Postgres rather than SQLite. I know there's been much defense of sqlite being performant but I wonder if those arguments break down as the database size increases. For example, if I scroll to the bottom of that linked page, where it says Checklist For Choosing The Right Database Engine, here's how I answer those questions:
So is sqlite still a good idea here? |
UPDATE:
So indexing reduced query time for an exact match to "Musk Elon" from almost |
.Hi @zeluspudding You're running your search queries using the "contains" filter, which uses a SQL Instead, you should take a look at SQLite's FTS - full text indexing feature. You can build a FTS index against a column and dramatically speed up searches for words within that column. This documentation should help get you started: https://datasette.readthedocs.io/en/stable/full_text_search.html |
Hi Simon, thanks for the pointer! Feeling good that I came to your conclusion a few days ago. I did hit a snag with figuring out how to compile a special version of sqlite for my windows machine (which I only realized I needed to do after running your command I'll try to solve that problem next week and report back here with my findings (if you know of a good tutorial for compiling on windows, I'm all ears). Either way, I'll try to close this issue out in the next two weeks. Thanks again! |
I just got FTS5 working and it is incredible! The lookup time for returning all rows where company name contains "Musk" from my table of 16,428,090 rows has dropped from So cool! Thanks again for the pointers and awesome datasette! |
I have an sqlite table with 16 million rows in it. Having read @simonw article "Fast Autocomplete Search for Your Website" I was curious to try datasette to see what kind of query performance I could get out of it. In truth I don't need to do full text search since all I would like to do is give my users a way to search for the names of investors such as "Warren Buffet", or "Tim Cook" (who's names are in a single column).
On the first search, Datasette takes over 20 seconds to return all records associated with
elon musk
:If I rerun the same search, it then takes almost 9 seconds:
That's far to slow to implement an autocomplete feature. I could reduce the latency by making a special table of only unique investor names, thereby reducing the search space to less than a million rows (then I'd need to implement a way to add only new investor names to the table as I received new data.. about 4,000 rows a day). If I did that, I'm still concerned the new table wouldn't be lean enough to lookup investor names quickly. Plus, even if I can implement the autocomplete feature, I would still finally have to lookup records for that investors which would take between 8 - 20 seconds.
Are there any tricks for speeding this up?
Here's my hardware:
The text was updated successfully, but these errors were encountered: