Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ways to improve fuzzy search speed on larger data sets? #607

Closed
zeluspudding opened this issue Oct 27, 2019 · 6 comments
Closed

Ways to improve fuzzy search speed on larger data sets? #607

zeluspudding opened this issue Oct 27, 2019 · 6 comments

Comments

@zeluspudding
Copy link

zeluspudding commented Oct 27, 2019

I have an sqlite table with 16 million rows in it. Having read @simonw article "Fast Autocomplete Search for Your Website" I was curious to try datasette to see what kind of query performance I could get out of it. In truth I don't need to do full text search since all I would like to do is give my users a way to search for the names of investors such as "Warren Buffet", or "Tim Cook" (who's names are in a single column).

On the first search, Datasette takes over 20 seconds to return all records associated with elon musk:

image

image

If I rerun the same search, it then takes almost 9 seconds:

image

That's far to slow to implement an autocomplete feature. I could reduce the latency by making a special table of only unique investor names, thereby reducing the search space to less than a million rows (then I'd need to implement a way to add only new investor names to the table as I received new data.. about 4,000 rows a day). If I did that, I'm still concerned the new table wouldn't be lean enough to lookup investor names quickly. Plus, even if I can implement the autocomplete feature, I would still finally have to lookup records for that investors which would take between 8 - 20 seconds.

Are there any tricks for speeding this up?

Here's my hardware:

image

@zeluspudding
Copy link
Author

zeluspudding commented Oct 27, 2019

Update: I've created a table of only unique names. This reduces the search space from over 16 million, to just about 640,000. Interestingly, it takes less than 2 seconds to create this table using Python. Performing the same search that we did earlier for elon musk takes nearly a second - much faster than before but still not speedy enough for an autocomplete feature (which usually needs to return results within 100ms to feel "real time").

Any ideas for slashing the search speed nearly 10 fold?

image

@zeluspudding
Copy link
Author

Ultimately, I'm needing to serve searches like this to multiple users (at times concurrently). Given the size of the database I'm working with, can anyone comment as to whether I should be storing this in something like MySQL or Postgres rather than SQLite. I know there's been much defense of sqlite being performant but I wonder if those arguments break down as the database size increases.

For example, if I scroll to the bottom of that linked page, where it says Checklist For Choosing The Right Database Engine, here's how I answer those questions:

  • Is the data separated from the application by a network? → choose client/server
    Yes
  • Many concurrent writers? → choose client/server
    Not exactly. I may have many concurrent readers but almost no concurrent writers.
  • Big data? → choose client/server
    No, my database is less than 40 gb and wont approach a terabyte in the next decade.

So is sqlite still a good idea here?

@zeluspudding
Copy link
Author

UPDATE:
According to tips suggested in Squeezing Performance from SQLite: Indexes? Indexes! I have added an index to my large table and benchmarked query speeds in the case where I want to return all rows, rows exactly equal to 'Musk Elon' and, rows like 'musk'. Indexing reduced query time for each of those measures and dramatically reduced the time to return rows exactly equal to 'Musk Elon' as shown below:

table: edgar_idx
rows: 16,428,090 rows
indexed: False
Return all rows where company name exactly equal to Musk Elon
query: select rowid, * from edgar_idx where "company" = :p0 order by rowid limit 101
query time: Query took 21821.031ms

Return all rows where company name contains Musk
query: select rowid, * from edgar_idx where "company" like :p0 order by rowid limit 101
query time: Query took 20505.029ms

Return everything
query: select rowid, * from edgar_idx order by rowid limit 101
query time: Query took 7985.011ms

indexed: True
Return all rows where company name exactly equal to Musk Elon
query: select rowid, * from edgar_idx where "company" = :p0 order by rowid limit 101
query time: Query took 30.0ms

Return all rows where company name contains Musk
query: select rowid, * from edgar_idx where "company" like :p0 order by rowid limit 101
query time: Query took 13340.019ms

Return everything
query: select rowid, * from edgar_idx order by rowid limit 101
query time: Query took 2190.003ms

So indexing reduced query time for an exact match to "Musk Elon" from almost 22 seconds to 30.0ms. That's amazing and truly promising! However, an autocomplete feature relies on fuzzy / incomplete matching, which is more similar to the contains 'musk' query... Unfortunately, that takes 13 seconds even after indexing. So the hunt for a fast fuzzy / autocomplete search capability persists.

@zeluspudding zeluspudding changed the title Ways to improve search speed on larger data sets? Ways to improve fuzzy search speed on larger data sets? Oct 28, 2019
@simonw
Copy link
Owner

simonw commented Oct 30, 2019

.Hi @zeluspudding

You're running your search queries using the "contains" filter, which uses a like query under the hood.

SQL like queries are generally slow because they force a full table scan. You can add an index on the column but it will only speed up prefix queries, like ... where name like 'apple%' - they won't help if you are searching for text further along the string.

Instead, you should take a look at SQLite's FTS - full text indexing feature. You can build a FTS index against a column and dramatically speed up searches for words within that column.

This documentation should help get you started: https://datasette.readthedocs.io/en/stable/full_text_search.html

@zeluspudding
Copy link
Author

Hi Simon, thanks for the pointer! Feeling good that I came to your conclusion a few days ago. I did hit a snag with figuring out how to compile a special version of sqlite for my windows machine (which I only realized I needed to do after running your command sqlite-utils enable-fts mydatabase.db items name description).

I'll try to solve that problem next week and report back here with my findings (if you know of a good tutorial for compiling on windows, I'm all ears). Either way, I'll try to close this issue out in the next two weeks. Thanks again!

@zeluspudding
Copy link
Author

I just got FTS5 working and it is incredible! The lookup time for returning all rows where company name contains "Musk" from my table of 16,428,090 rows has dropped from 13,340.019 ms to 15.6ms. Well below the 100ms latency for the "real time autocomplete" feel (which doesn't currently include the http call).

So cool! Thanks again for the pointers and awesome datasette!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants