Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a method to estimate the number of rows in the index #769

Open
hyperion-cs opened this issue Oct 3, 2022 · 6 comments
Open

Implement a method to estimate the number of rows in the index #769

hyperion-cs opened this issue Oct 3, 2022 · 6 comments
Assignees

Comments

@hyperion-cs
Copy link
Contributor

ZomboDB version: 3000.0.12
Postgres version: 14.x
Elasticsearch version: 8.3

This issue has already been mentioned in a neighboring issue (by my mistake). Again:

So, I need extremely fast mechanism (much faster than zdb.count(...) function) to estimate whether a query in ZDB will find more than 10'000 rows or not (don't be scared of this constant - it's inherent in the ES, I'll tell you about it next).
Directly in ES this is quite simple: you need to send _search query with size equal to 0. Like this:

curl -X GET "localhost:9200/16881.2200.527888473.530970621/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "size" : 0,
    "query":
    {
        "match":
        {
            "some_field":
            {
                "query": "el paso 5",
                "fuzzy_transpositions": false,
                "auto_generate_synonyms_phrase_query": false,
                "operator": "and"
            }
        }
    }
}
'

Example response (insignificant data removed):

{
  "hits" : {
    "total" : {
      "value" : 2653
    }
  }
}

In short, ES in this case returns either the exact count in hits.total.value or a constant of 10'000 if the number of rows found >= 10'000. Basically, it's like _count (aka zdb.count(...)), only the counter stops when it finds 10'000 rows.
The 10'000 constant is actually the default value for the track_total_hits parameter (described here).
Thus, the main difference from _count is that this method of estimating the number of rows works immediately on any number of rows in store.

At the same time, if I try to make a query in ZDB with size equal to 0, we come across this behavior:

Some(limit) if limit == 0 => {
    // with a limit of zero, we can avoid going to Elasticsearch at all
    // and just return a (mostly) None'd response
    return Ok(ElasticsearchSearchResponse {
        elasticsearch: None,
        limit: Some(0),
        offset: None,
        track_scores,
        should_sort_hits,
        scroll_id: None,
        shards: None,
        hits: None,
        fast_terms: None,
    });
}

Consequently, the problem:

  1. How can I query ES via ZDB with size 0 ?
  2. How do I get the value of hits.total.value from the request result?
  3. How to manage track_total_hits parameter (to change value 10'000 to any other)? I couldn't find anything about it in ZDB.

By the way, this whole problem can be solved as follows: perhaps a function should be added to aggregate functions that returns the estimated number of rows in the request (according to the method I described) ?

It seems to me that it would be logical and useful for many to have a special function for this. Especially for those who work with a very large collection of data. The definition of the function could be this:

FUNCTION zdb.estimate_count(
	index regclass,
	track_total_hits integer DEFAULT 10000,
	query zdbquery
      ) 
RETURNS integer
@eeeebbbbrrrr
Copy link
Collaborator

I can add a function like this. I'm just dubious as to how much faster it'll be than zdb.count().

@hyperion-cs
Copy link
Contributor Author

hyperion-cs commented Oct 13, 2022

@eeeebbbbrrrr, without a real necessity, I wouldn't ask for it... Look for yourself:

  1. _count => 3102 ms (its wrapper zdb.count() about the same);
  2. Proposed method with zero size => 13 ms.

The difference is x238 times!

Obviously, the point is that _count can sometimes consider a really large number of documents if you have a huge storage. For example, in the case above, this value is 134'701'809.

Mostly, I need this feature just to detect in advance that a user is going to make a very frequent query, i.e. with extremely low selectivity (then I can turn off the scoring of results in this case, hehe). There are other uses as well.
I believe adding this functionality to zombodb will allow it to work even better with "big data".

@hyperion-cs
Copy link
Contributor Author

@eeeebbbbrrrr, hello! It would be great if we could get back to this issue.

@mwieczorkiewicz
Copy link
Contributor

@eeeebbbbrrrr - I am willing to take this up over the weekend/next week if it's fine, as this would be pretty useful in our case as well. Will try to align it with other functions that are part of ZomboDB.

@eeeebbbbrrrr
Copy link
Collaborator

I'd be happy to review and merge it.

@hyperion-cs
Copy link
Contributor Author

@mwieczorkiewicz, hello! Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants