Implement a method to estimate the number of rows in the index #769

hyperion-cs · 2022-10-03T20:04:58Z

ZomboDB version: 3000.0.12
Postgres version: 14.x
Elasticsearch version: 8.3

This issue has already been mentioned in a neighboring issue (by my mistake). Again:

So, I need extremely fast mechanism (much faster than zdb.count(...) function) to estimate whether a query in ZDB will find more than 10'000 rows or not (don't be scared of this constant - it's inherent in the ES, I'll tell you about it next).
Directly in ES this is quite simple: you need to send _search query with size equal to 0. Like this:

curl -X GET "localhost:9200/16881.2200.527888473.530970621/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "size" : 0,
    "query":
    {
        "match":
        {
            "some_field":
            {
                "query": "el paso 5",
                "fuzzy_transpositions": false,
                "auto_generate_synonyms_phrase_query": false,
                "operator": "and"
            }
        }
    }
}
'

Example response (insignificant data removed):

{
  "hits" : {
    "total" : {
      "value" : 2653
    }
  }
}

In short, ES in this case returns either the exact count in hits.total.value or a constant of 10'000 if the number of rows found >= 10'000. Basically, it's like _count (aka zdb.count(...)), only the counter stops when it finds 10'000 rows.
The 10'000 constant is actually the default value for the track_total_hits parameter (described here).
Thus, the main difference from _count is that this method of estimating the number of rows works immediately on any number of rows in store.

At the same time, if I try to make a query in ZDB with size equal to 0, we come across this behavior:

Some(limit) if limit == 0 => {
    // with a limit of zero, we can avoid going to Elasticsearch at all
    // and just return a (mostly) None'd response
    return Ok(ElasticsearchSearchResponse {
        elasticsearch: None,
        limit: Some(0),
        offset: None,
        track_scores,
        should_sort_hits,
        scroll_id: None,
        shards: None,
        hits: None,
        fast_terms: None,
    });
}

Consequently, the problem:

How can I query ES via ZDB with size 0 ?
How do I get the value of hits.total.value from the request result?
How to manage track_total_hits parameter (to change value 10'000 to any other)? I couldn't find anything about it in ZDB.

By the way, this whole problem can be solved as follows: perhaps a function should be added to aggregate functions that returns the estimated number of rows in the request (according to the method I described) ?

It seems to me that it would be logical and useful for many to have a special function for this. Especially for those who work with a very large collection of data. The definition of the function could be this:

FUNCTION zdb.estimate_count(
	index regclass,
	track_total_hits integer DEFAULT 10000,
	query zdbquery
      ) 
RETURNS integer

The text was updated successfully, but these errors were encountered:

eeeebbbbrrrr · 2022-10-13T15:59:14Z

I can add a function like this. I'm just dubious as to how much faster it'll be than zdb.count().

hyperion-cs · 2022-10-13T19:37:16Z

@eeeebbbbrrrr, without a real necessity, I wouldn't ask for it... Look for yourself:

_count => 3102 ms (its wrapper zdb.count() about the same);
Proposed method with zero size => 13 ms.

The difference is x238 times!

Obviously, the point is that _count can sometimes consider a really large number of documents if you have a huge storage. For example, in the case above, this value is 134'701'809.

Mostly, I need this feature just to detect in advance that a user is going to make a very frequent query, i.e. with extremely low selectivity (then I can turn off the scoring of results in this case, hehe). There are other uses as well.
I believe adding this functionality to zombodb will allow it to work even better with "big data".

hyperion-cs · 2023-04-27T02:03:39Z

@eeeebbbbrrrr, hello! It would be great if we could get back to this issue.

mwieczorkiewicz · 2023-09-06T14:21:47Z

@eeeebbbbrrrr - I am willing to take this up over the weekend/next week if it's fine, as this would be pretty useful in our case as well. Will try to align it with other functions that are part of ZomboDB.

eeeebbbbrrrr · 2023-09-06T14:35:34Z

I'd be happy to review and merge it.

hyperion-cs · 2024-01-18T14:49:56Z

@mwieczorkiewicz, hello! Any updates?

hyperion-cs mentioned this issue Oct 3, 2022

Add the possibility to get a result efficiently without computing score #757

Closed

eeeebbbbrrrr self-assigned this Oct 13, 2022

eeeebbbbrrrr added the enhancement label Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a method to estimate the number of rows in the index #769

Implement a method to estimate the number of rows in the index #769

hyperion-cs commented Oct 3, 2022

eeeebbbbrrrr commented Oct 13, 2022

hyperion-cs commented Oct 13, 2022 •

edited

hyperion-cs commented Apr 27, 2023

mwieczorkiewicz commented Sep 6, 2023

eeeebbbbrrrr commented Sep 6, 2023

hyperion-cs commented Jan 18, 2024

Implement a method to estimate the number of rows in the index #769

Implement a method to estimate the number of rows in the index #769

Comments

hyperion-cs commented Oct 3, 2022

eeeebbbbrrrr commented Oct 13, 2022

hyperion-cs commented Oct 13, 2022 • edited

hyperion-cs commented Apr 27, 2023

mwieczorkiewicz commented Sep 6, 2023

eeeebbbbrrrr commented Sep 6, 2023

hyperion-cs commented Jan 18, 2024

hyperion-cs commented Oct 13, 2022 •

edited