Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow filtering grouped values by prefix or regex #23108

Open
angelf opened this issue Jun 15, 2022 · 1 comment
Open

Allow filtering grouped values by prefix or regex #23108

angelf opened this issue Jun 15, 2022 · 1 comment
Milestone

Comments

@angelf
Copy link

angelf commented Jun 15, 2022

Is your feature request related to a problem? Please describe.

Consider a classification scheme where each document can be classified in multiple categories, and categories form a hierarchy and there are many. As an example, let's say the classification schema contains

  genre
     genre/poetry
     genre/biographies
     genre/fiction
     
  styles
     styles/symbolism
     styles/realism
          styles/realism/neorealism
     styles/fantasy
          styles/fantasy/high-fantasy          
          styles/fantasy/medieval

And we may have docs classified in multiple categories, possible at the same level.

  doc1: categories={genre/poetry styles/symbolism}
  doc2: categories={genre/fictions styles/fantasy/high-fantasy styles/fantasy/medieval}

When browsing a certain category (styles/fantasy) we are interested in grouping ("faceting") search results, but only showing categories that are under the current path. It is also important to provide the complete result set.

Describe the solution you'd like

The preferred option would be to add a grouping function that can filter values that do not start with a certain prefix. So for example:

all( group(filter_prefix(category, "styles/fantasy/")) each(output(count())) )

Would discard all groups whose value does not start with "styles/fantasy/"). As with other expressions the computation would occur at each node, and so network bandwidth would be greatly reduced.

filter_prefix might completely omit the group, or replace the value with an empty string (both would solve the problem) or a string selected by the user. For example:

all( group(if_starts(category, "styles/fantasy", category, "alternative")) each(output(count())) )

Describe alternatives you've considered

  1. A first approach is to group by all values (all( group(category) each(output(count())) )), and then filter out the ones that don't belong to the current context. But this may require a very large maxHits to assure that the values of interest are actually included in the results, and it will be inefficient. On large taxonomies it'll make hard to provide assurances that the result set is complete.

  2. Creating one field for each level ("category1", "category2", "category3") attenuates but does not solve the problem since documents can be in multiple categories at different hierarchy points; so we are still at risk of not providing the complete result set.

  3. A more general but maybe less efficient approach would be allow regex filtering

    all( group(if_regex_matches(category, "styles/fantasy", category, "alternative")) each(output(count())) )

  4. A new expression syntax rather than a function would may be more natural, but probably requires more aggressive changes. For example:

    all( group(category) if_prefix("styles/fantasy") each(output(count())) )

Additional context
See originating discussion on: https://vespatalk.slack.com/archives/C01QNBPPNT1/p1654876998447789

@johans1 johans1 added this to the later milestone Jun 15, 2022
@jobergum
Copy link
Member

Thanks for the nice writeup @angelf, this request also relates to #15658.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Search and content
Awaiting triage
Development

No branches or pull requests

3 participants