Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __and__ and __or__ to Query #287

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

PSU3D0
Copy link

@PSU3D0 PSU3D0 commented May 18, 2024

Very new to Tantivy and Rust, so if I missed something apologies!

It would be excellent to combine queries using syntax like &, |.

I maintain a library, Docprompt, that uses Tantivy internally, and having these operators available would make my usage easier, as well as other developers that are trying to compose queries declarativly using python syntax.

I understand that are a lot of query types in Tantivy. If there are other areas/directions that need to be implemented before we can expose the python API to developers, let me know and I'm happy to implement. Cheers!

PS:

I think a natural followup would be a python-space QueryParser that allows Django/SqlAlchemy like ORM usage for constructing queries. To cite the unit tests:

Old:

query = index.parse_query("title:men AND body:winter", ["title", "body"])

New:

query = index.create_query(title="men") & index.create_query(body="winter")

This PR would be the very beginnings of a python API in that direction!

@@ -65,6 +65,26 @@ impl Query {
Ok(format!("Query({:?})", self.get()))
}

pub(crate) fn __and__(&self, other: Query) -> Query {
let inner = tv::query::BooleanQuery::from(vec![
(tv::query::Occur::Must, self.inner.box_clone()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check if self or other already is a BooleanQuery of all-Must and apend ourselves to that instead of building trees of BooleanQuery?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why would you want this feature? Complicated syntaxes like (query1 | query2) & query3 can be very troublesome to implement. Its also common to nest queries in Lucene.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear for the reduced efficiency compared to having a flat list of subqueries. AFAIU, Tantivy will not optimize that query tree and its execution will directly mirror the structure of the input query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to distinguish the presence of a boolean request, I afraid we need to setup dedicated Query struct but it will lead to big change in the API.

I was rather thinking about downcasting here. I also think we can keep this internal as an optimization instead of changing the API, i.e. downcast to see if we already have a boolean query with matching modifiers and append a clause instead of nesting.

Copy link
Contributor

@alex-au-922 alex-au-922 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think the idea is fine, the optimization cannot be too deep due to the Occur implementation.

Currently we have 3 variants of Occur: Should, Must, MustNot. The optimization can be done when we have a tuple of (Occur, BooleanQuery) where the boolean query might contain one or more groups of Should, Must, MustNot.

Their mathematics goes like the symmetric matrix (Take the header as the index 0 Occur of the (Occur, BooleanQuery) tuple, while the rows Occur as the groups inside the BooleanQuery):

Should Must MustNot
Should Should (Must, [Should, ...]) (MustNot, [Should, ...])
Must (Should, [Must, ...]) Must MustNot
MustNot (Should, [MustNot, ...]) MustNot Must

This means that we can only flatten 1 layer those 5 types of (Occur, InnerOccurGroup), and for the flattening we need to extract the boolean query's inner occur group, am I right?

P.S. Why I remain (Must, [Should, ...]) and (MustNot, [Should, ...]) though they can be deduced by DeMorgan's rule is that when transforming them, the resultant query still have the same degree of nesting and there is no significant benefits so I just remain them as it is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was only aimed at keeping the status quo, i.e. chaining clauses using & and | should result in a single boolean query as if the boolean_query method is used instead of a unbalanced tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we can keep this internal as an optimization instead of changing the API.

So do you mean you expect whenever __and__ and __or__ are used, the above optimization evaluation rule applies while the original Query.boolean_query remains the status quo?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you mean you expect whenever and and or are used, the above optimization evaluation rule applies while the original Query.boolean_query remains the status quo?

No, I just want a & b & c and boolean_query([(Must, a), (Must, b), (Must, c)]) to produce the same internal representation using a single boolean query instead of the first form using two boolean queries.

Copy link
Contributor

@alex-au-922 alex-au-922 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think the & and | are binary operations that they must evaluate to something first, no matter the order? Say for a & b & c then we must either evaluate to d & c where d = a & b or vice versa, and there must be nesting.

In case we optimistically flatten the queries with the & or | syntax, say a & b & c, while any of the queries a, b, c are boolean, the resultant query will be different from a boolean_query([(Must, a), (Must, b), (Must, c)]).

If we really want to adapt this syntax while producing the same internal representation, the Query.boolean_query method should also be flattened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants