-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add __and__ and __or__ to Query #287
base: master
Are you sure you want to change the base?
Conversation
@@ -65,6 +65,26 @@ impl Query { | |||
Ok(format!("Query({:?})", self.get())) | |||
} | |||
|
|||
pub(crate) fn __and__(&self, other: Query) -> Query { | |||
let inner = tv::query::BooleanQuery::from(vec![ | |||
(tv::query::Occur::Must, self.inner.box_clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we check if self
or other
already is a BooleanQuery
of all-Must
and apend ourselves to that instead of building trees of BooleanQuery
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why would you want this feature? Complicated syntaxes like (query1 | query2) & query3
can be very troublesome to implement. Its also common to nest queries in Lucene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear for the reduced efficiency compared to having a flat list of subqueries. AFAIU, Tantivy will not optimize that query tree and its execution will directly mirror the structure of the input query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to distinguish the presence of a boolean request, I afraid we need to setup dedicated Query struct but it will lead to big change in the API.
I was rather thinking about downcasting here. I also think we can keep this internal as an optimization instead of changing the API, i.e. downcast to see if we already have a boolean query with matching modifiers and append a clause instead of nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think the idea is fine, the optimization cannot be too deep due to the Occur
implementation.
Currently we have 3 variants of Occur
: Should
, Must
, MustNot
. The optimization can be done when we have a tuple of (Occur, BooleanQuery)
where the boolean query might contain one or more groups of Should
, Must
, MustNot
.
Their mathematics goes like the symmetric matrix (Take the header as the index 0 Occur
of the (Occur, BooleanQuery)
tuple, while the rows Occur
as the groups inside the BooleanQuery
):
Should |
Must |
MustNot |
|
---|---|---|---|
Should |
Should |
(Must, [Should, ...]) |
(MustNot, [Should, ...]) |
Must |
(Should, [Must, ...]) |
Must |
MustNot |
MustNot |
(Should, [MustNot, ...]) |
MustNot |
Must |
This means that we can only flatten 1 layer those 5 types of (Occur, InnerOccurGroup)
, and for the flattening we need to extract the boolean query's inner occur group, am I right?
P.S. Why I remain (Must, [Should, ...])
and (MustNot, [Should, ...])
though they can be deduced by DeMorgan's rule is that when transforming them, the resultant query still have the same degree of nesting and there is no significant benefits so I just remain them as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was only aimed at keeping the status quo, i.e. chaining clauses using &
and |
should result in a single boolean query as if the boolean_query
method is used instead of a unbalanced tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we can keep this internal as an optimization instead of changing the API.
So do you mean you expect whenever __and__
and __or__
are used, the above optimization evaluation rule applies while the original Query.boolean_query
remains the status quo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you mean you expect whenever and and or are used, the above optimization evaluation rule applies while the original Query.boolean_query remains the status quo?
No, I just want a & b & c
and boolean_query([(Must, a), (Must, b), (Must, c)])
to produce the same internal representation using a single boolean query instead of the first form using two boolean queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think the &
and |
are binary operations that they must evaluate to something first, no matter the order? Say for a & b & c
then we must either evaluate to d & c
where d = a & b
or vice versa, and there must be nesting.
In case we optimistically flatten the queries with the &
or |
syntax, say a & b & c
, while any of the queries a
, b
, c
are boolean, the resultant query will be different from a boolean_query([(Must, a), (Must, b), (Must, c)])
.
If we really want to adapt this syntax while producing the same internal representation, the Query.boolean_query
method should also be flattened.
Very new to Tantivy and Rust, so if I missed something apologies!
It would be excellent to combine queries using syntax like
&
,|
.I maintain a library, Docprompt, that uses Tantivy internally, and having these operators available would make my usage easier, as well as other developers that are trying to compose queries declarativly using python syntax.
I understand that are a lot of query types in Tantivy. If there are other areas/directions that need to be implemented before we can expose the python API to developers, let me know and I'm happy to implement. Cheers!
PS:
I think a natural followup would be a python-space
QueryParser
that allows Django/SqlAlchemy like ORM usage for constructing queries. To cite the unit tests:Old:
New:
This PR would be the very beginnings of a python API in that direction!