Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boolean Index query result is incorrect when inverted+resultset #28

Closed
flipmcf opened this issue Oct 2, 2017 · 12 comments
Closed

Boolean Index query result is incorrect when inverted+resultset #28

flipmcf opened this issue Oct 2, 2017 · 12 comments

Comments

@flipmcf
Copy link

flipmcf commented Oct 2, 2017

This might not be an issue, but a lack of understanding.
I do know however that one of our queries is misbehaving:

query_request={ 'path': '/somewhere/below/something',
                               'boolIndex': True
                         }

The indexed value is "False", so it's an inverted query.

What is returned:
difference(resultset, index)

And I read that as:
"Subtract the indexed values from the result set (all objects satisfying the path query) and return it"
Which is not correct.

Shouldn't this be:
"Invert the index" difference(self._unindex, index) And then intersect that with the result set intersection( <that>, resultset)

Also, we have an inverted index and have not modified it, so eventually _invert_index() will run and the error goes away - running a different line of code in _apply_index() (or query_index()) However, I'm not convinced this happens ONLY on inverted indexes. Queries for the unindexed value could suffer the same error.

This code is more than 7 years old and I can't possibly be the one to discover this error, so I'm really hesitant to raise the "Bug Flag".

@hannosch
Copy link
Contributor

hannosch commented Oct 3, 2017

The query you gave says 'boolIndex': True, and you said the actually indexed value is False. So the index set contains all the document ids, for which the indexed value is False. For a query looking for the opposite (restrict to True) it therefor needs to remove all of these from the resultset, which is what difference(resultset, index) accomplishes.

The logic has four different cases, the first two share a code path:

  1. The query value matches the indexed value
    1.a. A resultset is given, so return the intersection(index, resultset)
    1.b. No resultset is given. Same as before as intersection(index, None) simply returns index unchanged.
  2. The query value is not the indexed value
    2.a. A resultset is given, so it only needs to remove the non-matching document ids from the resultset. Since the indexed value is the opposite of the query value, it can simply remove those non-matching indexed values from the resultset (difference(resultset, index))
    2.b. No resultset is given and the indexed value doesn't match the query. In this case it needs to reconstruct the opposite document ids based on the unindex difference(self._unindex, index). It than casts the result into an IISet via union(..., IISet([])), which ensures an empty difference returns an empty IISet and not None.

@hannosch
Copy link
Contributor

hannosch commented Oct 3, 2017

I've added some more tests to ensure this really works as it should and updated some of the comments on the index. They were outdated and still referred to the early days when the _index_value was hardcoded as False.

@flipmcf
Copy link
Author

flipmcf commented Oct 3, 2017

Thanks for this work!

We can agree we're focused on 2a - Resultset and not-indexed value.

So the index set contains all the document ids, for which the indexed value is False

Exactly. I think the issue is aggravated when the resultset contains document ids that do not have an entry in the boolean index at all. Neither True nor False.

All tests assume that all objects are indexed in the boolean index, and I'm not sure that assumption is correct.

AFAIK - and I could be wrong, when catalog.reindexObject(obj) is called, and it does not have a field corresponding to the boolean index, the index is never touched. Please confirm this.

The test case I'm thinking of looks like this:

    obj = Dummy(1, True)
    index._index_object(obj.id, obj, attr='truth')
    obj = Dummy(2, False)
    index._index_object(obj.id, obj, attr='truth')
    obj = Dummy(3, True)
    index._index_object(obj.id, obj, attr='truth')
    # obj = Dummy(4, None)  ##Don't call index._index_object
    # obj = Dummy(5, None)  ##Don't call index._index_object

    ##False is the indexed value.
    res, idx = index._apply_index({'truth': True},  ##Inverted search
                                  resultset=IISet([1, 2, 3, 4, 5]))
    self.assertEqual(list(res), [1,3])
    ##Fails - list(res) is [1,3,4,5]

So the question is, what should we do with these objects in resultset that aren't in self._unindex?

Path 1a says remove objects that are not in self._unindex intersection(index, resultset)
Path 2a says keep objects that are not in self._unindex difference(resultset, index)

doing a difference removes only those objects that have an entry in the boolean index, leaving the other objects in the result, where as doing an intersection on the inversion will remove the objects that don't have an entry in self.unindex

And I argue that "remove them" is the correct answer for path 2a:
intersection(difference(self._unindex, index), resultset)

Additionally, I believe this test will pass, which exercises 1a with additional unindexed objects:

obj = Dummy(1, True)
index._index_object(obj.id, obj, attr='truth')
obj = Dummy(2, False)
index._index_object(obj.id, obj, attr='truth')
obj = Dummy(3, True)
index._index_object(obj.id, obj, attr='truth')
# obj = Dummy(4, None)  ##Don't call index._index_object
# obj = Dummy(5, None)  ##Don't call index._index_object

##False is the indexed value.
res, idx = index._apply_index({'truth': False},   ##query for indexed value.
                              resultset=IISet([1, 2, 3, 4, 5]))
self.assertEqual(list(res), [2])  #note that [4, 5] are not in result.

Other solutions include applying boolean indexes before all other indexes in catalog.search, or making sure that all objects in the catalog are included in the _unindex of boolean indexes. And honestly, I'm not qualified to say those are not better options. But I am very hesitant to say that they are.


Edit: updated tests, swapping false and true. My logic was backwards.

@hannosch
Copy link
Contributor

hannosch commented Oct 4, 2017

Ah, yes. The issue is indeed about documents without entries in the specific index.

In the other places where we use the resultset (like the Field/UnIndex) we use the equivalent of intersection(resultset, index), so that should be the correct approach. I also remember the same logic applying to not queries, where documents without entries in each index are filtered out. If one wants a different behavior, the recommendation is to index some kind of dummy value for these documents.

I've created a PR #29 to fix the issue. I've used a slightly different approach, by using an intersection with the unindex (intersection(difference(resultset, index), self._unindex).

The passed in resultset is in-memory and usually small. The index, holding document ids for the less common value is also a small treeset. The difference call on those two is reasonably fast and doesn't require loading a lot of objects (buckets) from the database. An intersection between such a small result and the self._unindex is reasonably fast again.

On the other hand difference(self._unindex, index) has the worst-case performance characteristics, as the entire index and unindex have to be loaded and thus result in lots of database object loads. This worst-case is what we need to pay, as a consequence of not storing this inverted set (like a Un/FieldIndex would do). But we shouldn't introduce it into the more common codepath where we get a resultset passed in. Or rather the catalog query plan will figure out that taking this codepath on a boolean index with a query for the inverted value is slow, so it will favor execution plans where some other index is queried first.

@flipmcf
Copy link
Author

flipmcf commented Oct 4, 2017

the optimization you present makes sense.

Would you like me to do anything? I mean, if one complains, one should be willing to help fix.

@hannosch
Copy link
Contributor

hannosch commented Oct 4, 2017

Would you like me to do anything?

Can you verify that this change actually fixes the problem you were having?

I've backported the change to all older branches, but would like some confirmation before I make all those releases.

@flipmcf
Copy link
Author

flipmcf commented Oct 4, 2017

Sure. No problem.

@viaYoung
Copy link

viaYoung commented Oct 4, 2017

For now, I only ran this fix on my local, and it looks good. Thank you for fixing the bug.

@flipmcf
Copy link
Author

flipmcf commented Oct 4, 2017

We are now in the process of updating our Plone 4 to get Products.ZCatalog at 2.13.28 directly from github. I am confident this will fix our issue, but I won't close thist ticket until I get independent confirmation.

@hannosch
Copy link
Contributor

I've gone ahead and cut new releases with this fix for all four branches. There are now 2.13.28, 3.0.3, 3.2.1 and 4.0.1 releases up on PyPi.

@esteele You might want to grab these new releases and update Plone to include the right maintenance release for each supported Plone version. This issue covers quite a nasty bug, which makes the catalog return wrong results in some hard to track cases.

@tseaver
Copy link
Member

tseaver commented Oct 10, 2017

Many thanks to @flipmcf for tracking down and reporting this ugly bug!

@tseaver tseaver closed this as completed Oct 10, 2017
@flipmcf
Copy link
Author

flipmcf commented Oct 10, 2017

@tseaver Thanks for shout out. Our Navigation Items Adapter is much simpler now.

esteele added a commit to plone/buildout.coredev that referenced this issue Nov 24, 2017
esteele added a commit to plone/buildout.coredev that referenced this issue Nov 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants