Boolean Index query result is incorrect when inverted+resultset #28

flipmcf · 2017-10-02T17:10:56Z

This might not be an issue, but a lack of understanding.
I do know however that one of our queries is misbehaving:

query_request={ 'path': '/somewhere/below/something',
                               'boolIndex': True
                         }

The indexed value is "False", so it's an inverted query.

What is returned:
difference(resultset, index)

And I read that as:
"Subtract the indexed values from the result set (all objects satisfying the path query) and return it"
Which is not correct.

Shouldn't this be:
"Invert the index" difference(self._unindex, index) And then intersect that with the result set intersection( <that>, resultset)

Also, we have an inverted index and have not modified it, so eventually _invert_index() will run and the error goes away - running a different line of code in _apply_index() (or query_index()) However, I'm not convinced this happens ONLY on inverted indexes. Queries for the unindexed value could suffer the same error.

This code is more than 7 years old and I can't possibly be the one to discover this error, so I'm really hesitant to raise the "Bug Flag".

The text was updated successfully, but these errors were encountered:

hannosch · 2017-10-03T01:35:26Z

The query you gave says 'boolIndex': True, and you said the actually indexed value is False. So the index set contains all the document ids, for which the indexed value is False. For a query looking for the opposite (restrict to True) it therefor needs to remove all of these from the resultset, which is what difference(resultset, index) accomplishes.

The logic has four different cases, the first two share a code path:

The query value matches the indexed value
1.a. A resultset is given, so return the intersection(index, resultset)
1.b. No resultset is given. Same as before as intersection(index, None) simply returns index unchanged.
The query value is not the indexed value
2.a. A resultset is given, so it only needs to remove the non-matching document ids from the resultset. Since the indexed value is the opposite of the query value, it can simply remove those non-matching indexed values from the resultset (difference(resultset, index))
2.b. No resultset is given and the indexed value doesn't match the query. In this case it needs to reconstruct the opposite document ids based on the unindex difference(self._unindex, index). It than casts the result into an IISet via union(..., IISet([])), which ensures an empty difference returns an empty IISet and not None.

hannosch · 2017-10-03T01:37:14Z

I've added some more tests to ensure this really works as it should and updated some of the comments on the index. They were outdated and still referred to the early days when the _index_value was hardcoded as False.

flipmcf · 2017-10-03T20:07:11Z

Thanks for this work!

We can agree we're focused on 2a - Resultset and not-indexed value.

So the index set contains all the document ids, for which the indexed value is False

Exactly. I think the issue is aggravated when the resultset contains document ids that do not have an entry in the boolean index at all. Neither True nor False.

All tests assume that all objects are indexed in the boolean index, and I'm not sure that assumption is correct.

AFAIK - and I could be wrong, when catalog.reindexObject(obj) is called, and it does not have a field corresponding to the boolean index, the index is never touched. Please confirm this.

The test case I'm thinking of looks like this:

    obj = Dummy(1, True)
    index._index_object(obj.id, obj, attr='truth')
    obj = Dummy(2, False)
    index._index_object(obj.id, obj, attr='truth')
    obj = Dummy(3, True)
    index._index_object(obj.id, obj, attr='truth')
    # obj = Dummy(4, None)  ##Don't call index._index_object
    # obj = Dummy(5, None)  ##Don't call index._index_object

    ##False is the indexed value.
    res, idx = index._apply_index({'truth': True},  ##Inverted search
                                  resultset=IISet([1, 2, 3, 4, 5]))
    self.assertEqual(list(res), [1,3])
    ##Fails - list(res) is [1,3,4,5]

So the question is, what should we do with these objects in resultset that aren't in self._unindex?

Path 1a says remove objects that are not in self._unindex intersection(index, resultset)
Path 2a says keep objects that are not in self._unindex difference(resultset, index)

doing a difference removes only those objects that have an entry in the boolean index, leaving the other objects in the result, where as doing an intersection on the inversion will remove the objects that don't have an entry in self.unindex

And I argue that "remove them" is the correct answer for path 2a:
intersection(difference(self._unindex, index), resultset)

Additionally, I believe this test will pass, which exercises 1a with additional unindexed objects:

obj = Dummy(1, True)
index._index_object(obj.id, obj, attr='truth')
obj = Dummy(2, False)
index._index_object(obj.id, obj, attr='truth')
obj = Dummy(3, True)
index._index_object(obj.id, obj, attr='truth')
# obj = Dummy(4, None)  ##Don't call index._index_object
# obj = Dummy(5, None)  ##Don't call index._index_object

##False is the indexed value.
res, idx = index._apply_index({'truth': False},   ##query for indexed value.
                              resultset=IISet([1, 2, 3, 4, 5]))
self.assertEqual(list(res), [2])  #note that [4, 5] are not in result.

Other solutions include applying boolean indexes before all other indexes in catalog.search, or making sure that all objects in the catalog are included in the _unindex of boolean indexes. And honestly, I'm not qualified to say those are not better options. But I am very hesitant to say that they are.

Edit: updated tests, swapping false and true. My logic was backwards.

hannosch · 2017-10-04T00:38:18Z

Ah, yes. The issue is indeed about documents without entries in the specific index.

In the other places where we use the resultset (like the Field/UnIndex) we use the equivalent of intersection(resultset, index), so that should be the correct approach. I also remember the same logic applying to not queries, where documents without entries in each index are filtered out. If one wants a different behavior, the recommendation is to index some kind of dummy value for these documents.

I've created a PR #29 to fix the issue. I've used a slightly different approach, by using an intersection with the unindex (intersection(difference(resultset, index), self._unindex).

The passed in resultset is in-memory and usually small. The index, holding document ids for the less common value is also a small treeset. The difference call on those two is reasonably fast and doesn't require loading a lot of objects (buckets) from the database. An intersection between such a small result and the self._unindex is reasonably fast again.

On the other hand difference(self._unindex, index) has the worst-case performance characteristics, as the entire index and unindex have to be loaded and thus result in lots of database object loads. This worst-case is what we need to pay, as a consequence of not storing this inverted set (like a Un/FieldIndex would do). But we shouldn't introduce it into the more common codepath where we get a resultset passed in. Or rather the catalog query plan will figure out that taking this codepath on a boolean index with a query for the inverted value is slow, so it will favor execution plans where some other index is queried first.

flipmcf · 2017-10-04T00:46:56Z

the optimization you present makes sense.

Would you like me to do anything? I mean, if one complains, one should be willing to help fix.

hannosch · 2017-10-04T09:19:41Z

Would you like me to do anything?

Can you verify that this change actually fixes the problem you were having?

I've backported the change to all older branches, but would like some confirmation before I make all those releases.

flipmcf · 2017-10-04T14:44:39Z

Sure. No problem.

viaYoung · 2017-10-04T16:16:28Z

For now, I only ran this fix on my local, and it looks good. Thank you for fixing the bug.

flipmcf · 2017-10-04T19:47:24Z

We are now in the process of updating our Plone 4 to get Products.ZCatalog at 2.13.28 directly from github. I am confident this will fix our issue, but I won't close thist ticket until I get independent confirmation.

hannosch · 2017-10-10T10:11:31Z

I've gone ahead and cut new releases with this fix for all four branches. There are now 2.13.28, 3.0.3, 3.2.1 and 4.0.1 releases up on PyPi.

@esteele You might want to grab these new releases and update Plone to include the right maintenance release for each supported Plone version. This issue covers quite a nasty bug, which makes the catalog return wrong results in some hard to track cases.

tseaver · 2017-10-10T16:03:21Z

Many thanks to @flipmcf for tracking down and reporting this ugly bug!

flipmcf · 2017-10-10T17:50:24Z

@tseaver Thanks for shout out. Our Navigation Items Adapter is much simpler now.

hannosch added a commit that referenced this issue Oct 3, 2017

Extend BooleanIndex test coverage, refs #28.

e774dbb

hannosch added a commit that referenced this issue Oct 4, 2017

Fix BooleanIndex to filter out missing documents, refs #28.

4c465ab

hannosch added a commit that referenced this issue Oct 4, 2017

Fix BooleanIndex to filter out missing documents, refs #28. (#29)

ba3ff27

hannosch added a commit that referenced this issue Oct 4, 2017

Fix BooleanIndex to filter out missing documents, refs #28.

5926300

hannosch added a commit that referenced this issue Oct 4, 2017

Fix BooleanIndex to filter out missing documents, refs #28.

374131c

hannosch added a commit that referenced this issue Oct 4, 2017

Fix BooleanIndex to filter out missing documents, refs #28.

7d246f2

tseaver closed this as completed Oct 10, 2017

esteele added a commit to plone/buildout.coredev that referenced this issue Nov 24, 2017

Bump ZCatalog to get fix in zopefoundation/Products.ZCatalog#28

1898e0f

esteele added a commit to plone/buildout.coredev that referenced this issue Nov 24, 2017

Bump ZCatalog for fix in zopefoundation/Products.ZCatalog#28

b32b185

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boolean Index query result is incorrect when inverted+resultset #28

Boolean Index query result is incorrect when inverted+resultset #28

flipmcf commented Oct 2, 2017

hannosch commented Oct 3, 2017

hannosch commented Oct 3, 2017 •

edited

Loading

flipmcf commented Oct 3, 2017 •

edited

Loading

hannosch commented Oct 4, 2017 •

edited

Loading

flipmcf commented Oct 4, 2017

hannosch commented Oct 4, 2017

flipmcf commented Oct 4, 2017

viaYoung commented Oct 4, 2017

flipmcf commented Oct 4, 2017

hannosch commented Oct 10, 2017

tseaver commented Oct 10, 2017

flipmcf commented Oct 10, 2017

Boolean Index query result is incorrect when inverted+resultset #28

Boolean Index query result is incorrect when inverted+resultset #28

Comments

flipmcf commented Oct 2, 2017

hannosch commented Oct 3, 2017

hannosch commented Oct 3, 2017 • edited Loading

flipmcf commented Oct 3, 2017 • edited Loading

hannosch commented Oct 4, 2017 • edited Loading

flipmcf commented Oct 4, 2017

hannosch commented Oct 4, 2017

flipmcf commented Oct 4, 2017

viaYoung commented Oct 4, 2017

flipmcf commented Oct 4, 2017

hannosch commented Oct 10, 2017

tseaver commented Oct 10, 2017

flipmcf commented Oct 10, 2017

hannosch commented Oct 3, 2017 •

edited

Loading

flipmcf commented Oct 3, 2017 •

edited

Loading

hannosch commented Oct 4, 2017 •

edited

Loading