Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not indexed value support (MissingValue, EmptyValue) #74

Open
wants to merge 49 commits into
base: master
Choose a base branch
from

Conversation

andbag
Copy link
Member

@andbag andbag commented May 8, 2019

As discussed in issue #35 UnIndex can now support queries on MissingValue and EmptyValue. KeywordIndex implements currently the the new feature. I hope for active feedback.

@andbag andbag requested a review from d-maurer May 8, 2019 13:13
@andbag
Copy link
Member Author

andbag commented May 8, 2019

@icemac unfortunately, CI for python3.8-dev is broken.

@dataflake
Copy link
Member

That's clearly an issue with the build environment, not with your code...

Copy link
Contributor

@d-maurer d-maurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeywordIndex line 76: insertNotIndexed should not have the argument newKeywords or should have a different name (such as insertSpecialIndexEntry).

KeywordIndex line 75: you may need to remove not indexed entries if oldkeywords is None.

KeywordIndex lines 81, 88: you want oldkeywords to be an OOSet but you store it as a list.

KeywordIndex line 84: I expect an insertNotIndex(...) somewhere in this else block (as there was one in the then block). In addition, it might be necessary to remove a potential MissingValue from the special index.

KeywordIndex line 120: not sure that an exception from the call should be silently swallowed. I suggest to at least log an entry.

KeywordIndex lines 149, 73: inconsistent check for missing _unindex entry (_marker versus None).

KeywordIndex: I suggest to rename ...NotIndexed to ...SpeciallyIndexed (or something similar) - as the document is indexed, just not in the "normal" way.

KeywordIndex, line 154: maybe, you do not need this line (in case, that _unindex is set to [] for an empty value (as this is the case for the old KeywordIndex).

The names MissingValue and EmptyValue are in "CamelCase" which indicates a class. Maybe, we should avoid "CamelCasing" for them to be more conformant with PEP8.

interfaces line 288: NotIndexedValue should not be used by the application; maybe, we want to indicate this by prefixing the name with _.

KeywordIndex - "pure not": I have not seen a "pure not" special handling for KeywordIndex. However, if we include MissingValue in a "pure not" for "UnIndex", we must as well include EmptyValue for KeywordIndex -- this could be done in UnIndex (to avoid code duplication in KeywordIndex.

UnIndex, line 565: the "pure not" should likely be implemented via _unindex rather than an enumeration of all keys (which may cause a huge multiunion to be executed). In addition: KeywordIndex might need that documents indexed under EmptyValue are included (by default) in a "pure not" result.

@andbag
Copy link
Member Author

andbag commented May 8, 2019

@d-maurer thanks for the helpful comments. However, the old implementation of KeywordIndex does not keep empty values like '()' in _unindex.

>>> from Products.PluginIndexes.KeywordIndex.KeywordIndex import KeywordIndex
>>> index = KeywordIndex('foo')
>>> class Dummy: pass
>>> obj1 = Dummy()
>>> obj1.foo = ('a','b')
>>> index.index_object(1,obj1)
True
>>> tuple(index._unindex.keys())
(1,)
>>> obj2 = Dummy()
>>> obj2.foo = ()
>>> index.index_object(2,obj2)
True
>>> tuple(index._unindex.keys())# expect (1, 2, ) but
(1,)

If we want to use _unindex for 'pure not' queries, then _unindex should also collect the special values.

@d-maurer
Copy link
Contributor

d-maurer commented May 9, 2019 via email

@d-maurer
Copy link
Contributor

d-maurer commented May 11, 2019

@andbag
ATTENTION When we now index MissingValue, we may get in trouble with the strange behaviour described in #64, i.e. when the index indexes more than a single attribute. We need at least tests for this case.
The current behaviour is to iterate over all "indexed attributes" and give each attribute a chance to modify the index according to its value. This means effectively, that the last attributes with a value succeeds. When we index even "missing value", then the last attribute will effectively always win, whether it has a value or not. I am quite sure that this would be unexpected.

As I wrote in #64, I believe that the current behaviour is not what was really intended: it would be much more natural if the first rather than the last attribute with a value succeeds. Maybe, we use the opportunity to document what it should mean the an index indexes several attributes, and maybe, we change the order in the process.

Whether or not we do something about the documentation for the "several indexed attributes" case (or even change the order), we must ensure that an attribute with a value has precedence over one without a value. We can distinguish both cases by checking the return value of _index_object.

I am unsure how to handle the case "empty value" (in contrast to "missing value"): should an attribute with a non empty value have precedence over one with an empty value? This question is relevant only for KeywordIndex like indexes. Should we say that "empty value" must be handled differently from "missing value", then potentially, we must change _index_object as well to differentiate both cases.

@andbag
Copy link
Member Author

andbag commented May 14, 2019

Unfortunately, there are no tests yet that check the current behavior for multiple indexed attributes. I will submit a new PR for these tests so that we don't lose track if we change the current behavior.

@andbag
Copy link
Member Author

andbag commented May 14, 2019

@d-maurer My observations show that the last attribute always wins, regardless of whether the value of last indexed attribute is set or not. The same applies to the existence of the last indexed attribute. Following test is based on code of master branch:

>>> from Products.PluginIndexes.KeywordIndex.tests import TestKeywordIndex
>>> test=TestKeywordIndex()
>>> index = test._makeOne('foobar', extra={'indexed_attrs': 'foo, bar'})
>>> class DummyContent(object):
...    def __init__(self, **kw):
...       for k in kw.keys():
...          setattr(self, k, kw.get(k))
... 
>>> index.index_object(0, DummyContent(foo=['NO']))
True
>>> index.index_object(1, DummyContent(foo=['NO'], bar=None))
True
>>> index.index_object(2, DummyContent(foo=['NO'], bar=''))
True
>>> tuple(index._index)
()
>>> tuple(index._unindex)
()

If the last attribute has a value, it is stored in the index.

>>> index.index_object(3, DummyContent(foo=['NO'], bar='YES'))
True
>>> tuple(index._index)
('YES',)
>>> tuple(index._unindex)
(3,)

In this regard, the option "indexed attributes" has no effect :(. That's why I don't think anyone's using the feature.

@d-maurer
Copy link
Contributor

d-maurer commented May 14, 2019 via email

@andbag
Copy link
Member Author

andbag commented May 15, 2019

@d-maurer

You might be right. I see two options:

  • we raise an exception when more than a single attribute is indexed
  • we document the feature "indexed attributes" and ensure that the implementation follows the documentation -- at least for "our own" indexes.

I prefer option one, because I don't want to implement features that nobody apparently requires. This feature can be much better implemented using a method that is executed by calling the single "indexed attribute". Which error fits best? TypeError or NotImplementedError?

@d-maurer
Copy link
Contributor

d-maurer commented May 15, 2019 via email

@d-maurer
Copy link
Contributor

d-maurer commented May 27, 2019 via email

@andbag
Copy link
Member Author

andbag commented May 29, 2019

@d-maurer I've corrected the code and generalized it a bit. Before I improve the code, it would be nice if you could have a look at my changes. Especially the mapping of the special value mapping can now be configured and the purpose is documented in interfaces.py.

special_values = Attribute('A dict which maps not regularly indexable '
'values or errors on value calculation to '
'a special value')

The implementation looks like this
special_values = {TypeError: missing,
AttributeError: missing,
None: missing,
(): empty}

@d-maurer
Copy link
Contributor

@andbag

@d-maurer ... Before I improve the code, it would be nice if you could have a look at my changes.

I find the idea good but have a few suggestions:

  • special_values could get a better name and description. Let's start with "description":
    A dict mapping "exceptional" object values to a special value.
    When the index indexes an object, it derives an index specific value from the object, the so called "object value" (relative to this index). This process can result in an exception or produce a value which the index cannot index in the normal way.
    The attribute controls what should happen in such a case. It maps exceptions or values to one of the special values. If an exception not mapped occurs, it is reraised; if an object value is not mapped, it is indexed normally.
    A name like map_to_special_value would fit quite well with this description.
  • KeywordIndex maps () to empty. However, a KeywordIndex related object value could be any sequence, not just tuple. It might be better to replace the dict by methods (e.g. map_value ("map value to a special value, if necessary") and map_exception_to_special_value).

try:
self.insertForwardIndexEntry(kw, documentId)
keys.append(kw)
except TypeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that this exception handling is right: it does not index the object if one key cannot be indexed - and the problem is only reported via a log entry.
In my opinion, other alternatives would be better:

  • log then ignore keys not indexable
  • do not catch the exception (and let the whole operation fail)
  • handle this TypeError in the same way as if it had occurred during the object value determination (e.g. map to missing).

In any case, the logic is at the wrong place. One would need similar logic for "update existing index info" and it should not be duplicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original version there was a bug that could lead to inconsistencies in the index. Also, the problem was not logged. In order not to have to abolish the old behavior completely, I would prefer the first variant. Consequently, _unindex is only allowed to store indexable keywords.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: Since the type OOSet is forced for keywords in the meantime, a TypeError can also be raised under python3 e.g. in the method map_value. For consistency reasons, TypeError is now always handled in the same way when determining the attribute value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-maurer I'm beginning to wonder if it wouldn't be more sensible to escalate TypeError when a value in the keyword list is incompatible with the already indexed values. Otherwise the new values would have to be pre-validated before being indexed.


newKeywords = OOSet(newKeywords)

self._unindex[documentId] = newKeywords
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _unindex update could be done together with the "update existing index info" case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay


# normalize datum
if isinstance(newKeywords, basestring):
newKeywords = (newKeywords,)
else:
try:
# unique
newKeywords = set(newKeywords)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At another place, the keywords are collected in an OOSet. Using different set types increases the constraints placed on the usable types for keywords: OOSet requires orderability (as the BTrees package as a whole); set requires hashability. I recommend to use OOSet uniformly (and avoid the tuple recasting).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@andbag
Copy link
Member Author

andbag commented May 31, 2019

  • KeywordIndex maps () to empty. However, a KeywordIndex related
    object value could be any sequence, not just tuple. It might be better to
    replace the dict by methods (e.g. map_value ("map value to a special value,
    if necessary") and map_exception_to_special_value).

@d-maurer I just can't imagine how I can implement such a method generically. In the end, the _get_object_datum method in combination with map_exception_to_special_value already serves the purpose, doesn't it?
I've thought about it again. I'll program a proposal. However, the methods could take shorter names and look better in camel case notation (e.g. mapValue and mapException).

@d-maurer
Copy link
Contributor

d-maurer commented May 31, 2019 via email

@icemac
Copy link
Member

icemac commented Jun 7, 2019

What a pity that Python 3.8 segfaults when starting the test. I cleaned the caches and tried to restart the Python 3.8 job.

@icemac
Copy link
Member

icemac commented Jun 7, 2019

Cool, cleaning the TravisCI cache seems to do the trick.

@andbag andbag requested a review from d-maurer June 11, 2019 14:08
from Products.PluginIndexes.KeywordIndex.KeywordIndex import KeywordIndex
from Products.PluginIndexes.unindex import _marker
from Products.ZCatalog.query import IndexQuery

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having used similar code for Python 2/3 compatibility, I have been directed to use six instead. Consistently using six for Python 2/3 compatibility will facilitate code cleanup once Python 2 support is dropped.

return tuple(pkl)
return OOSet(pkl)

def _get_component_datum(self, obj, attr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This almost looks like get_object_datum. Are you sure you need this special definition?

else:
try:
self.index_objectKeywords(documentId, newKeywords)
except self.exceptions_treated_as_missing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here does not yet seem correct: assume newKeywords is a special value - but not one we want to support. It then goes into index_objectKeywords (which will fail because it is a special value).

index=self.id))
if self.providesSpecialIndex(missing):
newKeywords = missing
self.insertSpecialIndexEntry(missing, documentId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here seems not yet correct: assume keywords "a", 1, "b". The index_objectKeywords will have failed after it has indexed "a" and add the document to the missing index as well. First while keywords "a" and "b" are similar, they are not treated similar; second it may surprise to have an object both in a "normal" index as well as the "missing" index.

Despite the appearance, the logic could be right: you may already have ensured at a different place that newkeywords contains only keywords of the same type. In this case, if index_objjectKeywords fails at all, it will fail with the first keyword. I suggest to add a corresponding comment in this case.

doc_id=documentId,
index=self.id))

if self.providesSpecialIndex(missing):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already seen this quite complex logic before. I would centralize it (maybe in a locally defined function) to have it in a single place.

return value

def index_objectKeywords(self, documentId, keywords):
""" carefully index keywords of object with integer id 'documentId'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are no longer "carefull" here. Likely, there is no longer any need because you have ensured elsewhere that keywords is homogenous and the indexing will fail with the first element if it fails at all.

newSet = newKeywords = OOSet(newKeywords)

try:
fdiff = difference(oldSet, newSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if the keywords change type - and will let your index in a strange state. Assume "oldSet" to be OOSet(['a', 'b']), "newSet" to be OOSet([1, 2]). Then under Python3, the differencewill result in aTypeErrorwhich lets your document remain indexed underoldSetand gets newly indexed undermissing`.

Under Python 3, all indexed values must have a common type. Changing the keyword type will therefore not work (apart from constructed cases). Therefore, you should let an exception from the difference calls propagate (maybe log and provide a more specific error message) and not turn it into missing.


def unindex_objectKeywords(self, documentId, keywords):
""" carefully unindex the object with integer id 'documentId'"""
""" carefully unindex keywords of object with integer id 'documentId'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not "carefull" here (drop the word).

special value query term."""

def map_value(value):
""" Map value, which is typically not generically indexable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "is typically not" is wrong.

I recommend:

def map_value(value):
  """Map (original) value to the value that should get indexed.

  The (original) value obtained from the object might not be indexable in the normal way.
  `map_value` gives you the chance to map it to a different, usually a special value in this case.
  """

@d-maurer
Copy link
Contributor

d-maurer commented Jun 26, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants