tests: Factor "Open another ClientStorage to the same server" into generic _new_storage_client() #170

navytux · 2021-04-16T17:01:27Z

This allows ZODB tests to recognize ZEO as client-server storage and so
make "load vs external invalidate" test added in
zopefoundation/ZODB#345 to reproduce data
corruption problem reported at
#155.

For the reference: that problem should be fixed by
#169.

We drop

# It's hard to find the actual address.
# The rpc mgr addr attribute is a list.  Each element in the
# list is a socket domain (AF_INET, AF_UNIX, etc.) and an
# address.

note because at the time it was added (81f586c) it came with

addr = self._storage._rpc_mgr.addr[0][1]

but nowdays after 0386718 getting to server address is just by ClientStorage._addr.

…neric _new_storage_client() This allows ZODB tests to recognize ZEO as client-server storage and so make "load vs external invalidate" test added in zopefoundation/ZODB#345 to reproduce data corruption problem reported at zopefoundation#155. For the reference: that problem should be fixed by zopefoundation#169. We drop # It's hard to find the actual address. # The rpc mgr addr attribute is a list. Each element in the # list is a socket domain (AF_INET, AF_UNIX, etc.) and an # address. note because at the time it was added (81f586c) it came with addr = self._storage._rpc_mgr.addr[0][1] but nowdays after 0386718 getting to server address is just by ClientStorage._addr.

@d-maurer

For ZEO this data corruption bug was reported at zopefoundation/ZEO#155 and fixed at zopefoundation/ZEO#169. Without that fix the failure shows e.g. as follows when running ZEO test suite: Failure in test check_race_load_vs_external_invalidate (ZEO.tests.testZEO.BlobAdaptedFileStorageTests) Traceback (most recent call last): File "/usr/lib/python2.7/unittest/case.py", line 329, in run testMethod() File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/BasicStorage.py", line 621, in check_race_load_vs_external_invalidate self.fail([_ for _ in failure if _]) File "/usr/lib/python2.7/unittest/case.py", line 410, in fail raise self.failureException(msg) AssertionError: ['T1: obj1.value (7) != obj2.value (8)'] Even if added test is somewhat similar to check_race_loadopen_vs_local_invalidate, it is added anew without trying to unify code. The reason here is that the probability to catch load vs external invalidation race is significantly reduced when there are only 1 modify and 1 verify workers. The unification with preserving both tests semantic would make test for "load vs local invalidate" harder to follow. Sometimes a little copying is better than trying to unify too much. For the test to work, test infrastructure is amended with ._new_storage_client() method that complements ._storage attribute: client-server storages like ZEO, NEO and RelStorage allow several storage clients to be connected to single storage server. For client-server storages test subclasses should implement _new_storage_client to return new storage client that is connected to the same storage server self._storage is connected to. For ZEO ._new_storage_client() is added by zopefoundation/ZEO#170 Other client-server storages can follow to implement ._new_storage_client() and this way automatically activate this "load vs external invalidation" test when their testsuite is run. Contrary to test for "load vs local invalidate" N is set to lower value (100), because with 8 workers the bug is usually reproduced at not-so-high iteration number (5-10-20). /cc @d-maurer, @jamadden, @jmuchemb

navytux · 2021-04-16T17:23:47Z

CI: besides usual "uvloop requires Python 3.7" there is zeo-fan-out.test failure on pypy3, that I belive is unrelated to my patch. Other 12 jobs pass ok.

Add entry to changelog. Note: the fix now has corresponding test, that should be coming in through zopefoundation#170 and zopefoundation/ZODB#345

d-maurer · 2021-04-20T13:15:38Z

src/ZEO/tests/testZEO.py

+            self._storage._addr, wait=1, **self._client_options())
+        client = self._wrap_client(client)
+        client.registerDB(DummyDB())
+        return client


Should the storage really already register with a DB? In some situation, we might want an MVCCAdapter to be registered.

@d-maurer, thanks for feedback. I've thought about that too, and my logic here is:

the original code being factored-out had that registerDB(DummyDB()). There is also the same registerDB(DummyDB()) in GenericTestBase.setUp which prepares ._storage client for practically most of the tests:

ZEO/src/ZEO/tests/testZEO.py

Lines 236 to 263 in 707316c

class GenericTestBase(

# Base class for all ZODB tests

StorageTestBase.StorageTestBase):

shared_blob_dir = False

blob_cache_dir = None

server_debug = False

def setUp(self):

StorageTestBase.StorageTestBase.setUp(self)

logger.info("setUp() %s", self.id())

zport, stop = forker.start_zeo_server(

self.getConfig(), self.getZEOConfig(), debug=self.server_debug)

self._servers = [stop]

if not self.blob_cache_dir:

# This is the blob cache for ClientStorage

self.blob_cache_dir = tempfile.mkdtemp(

'blob_cache',

dir=os.path.abspath(os.getcwd()))

self._storage = self._wrap_client(

ClientStorage(

zport, '1', cache_size=20000000,

min_disconnect_poll=0.5, wait=1,

wait_timeout=60, blob_dir=self.blob_cache_dir,

shared_blob_dir=self.shared_blob_dir,

**self._client_options()),

)

self._storage.registerDB(DummyDB())

the DB.registerDB, when called for the second time, forgets what was registered first. This way when/if MVCCAdapter registers into ClientStorage, the registration of DummyDB is dropped.

The code in _new_storage_client is this OK, and is in line with the way ClientStorage for main ._storage is prepared.

navytux · 2021-04-20T15:28:28Z

Thanks, @d-maurer

@jmuchemb

Currently loadBefore and prefetch spawn async protocol.load_before task, and, after waking up on its completion, populate the cache with received data. But in between the time when protocol.load_before handler is run and the time when protocol.load_before caller wakes up, there is a window in which event loop might be running some other code, including the code that handles invalidateTransaction messages from the server. This means that cache updates and cache invalidations can be processed on the client not in the order that server sent it. And such difference in the order can lead to data corruption if e.g server sent <- loadBefore oid serial=tid1 next_serial=None <- invalidateTransaction tid2 oid and client processed it as invalidateTransaction tid2 oid cache.store(oid, tid1, next_serial=None) because here the end effect is that invalidation for oid@tid2 is not applied to the cache. The fix is simple: perform cache updates right after loadBefore reply is received. Fixes: zopefoundation#155 The fix is based on analysis and initial patch by @jmuchemb: zopefoundation#155 (comment) A tests corresponding to the fix is coming coming through zopefoundation#170 and zopefoundation/ZODB#345 For the reference, my original ZEO-specific data corruption reproducer is here: zopefoundation#155 (comment) https://lab.nexedi.com/kirr/wendelin.core/blob/ecd0e7f0/zloadrace5.py /cc @jamadden, @dataflake, @jimfulton /reviewed-by @jmuchemb, @d-maurer /reviewed-on zopefoundation#169

@jmuchemb

Currently loadBefore and prefetch spawn async protocol.load_before task, and, after waking up on its completion, populate the cache with received data. But in between the time when protocol.load_before handler is run and the time when protocol.load_before caller wakes up, there is a window in which event loop might be running some other code, including the code that handles invalidateTransaction messages from the server. This means that cache updates and cache invalidations can be processed on the client not in the order that server sent it. And such difference in the order can lead to data corruption if e.g server sent <- loadBefore oid serial=tid1 next_serial=None <- invalidateTransaction tid2 oid and client processed it as invalidateTransaction tid2 oid cache.store(oid, tid1, next_serial=None) because here the end effect is that invalidation for oid@tid2 is not applied to the cache. The fix is simple: perform cache updates right after loadBefore reply is received. Fixes: #155 The fix is based on analysis and initial patch by @jmuchemb: #155 (comment) A tests corresponding to the fix is coming coming through #170 and zopefoundation/ZODB#345 For the reference, my original ZEO-specific data corruption reproducer is here: #155 (comment) https://lab.nexedi.com/kirr/wendelin.core/blob/ecd0e7f0/zloadrace5.py /cc @jamadden, @dataflake, @jimfulton /reviewed-by @jmuchemb, @d-maurer /reviewed-on #169

@d-maurer

For ZEO this data corruption bug was reported at zopefoundation/ZEO#155 and fixed at zopefoundation/ZEO#169. Without that fix the failure shows e.g. as follows when running ZEO test suite: Failure in test check_race_load_vs_external_invalidate (ZEO.tests.testZEO.BlobAdaptedFileStorageTests) Traceback (most recent call last): File "/usr/lib/python2.7/unittest/case.py", line 329, in run testMethod() File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/BasicStorage.py", line 621, in check_race_load_vs_external_invalidate self.fail([_ for _ in failure if _]) File "/usr/lib/python2.7/unittest/case.py", line 410, in fail raise self.failureException(msg) AssertionError: ['T1: obj1.value (7) != obj2.value (8)'] Even if added test is somewhat similar to check_race_loadopen_vs_local_invalidate, it is added anew without trying to unify code. The reason here is that the probability to catch load vs external invalidation race is significantly reduced when there are only 1 modify and 1 verify workers. The unification with preserving both tests semantic would make test for "load vs local invalidate" harder to follow. Sometimes a little copying is better than trying to unify too much. For the test to work, test infrastructure is amended with ._new_storage_client() method that complements ._storage attribute: client-server storages like ZEO, NEO and RelStorage allow several storage clients to be connected to single storage server. For client-server storages test subclasses should implement _new_storage_client to return new storage client that is connected to the same storage server self._storage is connected to. For ZEO ._new_storage_client() is added by zopefoundation/ZEO#170 Other client-server storages can follow to implement ._new_storage_client() and this way automatically activate this "load vs external invalidation" test when their testsuite is run. Contrary to test for "load vs local invalidate" N is set to lower value (100), because with 8 workers the bug is usually reproduced at not-so-high iteration number (5-10-20). /cc @d-maurer, @jamadden, @jmuchemb /reviewed-on zopefoundation#345

@d-maurer

For ZEO this data corruption bug was reported at zopefoundation/ZEO#155 and fixed at zopefoundation/ZEO#169. Without that fix the failure shows e.g. as follows when running ZEO test suite: Failure in test check_race_load_vs_external_invalidate (ZEO.tests.testZEO.BlobAdaptedFileStorageTests) Traceback (most recent call last): File "/usr/lib/python2.7/unittest/case.py", line 329, in run testMethod() File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/BasicStorage.py", line 621, in check_race_load_vs_external_invalidate self.fail([_ for _ in failure if _]) File "/usr/lib/python2.7/unittest/case.py", line 410, in fail raise self.failureException(msg) AssertionError: ['T1: obj1.value (7) != obj2.value (8)'] Even if added test is somewhat similar to check_race_loadopen_vs_local_invalidate, it is added anew without trying to unify code. The reason here is that the probability to catch load vs external invalidation race is significantly reduced when there are only 1 modify and 1 verify workers. The unification with preserving both tests semantic would make test for "load vs local invalidate" harder to follow. Sometimes a little copying is better than trying to unify too much. For the test to work, test infrastructure is amended with ._new_storage_client() method that complements ._storage attribute: client-server storages like ZEO, NEO and RelStorage allow several storage clients to be connected to single storage server. For client-server storages test subclasses should implement _new_storage_client to return new storage client that is connected to the same storage server self._storage is connected to. For ZEO ._new_storage_client() is added by zopefoundation/ZEO#170 Other client-server storages can follow to implement ._new_storage_client() and this way automatically activate this "load vs external invalidation" test when their testsuite is run. Contrary to test for "load vs local invalidate" N is set to lower value (100), because with 8 workers the bug is usually reproduced at not-so-high iteration number (5-10-20). /cc @d-maurer, @jamadden, @jmuchemb /reviewed-on #345

This was referenced Apr 16, 2021

tests: Add test for open vs invalidation race zopefoundation/ZODB#345

Merged

client: Fix cache corruption on loadBefore and prefetch #169

Merged

d-maurer reviewed Apr 20, 2021

View reviewed changes

d-maurer approved these changes Apr 20, 2021

View reviewed changes

navytux merged commit 415b1ff into zopefoundation:master Apr 20, 2021

navytux deleted the y/t-new_storage_client branch April 20, 2021 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: Factor "Open another ClientStorage to the same server" into generic _new_storage_client() #170

tests: Factor "Open another ClientStorage to the same server" into generic _new_storage_client() #170

navytux commented Apr 16, 2021

navytux commented Apr 16, 2021

d-maurer Apr 20, 2021

navytux Apr 20, 2021

navytux commented Apr 20, 2021

	class GenericTestBase(
	# Base class for all ZODB tests
	StorageTestBase.StorageTestBase):

	shared_blob_dir = False
	blob_cache_dir = None
	server_debug = False

	def setUp(self):
	StorageTestBase.StorageTestBase.setUp(self)
	logger.info("setUp() %s", self.id())
	zport, stop = forker.start_zeo_server(
	self.getConfig(), self.getZEOConfig(), debug=self.server_debug)
	self._servers = [stop]
	if not self.blob_cache_dir:
	# This is the blob cache for ClientStorage
	self.blob_cache_dir = tempfile.mkdtemp(
	'blob_cache',
	dir=os.path.abspath(os.getcwd()))
	self._storage = self._wrap_client(
	ClientStorage(
	zport, '1', cache_size=20000000,
	min_disconnect_poll=0.5, wait=1,
	wait_timeout=60, blob_dir=self.blob_cache_dir,
	shared_blob_dir=self.shared_blob_dir,
	**self._client_options()),
	)
	self._storage.registerDB(DummyDB())

tests: Factor "Open another ClientStorage to the same server" into generic _new_storage_client() #170

tests: Factor "Open another ClientStorage to the same server" into generic _new_storage_client() #170

Conversation

navytux commented Apr 16, 2021

navytux commented Apr 16, 2021

d-maurer Apr 20, 2021

Choose a reason for hiding this comment

navytux Apr 20, 2021

Choose a reason for hiding this comment

navytux commented Apr 20, 2021