Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corruption of data structures in storage under higher load #8

Open
icemac opened this issue Nov 28, 2018 · 13 comments

Comments

@icemac
Copy link
Member

commented Nov 28, 2018

When putting tempstorage under higher load it can happen that it corrupts its data structures and runs into POSKeyErrors.

"higher load" means in the concrete example of a customer application: rendering a page containing 5 frames on a single machine in a single browser on the first try to render the page. In most times two of the frames cannot be rendered sometimes it is only one. (It was not happening in the same application running on Zope 2.13 with Python 2.)

When trying to reproduce the error outside that application I get tracebacks like:

2018-11-28 15:59:14 - ERROR [ZODB.Connection] Couldn't load state for Products.Transience.Transience.Increaser 0x04cb
Traceback (most recent call last):
  File ".../lib/python3.6/site-packages/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File ".../lib/python3.6/site-packages/ZODB/mvccadapter.py", line 143, in load
    r = self._storage.loadBefore(oid, self._start)
  File ".../lib/python3.6/site-packages/tempstorage/TemporaryStorage.py", line 187, in loadBefore
    raise POSException.POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x04cb
2018-11-28 15:59:14 - ERROR [waitress] Exception when serving /temp_folder/session_data/manage_container
Traceback (most recent call last):
  File ".../lib/python3.6/site-packages/waitress/channel.py", line 338, in service
    task.service()
  File ".../lib/python3.6/site-packages/waitress/task.py", line 169, in service
    self.execute()
  File ".../lib/python3.6/site-packages/waitress/task.py", line 399, in execute
    app_iter = self.channel.server.application(env, start_response)
  File ".../lib/python3.6/site-packages/ZPublisher/httpexceptions.py", line 30, in __call__
    return self.application(environ, start_response)
  File ".../lib/python3.6/site-packages/ZPublisher/WSGIPublisher.py", line 268, in publish_module
    response = _publish(request, new_mod_info)
  File ".../lib/python3.6/site-packages/ZPublisher/WSGIPublisher.py", line 210, in publish
    bind=1)
  File ".../lib/python3.6/site-packages/ZPublisher/mapply.py", line 85, in mapply
    return debug(object, args, context)
  File ".../lib/python3.6/site-packages/ZPublisher/WSGIPublisher.py", line 57, in call_object
    return obj(*args)
  File ".../lib/python3.6/site-packages/App/special_dtml.py", line 94, in __call__
    *(self,) + args[1:], **kw)
  File ".../lib/python3.6/site-packages/DocumentTemplate/DT_String.py", line 519, in __call__
    result = render_blocks(self._v_blocks, md)
  File ".../lib/python3.6/site-packages/DocumentTemplate/_DocumentTemplate.py", line 148, in render_blocks
    render_blocks_(blocks, rendered, md)
  File ".../lib/python3.6/site-packages/DocumentTemplate/_DocumentTemplate.py", line 218, in render_blocks_
    cond = md[cond]
  File ".../lib/python3.6/site-packages/DocumentTemplate/_DocumentTemplate.py", line 379, in __getitem__
    return self.getitem(name, call=1)
  File ".../lib/python3.6/site-packages/DocumentTemplate/_DocumentTemplate.py", line 403, in getitem
    return e()
  File ".../lib/python3.6/site-packages/Products/Transience/Transience.py", line 983, in housekeep
    self._housekeep(getCurrentTimeslice(self._period))
  File ".../lib/python3.6/site-packages/Products/Transience/Transience.py", line 987, in _housekeep
    self._replentish(now)
  File ".../lib/python3.6/site-packages/Products/Transience/Transience.py", line 670, in _replentish
    required = self._in_emergency_bucket_shortage(now)
  File ".../lib/python3.6/site-packages/Products/Transience/Transience.py", line 556, in _in_emergency_bucket_shortage
    max_ts = self._max_timeslice()
  File ".../lib/python3.6/site-packages/Products/Transience/Transience.py", line 1137, in __call__
    return self.value
  File ".../lib/python3.6/site-packages/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File ".../lib/python3.6/site-packages/ZODB/mvccadapter.py", line 143, in load
    r = self._storage.loadBefore(oid, self._start)
  File ".../lib/python3.6/site-packages/tempstorage/TemporaryStorage.py", line 187, in loadBefore
    raise POSException.POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x04cb

This one happens when trying to access /temp_folder/session_data/manage_container. It persists until Zope is restarted. When trying to access REQUEST['SESSION'] the traceback looks quite the same.

Interestingly the exception happens in the very same line which was changed in #6. Without this change the error seems to happen less often. But this observation might be wrong.

I was using Zope 4.0b7 on Python 3.6 which uses ZODB = 5.5.1 and tempstorage = 4.0.1 resp. 4.0.

Anyone who has seen this problem, too?
Any ideas how to fix it?

@icemac icemac added the bug label Nov 28, 2018

@icemac icemac added this to To do in Zope 4 final release via automation Nov 28, 2018

@icemac icemac added the help wanted label Nov 28, 2018

@mauritsvanrees

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

Earlier this year, in zopefoundation/Products.Sessions#4 I saw this with tempstorage 4.0. I then tried 4.0.1 but ran into the same problems. The change #6 between those versions may influence the behavior, or give a slightly different traceback, but it does not cause or solve the problem.

@icemac

This comment has been minimized.

Copy link
Member Author

commented May 21, 2019

See zopefoundation/Zope#637 for another variant of the traceback which seems to be the same problem.

@mgedmin

This comment has been minimized.

Copy link
Member

commented May 21, 2019

I decided to take a look at the source, see if maybe something jumps out that might be different on Python 2 vs Python 3.

Shouldn't this zero OID be a byte string?

if oid == '\0\0\0\0\0\0\0\0':

also

rootl = ['\0\0\0\0\0\0\0\0']

(Better yet, shouldn't they be using the z64 constant from ZODB.utils?)

Also this loop is modifying a dictionary it is iterating over:

for oid in self._index.keys():
if oid not in rindex:
self._takeOutGarbage(oid)

(self._takeOutGarbage() contains a del self._index[oid])

@dwt

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

This seems to be a stable (though quite noughty) reproduction / reduction of this

#!/usr/bin/env python

from contextlib import contextmanager
import time
from random import random
from datetime import datetime

import ZODB, ZODB.FileStorage
from tempstorage.TemporaryStorage import TemporaryStorage
from persistent.mapping import PersistentMapping
from BTrees.OOBTree import OOBTree
import transaction

from Testing.makerequest import makerequest

import tempstorage.TemporaryStorage
tempstorage.TemporaryStorage.CONFLICT_CACHE_MAXAGE = 0
tempstorage.TemporaryStorage.CONFLICT_CACHE_GCEVERY = 0

db = ZODB.DB(None)
original_connection = connection = db.open()
root = connection.root

from OFS.Application import Application
root.Application = Application()
transaction.savepoint(optimistic=True)

def _populate(app):
    from OFS.DTMLMethod import DTMLMethod
    from Products.Sessions.BrowserIdManager import BrowserIdManager
    from Products.Sessions.SessionDataManager import SessionDataManager
    from Products.TemporaryFolder.TemporaryFolder import MountedTemporaryFolder
    from Products.Transience.Transience import TransientObjectContainer
    import transaction
    bidmgr = BrowserIdManager('browser_id_manager')
    tf = MountedTemporaryFolder('temp_folder', title="Temporary Folder")
    toc = TransientObjectContainer(
        'temp_transient_container',
        title='Temporary '
        'Transient Object Container',
        timeout_mins=20
    )
    session_data_manager = SessionDataManager(
        id='session_data_manager',
        path='/' + 'temp_folder' + '/' + 'temp_transient_container',
        title='Session Data Manager',
        requestName='SESSION'
    )
    
    app._setObject('browser_id_manager', bidmgr)
    app._setObject('session_data_manager', session_data_manager)
    app._setObject('temp_folder', tf)
    
    transaction.commit()
    
    app.temp_folder._setObject('temp_transient_container', toc)
    transaction.commit()


_populate(root.Application)
print(connection.root)

container = makerequest(root.Application)
session = container.session_data_manager.getSessionData()
cookies = container.REQUEST.response.cookies
cookies = {'_ZopeId': cookies['_ZopeId']['value']}
transaction.savepoint(optimistic=True)
transaction.commit()

from threading import Thread

class FirstThread(Thread):

    def run(self):
        with db.transaction() as connection:
            container = makerequest(connection.root.Application)
            container.REQUEST.cookies = cookies
            session = container.session_data_manager.getSessionData(create=False)
            print('oid', session._p_oid, 'serial', session._p_serial)
            
            time.sleep(1)
            session['last_access'] = datetime.now().isoformat()

class SecondThread(Thread):
    def run(self):
        with db.transaction() as connection:
            container = makerequest(root.Application)
            container.REQUEST.cookies = cookies
            session = container.session_data_manager.getSessionData(create=False)
            print('oid', session._p_oid, 'serial', session._p_serial)
            
            time.sleep(1)
            session['last_access'] = datetime.now().isoformat()
            time.sleep(1)
            session = container.session_data_manager.getSessionData(create=False)
            print('oid', session._p_oid, 'serial', session._p_serial)
            session['last_access'] = datetime.now().isoformat()

print('\n'*10)

threads = [FirstThread(), SecondThread()]
for thread in threads:thread.start()
for thread in threads: thread.join()

print(session)
print(dict(session))

Not quite sure yet what to make of this, but it seems that two requests arriving at the same time of, which one triggers the garbage collection will trigger this error reliably.

In practice this means (I think), that if you have pages with multiple frames (like the admin panel) or pages that also do Ajax requests will almost certainly run into this issue. Still working on getting this to reproduce without having to modify tempstorage.TemporaryStorage.CONFLICT_CACHE_MAXAGE and tempstorage.TemporaryStorage.CONFLICT_CACHE_GCEVERY but I am pretty sure that this is the reason for this bug as well as the bugs I'm seeing in our app. Also I don't understand why this is so much more likely to trigger on python3. Any advice would be welcome!

Not sure how to best involve #12, advice would be welcome.

@dwt

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

This seems to be a little more faithful reproduction:

#!/usr/bin/env python

from contextlib import contextmanager
import time
from random import random
from datetime import datetime

import ZODB, ZODB.FileStorage
from tempstorage.TemporaryStorage import TemporaryStorage
from persistent.mapping import PersistentMapping
from BTrees.OOBTree import OOBTree
import transaction

from Testing.makerequest import makerequest

import tempstorage.TemporaryStorage
# This is far simpler to reproduce if these are set to 1 or even 0
# which we might want to to turn this into a test
# however the point of this script is to demonstrate that this is triggerable
# even without that modification
# tempstorage.TemporaryStorage.CONFLICT_CACHE_MAXAGE = 1
# tempstorage.TemporaryStorage.CONFLICT_CACHE_GCEVERY = 1

db = ZODB.DB(None)
original_connection = connection = db.open()
root = connection.root

from OFS.Application import Application
root.Application = Application()
transaction.savepoint(optimistic=True)

def _populate(app):
    from OFS.DTMLMethod import DTMLMethod
    from Products.Sessions.BrowserIdManager import BrowserIdManager
    from Products.Sessions.SessionDataManager import SessionDataManager
    from Products.TemporaryFolder.TemporaryFolder import MountedTemporaryFolder
    from Products.Transience.Transience import TransientObjectContainer
    import transaction
    bidmgr = BrowserIdManager('browser_id_manager')
    tf = MountedTemporaryFolder('temp_folder', title="Temporary Folder")
    toc = TransientObjectContainer(
        'temp_transient_container',
        title='Temporary '
        'Transient Object Container',
        timeout_mins=20
    )
    session_data_manager = SessionDataManager(
        id='session_data_manager',
        path='/' + 'temp_folder' + '/' + 'temp_transient_container',
        title='Session Data Manager',
        requestName='SESSION'
    )
    
    app._setObject('browser_id_manager', bidmgr)
    app._setObject('session_data_manager', session_data_manager)
    app._setObject('temp_folder', tf)
    
    transaction.commit()
    
    app.temp_folder._setObject('temp_transient_container', toc)
    transaction.commit()


_populate(root.Application)
print(connection.root)

cookies = None
with db.transaction() as connection:
    container = makerequest(root.Application)
    session = container.session_data_manager.getSessionData()
    session['last_access'] = datetime.now().isoformat()
    cookies = {'_ZopeId': container.REQUEST.response.cookies['_ZopeId']['value']}

from threading import Thread

class RequestThread(Thread):

    def run(self):
        with db.transaction() as connection:
            container = makerequest(connection.root.Application)
            container.REQUEST.cookies = cookies
            session = container.session_data_manager.getSessionData(create=False)

print('waiting for garbage collection time')

time.sleep(70)

# trigger garbage collect
print('trigger garbage collect')
transaction.commit()

print('time to explode')

threads = [RequestThread()]
for thread in threads:thread.start()
for thread in threads: thread.join()
@dwt

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

I'd love some feedback if this is an unreasonable way to simplify this / or maybe if there is a way to better express what is happening here.

@dwt

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2019

Looking at the code, the method _clear_temp() seems particularly ... interesting? The idea to remove (basically random) transactions by timeout instead of when they are no longer used by dependent transactions seems wrong / dangerous?

@dwt

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

Hey there maintainers! I'd like some feedback on this reproduction. What do you think about it? Is the bug described here real, or should it be caught by some other system in Zope?

If it is real, is this something that is fixable / you guys want to fix?

If not, what do we make from this?

@dwt

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

@hannosch @jimfulton Maybe you guys can give some input?

@dwt

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2019

I'm not sure how to best escalate this bug - this seems to be a pretty major bug in tempstorage that should either get fixed (may be hard?) or we should start some form of deprecation. But this would need advice from some more senior developers. @maintainers: any input?

@hannosch

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

I’m not much help here. I remember we ran into problems with tempstorage and sessions in Plone and removed all code that used those way back in 2004. To my understanding these were considered broken ever since in the Plone community and you either used cookies, memcached or later Redis to store session state.

@dwt

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2019

So I guess that I can count this as a vote to deprecate use of tempstorage in zope proper too?

@icemac

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

@d-maurer suggested in zopefoundation/Zope#684 (comment):

A long time ago, I made a review of Zope's session machinery.
When I remember right, then somewhere (I forgot where
precisely) a special measure (I forgot what precisely)
was used to allow it to not
follow the ZODB's typical MVCC (i.e. see state as it was at the transaction
start) but to see more recent state. Something like this is important
for a storage with severely limited history (such as TemporaryStorage).

I suspect this no longer works (with modern ZODB versions)
or has been removed in the session machinery (because its importance
has not been recognized).

A way to restore such a possibility would be to let
the loadBefore of TemporaryStorage ignore its tid
parameter and always load the most current (committed) state.
I expect that this will make TemporaryStorage again as
usable as it was earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.