Merge 9d76c53 into cb8af9c

zodb · Jun 8, 2021 · 4a0d886 · 4a0d886
2 parents cb8af9c + 9d76c53
commit 4a0d886
Show file tree

Hide file tree

Showing 14 changed files with 292 additions and 101 deletions.
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -7,6 +7,29 @@
 
 - Stop closing RDBMS connections when ``tpc_vote`` raises a
   semi-expected ``TransientError`` such as a ``ConflictError``.
+- PostgreSQL: Now uses advisory locks instead of row-level locks
+  during the commit process. This benchmarks substantially faster and
+  reduces the potential for table bloat.
+
+  For environments that process many large, concurrent transactions,
+  or deploy many RelStorage instances to the same database server, it
+  might be necessary to increase the PostgreSQL configuration value
+  ``max_locks_per_transaction.`` The default value of 64 is multiplied
+  by the default value of ``max_connections`` (100) to allow for 6,400
+  total objects to be locked across the entire database server. See
+  `the PostgreSQL documentation
+  <https://www.postgresql.org/docs/13/runtime-config-locks.html>`_ for
+  more information.
+
+  .. caution:: Be careful deploying this version while older versions
+               are executing. There could be a small window of time
+               where the locking strategies are different, leading to
+               database corruption.
+
+  .. note:: Deploying multiple RelStorage instances to separate
+            schemas in the same PostgreSQL database (e.g., the default
+            of "public" plus another) has never been supported. It is
+            even less supported now.
 
 
 3.5.0a3 (2021-05-26)

diff --git a/docs/internals.rst b/docs/internals.rst
@@ -50,6 +50,7 @@ Internal Details
    relstorage.adapters.postgresql.schema
    relstorage.adapters.postgresql.stats
    relstorage.adapters.postgresql.txncontrol
+   relstorage.adapters.postgresql.util
    relstorage.adapters.replica
    relstorage.adapters.schema
    relstorage.adapters.scriptrunner

diff --git a/docs/postgresql/index.rst b/docs/postgresql/index.rst
@@ -11,9 +11,9 @@
 
 .. tip::
 
-   Using ZODB's ``readCurrent(ob)`` method will result in taking
-   shared locks (``SELECT FOR SHARE``) in PostgreSQL for the row
-   holding the data for *ob*.
+   Prior to version 3.5.0a4, using ZODB's ``readCurrent(ob)`` method
+   will result in taking shared locks (``SELECT FOR SHARE``) in
+   PostgreSQL for the row holding the data for *ob*.
 
    This operation performs disk I/O, and consequently has an
    associated cost. We recommend using this method judiciously.

diff --git a/docs/postgresql/setup.rst b/docs/postgresql/setup.rst
@@ -4,6 +4,11 @@
 
 .. highlight:: shell
 
+.. important::
+
+   RelStorage can only be installed into a single schema within a
+   database. This is usually the default "public" schema. It may be
+   possible to use other schemas, but this is not supported or tested.
 
 If you installed PostgreSQL from a binary package, you probably have a
 user account named ``postgres``. Since PostgreSQL respects the name of
@@ -40,18 +45,133 @@ configuration file::
 Configuration
 =============
 
-.. tip::
+The default PostgreSQL server configuration will work fine for most
+users. However, some configuration changes may yield increased performance.
+
+Defaults and Background
+-----------------------
+
+This section is current for PostgreSQL 13 and earlier versions.
+
+``max_connections`` (100) gives the number of worker processes that could
+possibly be active at a time. Each worker consumes (at most)
+``work_mem`` (4MB) + ``temp_mem`` (8MB) = 12MB (plus a tiny bit of
+overhead).
+
+``shared_buffers`` is the amount of memory that PostgreSQL will
+allocate to keeping database data in memory. It is perhaps the single
+most important tunable, larger values are better. If data is not in
+this, then a worker will have to go to the operating system with an
+I/O request (or two). The default is a measly 128MB.
+
+``max_wal_size`` determines how often the data must be taken from the
+write-ahead log and placed into the main tables. Reasons to keep this
+small are (a) low amount of disk space; (b) reduced crash recovery
+time; (c) if you're doing replication in the WAL-based way, keeping
+online replicas more up-to-date.
+
+``random_page_cost`` (4.0) is relative to ``seq_page_cost`` (1.0) and
+tells how relatively expensive it is to do random I/O versus large
+blocks of sequential I/O. This in turn influences whether the planner
+will use an index or not. For solid-state drives, the
+``random_page_cost`` should generally be lowered.
+
+
+General
+-------
+
+Many PostgreSQL configuration defaults are conservative on modern
+machines. Without knowing the resources available to any particular
+installation, some general tips are listed below.
+
+.. important:: Be sure you understand the consequences before changing
+               any settings. Some of those listed here may be risky,
+               depending on your level of risk tolerance.
+
+* Increase ``temp_mem``. This prevents having to use disk tables for
+  temporary storage. RelStorage does a lot with temp tables. In my
+  benchmarks, I use 32MB.
+
+* ``work_mem`` improves sorting and hashing, that sort of thing.
+  RelStorage doesn't do much of that *except* when you do a native GC,
+  and then it can make a big difference. Because this is a max that's
+  not allocated unless needed, it should be safe to increase it. In my
+  benchmarks, I leave this alone.
+
+* Increase ``shared_buffers`` as much as you are able. When I
+  benchmark, on my 16GB laptop, I use 2GB. The rule of thumb for
+  dedicated servers is 25% of available RAM.
+
+* If deploying on SSDs, then the cost of random page access can probably
+  be lowered some more. I know they're old SSDs, but the cost is
+  relative to sequential access, not absolute. This is probably not
+  important though, unless you're experiencing issues accessing blobs
+  (the only thing doing sequential scans).
+
+* If you are not doing replication, setting ``wal_level = minimal``
+  will improve write speed and reduce disk usage. Similarly, setting
+  ``wal_compression = on`` will reduce disk IO for writes (at a tiny
+  CPU cost). I benchmark with both those settings.
+
+* If you're not doing replication and can stand some longer recovery
+  times, increasing ``max_wal_size`` (I use 10GB) has benefit for
+  heavy writes. Even if you are doing replication, increasing
+  ``checkpoint_timeout`` (I use 30 minutes, up from 5),
+  ``checkpoint_completion_target`` (I use 0.9, up from 0.5) and either
+  increasing or disabling ``checkpoint_flush_after`` (I disable, the
+  default is a skimpy 256KB) also help. This especially helps on
+  spinning rust, and for very "bursty" workloads.
+
+* If our IO bandwidth is constrained, and you can't increase
+  ``shared_buffers`` enough to compensate, disabling the background
+  writer can help too. ``bgwriter_lru_maxpages = 0`` and
+  ``bgwriter_flush_after = 0``. I set these when I benchmark using
+  spinning rust.
+
+* Setting ``synchronous_commit = off`` makes for faster turnaround
+  time on ``COMMIT`` calls. This is safe in the sense that it can
+  never corrupt the database in the event of a crash, but it might
+  leave the application *thinking* something was saved when it really
+  wasn't. Since the whole site will go down in the event of a database
+  crash anyway, you might consider setting this to off if you're
+  struggling with database performance. I benchmark with it off.
+
+
+Large Sites
+-----------
+
+* For very large sites processing many large or concurrent
+  transactions, or deploying many RelStorage instances to a single
+  database server, it may be necessary to increase the value of
+  ``max_locks_per_transaction`` beginning with RelStorage 3.5.0a4. The
+  default value (64) allows about 6,400 objects to be locked because
+  it is multiplied by the value of ``max_connections`` (which defaults
+  to 100). Large sites may have already increased this second value.
+
+* For systems with very high write levels, setting
+  ``wal_writer_flush_after = 10MB`` (or something higher than the
+  default of 1MB) and ``wal_writer_delay = 10s`` will improve write
+  speed without any appreciable safety loss (because your write volume
+  is so high already). I run write benchmarks this way.
+
+* Likewise for high writes, I increase ``autovacuum_max_workers`` from
+  the default of 3 to 8 so they can keep up. Similarly, consider
+  lowering ``autovacuum_vacuum_scale_factor`` from its default of 20%
+  to 10% or even 1%. You might also raise
+  ``autovacuum_vacuum_cost_limit`` from its default of 200 to 1000
+  or 2000.
 
-   For packing large databases, a larger value of the PostgreSQL
-   configuration paramater ``work_mem`` is likely to yield improved
-   performance. The default is 4MB; try 16MB if packing performance is
-   unacceptable.
+Packing
+-------
 
-.. tip::
+* For packing large databases, a larger value of the PostgreSQL
+  configuration paramater ``work_mem`` is likely to yield improved
+  performance. The default is 4MB; try 16MB if packing performance is
+  unacceptable.
 
-   For packing large databases, setting the ``pack_object``,
-   ``object_ref`` and ``object_refs_added`` tables to `UNLOGGED
-   <https://www.postgresql.org/docs/12/sql-createtable.html#SQL-CREATETABLE-UNLOGGED>`_
-   can provide a performance boost (if replication doesn't matter and
-   you don't care about the contents of these tables). This can be
-   done after the schema is created with ``ALTER TABLE table SET UNLOGGED``.
+* For packing large databases, setting the ``pack_object``,
+  ``object_ref`` and ``object_refs_added`` tables to `UNLOGGED
+  <https://www.postgresql.org/docs/12/sql-createtable.html#SQL-CREATETABLE-UNLOGGED>`_
+  can provide a performance boost (if replication doesn't matter and
+  you don't care about the contents of these tables). This can be done
+  after the schema is created with ``ALTER TABLE table SET UNLOGGED``.
diff --git a/src/relstorage/adapters/locker.py b/src/relstorage/adapters/locker.py
@@ -118,18 +118,14 @@ class will call this method when :meth:`hold_commit_lock` is
         ("""
         SELECT zoid
         FROM current_object
-        WHERE zoid IN (
-            SELECT zoid
-            FROM temp_store
-        )
+        INNER JOIN temp_store USING (zoid)
+        WHERE temp_store.prev_tid <> 0
         """, 'current_object'),
         ("""
         SELECT zoid
         FROM object_state
-        WHERE zoid IN (
-            SELECT zoid
-            FROM temp_store
-        )
+        INNER JOIN temp_store USING (zoid)
+        WHERE temp_store.prev_tid <> 0
         """, 'object_state'),
     )
 
@@ -194,24 +190,38 @@ def lock_current_objects(self, cursor, read_current_oid_ints, shared_locks_block
         # possibly * N
         self._lock_rows_being_modified(cursor)
 
-    def _lock_readCurrent_oids_for_share(self, cursor, current_oids, shared_locks_block):
-        _, table = self._get_current_objects_query
-        oids_to_lock = sorted(set(current_oids))
-        batcher = self.make_batcher(cursor)
-
-        locking_suffix = ' %s ' % (
+    def _lock_suffix_for_readCurrent(self, shared_locks_block):
+        return ' %s ' % (
             self._lock_share_clause
             if shared_locks_block
             else
             self._lock_share_clause_nowait
         )
+
+    def _lock_column_name_for_readCurrent(self, shared_locks_block):
+        # subclasses use the argument
+        # pylint:disable=unused-argument
+        return 'zoid'
+
+    def _lock_consume_rows_for_readCurrent(self, rows, shared_locks_block):
+        # subclasses use the argument
+        # pylint:disable=unused-argument
+        consume(rows)
+
+    def _lock_readCurrent_oids_for_share(self, cursor, current_oids, shared_locks_block):
+        _, table = self._get_current_objects_query
+        oids_to_lock = sorted(set(current_oids))
+        batcher = self.make_batcher(cursor)
+
+        locking_suffix = self._lock_suffix_for_readCurrent(shared_locks_block)
+        lock_column = self._lock_column_name_for_readCurrent(shared_locks_block)
         try:
             rows = batcher.select_from(
-                ('zoid',), table,
+                (lock_column,), table,
                 suffix=locking_suffix,
                 **{'zoid': oids_to_lock}
             )
-            consume(rows)
+            self._lock_consume_rows_for_readCurrent(rows, shared_locks_block)
         except self.illegal_operation_exceptions: # pragma: no cover
             # Bug in our code
             raise

diff --git a/src/relstorage/adapters/mysql/packundo.py b/src/relstorage/adapters/mysql/packundo.py
@@ -20,12 +20,8 @@
 from ..packundo import HistoryPreservingPackUndo
 from ..schema import Schema
 
-class _LockStmt(object):
-    # 8.0 supports 'FOR SHARE' but before that we have
-    # this.
-    _lock_for_share = 'LOCK IN SHARE MODE'
 
-class MySQLHistoryPreservingPackUndo(_LockStmt, HistoryPreservingPackUndo):
+class MySQLHistoryPreservingPackUndo(HistoryPreservingPackUndo):
 
     # Previously we needed to work around a MySQL performance bug by
     # avoiding an expensive subquery.
@@ -112,5 +108,5 @@ class MySQLHistoryPreservingPackUndo(_LockStmt, HistoryPreservingPackUndo):
     ).limit(1000)
 
 
-class MySQLHistoryFreePackUndo(_LockStmt, HistoryFreePackUndo):
+class MySQLHistoryFreePackUndo(HistoryFreePackUndo):
     pass
diff --git a/src/relstorage/adapters/packundo.py b/src/relstorage/adapters/packundo.py
@@ -51,10 +51,6 @@ class PackUndo(DatabaseHelpersMixin):
 
     _choose_pack_transaction_query = None
 
-
-    _lock_for_share = 'FOR SHARE'
-    _lock_for_update = 'FOR UPDATE'
-
     driver = None
     connmanager = None
     runner = None
@@ -106,8 +102,7 @@ def with_options(self, options):
             # (checkPackWhileReferringObjectChanges)
             return self
         result = self.__class__(self.driver, self.connmanager, self.runner, self.locker, options)
-        # Setting the MAX_TID is important for SQLite,
-        # as is the _lock_for_share.
+        # Setting the MAX_TID is important for SQLite.
         # This should probably be handled directly in subclasses.
         for k, v in vars(self).items():
             if k != 'options' and getattr(result, k, None) is not v:

diff --git a/src/relstorage/adapters/postgresql/adapter.py b/src/relstorage/adapters/postgresql/adapter.py
@@ -150,8 +150,6 @@ def _create(self):
                 locker=self.locker,
                 options=options,
             )
-            # TODO: Subclass for this.
-            self.packundo._lock_for_share = 'FOR KEY SHARE OF object_state'
             self.dbiter = HistoryFreeDatabaseIterator(
                 driver,
             )