Don't copy any data in case of 100% cache valid stages #54

windiana42 · 2023-02-09T22:01:15Z

Sounds like a nobrainer. But it is quite tricky to achieve since we only know half way through stage execution whether this is the case or not. So we start setting just aliases/read views and replace them by copying tables (in the background) when the first cache invalid task is discovered and the aliases are replaced before stage commit.

Checklist

Added a CHANGELOG.rst entry

…chema of cache valid tasks [tests still fail] This is quite a hard change into existing concepts, but it should massively speed up the common case where most stages are 100% cache-valid. There are at least issues with DB2 tests.

This is important since the source tables in DB2 are aliases and in the current schema we also keep aliases to the cache for cache valid tables.

…of cache valid tasks

… mssql focus

windiana42 · 2023-02-09T22:04:43Z

Visible by test failures, there will have to be some more solidification work done. For DB2 and mssql the technique is already working in general. For Postgres it might be a bit more tricky since we can only place Read-Views as aliases. However, also for DB2/mssql we resolve those aliases before working with them (sqlalchemy can't work with them very well). The same can be done also for read views. They would just be placeholders in this case functioning a s a level of indirection.

… postgres focus

3.9 caused strange errors with DB2 interface.

further adaptions were needed

This is an experiment to see whether CI runs get more reliable with that.

The ibm_db2 service is only started once for the whole matrix execution with steps for different python versions. Due to the artificially small tests, this leads to actions that DB2 consideres as deadlock even though zookeeper is used to ensure that any stage is only touched by exactly one process at any time. It could be that using wildly different graphs on the same instance_id might cause this, but this problem will not affect any realistic usecases of pipedag.

windiana42 · 2023-03-30T10:02:14Z

@NMAC427 we already talked this change through. But I guess a review would be good for this size of a refactoring. I am about to get tests green, but it seems to be just a problem of current activity on ibm_db2 service which is started in a different way than our other docker-compose databases.

The number of 100 concurrent connections to zookeeper is arbitrary. But it is easier to change that option if we don't rely on the default.

I am also making stupid commits to get github actions to run something.

to trigger GitHub Actions

This will be undone until we understand how to avoid either the serialization or the database based deadlocks.

windiana42

Commented changes in PR

windiana42 · 2023-03-31T15:04:04Z

.github/actions/test/action.yml

@@ -10,5 +10,5 @@ runs:
    - name: Test
      shell: bash
      run: |
-        poetry run pytest tests -ra ${DEBUG:+-s} ${{ inputs.arguments }}
+        poetry run pytest tests -ra -v -s ${DEBUG:+-s} ${{ inputs.arguments }}


For now, I suggest to leave more verbose test output on CI. We are still having trouble with database deadlocks despite locking of stages.

windiana42 · 2023-03-31T15:05:10Z

.github/workflows/ci.yml

@@ -65,15 +65,133 @@ jobs:
        with:
          arguments: --workers 2

-  full_test:
+  full_test39:


This lengthy duplication is not nice. We will try to undo this. One idea is strategy: max-parallel=1. However, this option did not work.

windiana42 · 2023-03-31T15:06:39Z

pyproject.toml

@@ -57,7 +57,7 @@ mssql = ["pyodbc", "pytsql"]
 ibm_db2 = ["ibm-db", "ibm-db-sa"]
 prefect = ["prefect", "docker-compose"]

-[tool.poetry.group.dev.dependencies]
+[tool.poetry.dev-dependencies]


this seems to be the way for future poetry versions

windiana42 · 2023-03-31T15:07:13Z

src/pydiverse/pipedag/_typing.py

-    Tuple["Materializable", ...],
+    dict[str, "Materializable"],
+    list["Materializable"],
+    tuple["Materializable", ...],


This is an automated pyupdate change done by pre-commit hook.

windiana42 · 2023-03-31T15:08:46Z

src/pydiverse/pipedag/backend/table/sql.py

+            synonyms = self._get_mssql_sql_synonyms(transaction_schema.get())
+            for name in synonyms.keys():
+                self.execute(
+                    DropAlias(name, transaction_schema, if_exists=True),


for mssql, we use synonyms as implementation of DropAlias/CreateAlias

windiana42 · 2023-03-31T15:12:09Z

src/pydiverse/pipedag/context/run_context.py

@@ -91,14 +95,24 @@ def __init__(self, flow: Flow):

        self.task_memo: defaultdict[Any, Any] = defaultdict(lambda: MemoState.NONE)

+        # deferred table store operations
+        self._thread_pool = ThreadPoolExecutor(


we use multithreading here since all threads in process wait most of the time on I/O

windiana42 · 2023-03-31T15:17:38Z

To get this change in project, I merge this PR without Review.

windiana42 added 4 commits February 8, 2023 17:12

Allow recursive lookup of ALIASes (at least 2 level)

cf4b627

This is important since the source tables in DB2 are aliases and in the current schema we also keep aliases to the cache for cache valid tables.

further refactoring for allowing deferred copy to transaction schema …

233585d

…of cache valid tasks

further refactoring for allowing deferred copy to transaction schema:…

4b1f422

… mssql focus

windiana42 linked an issue Feb 9, 2023 that may be closed by this pull request

Common case fast: If all tasks in a stage are cache valid, then there should not be spent time to write data to a new schema #49

Closed

windiana42 changed the base branch from main to speedup_db2 February 9, 2023 22:24

Base automatically changed from speedup_db2 to main March 16, 2023 14:50

windiana42 added 13 commits March 16, 2023 15:50

Merge remote-tracking branch 'origin/main' into cache_valid_stages

ec21a77

add base_path to ZooKeeperLockManager debug output

9507ea3

Fix typing problem which only occurred in high-contention scenario.

6a93beb

further refactoring for allowing deferred copy to transaction schema:…

f948a5e

… postgres focus

#54: updated CHANGELOG.rst for release 0.2.2

17a2730

debug with console output

9fe19ef

fix readme: how to install poetry

952636c

activate tests for all python 3.8 - 3.11 versions

ee1d75b

drop compatibility for python 3.9

0c32c36

3.9 caused strange errors with DB2 interface.

fix release dates in CHANGELOG.rst

9450a4f

drop compatibility for python 3.9

a6bc0e0

further adaptions were needed

reduce table copy parallelism from 10 to 2

f3e9c39

This is an experiment to see whether CI runs get more reliable with that.

windiana42 marked this pull request as ready for review March 30, 2023 10:00

windiana42 requested a review from a team as a code owner March 30, 2023 10:00

windiana42 requested a review from NMAC427 March 30, 2023 10:00

windiana42 added 5 commits March 30, 2023 12:05

docker-compose consistency with example

1cbf624

The number of 100 concurrent connections to zookeeper is arbitrary. But it is easier to change that option if we don't rely on the default.

try switching zookeeper connections to default

b573860

I am also making stupid commits to get github actions to run something.

tomorrow is the day...

921b2f4

temporarily undo max-parallel

dfd65e4

to trigger GitHub Actions

try max-parallel=1 again

fda5f08

hacky way to implement max-parallel=1

424f191

This will be undone until we understand how to avoid either the serialization or the database based deadlocks.

windiana42 commented Mar 31, 2023

View reviewed changes

windiana42 merged commit 5b44fc0 into main Mar 31, 2023

windiana42 deleted the cache_valid_stages branch March 31, 2023 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't copy any data in case of 100% cache valid stages #54

Don't copy any data in case of 100% cache valid stages #54

windiana42 commented Feb 9, 2023 •

edited

Loading

windiana42 commented Feb 9, 2023

windiana42 commented Mar 30, 2023

windiana42 left a comment

windiana42 Mar 31, 2023

windiana42 Mar 31, 2023

windiana42 Mar 31, 2023

windiana42 Mar 31, 2023

windiana42 Mar 31, 2023

windiana42 Mar 31, 2023

windiana42 commented Mar 31, 2023

Don't copy any data in case of 100% cache valid stages #54

Don't copy any data in case of 100% cache valid stages #54

Conversation

windiana42 commented Feb 9, 2023 • edited Loading

Checklist

windiana42 commented Feb 9, 2023

windiana42 commented Mar 30, 2023

windiana42 left a comment

Choose a reason for hiding this comment

windiana42 Mar 31, 2023

Choose a reason for hiding this comment

windiana42 Mar 31, 2023

Choose a reason for hiding this comment

windiana42 Mar 31, 2023

Choose a reason for hiding this comment

windiana42 Mar 31, 2023

Choose a reason for hiding this comment

windiana42 Mar 31, 2023

Choose a reason for hiding this comment

windiana42 Mar 31, 2023

Choose a reason for hiding this comment

windiana42 commented Mar 31, 2023

windiana42 commented Feb 9, 2023 •

edited

Loading