Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't copy any data in case of 100% cache valid stages #54

Merged
merged 23 commits into from
Mar 31, 2023

Conversation

windiana42
Copy link
Member

@windiana42 windiana42 commented Feb 9, 2023

Sounds like a nobrainer. But it is quite tricky to achieve since we only know half way through stage execution whether this is the case or not. So we start setting just aliases/read views and replace them by copying tables (in the background) when the first cache invalid task is discovered and the aliases are replaced before stage commit.

Checklist

  • Added a CHANGELOG.rst entry

…chema of cache valid tasks [tests still fail]

This is quite a hard change into existing concepts, but it should massively speed up the common case where most stages are 100% cache-valid.

There are at least issues with DB2 tests.
This is important since the source tables in DB2 are aliases and in the current schema we also keep aliases to the cache for cache valid tables.
@windiana42
Copy link
Member Author

Visible by test failures, there will have to be some more solidification work done. For DB2 and mssql the technique is already working in general. For Postgres it might be a bit more tricky since we can only place Read-Views as aliases. However, also for DB2/mssql we resolve those aliases before working with them (sqlalchemy can't work with them very well). The same can be done also for read views. They would just be placeholders in this case functioning a s a level of indirection.

@windiana42 windiana42 changed the base branch from main to speedup_db2 February 9, 2023 22:24
Base automatically changed from speedup_db2 to main March 16, 2023 14:50
3.9 caused strange errors with DB2 interface.
further adaptions were needed
This is an experiment to see whether CI runs get more reliable with that.
The ibm_db2 service is only started once for the whole matrix execution with steps for different python versions. Due to the artificially small tests, this leads to actions that DB2 consideres as deadlock even though zookeeper is used to ensure that any stage is only touched by exactly one process at any time.

It could be that using wildly different graphs on the same instance_id might cause this, but this problem will not affect any realistic usecases of pipedag.
@windiana42 windiana42 marked this pull request as ready for review March 30, 2023 10:00
@windiana42 windiana42 requested a review from a team as a code owner March 30, 2023 10:00
@windiana42 windiana42 requested a review from NMAC427 March 30, 2023 10:00
@windiana42
Copy link
Member Author

@NMAC427 we already talked this change through. But I guess a review would be good for this size of a refactoring. I am about to get tests green, but it seems to be just a problem of current activity on ibm_db2 service which is started in a different way than our other docker-compose databases.

The number of 100 concurrent connections to zookeeper is arbitrary. But it is easier to change that option if we don't rely on the default.
I am also making stupid commits to get github actions to run something.
to trigger GitHub Actions
This will be undone until we understand how to avoid either the serialization or the database based deadlocks.
Copy link
Member Author

@windiana42 windiana42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented changes in PR

@@ -10,5 +10,5 @@ runs:
- name: Test
shell: bash
run: |
poetry run pytest tests -ra ${DEBUG:+-s} ${{ inputs.arguments }}
poetry run pytest tests -ra -v -s ${DEBUG:+-s} ${{ inputs.arguments }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I suggest to leave more verbose test output on CI. We are still having trouble with database deadlocks despite locking of stages.

@@ -65,15 +65,133 @@ jobs:
with:
arguments: --workers 2

full_test:
full_test39:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lengthy duplication is not nice. We will try to undo this. One idea is strategy: max-parallel=1. However, this option did not work.

@@ -57,7 +57,7 @@ mssql = ["pyodbc", "pytsql"]
ibm_db2 = ["ibm-db", "ibm-db-sa"]
prefect = ["prefect", "docker-compose"]

[tool.poetry.group.dev.dependencies]
[tool.poetry.dev-dependencies]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be the way for future poetry versions

Tuple["Materializable", ...],
dict[str, "Materializable"],
list["Materializable"],
tuple["Materializable", ...],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an automated pyupdate change done by pre-commit hook.

synonyms = self._get_mssql_sql_synonyms(transaction_schema.get())
for name in synonyms.keys():
self.execute(
DropAlias(name, transaction_schema, if_exists=True),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for mssql, we use synonyms as implementation of DropAlias/CreateAlias

@@ -91,14 +95,24 @@ def __init__(self, flow: Flow):

self.task_memo: defaultdict[Any, Any] = defaultdict(lambda: MemoState.NONE)

# deferred table store operations
self._thread_pool = ThreadPoolExecutor(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use multithreading here since all threads in process wait most of the time on I/O

@windiana42
Copy link
Member Author

To get this change in project, I merge this PR without Review.

@windiana42 windiana42 merged commit 5b44fc0 into main Mar 31, 2023
@windiana42 windiana42 deleted the cache_valid_stages branch March 31, 2023 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant