-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't copy any data in case of 100% cache valid stages #54
Conversation
…chema of cache valid tasks [tests still fail] This is quite a hard change into existing concepts, but it should massively speed up the common case where most stages are 100% cache-valid. There are at least issues with DB2 tests.
This is important since the source tables in DB2 are aliases and in the current schema we also keep aliases to the cache for cache valid tables.
…of cache valid tasks
Visible by test failures, there will have to be some more solidification work done. For DB2 and mssql the technique is already working in general. For Postgres it might be a bit more tricky since we can only place Read-Views as aliases. However, also for DB2/mssql we resolve those aliases before working with them (sqlalchemy can't work with them very well). The same can be done also for read views. They would just be placeholders in this case functioning a s a level of indirection. |
3.9 caused strange errors with DB2 interface.
further adaptions were needed
This is an experiment to see whether CI runs get more reliable with that.
The ibm_db2 service is only started once for the whole matrix execution with steps for different python versions. Due to the artificially small tests, this leads to actions that DB2 consideres as deadlock even though zookeeper is used to ensure that any stage is only touched by exactly one process at any time. It could be that using wildly different graphs on the same instance_id might cause this, but this problem will not affect any realistic usecases of pipedag.
@NMAC427 we already talked this change through. But I guess a review would be good for this size of a refactoring. I am about to get tests green, but it seems to be just a problem of current activity on ibm_db2 service which is started in a different way than our other docker-compose databases. |
The number of 100 concurrent connections to zookeeper is arbitrary. But it is easier to change that option if we don't rely on the default.
I am also making stupid commits to get github actions to run something.
to trigger GitHub Actions
This will be undone until we understand how to avoid either the serialization or the database based deadlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented changes in PR
@@ -10,5 +10,5 @@ runs: | |||
- name: Test | |||
shell: bash | |||
run: | | |||
poetry run pytest tests -ra ${DEBUG:+-s} ${{ inputs.arguments }} | |||
poetry run pytest tests -ra -v -s ${DEBUG:+-s} ${{ inputs.arguments }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I suggest to leave more verbose test output on CI. We are still having trouble with database deadlocks despite locking of stages.
@@ -65,15 +65,133 @@ jobs: | |||
with: | |||
arguments: --workers 2 | |||
|
|||
full_test: | |||
full_test39: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lengthy duplication is not nice. We will try to undo this. One idea is strategy: max-parallel=1. However, this option did not work.
@@ -57,7 +57,7 @@ mssql = ["pyodbc", "pytsql"] | |||
ibm_db2 = ["ibm-db", "ibm-db-sa"] | |||
prefect = ["prefect", "docker-compose"] | |||
|
|||
[tool.poetry.group.dev.dependencies] | |||
[tool.poetry.dev-dependencies] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be the way for future poetry versions
Tuple["Materializable", ...], | ||
dict[str, "Materializable"], | ||
list["Materializable"], | ||
tuple["Materializable", ...], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an automated pyupdate change done by pre-commit hook.
synonyms = self._get_mssql_sql_synonyms(transaction_schema.get()) | ||
for name in synonyms.keys(): | ||
self.execute( | ||
DropAlias(name, transaction_schema, if_exists=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for mssql, we use synonyms as implementation of DropAlias/CreateAlias
@@ -91,14 +95,24 @@ def __init__(self, flow: Flow): | |||
|
|||
self.task_memo: defaultdict[Any, Any] = defaultdict(lambda: MemoState.NONE) | |||
|
|||
# deferred table store operations | |||
self._thread_pool = ThreadPoolExecutor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use multithreading here since all threads in process wait most of the time on I/O
To get this change in project, I merge this PR without Review. |
Sounds like a nobrainer. But it is quite tricky to achieve since we only know half way through stage execution whether this is the case or not. So we start setting just aliases/read views and replace them by copying tables (in the background) when the first cache invalid task is discovered and the aliases are replaced before stage commit.
Checklist
CHANGELOG.rst
entry