New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbt seed operator only runs successfully on first run after reboot of mwaa #29
Comments
Thanks for using Interesting, I haven't personally run into anything like this and we have Every execution of the operator should have its own temp directory created and all files should be pulled there. Moreover, the mechanisms that handle file pulling and temp directory creation are the same for Would you mind sharing a few more details? For example, the following may help us understand a bit more what's going on:
Of course, do remove any sensitive data that you deem relevant. We'll work with what we get 😄 Thanks! |
My current line of thought is the following: the error is happening when attempting to download If it's the latter, then your issue could be the same as #25, in which case I would also like to know if you are running the latest version of ❯ aws s3 ls 's3://MYBUCKET/test/'
2021-12-15 19:08:37 0
2021-12-15 19:08:55 225 requirements.txt If it's the former, I can't really say much without looking at new data. |
@tomasfarias thanks for the reply. Here's some more info that might be helpful. Let me know if I can get you any other info. Does this issue happen with any other operators: no it doesn't. Using all the same settings, dbt run and dbt test always work. airflow-dbt-python version = 0.11.0 files in the data folder on s3:
Some log lines that document the work related to this file:
dag code:
|
As a follow up, if you do think that this is related to this project, I don't mind taking a run at fixing it myself. |
Going by the logs seems like there's a second dbt thread running (as the log message comes from here). Could you check your I don't think Given that only one temporary directory is created per task instance, I don't see how multiple temporary directories would pop up in a single task instance. It almost appears as if there was another task instance running simultaneously. Given this, you can try is setting |
I have been running with ~30 threads, we have about 100 models that get built and running with a lot of threads dramatically cuts down on run time when running locally. I'll explore running with a single thread and see if that improves anything |
Just to clarify: We do want to support multi-threaded workflows. I suggested testing out single-threaded just to try and isolate the issue so that it can be fixed, not as a permanent solution. Thanks for taking a look! |
hey @samLozier! Good news! I'm working on a major refactoring of the way we handle dbt projects to support multiple backends besides S3... and I was able to (unintentionally) reproduce your original issue. I figured out the root cause as I noticed some of my tests were pushing back the Problem is that these old saved files have precedence over the new configuration! So dbt thinks we are still in the old project (with an old directory!), but since we use temporary directories the old project's directory doesn't exist anymore! This is why even though we are using a new project directory, it's just being ignored. Your particular error may not originate for this same reason, but I imagine that the root cause is the sharing of an old target directory (actually the In summary, the contents of the target directory aren't really meant to be moved around. So, how do we fix this? The quickest way is for you to simply disable partial parsing by setting In the future, we may also remove this file if we want to do anything with partial parsing or allow users to decide what to pull and push. Finally, just to clarify, this error is not related to your multi-threading workflows, so feel free to thread on! |
Thanks for this, I had to do a big multi-day rebuild today and this cut the run-time per dbt run by half. |
Closed by #40: |
I'm using this very helpful package to run dbt code on mwaa and have encountered a weird bug where the dbt seed operator only works on the first run after the mwaa environment has been created or updated. Subsequent runs fail with a .csv not found error. Notably, dbt seed appears to be looking in the wrong temp directory on the second run.
An example of the failure: �[33m[Errno 2] No such file or directory: '/tmp/airflowtmpeijb2yn7/data/METRICS_SEED.csv'
The temp directory, in this case airflowtmpeijb2yn7, does not match the directory that the "fetch from s3" step has downloaded all the dbt files into. Looking at that log, I can see that the .csv file I wanted was downloaded, along with all the other dbt files into a different subdirectory of tmp.
I'm not sure if this is even an issue with this package or a bug with mwaa or dbt, but I thought I'd raise it here first.
The text was updated successfully, but these errors were encountered: