Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt seed operator only runs successfully on first run after reboot of mwaa #29

Closed
samLozier opened this issue Dec 31, 2021 · 10 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@samLozier
Copy link

I'm using this very helpful package to run dbt code on mwaa and have encountered a weird bug where the dbt seed operator only works on the first run after the mwaa environment has been created or updated. Subsequent runs fail with a .csv not found error. Notably, dbt seed appears to be looking in the wrong temp directory on the second run.

An example of the failure: �[33m[Errno 2] No such file or directory: '/tmp/airflowtmpeijb2yn7/data/METRICS_SEED.csv'
The temp directory, in this case airflowtmpeijb2yn7, does not match the directory that the "fetch from s3" step has downloaded all the dbt files into. Looking at that log, I can see that the .csv file I wanted was downloaded, along with all the other dbt files into a different subdirectory of tmp.

  • All dbt related files live in s3 buckets
  • dbt deps is run in ci so that it isn't called every time in airflow

I'm not sure if this is even an issue with this package or a bug with mwaa or dbt, but I thought I'd raise it here first.

@tomasfarias
Copy link
Owner

Thanks for using airflow-dbt-python and reporting an issue!

Interesting, I haven't personally run into anything like this and we have DbtSeedOperator running at the beginning of our pipeline with a pretty high frequency.

Every execution of the operator should have its own temp directory created and all files should be pulled there. Moreover, the mechanisms that handle file pulling and temp directory creation are the same for DbtSeedOperator as well as other operators like, say, DbtRunOperator. Have you encountered the issue with any other operator besides DbtSeedOperator?

Would you mind sharing a few more details? For example, the following may help us understand a bit more what's going on:

  • A snippet of code showing your DAG/Operator.
  • More log lines: specifically the logs about files being pulled, what happens before you see the error.

Of course, do remove any sensitive data that you deem relevant. We'll work with what we get 😄

Thanks!

@tomasfarias tomasfarias self-assigned this Dec 31, 2021
@tomasfarias tomasfarias added the question Further information is requested label Dec 31, 2021
@tomasfarias
Copy link
Owner

My current line of thought is the following: the error is happening when attempting to download METRICS_SEED.csv because either airflowtmpeijb2yn7 or data does not exist.

If it's the latter, then your issue could be the same as #25, in which case I would also like to know if you are running the latest version of airflow-dbt-python: v0.11.0, which should have taken care of that bug (although it's hard to reproduce, so we may not have fixed it correctly). Moreover, it would be valuable to know if you have an empty, unnamed, file inside your data subpath in S3. This file can pop up when uploading files to S3 manually, and you can spot them with the aws cli:

❯ aws s3 ls 's3://MYBUCKET/test/'
2021-12-15 19:08:37          0 
2021-12-15 19:08:55        225 requirements.txt

If it's the former, I can't really say much without looking at new data.

@samLozier
Copy link
Author

samLozier commented Jan 3, 2022

@tomasfarias thanks for the reply. Here's some more info that might be helpful. Let me know if I can get you any other info.

Does this issue happen with any other operators: no it doesn't. Using all the same settings, dbt run and dbt test always work.

airflow-dbt-python version = 0.11.0

files in the data folder on s3:

  • .gitkeep # possibly the issue? size is 0b
  • metrics_seed.csv
  • metrics_seed.yml

Some log lines that document the work related to this file:

[2022-01-03 00:00:23,995] {{s3.py:85}} INFO - Downloading dbt project files from: s3://<mybucket>/dbt/project/
[2022-01-03 00:00:27,583] {{s3.py:53}} INFO - Saving s3.Object(bucket_name='<mybucket>', key='dbt/project/data/METRICS_SEED.csv') file to: /tmp/airflowtmpxpofvfph/data/METRICS_SEED.csv
[2022-01-03 00:04:13,940] {{functions.py:248}} INFO - 00:04:13  1 of 1 START seed file transform.METRICS_SEED............................... [RUN]
[2022-01-03 00:04:14,005] {{functions.py:252}} ERROR - 00:04:14.005093 [error] [Thread-2  ]: �[31mUnhandled error while executing seed.project.METRICS_SEED�[0m
[Errno 2] No such file or directory: '/tmp/airflowtmpcm25eelz/data/METRICS_SEED.csv'

dag code:


PROFILES_DIR= f"s3://project-mwaa-{S3_BRANCH_ALIAS}/dbt/"
PROJECT_DIR= f"s3://project-mwaa-{S3_BRANCH_ALIAS}/dbt/project/"

BASE_PATH = "/usr/local/airflow/dbt/project/"
models_dir = str(BASE_PATH + "/models")


with DAG(
    dag_id="project_default_run",
    default_args=args,
    schedule_interval='@daily',
    start_date=days_ago(2),
    dagrun_timeout=timedelta(minutes=60),
    tags=["project", "default"],
) as dag:

    dbt_seed= DbtSeedOperator(
         task_id="dbt_seed",
         project_dir=PROJECT_DIR,
         profiles_dir=PROFILES_DIR,
         full_refresh=True, # have also tried False 
     )

@samLozier
Copy link
Author

As a follow up, if you do think that this is related to this project, I don't mind taking a run at fixing it myself.

@tomasfarias
Copy link
Owner

Going by the logs seems like there's a second dbt thread running (as the log message comes from here). Could you check your profiles.yml to see if you are running with threads: 2 (or some other value > 1)? Running the test suite with threads=2 in DbtSeedOperator doesn't cause any errors, so it may be a red herring, but perhaps you could tell me the results of running with threads=1?

I don't think .gitkeep would cause any issues for the same reason: adding a .gitkeep file to the tests doesn't cause them to fail. But perhaps you could try a test run without it?

Given that only one temporary directory is created per task instance, I don't see how multiple temporary directories would pop up in a single task instance. It almost appears as if there was another task instance running simultaneously. Given this, you can try is setting AIRFLOW__CORE__DAG_CONCURRENCY to 1, although I'm a bit perplexed why would something like this be happening. If you are using MWAA you can see how to set Airflow configuration variables here: https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html#configuring-env-variables-customizing.

@samLozier
Copy link
Author

I have been running with ~30 threads, we have about 100 models that get built and running with a lot of threads dramatically cuts down on run time when running locally. I'll explore running with a single thread and see if that improves anything

@tomasfarias
Copy link
Owner

Just to clarify: We do want to support multi-threaded workflows. I suggested testing out single-threaded just to try and isolate the issue so that it can be fixed, not as a permanent solution. Thanks for taking a look!

@tomasfarias
Copy link
Owner

tomasfarias commented Feb 5, 2022

hey @samLozier! Good news! I'm working on a major refactoring of the way we handle dbt projects to support multiple backends besides S3... and I was able to (unintentionally) reproduce your original issue.

I figured out the root cause as I noticed some of my tests were pushing back the target directory of my dbt project, and this target directory is later pulled again by the future tests. But future tests then activate dbt's partial parsing, since target is not empty, and dbt then reads the old saved files here: https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/parser/manifest.py#L226

Problem is that these old saved files have precedence over the new configuration! So dbt thinks we are still in the old project (with an old directory!), but since we use temporary directories the old project's directory doesn't exist anymore! This is why even though we are using a new project directory, it's just being ignored. Your particular error may not originate for this same reason, but I imagine that the root cause is the sharing of an old target directory (actually the partial_parse.msgpack file in it).

In summary, the contents of the target directory aren't really meant to be moved around.

So, how do we fix this?

The quickest way is for you to simply disable partial parsing by setting partial_parse=False. This will make dbt ignore anything in the target dir and rebuild everything. Since messing with the target dir is as unsupported as it gets, I'll be also disabling it by default in the next release (the PR that does that will close this issue).

In the future, we may also remove this file if we want to do anything with partial parsing or allow users to decide what to pull and push.

Finally, just to clarify, this error is not related to your multi-threading workflows, so feel free to thread on!

@tomasfarias tomasfarias added bug Something isn't working and removed question Further information is requested labels Feb 5, 2022
@samLozier
Copy link
Author

Thanks for this, I had to do a big multi-day rebuild today and this cut the run-time per dbt run by half.

@tomasfarias
Copy link
Owner

Closed by #40: partial_parse now defaults to False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants