Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push to S3: malformed URL #46

Closed
sparkysean opened this issue Feb 21, 2022 · 4 comments
Closed

Push to S3: malformed URL #46

sparkysean opened this issue Feb 21, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@sparkysean
Copy link

Hi there,

I'm testing out airflow-dbt-python==0.12.0 on AWS MWAA. It's working brilliantly with push_dbt_project=False, but fails when trying to push things back to S3 using push_dbt_project=True. I definitely have an IAM policy associated with MWAA that gives the correct PutObject permission for the bucket.

Looking at my log output, I think there's something wrong with the arguments being passed into load_file_handle_replace_error.

The log message written by the operator shows the correct target URL:

[2022-02-21, 03:24:01 UTC] {{dbt.py:258}} INFO - Pushing dbt project back to S3: s3://mwaa-test-bucket/dbt/project/dbt.zip

But when the boto3 S3UploadFailedError occurs later, the error output shows a malformed URL:

boto3.exceptions.S3UploadFailedError: Failed to upload /tmp/airflowtmptr4_9_un/dbt_project.zip to mwaa-test-bucket/s3://mwaa-test-bucket/dbt/project/dbt.zip: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied

I had a quick look at the S3 hook code, and I think it may be to do with this bit:

s3_key = f"s3://{bucket_name}/{key}{ _file.relative_to(project_dir)}"
self.load_file_handle_replace_error(
filename=_file,
key=s3_key,
bucket_name=bucket_name,
replace=replace,
)

As I understand it, what's happening here is that the full S3 URL is being passed in as the object key, where I think it should just be the object key itself (assuming that the underlying Airflow S3 hook code should be able to construct the final S3 URL using the supplied bucket name and key).

This probably explains why the logs have a malformed URL - it looks like it's concatenating the bucket name (mwaa-test-bucket/) and key (s3://mwaa-test-bucket/dbt/project/dbt.zip).

@tomasfarias
Copy link
Owner

Hello @sparkysean, thanks for opening up an issue.

That definitely looks like some incorrect concatenation is going on, however, that being said, I don't think it's that particular snippet of code that's causing the issue: I noticed you are using an S3 URL that ends with .zip, which means we should be running the first arm of the if, as this condition is true:

if key.endswith(".zip"):

If airflow-dbt-python is indeed getting to line 174 onward, then the bug is before that: as when using a zip file we shouldn't even be in the else that runs that concat to begin with.

I tested out the S3Hook.parse_s3_url function that splits up the full URL into bucket and key, in case your particular URL was causing any parsing issues, but I didn't find any problems.

Could you please share some of your DAG code? I'm particularly interested in how the project_dir and profiles_dir parameters are being passed.

Also, it would be great if you can share which Airflow, Python, and Airflow Amazon Providers (if running Airflow >= 2.0) versions are you running, as the implementation of S3Hook.parse_s3_url and urllib.parse.urlparse may have changed and that's what could be breaking the parsing.

@tomasfarias tomasfarias self-assigned this Feb 22, 2022
@tomasfarias tomasfarias added question Further information is requested bug Something isn't working labels Feb 22, 2022
@tomasfarias
Copy link
Owner

tomasfarias commented Feb 23, 2022

I was able to reproduce the issue with Airflow 2.2.2: I see it interprets the entire S3 URL as the key.

I think this could be solved by calling S3Hook.load_file with the full URL as the key and bucket_name=None, letting S3Hook handle parsing and splitting into bucket and key. Both major Airflow versions should support this.

Unfortunately, our tests were hiding the issue as we checked with the full URL as the key:

    key = hook.check_for_key(
        f"s3://{s3_bucket}/project/project.zip",
        s3_bucket,
    )

    assert key is True

Which passes when we upload the whole thing.

I truly believe this was caused by the differences between Airflow 1.10 and Airflow 2.x, as I have tested 1.10 and it works fine (meaning no full path is uploaded). Our testing pipeline is much more comprehensive and should let us know if we break any of the Airflow versions we support, so I will push a patch to fix it. Thanks for reporting it!

@tomasfarias tomasfarias removed the question Further information is requested label Feb 23, 2022
@tomasfarias
Copy link
Owner

Closed by #48. Expect patch in v0.13.1.

@sparkysean
Copy link
Author

Hi @tomasfarias,

Thanks so much for getting back to me! It looks like you've already found a solution for this, but for completeness here's your questions answered about my specific environment:

MWAA Airflow version: 2.2.2
Python version: 3.7
Amazon Provider version: 2.4.0

DAG code:

from datetime import datetime, timedelta

from airflow import DAG
from airflow_dbt_python.operators.dbt import DbtRunOperator

with DAG('example_dbt_run',
    description='testing out dbt run from MWAA',
    schedule_interval=timedelta(days=1),
    start_date=datetime(2022, 2, 16),
    catchup=False
) as dag:

    dbt_run = DbtRunOperator(
        task_id="dbt_run",
        project_dir="s3://mwaa-test-bucket/dbt/project/dbt.zip",
        profiles_dir="s3://mwaa-test-bucket/dbt/profile/",
        target="dev",
        profile="snowflake",
        push_dbt_project=True
    )

Thanks for a (ridiculously!) quick turnaround on this, as well as for developing these super useful Airflow hooks/operators in the first place - looking forward to testing this out in our environment in v0.13.1 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants