Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error downloading dbt project files from s3 #25

Closed
tpcarneiro opened this issue Dec 14, 2021 · 4 comments · Fixed by #27
Closed

Error downloading dbt project files from s3 #25

tpcarneiro opened this issue Dec 14, 2021 · 4 comments · Fixed by #27
Assignees
Labels
bug Something isn't working

Comments

@tpcarneiro
Copy link

When passing an S3 URL to project_dir parameter, DbtS3Hook tries to download the root folder (prefix) as the temp dir throwing an exception.

With project_dir="s3://MYBUCKET.com/dbt"

...
[2021-12-14 20:15:22,510] {{dbt.py:259}} INFO - Fetching dbt project from S3: s3://MYBUCKET.com/dbt/
[2021-12-14 20:15:22,511] {{s3.py:82}} INFO - Downloading dbt project files from: s3://MYBUCKET.com/dbt/
...
[2021-12-14 20:15:24,950] {{s3.py:58}} INFO - Saving s3.Object(bucket_name='MYBUCKET.com', key='dbt/') file to: /tmp/airflowtmpimiwzxx7
[2021-12-14 20:15:24,952] {{taskinstance.py:1482}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
    result = task_copy.execute(context=context)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/operators/dbt.py", line 147, in execute
    with self.dbt_directory() as dbt_dir:  # type: str
  File "/usr/lib64/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/operators/dbt.py", line 234, in dbt_directory
    self.prepare_directory(tmp_dir)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/operators/dbt.py", line 262, in prepare_directory
    tmp_dir,
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/s3.py", line 112, in get_dbt_project
    bucket_name, s3_object_keys, local_project_dir, key_prefix
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/s3.py", line 131, in download_many_s3_keys
    self.download_one_s3_object(local_project_file, s3_object)
  File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/s3.py", line 60, in download_one_s3_object
    with open(target, "wb+") as f:
IsADirectoryError: [Errno 21] Is a directory: '/tmp/airflowtmpimiwzxx7'

I wasn't able to figure out why airflow_dbt_python/hooks/s3.py, line 98:

s3_object_keys = self.list_keys(bucket_name=bucket_name, prefix=f"{key_prefix}")

Returns the prefix folder itself (dbt/) as one of the keys.

Then airflow_dbt_python/hooks/s3.py, line 55, fails to open the file as it is actually a folder:

with open(target, "wb+") as f:
    s3_object.download_fileobj(f)

By adding the following check after line 112, I was able to workaround the error.

if s3_object_key == prefix:
    continue
@tomasfarias
Copy link
Owner

tomasfarias commented Dec 15, 2021

I also ran a couple of times into a similar issue and realized it was caused by an empty file in S3, that when downloading, since it doesn't have a name, the hook assumes it has the same name as the root project directory (which already exists, and is a directory).

You can verify if this is the case by doing a aws s3 ls on your prefix and see if you have an empty file there. The file does not appear in the AWS UI so you'll have to use the CLI. I did manage to figure out it only appears when I upload things via the UI: if I use the CLI to push my dbt files no strange empty file appears.

Regardless, given that you are reporting this, it appears to be more common than I thought, and it's not just me messing up when using the UI. So I'll write a quick patch as you suggested and skip s3 objects that are == prefix. Before I do anything though, do you feel like writing a PR yourself? Don't want to take your credit if the patch is essentially going to be what you already detailed. After we merge it, I can do a quick patch release to push it out.

Thanks for reporting this!

@tomasfarias
Copy link
Owner

tomasfarias commented Dec 15, 2021

Look I just reproduced the issue by doing the following:

  • Created a test directory in my bucket.
  • Uploaded a random requirements.txt file I had laying around inside that test directory using the AWS UI.
  • Ran aws s3 ls:
❯ aws s3 ls 's3://MYBUCKET/test/'
2021-12-15 19:08:37          0 
2021-12-15 19:08:55        225 requirements.txt

There it is!

@tomasfarias
Copy link
Owner

tomasfarias commented Dec 15, 2021

Thinking a bit more about it, the patch may have to also account for the fact that the empty file may be inside a sub-directory, so it will not be equal to the prefix. In the spirit of "better to ask for forgiveness than permission", we could just add a try block and skip files that raise IsADirectoryError.

@tpcarneiro
Copy link
Author

Thanks for replying so fast and being so open about the issue and your experience with it, and also for offering me to write the PR. Really appreciated!
After reading your answer, I checked my bucket using the CLI and do have a 0 bytes unnamed file just like you do. I haven't noticed it before, great catch!

Awesome work on this repo btw!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants