Skip to content

Cannot add file having name with substring of a folder as prefix in s3 #2871

@skshetry

Description

@skshetry

Steps to reproduce

  1. Upload two files in s3: folder/data/data.csv and folder/datasets.md.
  2. Setup remotes and caches.
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache
  1. dvc run -d remote://s3/data 'echo hello world'

Outcome

Running command:
        echo hello world
hello world
ERROR: unexpected error - '/folder/datasets.md' does not start with '/folder/data'

Version

$ dvc version
Python version: 3.6.6
Platform: Linux-5.3.12-arch1-1-x86_64-with-arch
Binary: False
Package: None
Filesystem type (cache directory): ('ext4', '/dev/sda9')
Filesystem type (workspace): ('ext4', '/dev/sda9')

Script to reproduce

#! /usr/bin/env bash

export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'

moto_server s3 &> /dev/null &

python -c '
import boto3

session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")

s3.put_object(Bucket="dvc-temp", Key="folder/data/data.csv")
s3.put_object(Bucket="dvc-temp", Key="folder/datasets.md", Body="### Datasets")
'

temp=$(mktemp -d)
cd $temp

dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache

dvc run -d remote://s3/data 'echo hello world'

Analysis:

  1. This is due to walk_files implementation in RemoteS3 looking via prefix instead of /<prefix> to walk files. Either, walk_files should get directory path or should just append it itself.

https://github.com/iterative/dvc/blob/0404a2324e497667a8b7d0ab0bd2b37db8c97e4c/dvc/remote/s3.py#L282

Or, I'd prefer it to be handled when collecting the directory.
https://github.com/iterative/dvc/blob/caa67c725e1e351ed122bdad17db0f29a8e73c39/dvc/remote/base.py#L196

  1. Again, the logic of exists looks flawed. Say, you have data/subdir-file.txt and data/subdir/1 files. When adding data/subdir, the first result could be subdir-file.txt which matches startswith, therefore, the exists() will return True, but in reality, subdir does not exist.
    So, the function should check if it's a directory, and should loop through all results of _list_paths() till it finds the exact match (not sure, how expensive this will be).

https://github.com/iterative/dvc/blob/caa67c725e1e351ed122bdad17db0f29a8e73c39/dvc/remote/s3.py#L208-L211

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugDid we break something?

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions