-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Labels
bugDid we break something?Did we break something?
Description
Steps to reproduce
- Upload two files in s3:
folder/data/data.csvandfolder/datasets.md. - Setup remotes and caches.
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cachedvc run -d remote://s3/data 'echo hello world'
Outcome
Running command:
echo hello world
hello world
ERROR: unexpected error - '/folder/datasets.md' does not start with '/folder/data'
Version
$ dvc version
Python version: 3.6.6
Platform: Linux-5.3.12-arch1-1-x86_64-with-arch
Binary: False
Package: None
Filesystem type (cache directory): ('ext4', '/dev/sda9')
Filesystem type (workspace): ('ext4', '/dev/sda9')
Script to reproduce
#! /usr/bin/env bash
export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'
moto_server s3 &> /dev/null &
python -c '
import boto3
session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")
s3.put_object(Bucket="dvc-temp", Key="folder/data/data.csv")
s3.put_object(Bucket="dvc-temp", Key="folder/datasets.md", Body="### Datasets")
'
temp=$(mktemp -d)
cd $temp
dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache
dvc run -d remote://s3/data 'echo hello world'Analysis:
- This is due to
walk_filesimplementation inRemoteS3looking via prefix instead of/<prefix> to walk files. Either,walk_filesshould get directory path or should just append it itself.
https://github.com/iterative/dvc/blob/0404a2324e497667a8b7d0ab0bd2b37db8c97e4c/dvc/remote/s3.py#L282
Or, I'd prefer it to be handled when collecting the directory.
https://github.com/iterative/dvc/blob/caa67c725e1e351ed122bdad17db0f29a8e73c39/dvc/remote/base.py#L196
- Again, the logic of
existslooks flawed. Say, you havedata/subdir-file.txtanddata/subdir/1files. When addingdata/subdir, the first result could besubdir-file.txtwhich matchesstartswith, therefore, theexists()will return True, but in reality,subdirdoes not exist.
So, the function should check if it's a directory, and should loop through all results of_list_paths()till it finds the exact match (not sure, how expensive this will be).
Metadata
Metadata
Assignees
Labels
bugDid we break something?Did we break something?