-
Notifications
You must be signed in to change notification settings - Fork 2
Allow use of cloud stores for intermediate zarr files #9
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @tomwhite, but we're pretty light on testing here. There's a lot of corner cases that can go wrong. I think we need tests for
- Make sure that the
dir=file://
works as expected - Verify sensible behaviour when
dir
is a local file but doesn't exist - Check that well-formed URLs that are not local files get processes as expected (I'm guessing a mock is the right way to do this: we mock out fspec.filesystem() and check that it's (a) getting called with the right args, and (b) fs.mkdir is getting called with the right args later.
- Verify that name collisions are handled correctly. This should be possible to do by calling tempdir twice with random.seed() set to the same value immediately before calling.
@jeromekelleher thanks for enumerating those cases. I think most of them are straightforward. Name collisions are tricky for cloud object stores, since they don't have a first-class concept of directory. Indeed, the doc for fsspec mkdir states:
So we could easily get into race conditions and eventual consistency issues. Instead, I think we should use a UUID (rather than the 8 character random string I used in the first PR) to ensure uniqueness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @tomwhite. UUID is much better - although I wonder if it's better to go all the way with uuid4.
0f67185
to
a3cd98a
Compare
It turns out that this needed a bit more work to get it to handle non-local filesystems properly, after I tried using it on GCS.
As a result of these changes I discovered that |
(BTW the reason that |
I'm out of my lane here but is there a reason to manipulate URLs as strings rather than use urllib.parse? |
Should we file an issue upstream? |
Good question. Actually, urllib.parse is being used to extract the protocol from the URL string. However, urllib.parse.urljoin doesn't work for arbitrary protocols (like I wondered if we should use a URI library like uritools. But even that would require additional utility functions to extract a filename, and combine a parent directory with a child directory or file.
This turned out to be a bug in the test, and |
Hopefully final update - use yarl for URL parsing (simple API, more popular library). This is ready for review again now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @tomwhite, I think we just need to do a bit more testing on build_url. There's a lot of different corner cases here, so we need to be reasonably thorough here I think.
|
||
def test_build_url(): | ||
assert build_url("http://host/path", "subpath") == "http://host/path/subpath" | ||
assert build_url("http://host/path/", "subpath") == "http://host/path/subpath" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be good to test this with more awkard examples:
- Non http scheme (s3, say)
- No scheme, just a local file path like "relative_path/stuff" and "/absolute_path/stuff/"
- A URL with a query string
http://host/path?x=1
->http://host/path/subpath?x=
presumably? - A URL with some url encoded entities
http://host/a%20path/
, say - A URL that has some things in it that need to be URL encoded, say, "http://host/!#_%" - presumably, we should always return a properly URL encoded string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeromekelleher! Fixed in e2de445
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, reapproving.
sgkit_vcf/utils.py
Outdated
prefix: Optional[str] = None, | ||
dir: Optional[PathType] = None, | ||
storage_options: Optional[Dict[str, str]] = None, | ||
) -> Generator[str, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of Generator[str, None, None]
you can say Iterator[str]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed in 5671652
sgkit_vcf/vcf_reader.py
Outdated
|
||
DEFAULT_ALT_NUMBER = 3 # see vcf_read.py in scikit_allel | ||
|
||
|
||
@contextmanager | ||
def open_vcf(path: PathType) -> VCF: | ||
def open_vcf(path: PathType) -> Generator[VCF, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use Generator
, see above
yield tempdir | ||
finally: | ||
# Remove the temporary directory on exiting the context manager | ||
fs.rm(tempdir, recursive=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should allow to control whether temp files are deleted via a flag (default = True)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be useful. I've opened #10 for it.
suffix: Optional[str] = None, | ||
prefix: Optional[str] = None, | ||
dir: Optional[PathType] = None, | ||
storage_options: Optional[Dict[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these storage options tested anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgkit_vcf/utils.py
Outdated
|
||
# Construct a random directory name | ||
tempdir = build_url(dir, prefix + str(uuid.uuid4()) + suffix) | ||
fs.mkdir(tempdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this fs.mkdir
call should be inside the try/finally
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #7