Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Allow use of cloud stores for intermediate zarr files #9

Merged
merged 17 commits into from
Sep 14, 2020

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented Sep 7, 2020

Fixes #7

Copy link

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @tomwhite, but we're pretty light on testing here. There's a lot of corner cases that can go wrong. I think we need tests for

  • Make sure that the dir=file:// works as expected
  • Verify sensible behaviour when dir is a local file but doesn't exist
  • Check that well-formed URLs that are not local files get processes as expected (I'm guessing a mock is the right way to do this: we mock out fspec.filesystem() and check that it's (a) getting called with the right args, and (b) fs.mkdir is getting called with the right args later.
  • Verify that name collisions are handled correctly. This should be possible to do by calling tempdir twice with random.seed() set to the same value immediately before calling.

sgkit_vcf/tests/test_utils.py Outdated Show resolved Hide resolved
sgkit_vcf/utils.py Outdated Show resolved Hide resolved
sgkit_vcf/utils.py Outdated Show resolved Hide resolved
@tomwhite
Copy link
Collaborator Author

tomwhite commented Sep 7, 2020

@jeromekelleher thanks for enumerating those cases. I think most of them are straightforward.

Name collisions are tricky for cloud object stores, since they don't have a first-class concept of directory. Indeed, the doc for fsspec mkdir states:

For systems that don’t have true directories, may create an [entry?] for this instance only and not touch the real filesystem

So we could easily get into race conditions and eventual consistency issues.

Instead, I think we should use a UUID (rather than the 8 character random string I used in the first PR) to ensure uniqueness.

sgkit_vcf/utils.py Outdated Show resolved Hide resolved
Copy link

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @tomwhite. UUID is much better - although I wonder if it's better to go all the way with uuid4.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Sep 9, 2020

It turns out that this needed a bit more work to get it to handle non-local filesystems properly, after I tried using it on GCS.

  • Pass storage_options through API calls.
  • Fix (and test) tempdir URL construction (using pytest-mock, and callee).

As a result of these changes I discovered that file: URLs didn't work, and on closer inspection it looks like fsspec doesn't support them. More accurately, fsspec will get a local filesystem for a file: URL, but it won't open/mkdir a file/dir for a for a file: URL. I propose that we don't support file: URLs - at least for the moment.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Sep 9, 2020

(BTW the reason that file: URLs were working before was because I was converting them to strings using str(Path(url)). But this isn't safe for non-local URLs, e.g. s3:// or gs://. As a rule, I think we should do as little path manipulation as possible.)

@hammer
Copy link

hammer commented Sep 9, 2020

I'm out of my lane here but is there a reason to manipulate URLs as strings rather than use urllib.parse?

@hammer
Copy link

hammer commented Sep 9, 2020

I discovered that file: URLs didn't work, and on closer inspection it looks like fsspec doesn't support them.

Should we file an issue upstream?

@tomwhite
Copy link
Collaborator Author

I'm out of my lane here but is there a reason to manipulate URLs as strings rather than use urllib.parse?

Good question. Actually, urllib.parse is being used to extract the protocol from the URL string. However, urllib.parse.urljoin doesn't work for arbitrary protocols (like s3).

I wondered if we should use a URI library like uritools. But even that would require additional utility functions to extract a filename, and combine a parent directory with a child directory or file.

I discovered that file: URLs didn't work

This turned out to be a bug in the test, and file:/// URLs do work, so I've added a test for that case.

@tomwhite
Copy link
Collaborator Author

Hopefully final update - use yarl for URL parsing (simple API, more popular library). This is ready for review again now.

Copy link

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @tomwhite, I think we just need to do a bit more testing on build_url. There's a lot of different corner cases here, so we need to be reasonably thorough here I think.


def test_build_url():
assert build_url("http://host/path", "subpath") == "http://host/path/subpath"
assert build_url("http://host/path/", "subpath") == "http://host/path/subpath"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be good to test this with more awkard examples:

  • Non http scheme (s3, say)
  • No scheme, just a local file path like "relative_path/stuff" and "/absolute_path/stuff/"
  • A URL with a query string http://host/path?x=1 -> http://host/path/subpath?x= presumably?
  • A URL with some url encoded entities http://host/a%20path/, say
  • A URL that has some things in it that need to be URL encoded, say, "http://host/!#_%" - presumably, we should always return a properly URL encoded string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeromekelleher! Fixed in e2de445

Copy link

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, reapproving.

prefix: Optional[str] = None,
dir: Optional[PathType] = None,
storage_options: Optional[Dict[str, str]] = None,
) -> Generator[str, None, None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Generator[str, None, None] you can say Iterator[str].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed in 5671652


DEFAULT_ALT_NUMBER = 3 # see vcf_read.py in scikit_allel


@contextmanager
def open_vcf(path: PathType) -> VCF:
def open_vcf(path: PathType) -> Generator[VCF, None, None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use Generator, see above

yield tempdir
finally:
# Remove the temporary directory on exiting the context manager
fs.rm(tempdir, recursive=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should allow to control whether temp files are deleted via a flag (default = True)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be useful. I've opened #10 for it.

suffix: Optional[str] = None,
prefix: Optional[str] = None,
dir: Optional[PathType] = None,
storage_options: Optional[Dict[str, str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these storage options tested anywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Construct a random directory name
tempdir = build_url(dir, prefix + str(uuid.uuid4()) + suffix)
fs.mkdir(tempdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this fs.mkdir call should be inside the try/finally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite tomwhite merged commit 8c8faaf into sgkit-dev:master Sep 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow use of cloud stores for intermediate zarr files
4 participants