Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add classes for fetching a BIDS-compliant study on S3 #290

Merged
merged 34 commits into from
Aug 10, 2020

Conversation

richford
Copy link
Collaborator

@richford richford commented Jul 2, 2020

This PR adds classes to AFQ.data that can query and download subjects from a BIDS-compliant study hosted on Amazon S3. It is related to #255 but is much narrower in scope and does not resolve that issue. The main use case is to create a subset BIDS study for use on distributed computing elements.

Here's some example usage using the HBN class, which inherits from S3BIDSStudy:

import AFQ.data as afqd
hbn = afqd.HBN(subjects=3, random_seed=42)

# List all subjects in the HBN study
print(hbn._all_subjects)

# List the three random subjects that we asked for. This is a list of `S3BIDSSubject` objects
print(hbn.subjects)

# Take a look at the first one
s0 = hbn.subjects[0]

# Print the s3 keys for this subject
print(s0.s3_keys)

# Download this subject
s0.download("./hbn-test", include_site=True, include_derivs=False)

# Print a dict which maps the s3 keys to local downloaded files
print(s0.files)

# Download the rest of the study all at once (using multithreading and skipping previously downloaded files for s0, unless `overwrite=True`)
hbn.download("./hbn-test", include_site=True, include_derivs=False)

# Maybe we want all of the mindboggle derivatives too, but not the freesurfer derivatives
# Unless `overwrite=True`, this will skip all the previously downloaded files
hbn.download("./hbn-test", include_site=True, include_derivs=["mindboggle"])

# Okay, maybe we want all the derivatives, after all
hbn.download("./hbn-test", include_site=True, include_derivs=True)

# Hmm, I'm concerned that we're missing some s3 keys that live at the higher levels of the directory structure and should be inherited, let's look at them
print(hbn.non_subject_s3_keys)

Remaining questions or todo items:

  • It'd be nice to give the user some way to easily download those "non-subject" s3 keys.
  • More sophisticated treatment of sites. Currently, if multisite=True in the S3BIDSStudy parameters, we assume that all of the top-level directories are Site IDs. But maybe there's some other top level data in there. It'd be nice to be able to differentiate between "Site-RU" and a top-level "participants.tsv," for example.
  • Testing with moto

@pep8speaks
Copy link

pep8speaks commented Jul 2, 2020

Hello @richford! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 651:17: E128 continuation line under-indented for visual indent
Line 652:17: E128 continuation line under-indented for visual indent
Line 935:80: E501 line too long (81 > 79 characters)
Line 1058:80: E501 line too long (86 > 79 characters)
Line 1074:80: E501 line too long (125 > 79 characters)
Line 1100:80: E501 line too long (83 > 79 characters)
Line 1110:80: E501 line too long (81 > 79 characters)
Line 1114:80: E501 line too long (84 > 79 characters)
Line 1204:45: W291 trailing whitespace
Line 1287:80: E501 line too long (125 > 79 characters)

Line 65:80: E501 line too long (88 > 79 characters)
Line 116:31: E225 missing whitespace around operator
Line 126:31: E225 missing whitespace around operator
Line 136:31: E225 missing whitespace around operator
Line 150:1: W293 blank line contains whitespace

Comment last updated at 2020-08-09 19:59:14 UTC

@arokem
Copy link
Collaborator

arokem commented Jul 2, 2020

Looks great!

Regarding this

It'd be nice to give the user some way to easily download those "non-subject" s3 keys

Once you have these keys in hand, it's a call to fs.get away, right? But the user would have to sort through them, to see which ones they want to download?

@richford
Copy link
Collaborator Author

richford commented Jul 2, 2020

Agreed. So we probably don't need to include it in AFQ.data. They already have the list of s3 keys. So maybe I'll check that off the todo list and assume that we put the rest of that responsibility on the user.

@richford
Copy link
Collaborator Author

richford commented Jul 17, 2020

Still a WIP. I did some significant restructuring after realizing that HBN is not BIDS compliant. We now have an S3BIDSStudy class and a separate HBNSite class. I think the remaining things to do are:

  • Test with moto, this will require adding moto to the [dev] requirements in extras_require.
  • Rewrite the download method for the HBN site and subject classes so that it BIDS-ifies the output on the local disk
    • For raw data
    • For derivative data
  • Add a convenience get_non_subject_files() method to download all of the non-subject files.

We could also consider creating an HBN object that contains all of the HBNSite objects. But why? As tidy as it would seem, I can't think of an authentic use case.

@richford
Copy link
Collaborator Author

@arokem Is the documentation build failing because the PR is coming from my fork and not from yeatmanlab?

@arokem
Copy link
Collaborator

arokem commented Jul 17, 2020

That looks like the return of numpy/numpydoc#268. Is this branch rebased on current master?

@richford
Copy link
Collaborator Author

Nope. Just rebased. Thanks for the tip on the 🧟 bug!

@richford
Copy link
Collaborator Author

Hmm, no such luck. Anyway, low priority since this is still a WIP. We can come back to it once I implement the other things.

@richford
Copy link
Collaborator Author

I tried updating requirements-dev.txt to require numpydoc==1.1.0 but that did not fix my local doc build. I doubt it will work on GitHub but we'll see.

@36000 36000 mentioned this pull request Aug 4, 2020
@richford
Copy link
Collaborator Author

richford commented Aug 4, 2020

Thanks @36000 for fixing the docbuild issue!

@richford richford changed the title WIP: Add classes for fetching a BIDS-compliant study on S3 Add classes for fetching a BIDS-compliant study on S3 Aug 4, 2020
@richford
Copy link
Collaborator Author

richford commented Aug 4, 2020

Added tests using moto. They pass locally. Let's see if they pass here. Then I think this is ready for review.

Copy link
Collaborator

@arokem arokem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! I had some small comments/suggestions.

You might also want to rebase this, as there has been some restructuring of the library in the recently merged #322

requirements.txt Outdated Show resolved Hide resolved
AFQ/data.py Show resolved Hide resolved
AFQ/data.py Show resolved Hide resolved
AFQ/data.py Outdated
study : AFQ.data.S3BIDSStudy
The S3BIDSStudy for which this subject was a participant

site : str, optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For generality, might want to point out that this idiosyncrasy designed to support use of HBN.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this particular part is an HBN idiosyncrasy. Rather, I'm just trying to record some notion of the parent study object. On the other hand, the parameter _subject_class described on line 814 is an idiosyncrasy designed to support HBN, but I think I document that reasonably well there. What do you think?

AFQ/data.py Show resolved Hide resolved
AFQ/data.py Show resolved Hide resolved
pbar_idx=idx,
) for idx, sub in enumerate(self.subjects)]

compute(*results, scheduler='threads')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you rather use the dask progressbar construct here as well? Instead of the pbar input to sub.download?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble getting the nesting to work with the Dask progress bar since we download multiple subjects all at once. So I went with this instead. I admit it's weird to mix the dask progress bar with the tqdm progress bar but I didn't want to spend too much time on that part. But by all means, commits to this PR are welcome.

AFQ/data.py Outdated Show resolved Hide resolved


@pytest.fixture
def temp_data_dir():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, cool. I didn't know about nibabel.tmpdirs. The main difference I see is that nbtmp.InTemporaryDirectory() also changes the current working directory. Also, using is as a pytest.fixture with a yield statement will make sure that we delete the directory even if the test fails. I'm not sure what happens to the temporary directory if the test you linked in test_AFQ_data fails in the middle. but maybe it also deletes the directory.

@36000
Copy link
Collaborator

36000 commented Aug 5, 2020

There could also be an upload function similar to the download function, which uploads derivatives generated by AFQ back into the same folder. So the cloudknot workflow would look like:

study = afqd.S3BIDSStudy()
study.download()

myafq = api.AFQ(virtual_frame_buffer=True)
myafq.export_all()

study.upload()

Thereby pyAFQ would achieve its peak API potential

AFQ/data.py Show resolved Hide resolved
@richford
Copy link
Collaborator Author

richford commented Aug 8, 2020

Thanks @arokem and @36000 for the review! I rebased and responded to most of your suggestions, resolving the comments that I believe I responded to and leaving open the ones that we might want to keep talking about.

@richford
Copy link
Collaborator Author

richford commented Aug 8, 2020

@36000, regarding the upload function. This would presuppose that the s3 bucket where we got the input data is the same one where we want to put the results. But for public datasets, this might not be true. Two options: (1) we could assume that it will work and then throw an error if it doesn't (or rather, let s3fs throw the permissions error). (2) we could put the upload functionality inside of the myafq object in your example. In either case, I think we should delay the upload functionality to another issue/PR so that this one can be concerned only with download.

@arokem
Copy link
Collaborator

arokem commented Aug 8, 2020 via email

setup.cfg Outdated Show resolved Hide resolved
Co-authored-by: Ariel Rokem <arokem@gmail.com>
@arokem
Copy link
Collaborator

arokem commented Aug 9, 2020

Well. This is annoying. Not as elegant, perhaps, but you can still add a package_data input to the setup.py call to setup: https://github.com/yeatmanlab/pyAFQ/blob/master/setup.py#L25

AFQ/data.py Outdated Show resolved Hide resolved
AFQ/data.py Show resolved Hide resolved
@richford
Copy link
Collaborator Author

richford commented Aug 9, 2020

Well. This is annoying. Not as elegant, perhaps, but you can still add a package_data input to the setup.py call to setup: https://github.com/yeatmanlab/pyAFQ/blob/master/setup.py#L25

I went with the MANIFEST.in solution, instead.

@richford
Copy link
Collaborator Author

richford commented Aug 9, 2020

Yay! Tests finally pass. MANIFEST.in FTW.

@arokem
Copy link
Collaborator

arokem commented Aug 10, 2020

Great! I believe this is ready to go? Maybe we can merge this and follow up with issues/PRs on things that came up but were left unresolved?

@richford
Copy link
Collaborator Author

Agreed. I think it's ready to go. I believe the only outstanding issue is the upload functionality that @36000 mentioned. However, I may have missed something else in the back-and-forth.

@arokem arokem mentioned this pull request Aug 10, 2020
@arokem arokem merged commit 741beaf into yeatmanlab:master Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants