Add classes for fetching a BIDS-compliant study on S3 #290

richford · 2020-07-02T04:36:05Z

This PR adds classes to AFQ.data that can query and download subjects from a BIDS-compliant study hosted on Amazon S3. It is related to #255 but is much narrower in scope and does not resolve that issue. The main use case is to create a subset BIDS study for use on distributed computing elements.

Here's some example usage using the HBN class, which inherits from S3BIDSStudy:

import AFQ.data as afqd
hbn = afqd.HBN(subjects=3, random_seed=42)

# List all subjects in the HBN study
print(hbn._all_subjects)

# List the three random subjects that we asked for. This is a list of `S3BIDSSubject` objects
print(hbn.subjects)

# Take a look at the first one
s0 = hbn.subjects[0]

# Print the s3 keys for this subject
print(s0.s3_keys)

# Download this subject
s0.download("./hbn-test", include_site=True, include_derivs=False)

# Print a dict which maps the s3 keys to local downloaded files
print(s0.files)

# Download the rest of the study all at once (using multithreading and skipping previously downloaded files for s0, unless `overwrite=True`)
hbn.download("./hbn-test", include_site=True, include_derivs=False)

# Maybe we want all of the mindboggle derivatives too, but not the freesurfer derivatives
# Unless `overwrite=True`, this will skip all the previously downloaded files
hbn.download("./hbn-test", include_site=True, include_derivs=["mindboggle"])

# Okay, maybe we want all the derivatives, after all
hbn.download("./hbn-test", include_site=True, include_derivs=True)

# Hmm, I'm concerned that we're missing some s3 keys that live at the higher levels of the directory structure and should be inherited, let's look at them
print(hbn.non_subject_s3_keys)

Remaining questions or todo items:

It'd be nice to give the user some way to easily download those "non-subject" s3 keys.
More sophisticated treatment of sites. Currently, if multisite=True in the S3BIDSStudy parameters, we assume that all of the top-level directories are Site IDs. But maybe there's some other top level data in there. It'd be nice to be able to differentiate between "Site-RU" and a top-level "participants.tsv," for example.
Testing with moto

pep8speaks · 2020-07-02T04:36:08Z

Hello @richford! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file AFQ/data.py:

Line 651:17: E128 continuation line under-indented for visual indent
Line 652:17: E128 continuation line under-indented for visual indent
Line 935:80: E501 line too long (81 > 79 characters)
Line 1058:80: E501 line too long (86 > 79 characters)
Line 1074:80: E501 line too long (125 > 79 characters)
Line 1100:80: E501 line too long (83 > 79 characters)
Line 1110:80: E501 line too long (81 > 79 characters)
Line 1114:80: E501 line too long (84 > 79 characters)
Line 1204:45: W291 trailing whitespace
Line 1287:80: E501 line too long (125 > 79 characters)

In the file AFQ/tests/test_data.py:

Line 65:80: E501 line too long (88 > 79 characters)
Line 116:31: E225 missing whitespace around operator
Line 126:31: E225 missing whitespace around operator
Line 136:31: E225 missing whitespace around operator
Line 150:1: W293 blank line contains whitespace

Comment last updated at 2020-08-09 19:59:14 UTC

arokem · 2020-07-02T17:23:20Z

Looks great!

Regarding this

It'd be nice to give the user some way to easily download those "non-subject" s3 keys

Once you have these keys in hand, it's a call to fs.get away, right? But the user would have to sort through them, to see which ones they want to download?

richford · 2020-07-02T18:39:21Z

Agreed. So we probably don't need to include it in AFQ.data. They already have the list of s3 keys. So maybe I'll check that off the todo list and assume that we put the rest of that responsibility on the user.

richford · 2020-07-17T20:52:25Z

Still a WIP. I did some significant restructuring after realizing that HBN is not BIDS compliant. We now have an S3BIDSStudy class and a separate HBNSite class. I think the remaining things to do are:

Test with moto, this will require adding moto to the [dev] requirements in extras_require.
Rewrite the download method for the HBN site and subject classes so that it BIDS-ifies the output on the local disk
- For raw data
- For derivative data
Add a convenience get_non_subject_files() method to download all of the non-subject files.

We could also consider creating an HBN object that contains all of the HBNSite objects. But why? As tidy as it would seem, I can't think of an authentic use case.

richford · 2020-07-17T22:45:55Z

@arokem Is the documentation build failing because the PR is coming from my fork and not from yeatmanlab?

arokem · 2020-07-17T22:55:20Z

That looks like the return of numpy/numpydoc#268. Is this branch rebased on current master?

richford · 2020-07-17T23:02:17Z

Nope. Just rebased. Thanks for the tip on the 🧟 bug!

richford · 2020-07-17T23:28:34Z

Hmm, no such luck. Anyway, low priority since this is still a WIP. We can come back to it once I implement the other things.

richford · 2020-07-25T01:00:58Z

I tried updating requirements-dev.txt to require numpydoc==1.1.0 but that did not fix my local doc build. I doubt it will work on GitHub but we'll see.

richford · 2020-08-04T18:24:39Z

Thanks @36000 for fixing the docbuild issue!

richford · 2020-08-04T23:48:44Z

Added tests using moto. They pass locally. Let's see if they pass here. Then I think this is ready for review.

arokem

Overall looks good! I had some small comments/suggestions.

You might also want to rebase this, as there has been some restructuring of the library in the recently merged #322

requirements.txt

AFQ/data.py

arokem · 2020-08-05T01:44:33Z

AFQ/data.py

+        study : AFQ.data.S3BIDSStudy
+            The S3BIDSStudy for which this subject was a participant
+
+        site : str, optional


For generality, might want to point out that this idiosyncrasy designed to support use of HBN.

I don't think this particular part is an HBN idiosyncrasy. Rather, I'm just trying to record some notion of the parent study object. On the other hand, the parameter _subject_class described on line 814 is an idiosyncrasy designed to support HBN, but I think I document that reasonably well there. What do you think?

AFQ/data.py

arokem · 2020-08-05T02:03:02Z

AFQ/data.py

+            pbar_idx=idx,
+        ) for idx, sub in enumerate(self.subjects)]
+
+        compute(*results, scheduler='threads')


Wouldn't you rather use the dask progressbar construct here as well? Instead of the pbar input to sub.download?

I had trouble getting the nesting to work with the Dask progress bar since we download multiple subjects all at once. So I went with this instead. I admit it's weird to mix the dask progress bar with the tqdm progress bar but I didn't want to spend too much time on that part. But by all means, commits to this PR are welcome.

AFQ/data.py

arokem · 2020-08-05T02:08:07Z

AFQ/tests/test_data.py

+
+
+@pytest.fixture
+def temp_data_dir():


Is this better in some way than https://github.com/yeatmanlab/pyAFQ/blob/master/AFQ/tests/test_api.py#L217 ?

Oh, cool. I didn't know about nibabel.tmpdirs. The main difference I see is that nbtmp.InTemporaryDirectory() also changes the current working directory. Also, using is as a pytest.fixture with a yield statement will make sure that we delete the directory even if the test fails. I'm not sure what happens to the temporary directory if the test you linked in test_AFQ_data fails in the middle. but maybe it also deletes the directory.

36000 · 2020-08-05T20:24:39Z

There could also be an upload function similar to the download function, which uploads derivatives generated by AFQ back into the same folder. So the cloudknot workflow would look like:

study = afqd.S3BIDSStudy()
study.download()

myafq = api.AFQ(virtual_frame_buffer=True)
myafq.export_all()

study.upload()

Thereby pyAFQ would achieve its peak API potential

AFQ/data.py

richford · 2020-08-08T15:04:23Z

Thanks @arokem and @36000 for the review! I rebased and responded to most of your suggestions, resolving the comments that I believe I responded to and leaving open the ones that we might want to keep talking about.

richford · 2020-08-08T15:09:15Z

@36000, regarding the upload function. This would presuppose that the s3 bucket where we got the input data is the same one where we want to put the results. But for public datasets, this might not be true. Two options: (1) we could assume that it will work and then throw an error if it doesn't (or rather, let s3fs throw the permissions error). (2) we could put the upload functionality inside of the myafq object in your example. In either case, I think we should delay the upload functionality to another issue/PR so that this one can be concerned only with download.

arokem · 2020-08-08T18:26:13Z

Indeed. This has moved into setup.cfg. Could you please rebase onto master? You can sort the dependencies alphabetically in there, if you want :-)

…

On Sat, Aug 8, 2020 at 6:12 AM Adam Richie-Halford ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In requirements.txt <#290 (comment)>: > templateflow==0.6.2 -pybids==0.11.1 +toml==0.10.0 +toolz==0.9.0 +tqdm==4.32.2 But It appears that this is all happening in setup.cfg now anyway. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#290 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA46NXROE3ERA2KHEMAINDR7VFLFANCNFSM4OOQCICQ> .

… more general S3BIDSStudy

… files

setup.cfg

Co-authored-by: Ariel Rokem <arokem@gmail.com>

arokem · 2020-08-09T16:32:18Z

Well. This is annoying. Not as elegant, perhaps, but you can still add a package_data input to the setup.py call to setup: https://github.com/yeatmanlab/pyAFQ/blob/master/setup.py#L25

…s-fetch

AFQ/data.py

…download

richford · 2020-08-09T20:03:02Z

Well. This is annoying. Not as elegant, perhaps, but you can still add a package_data input to the setup.py call to setup: https://github.com/yeatmanlab/pyAFQ/blob/master/setup.py#L25

I went with the MANIFEST.in solution, instead.

richford · 2020-08-09T22:18:05Z

Yay! Tests finally pass. MANIFEST.in FTW.

arokem · 2020-08-10T01:56:17Z

Great! I believe this is ready to go? Maybe we can merge this and follow up with issues/PRs on things that came up but were left unresolved?

richford · 2020-08-10T02:39:32Z

Agreed. I think it's ready to go. I believe the only outstanding issue is the upload functionality that @36000 mentioned. However, I may have missed something else in the back-and-forth.

richford mentioned this pull request Jul 2, 2020

Functions to get HBN data #179

Closed

richford force-pushed the s3-bids-fetch branch from 72977bd to 24bd6f1 Compare July 17, 2020 23:01

richford mentioned this pull request Jul 22, 2020

Remove dmriprep-data nipreps/dmriprep#20

Closed

arokem mentioned this pull request Jul 24, 2020

enh: added functions to download preprocessed hbn data #110

Closed

arokem added this to the Version 0.4 milestone Jul 24, 2020

36000 mentioned this pull request Aug 4, 2020

Compare profiles from CSVs #317

Merged

richford changed the title ~~WIP: Add classes for fetching a BIDS-compliant study on S3~~ Add classes for fetching a BIDS-compliant study on S3 Aug 4, 2020

arokem reviewed Aug 5, 2020

View reviewed changes

36000 reviewed Aug 6, 2020

View reviewed changes

AFQ/data.py Show resolved Hide resolved

richford force-pushed the s3-bids-fetch branch from 683ccde to 3849067 Compare August 8, 2020 13:12

Adam Richie-Halford added 5 commits August 8, 2020 14:45

Add S3BIDSStudy, et al to AFQ.data

b158410

Fix testing errors

c7b8f86

get non-subject s3 keys'

32f8e2b

Thanks, pep8speaks

90f8d35

Use thread-safe boto3.session.Session().client

240622e

richford and others added 12 commits August 8, 2020 14:45

Start writing tests for AFQ.data S3BIDSStudy stuff

a11fa05

no paranthesis in the See Also section

1d092a7

Test S3BIDSStudy and S3BIDSSubject

5a30a71

Add moto to setup.cfg

44c3cea

Add anon to s3fs_nifti_read and s3fs_json_read

9c3a6ac

Use s3fs for download instead of low-level boto

3455e1f

Add more info to the HBNSite dataset_description.json

cf5c23a

Move all site stuff into HBNSite and HBNSubject, removing it from the…

5de4b71

… more general S3BIDSStudy

Fix typo in S3FiieSystem and allow user to download modality agnostic…

1471334

… files

Fix syntax error and create parent dir in _download_from_s3

5a33427

Try to put package_data in setup.cfg

f27e584

Fix package data in setup.cfg

4fa89be

richford force-pushed the s3-bids-fetch branch from 8c03de7 to 4fa89be Compare August 8, 2020 21:45

Maybe it doesn't like star-star

fb420c5

arokem reviewed Aug 9, 2020

View reviewed changes

setup.cfg Outdated Show resolved Hide resolved

Update setup.cfg

b19ce40

Co-authored-by: Ariel Rokem <arokem@gmail.com>

richford added 2 commits August 9, 2020 11:15

Be more explicit about where the data files should be placed

46d6089

Merge branch 's3-bids-fetch' of github.com:richford/pyAFQ into s3-bid…

165650a

…s-fetch

36000 reviewed Aug 9, 2020

View reviewed changes

AFQ/data.py Outdated Show resolved Hide resolved

36000 reviewed Aug 9, 2020

View reviewed changes

AFQ/data.py Show resolved Hide resolved

richford added 3 commits August 9, 2020 12:54

Explicitly include tests/data files in MANIFEST.in

769bcce

Make dataset_description.json the default for modality_agnostic file …

0841271

…download

Okay actually make it the default

51db7a0

arokem mentioned this pull request Aug 10, 2020

Upload to BIDS S3 #339

Closed

arokem merged commit 741beaf into yeatmanlab:master Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add classes for fetching a BIDS-compliant study on S3 #290

Add classes for fetching a BIDS-compliant study on S3 #290

richford commented Jul 2, 2020 •

edited

pep8speaks commented Jul 2, 2020 •

edited

arokem commented Jul 2, 2020

richford commented Jul 2, 2020

richford commented Jul 17, 2020 •

edited

richford commented Jul 17, 2020

arokem commented Jul 17, 2020

richford commented Jul 17, 2020

richford commented Jul 17, 2020

richford commented Jul 25, 2020

richford commented Aug 4, 2020

richford commented Aug 4, 2020

arokem left a comment

arokem Aug 5, 2020

richford Aug 8, 2020

arokem Aug 5, 2020

richford Aug 8, 2020

arokem Aug 5, 2020

richford Aug 8, 2020

36000 commented Aug 5, 2020 •

edited

richford commented Aug 8, 2020

richford commented Aug 8, 2020

arokem commented Aug 8, 2020 via email

arokem commented Aug 9, 2020

richford commented Aug 9, 2020

richford commented Aug 9, 2020

arokem commented Aug 10, 2020

richford commented Aug 10, 2020



		@pytest.fixture
		def temp_data_dir():

Add classes for fetching a BIDS-compliant study on S3 #290

Add classes for fetching a BIDS-compliant study on S3 #290

Conversation

richford commented Jul 2, 2020 • edited

pep8speaks commented Jul 2, 2020 • edited

Comment last updated at 2020-08-09 19:59:14 UTC

arokem commented Jul 2, 2020

richford commented Jul 2, 2020

richford commented Jul 17, 2020 • edited

richford commented Jul 17, 2020

arokem commented Jul 17, 2020

richford commented Jul 17, 2020

richford commented Jul 17, 2020

richford commented Jul 25, 2020

richford commented Aug 4, 2020

richford commented Aug 4, 2020

arokem left a comment

Choose a reason for hiding this comment

arokem Aug 5, 2020

Choose a reason for hiding this comment

richford Aug 8, 2020

Choose a reason for hiding this comment

arokem Aug 5, 2020

Choose a reason for hiding this comment

richford Aug 8, 2020

Choose a reason for hiding this comment

arokem Aug 5, 2020

Choose a reason for hiding this comment

richford Aug 8, 2020

Choose a reason for hiding this comment

36000 commented Aug 5, 2020 • edited

richford commented Aug 8, 2020

richford commented Aug 8, 2020

arokem commented Aug 8, 2020 via email

arokem commented Aug 9, 2020

richford commented Aug 9, 2020

richford commented Aug 9, 2020

arokem commented Aug 10, 2020

richford commented Aug 10, 2020

richford commented Jul 2, 2020 •

edited

pep8speaks commented Jul 2, 2020 •

edited

richford commented Jul 17, 2020 •

edited

36000 commented Aug 5, 2020 •

edited