Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch export sower job #29

Merged
merged 18 commits into from
Jan 6, 2022
Merged

Batch export sower job #29

merged 18 commits into from
Jan 6, 2022

Conversation

mcannalte
Copy link
Contributor

@mcannalte mcannalte commented Aug 12, 2021

New Features

  • Adds a sower job which accepts an input of
    { "study_ids": [ "acb-xyz" ] }
    and constructs a manifest of the files in these studies, downloads and compresses them, and uploads them to a user-specific location in s3 (which is overwritten each time a user initiates a download)

@mcannalte mcannalte changed the title Feat/zip Batch export sower job Aug 12, 2021
@mcannalte mcannalte marked this pull request as ready for review August 12, 2021 16:08
describe_access_to_files_in_workspace_manifest(
hostname, auth, MANIFEST_FILENAME
)
download_files_in_workspace_manifest(
Copy link
Contributor

@mfshao mfshao Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI just making a note here (since the DRS download python SDK branch doesn't have a PR now): I think the way that SDK implement this currently is to gather all the access token from all related commons first, and then iterate through the manifest to download files.

There is a possiblilty that if the manifest is large or for whatever reasons some downloads took longer than 20 mins, then some access token will expire, resulting in failure in subsequent downloads.

Probably not a big deal for now becaus we have a relatively small file size limits

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the pre-caching of WTS is a known issue and needs to be addressed. I will highlight this in the related feature doc.

@mcannalte
Copy link
Contributor Author

Note: this job replaces the (previously assumed) need for a separate file downloads API for the sake of HEAL. However, Sower's limitation on input data requires that input data fit in an environment variable, hence we send a list of study IDs rather than an entire manifest. If a manifest, DRS bundle, etc is preferable for another use case, this job is easily extensible to use a user-specific s3 file as a job input instead. Batch-export cloud automation is setup with a bucket for this case.

for study in study_metadata:
if not study["__manifest"]:
print(
f"Study {study['project_number']} is missing __manifest entry. Skipping."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the key configurable? Might it be project_number for some entries, but study_id for others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth removing this key altogether, since it's only ever needed for this informational log

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it might be safer to not assume the content/structure of the data since the MDS really allows for any shape. I think you're right that project_number is most common. Maybe best to just set up that log line so it wouldn't error/crash if that field was missing

# print(f"Retrieved metadata: {study_metadata}")
manifest = []
for study in study_metadata:
if not study["__manifest"]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we make it configurable which field name we use? Is it __manifest for some and __files for others? Not saying it's good if it's configurable, but just checking. I wonder if we need to query the gitops info to see what the field name is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HEAL, I think it's always been __manifest : https://github.com/uc-cdis/cdis-manifest/blob/369b80452adb4a91fdef7fd1b6f2e283c12168e6/healdata.org/portal/gitops.json#L126
At least that's how we've been building manifests from the UI

But I can change this to be passed as an env var from sower, not sure if there's a way to avoid that being a duplicate configuration

Copy link
Contributor

@mfshao mfshao Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use put some bash script in your cloud-auto PR to extract that from portal config and put it into a k8s configmap or secrets and mount them

Edit: oops didn't see you just merged that cloud-auto PR

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I search across our repos I only see us ever setting manifestFieldName to __manifest. I think the best thing to do would be to not have that field configurable, keep it hard-coded here, and take it out of configurations. I think making that field configurable just adds more headache especially if it's consistent everywhere

Copy link
Contributor

@mfshao mfshao Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I remember I have mention to have a set of "pre-defined" field name rules for MDS data that serves as a convention enforced over all MDS

This way we can reduce this kind of configuration complexity by a lot

aws_secret_access_key=aws_secret_access_key,
)

export_key = f"{username}-export.zip"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk/possibility that an IDP may allow a user to be authenticated, but no username is set? Or could the username have special characters that don't play well with S3 or various file systems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are both good questions, but I'm not sure what fields an IDP requires or what characters they can take on. Can you think of a more appropriate way to name exports? I'm brainstorming...

Copy link

@williamhaley williamhaley Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A guid might be a nice way to simply make file names unique. That also allows us to not have to worry about storing names/identifying information in S3 or anywhere else

A timestamp + guid is also nice because it allows us to see how old exports are and expire them while enforcing uniqueness

ZakirG
ZakirG previously approved these changes Aug 13, 2021
Copy link
Contributor

@ZakirG ZakirG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well, good job ~

williamhaley
williamhaley previously approved these changes Aug 13, 2021
Copy link

@williamhaley williamhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple minor notes, but lgtm!

mfshao
mfshao previously approved these changes Aug 13, 2021
@mcannalte mcannalte dismissed stale reviews from mfshao, williamhaley, and ZakirG via 7b631b5 August 16, 2021 19:51
williamhaley
williamhaley previously approved these changes Aug 16, 2021
@mcannalte mcannalte merged commit 8798379 into master Jan 6, 2022
@mcannalte mcannalte deleted the feat/zip branch January 6, 2022 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants