Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PXP:9493):(feat)Non-dbgap whitelist in usersync #1020

Merged
merged 23 commits into from
Jul 14, 2022
Merged

Conversation

BinamB
Copy link
Contributor

@BinamB BinamB commented Apr 14, 2022

New Features

  • Add non-dbgap whitelist support in usersync.

@github-actions
Copy link

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.

@BinamB BinamB changed the title Feat/custom data access (PXP:9493):(feat)Non-dbgap whitelist in usersync Apr 14, 2022
@coveralls
Copy link

coveralls commented Apr 14, 2022

Pull Request Test Coverage Report for Build 12645

  • 27 of 30 (90.0%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.05%) to 73.764%

Changes Missing Coverage Covered Lines Changed/Added Lines %
fence/sync/sync_users.py 27 30 90.0%
Files with Coverage Reduction New Missed Lines %
fence/sync/sync_users.py 1 74.66%
Totals Coverage Status
Change from base Build 12642: 0.05%
Covered Lines: 6818
Relevant Lines: 9243

💛 - Coveralls

@@ -904,6 +904,8 @@ SERVICE_ACCOUNT_LIMIT: 6
GLOBAL_PARSE_VISAS_ON_LOGIN:
# Settings for usersync with visas
USERSYNC:
# allowed list of custom project ids that can be processed by usersync
allowed_custom_project_ids: ["CUSTOM-*"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? I don't see where this is used

@@ -509,7 +523,7 @@ def _parse_csv(self, file_dict, sess, dbgap_config={}, encrypted=True):
continue

phsid_privileges = {}
phsid = row.get("phsid", "").split(".")
phsid = row.get("phsid", row.get("project_id", "")).split(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change here? does that row exist in dbgap files? If so this is potentially a change in behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see the new file might have a different row name. This should be configuration

@@ -341,6 +341,8 @@ dbGaP:
proxy_user: ''
protocol: 'sftp'
decrypt_key: ''
allow_non_dbGaP_whitelist: false
allowed_id_patterns: ['PROJECT-12345']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is somewhat confusingly named, since this is for file matching. I think this should be a full regex to match the filename and default it to ['authentication_file_phs(\d{6})', 'authentication_file_PROJECT-*']' or something. There's too much custom logic in the code for someone to know by looking at the configuration how to set this up or what it means. Probably deserves a comment about it as well

@@ -368,6 +370,8 @@ dbGaP:
'phs000178': ['/orgA/', '/orgB/', '/']
# study when parsing consent codes
'phs000178.c2': ['/orgA/', '/orgB/', '/']
# non dbgap study:
'PROJECT-12345': ['']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is optional?

pattern += ".enc"
pattern += "$"
if re.match(pattern, os.path.basename(filepath)):
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended to return early on the first matched pattern? Or should this only return True if ALL patterns match that are passed in?

@@ -527,7 +543,11 @@ def _parse_csv(self, file_dict, sess, dbgap_config={}, encrypted=True):
continue

phsid_privileges = {}
phsid = row.get("phsid", "").split(".")
if self.non_dbGaP_whitelist:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you check allow_non_dbgap_whitelist?

)
monkeypatch.setattr(syncer, "non_dbGaP_whitelist", allow_non_dbgap_whitelist)
# merge sftp server containing both dbgap and non-dbgap whitelists
monkeypatch.setattr(syncer, "is_sync_from_dbgap_server", allow_non_dbgap_whitelist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used?

@@ -633,7 +711,7 @@ def mock_merge(dbgap_servers, sess):

# this function will be called once for each sftp server
# the test config file has 2 dbgap sftp servers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# the test config file has 2 dbgap sftp servers
# the test config file has 3 dbgap sftp servers

@@ -662,6 +676,10 @@ EXPIRED_AUTHZ_REMOVAL_JOB_FREQ_IN_SECONDS: 1
GLOBAL_PARSE_VISAS_ON_LOGIN:
# Settings for usersync with visas
USERSYNC:
non_dbgap_whitelist_enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have both this and the per-sftp config allow_non_dbGaP_whitelist?

@@ -662,6 +676,9 @@ EXPIRED_AUTHZ_REMOVAL_JOB_FREQ_IN_SECONDS: 1
GLOBAL_PARSE_VISAS_ON_LOGIN:
# Settings for usersync with visas
USERSYNC:
sync_from_visas: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these were removed since our merge of feat/passport, can you double check we actually need this and the one below?

@pytest.mark.parametrize("syncer", ["google", "cleversafe"], indirect=True)
@pytest.mark.parametrize("parse_consent_code_config", [False, True])
@pytest.mark.parametrize("parse_consent_code_config", [True, False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just to be consistently passing true then false or was there some real issue when it was ordered the other way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh woops. I was messing around with it and reversed the order. I'll change it back.

pattern += "$"
pattern = pattern.encode().decode(
"unicode_escape"
) # when converting from yaml, python reads it as Python string literal. So "\" turns into "\\" which messes with the regex match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this above the line, I generally try to avoid inline comments unless absolutely necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why is there YAML at all now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh should I keep the comment and instead of "yaml", call it fence-config or should I just completely get rid of the comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is helpful, let's clarify that it's YAML from the fence-config

@@ -527,9 +542,13 @@ def _parse_csv(self, file_dict, sess, dbgap_config={}, encrypted=True):
continue

phsid_privileges = {}
phsid = row.get("phsid", "").split(".")
if dbgap_config.get("allow_non_dbGaP_whitelist", False):
phsid = row.get("phsid", row.get("project_id", "")).split(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be configurable? Or I guess we can just require that people either use phsid or project_id, but this feels like something that could be a cfg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way usersync is written right now makes ait a little difficult for us to have a per sftp server for picking phsid and project_id. We could include it in the next design for usersync that i want to do for fix-it-Friday though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fine

def _merge_multiple_local_csv_files(
self, dbgap_file_list, encrypted, dbgap_configs, session
):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a quick docstring here?

Avantol13
Avantol13 previously approved these changes Jul 12, 2022
@BinamB BinamB merged commit c720f21 into master Jul 14, 2022
@BinamB BinamB deleted the feat/custom-data-access branch July 14, 2022 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants