-
Notifications
You must be signed in to change notification settings - Fork 79
Separate plugin file system from BASA_DATA_DIR
#3480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ed manually, but are obtained from an external identity provider
… API endpoints through the config file
…oidc_wellknown
This sounds cool. Broader filesystem question, should Qiita support S3-like object stores?On Aug 20, 2025, at 05:46, Stefan Janssen ***@***.***> wrote:When processing data via plugins, input and output files are located in ONE shared filesystem, i.e. BASE_DATA_DIR (and WORKING_DIR for temporary files). This works well as long as the plugin process and Qiita pet & DB are operating within one machine or machines equipped with network file systems like slurm grids.
We intend to host Qiita within a kubernets cloud environment. A plugin will become an independent docker image running as one or multiple pods. It would therefore be advisable to separate the central file system (Qiita pet & DB) from individual plugins, as this would slow boot up of plugin pods AND we might later distribute plugin jobs across multiple clouds. In this use case, transferring the whole BASE_DATA_DIR is infeasible.
text236.png (view on web)
I understand the current flow as follows:
the user composes a processing / analysis workflow and hits "run". Qiita pet uses a launcher to "submit" a new job for the according plugin
the plugin command requests information about e.g. an artifact, the prep file, ... via qiita_client, which calls the postgress DB and returns filepaths in BASE_DATA_DIR.
the plugin command directly accesses content of filepaths from 2.
My suggested flow is designed to require minimal changes for plugin codes, e.g. wrapping the filepath when reading / writing content with additional functions fetch_file_from_central and push_file_to_central to pull or push file content either directly from/to the filesystem (no change from what is done currently) or receive/push and create files in the plugin local or central filesystem, respectively.
This PR adds two endpoints/functions to Qiita pet to send https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_pet/handlers/cloud_handlers/file_transfer_handlers.py#L15 and receive https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_pet/handlers/cloud_handlers/file_transfer_handlers.py#L61 files to and from plugins.
Both endpoints are deactivated by default (to make Qiita behave as is) and can be activated by setting https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_core/support_files/config_test.cfg#L53-L57 to True).
For higher performance, file content will be send through nginx instead of plain python/tornado. I thus had to fix the /Users/username prefix in the nginx example configuration file to match the actual filepaths.
It's not yet clear to me, IF we need to add a mechanism to check if a plugin has permissions to access the requested file(s) from BASE_DATA_DIR. Currently, we "trust" the plugin via oauth credentials.
This PR is accompanied by jlab/qiita_client#1 to equip the Qiita Client (part of every plugin) to handle according requests. The client can select between the current central filesystem mechanism
https://github.com/jlab/qiita_client/blob/d9543c9575f0171620c17f4e87897ed5cf52a905/qiita_client/qiita_client.py#L783-L790
OR the novel file transfer through https
https://github.com/jlab/qiita_client/blob/d9543c9575f0171620c17f4e87897ed5cf52a905/qiita_client/qiita_client.py#L792-L811
Both mechanisms return the actual filepath of the requested (and potentially transferred) files. This allows individual plugins to use different mechanisms, i.e. we don't need to migrate all plugins at once.
As only the plugin functions know which files they request / send as artifact components, we need to "decorate" file access in their individual implementations. Here is an example PR for qp-deblur: jlab/qp-deblur#1
You can view, comment on, or merge this pull request online at:
#3480
Commit Summary
aab4750 extended configuration manager with optional OIDC sections
49b0448 flake8
2840601 also provide a label for a speaking name of the identity provider
f1c9149 start implementing the OIDC dance
2eb6d08 modal not necessary, if only one provider was defined
48ca02a error handling of provider not in config file
dc4bd20 adding pycurl package to enable tornado curl_httpclients
e1f3c13 a new method to create a user, if information do not need to be entered manually, but are obtained from an external identity provider
48f09a5 full OIDC dance implemented
baf40df add an admin page to activate users which requested authorization through OIDC
670a55a flake8
091ffc6 adding menu entry for user authorization
1feefc0 do not expose traditional qiita internal user authentication, if OIDC is configured
29ce7dd use Qiita typical modal for OIDC login
2ca5bb8 wrong menu entrie affected
1b787cb always allow logout
88319b2 improved error handling
02d9af0 Merge branch 'dev' of https://github.com/qiita-spots/qiita into auth_oidc
b1e1b6b revert: let user change their profile, but not password - if provided through OIDC
a7d3b84 speaking button names + move into correct div to always get displayed
125835a use email from config + loop user_info from OIDC to fill DB
5f28092 use OIDC info to prefil user information
19b4d7b drop admin user authorization
33f2879 Merge branch 'dev' of https://github.com/qiita-spots/qiita into auth_oidc
c9d413a using the well-known json dict instead of manually providing multiple API endpoints through the config file
6bfafcb Merge branch 'dev' of https://github.com/qiita-spots/qiita into auth_oidc_wellknown
9a5e7cc flake8
b2fc279 flake8
5cc0896 add ability to display OIDC logos
949084d add OIDC logo
c3b040b revert to dev branch
d96bbae fixing config manager tests
a491870 Merge pull request #7 from jlab/auth_oidc_wellknown
b1baece Merge branch 'dev' of https://github.com/qiita-spots/qiita into auth_oidc
81fdcbf Merge branch 'dev' of https://github.com/qiita-spots/qiita into auth_oidc
e0c4002 add missing template
bb9c685 Merge branch 'add_admin_purge_template' of github.com:jlab/qiita into auth_oidc
79e794a Merge branch 'dev' of https://github.com/qiita-spots/qiita into auth_oidc
c9aacec Merge branch 'master' of github.com:qiita-spots/qiita into auth_oidc_merged
a5deb83 Merge pull request #10 from jlab/auth_oidc_merged
7693c5e extended configuration manager with optional OIDC sections
b4ab605 flake8
baa7230 also provide a label for a speaking name of the identity provider
52e57ca start implementing the OIDC dance
4061373 modal not necessary, if only one provider was defined
51307d1 error handling of provider not in config file
7a0ec9f adding pycurl package to enable tornado curl_httpclients
0c365a1 a new method to create a user, if information do not need to be entered manually, but are obtained from an external identity provider
e993a99 full OIDC dance implemented
ca5f7f6 add an admin page to activate users which requested authorization through OIDC
4d5c6a2 flake8
fd6d15e adding menu entry for user authorization
9c8b824 do not expose traditional qiita internal user authentication, if OIDC is configured
a654e48 use Qiita typical modal for OIDC login
27f6d35 always allow logout
85bf1fa improved error handling
8a504cc revert: let user change their profile, but not password - if provided through OIDC
ef05eed speaking button names + move into correct div to always get displayed
a5270a0 use email from config + loop user_info from OIDC to fill DB
2efb70f use OIDC info to prefil user information
c8b1198 drop admin user authorization
3d6f718 using the well-known json dict instead of manually providing multiple API endpoints through the config file
3957030 flake8
648f2f9 flake8
73f92b9 add ability to display OIDC logos
0dc243d add OIDC logo
9b81163 fixing config manager tests
bb03167 Merge branch 'auth_oidc' of github.com:jlab/qiita into auth_oidc
a76288b Merge branch 'master' of github.com:qiita-spots/qiita into auth_oidc
8cc718f Merge branch 'dev' of github.com:qiita-spots/qiita into auth_oidc
89cab41 Merge branch 'master' of github.com:qiita-spots/qiita into auth_oidc
5c3fa7b Merge branch 'dev' of github.com:jlab/qiita into dev
769e382 multiple validation jobs should be submitted as lead and dependent jobs, but the later must also made known by the DB
1a61930 Merge branch 'dev' of github.com:qiita-spots/qiita into auth_oidc
1b55c47 Merge branch 'dev' of github.com:qiita-spots/qiita into dev
fa26cff expose additional HTTP endpoints to send and retrieve data over http from/to plugin
d377393 add another configuration variable to decide if HTTPS filetransfer endpoint will be exposed
5e34bab aim to subtract OIDC changes
931bc4d remove image
b0de5c7 restore blank line
6c1fd0e Merge branch 'master' of github.com:jlab/qiita into uncouplePlugins
5499e04 Merge branch 'dev' of github.com:jlab/qiita into uncouplePlugins
44d4315 revert changes
2b70f7c revert
dd622c6 revert
85d30f4 return info as "reason"
9858b42 initial set of tests
2f43e89 codestyle
3f77f85 more codestyle
ad06f27 enable endpoints
28cbb98 tests pass locally
3edd07e adding in more functionallity for testing, i.e. operate on token based authorization
5bad769 avoid name collisions
d5f00d7 debug
e9028a8 more debug
7efc716 change debug
91bbbe4 continue debug
2342304 adapt filepaths to github runner
3fce312 remove debug and fix codestyle
be38c10 remove debug step
File Changes (9 files)
M
.github/workflows/qiita-ci.yml
(4)
M
qiita_core/configuration_manager.py
(8)
M
qiita_core/support_files/config_test.cfg
(6)
M
qiita_core/tests/test_configuration_manager.py
(7)
M
qiita_db/handlers/tests/oauthbase.py
(51)
A
qiita_pet/handlers/cloud_handlers/__init__.py
(9)
A
qiita_pet/handlers/cloud_handlers/file_transfer_handlers.py
(94)
A
qiita_pet/handlers/cloud_handlers/tests/test_file_transfer_handlers.py
(93)
M
qiita_pet/webserver.py
(7)
Patch Links:
https://github.com/qiita-spots/qiita/pull/3480.patch
https://github.com/qiita-spots/qiita/pull/3480.diff
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Absolutely and we are already planning in this direction with our effort on aruna. This PR is just the first step towards this goal :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sjanssen2; this looks good.
Now, I'm wondering if the plugin_coupling
should be a general setting in the qiita config file - this will mean that all plugins are either filesystem or https. The nice thing about this, is that then there are no changes required in the plugins themself and only in qiita-client.
Another option is to use an environment variable to set this; which the qiita-client will use and define how the files are going to be moved. For example: here we are setting a variable that we later use here.
What do you think?
data = { | ||
'client_id': '4MOBzUBHBtUmwhaC258H7PS0rBBLyGQrVxGPgc9g305bvVhf6h', | ||
'client_secret': | ||
('rFb7jwAb3UmSUN57Bjlsi4DTl2owLwRpwCc0SggRN' | ||
'EVb2Ebae2p5Umnq20rNMhmqN'), | ||
'grant_type': 'client'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be part of the configuration file, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are credentials necessary to be present in qiitaDB to obtain a token ... finally on the plugin side. But here it is within the db side, just for testing.
For real plugins, client_id and client_secret get initialized with random values and the "plugin registry" process is when qiita admins "trust" this plugin and add their credentials into the DB.
So yes, these are values stored in the plugin configuration file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue about this: #3481
# TODO: can we somehow check, if the requesting client (which should be | ||
# one of the plugins) was started from a user that actually has | ||
# access to the requested file? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, a user can only start a job if they have access to that artifact. However, I don't see why this shouldn't be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also undecided at the moment. I think of multiple "runner" locations for plugins, e.g. UCSD, JLU Gießen, .... Do we really want to trust these runners to be nice and access only files of the central filesystem necessary for their job?!
What if one of those plugins iterate over all artifacts even from private studies?!
@authenticate_oauth | ||
@coroutine | ||
@execute_as_transaction | ||
def post(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this is a web blocking operation? If yes, I guess we should add a new block to the nginx/supervisord files so they have their own workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, shouldn't the @coroutine decorator spawn a new "thread" / "worker"? But this might only be true for python/tornado, not for the nginx itself. But doesn't come nginx with load balancing?
thanks :-)
I assume you are referring to the plugin configuration file with "the qiita config" not qiita_pet/db configuration, right? I though, but might be wrong here, that one should not need to explicitly set the parameter as I defined However, it looks like click enforces explicitly defining a value for plugin_coupling - which will break all non-migrated plugins :-( Is there a nice way to avoid this via click?
That's a possible way, however, I feel it make this setting more obfuscated. I agree, if there are free text / names allowed for something like a conda environment, this approach sounds reasonable. However, if you have to choose from a set of given values, it might be easier for novel users to go with the click approach.
|
Is fsspec useful here? There are more protocols beyond https and generalizing now may save pain laterhttps://filesystem-spec.readthedocs.io/en/latest/On Aug 20, 2025, at 13:09, Stefan Janssen ***@***.***> wrote:sjanssen2 left a comment (qiita-spots/qiita#3480)
Thank you @sjanssen2; this looks good.
thanks :-)
Now, I'm wondering if the plugin_coupling should be a general setting in the qiita config file - this will mean that all plugins are either filesystem or https. The nice thing about this, is that then there are no changes required in the plugins themself and only in qiita-client.
I assume you are referring to the plugin configuration file with "the qiita config" not qiita_pet/db configuration, right?
I though, but might be wrong here, that one should not need to explicitly set the parameter as I defined filesystem as the default. Plugins that are not yet "decorated" with the new functions will ignore this setting altogether. Updated plugins can switch their behaviour with just one central setting. We might consider situations where e.g. UCSD communicated via "filesystem" with a deblur plugin OR via "https" with a deblur plugin hosted in our compute cloud. Thus, I thought it would not hurt to make this parameter configurable when setting up a plugin in general.
However, it looks like click enforces explicitly defining a value for plugin_coupling - which will break all non-migrated plugins :-( Is there a nice way to avoid this via click?
Another option is to use an environment variable to set this; which the qiita-client will use and define how the files are going to be moved. For example: here we are setting a variable that we later use here.
That's a possible way, however, I feel it make this setting more obfuscated. I agree, if there are free text / names allowed for something like a conda environment, this approach sounds reasonable. However, if you have to choose from a set of given values, it might be easier for novel users to go with the click approach.
What do you think?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
I agree with @wasade about using something like Now, @sjanssen2, sorry for not being clear, my suggestion was to add it in qiita_core/support_files/config_test.cfg, similar to What do you think? |
Hi again.
Exactly! And this is how I have implemented it :-) I agree that qiita_client should do the heavy lifting - as it is part of each plugin, but a central piece of code. The two new functions I am unsure to what degree use of and where the later is a more complicated case as it actually iterated through a set of fastq files. I foresee more potential of breaking existing plugins if we introduce these more fundamental changes. It might also make plugin development slightly more complicated. However, with my proposed change, I think it should be easy to extend functionally, i.e.
Yes, absolutely. Once for being able to configure different plugins differently depending on their run-time / file-transfer size, two for avoiding the need to refactor all plugins at the same time and three to allow for scenarios where we can have multiple "runners" for the very same plugin. Think of executing qp-deblur on your local barnacle2 cluster (protocol=filesystem) or you offload some compute to our infrastructure (protocol=https).
Here, I disagree. Your qp-deblur config file might hold Addendum: Reconsidering your objection, I realize that we might have diverging assumptions. I think of plugins as separate docker container (each with it's own file system that hold dependencies and a potentially shared volume that is mounted for data exchange), you might consider a fully shared file system with individual conda environments for the plugin. In your case, one might be able to duplicate the generated configuration file which are collected in How about we set-up a video conference to further discuss pros and cons? I am not 100% sure if I understand how you would like to see use of fsspec. |
Forgot that there is a conceptional difference between |
Reconsidering this, I added the ability to change the "protocol" via an environment variable: Thus, |
…meter for ENABLE_HTTPS_PLUGIN_FILETRANSFER is necessary, we simply expose these new endpoints, whatever
should be good to go now @antgonza Again, thanks for your time! |
When processing data via plugins, input and output files are located in ONE shared filesystem, i.e.
BASE_DATA_DIR
(andWORKING_DIR
for temporary files). This works well as long as the plugin process and Qiita pet & DB are operating within one machine or machines equipped with network file systems like slurm grids.We intend to host Qiita within a kubernets cloud environment. A plugin will become an independent docker image running as one or multiple pods. It would therefore be advisable to separate the central file system (Qiita pet & DB) from individual plugins, as this would slow boot up of plugin pods AND we might later distribute plugin jobs across multiple clouds. In this use case, transferring the whole
BASE_DATA_DIR
is infeasible.I understand the current flow as follows:
BASE_DATA_DIR
.My suggested flow is designed to require minimal changes for plugin codes, e.g. wrapping the filepath when reading / writing content with additional functions
fetch_file_from_central
andpush_file_to_central
to pull or push file content either directly from/to the filesystem (no change from what is done currently) or receive/push and create files in the plugin local or central filesystem, respectively.This PR adds two endpoints/functions to Qiita pet to send https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_pet/handlers/cloud_handlers/file_transfer_handlers.py#L15 and receive https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_pet/handlers/cloud_handlers/file_transfer_handlers.py#L61 files to and from plugins.
Both endpoints are deactivated by default (to make Qiita behave as is) and can be activated by setting https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_core/support_files/config_test.cfg#L53-L57 to
True
).For higher performance, file content will be send through nginx instead of plain python/tornado. I thus had to fix the
/Users/username
prefix in the nginx example configuration file to match the actual filepaths.It's not yet clear to me, IF we need to add a mechanism to check if a plugin has permissions to access the requested file(s) from
BASE_DATA_DIR
. Currently, we "trust" the plugin via oauth credentials.This PR is accompanied by jlab/qiita_client#1 to equip the Qiita Client (part of every plugin) to handle according requests. The client can select between the current central
filesystem
mechanismhttps://github.com/jlab/qiita_client/blob/d9543c9575f0171620c17f4e87897ed5cf52a905/qiita_client/qiita_client.py#L783-L790
OR the novel file transfer through
https
https://github.com/jlab/qiita_client/blob/d9543c9575f0171620c17f4e87897ed5cf52a905/qiita_client/qiita_client.py#L792-L811
Both mechanisms return the actual filepath of the requested (and potentially transferred) files. This allows individual plugins to use different mechanisms, i.e. we don't need to migrate all plugins at once.
As only the plugin functions know which files they request / send as artifact components, we need to "decorate" file access in their individual implementations. Here is an example PR for qp-deblur: jlab/qp-deblur#1