Skip to content

Conversation

sjanssen2
Copy link
Contributor

When processing data via plugins, input and output files are located in ONE shared filesystem, i.e. BASE_DATA_DIR (and WORKING_DIR for temporary files). This works well as long as the plugin process and Qiita pet & DB are operating within one machine or machines equipped with network file systems like slurm grids.

We intend to host Qiita within a kubernets cloud environment. A plugin will become an independent docker image running as one or multiple pods. It would therefore be advisable to separate the central file system (Qiita pet & DB) from individual plugins, as this would slow boot up of plugin pods AND we might later distribute plugin jobs across multiple clouds. In this use case, transferring the whole BASE_DATA_DIR is infeasible.

text236

I understand the current flow as follows:

  1. the user composes a processing / analysis workflow and hits "run". Qiita pet uses a launcher to "submit" a new job for the according plugin
  2. the plugin command requests information about e.g. an artifact, the prep file, ... via qiita_client, which calls the postgress DB and returns filepaths in BASE_DATA_DIR.
  3. the plugin command directly accesses content of filepaths from 2.

My suggested flow is designed to require minimal changes for plugin codes, e.g. wrapping the filepath when reading / writing content with additional functions fetch_file_from_central and push_file_to_central to pull or push file content either directly from/to the filesystem (no change from what is done currently) or receive/push and create files in the plugin local or central filesystem, respectively.

This PR adds two endpoints/functions to Qiita pet to send https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_pet/handlers/cloud_handlers/file_transfer_handlers.py#L15 and receive https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_pet/handlers/cloud_handlers/file_transfer_handlers.py#L61 files to and from plugins.

Both endpoints are deactivated by default (to make Qiita behave as is) and can be activated by setting https://github.com/jlab/qiita/blob/be38c1079fe8cb12e120d9b96c48e97b8b3cd062/qiita_core/support_files/config_test.cfg#L53-L57 to True).

For higher performance, file content will be send through nginx instead of plain python/tornado. I thus had to fix the /Users/username prefix in the nginx example configuration file to match the actual filepaths.

It's not yet clear to me, IF we need to add a mechanism to check if a plugin has permissions to access the requested file(s) from BASE_DATA_DIR. Currently, we "trust" the plugin via oauth credentials.

This PR is accompanied by jlab/qiita_client#1 to equip the Qiita Client (part of every plugin) to handle according requests. The client can select between the current central filesystem mechanism
https://github.com/jlab/qiita_client/blob/d9543c9575f0171620c17f4e87897ed5cf52a905/qiita_client/qiita_client.py#L783-L790
OR the novel file transfer through https
https://github.com/jlab/qiita_client/blob/d9543c9575f0171620c17f4e87897ed5cf52a905/qiita_client/qiita_client.py#L792-L811
Both mechanisms return the actual filepath of the requested (and potentially transferred) files. This allows individual plugins to use different mechanisms, i.e. we don't need to migrate all plugins at once.

As only the plugin functions know which files they request / send as artifact components, we need to "decorate" file access in their individual implementations. Here is an example PR for qp-deblur: jlab/qp-deblur#1

…ed manually, but are obtained from an external identity provider
@sjanssen2 sjanssen2 changed the base branch from master to dev August 20, 2025 11:46
@wasade
Copy link
Contributor

wasade commented Aug 20, 2025 via email

@sjanssen2
Copy link
Contributor Author

sjanssen2 commented Aug 20, 2025

Broader filesystem question, should Qiita support S3-like object stores?

Absolutely and we are already planning in this direction with our effort on aruna. This PR is just the first step towards this goal :-)
P.S. can you re-start the rests - looks like coveralls was not available at that moment?

Copy link
Member

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sjanssen2; this looks good.

Now, I'm wondering if the plugin_coupling should be a general setting in the qiita config file - this will mean that all plugins are either filesystem or https. The nice thing about this, is that then there are no changes required in the plugins themself and only in qiita-client.

Another option is to use an environment variable to set this; which the qiita-client will use and define how the files are going to be moved. For example: here we are setting a variable that we later use here.

What do you think?

Comment on lines +39 to +44
data = {
'client_id': '4MOBzUBHBtUmwhaC258H7PS0rBBLyGQrVxGPgc9g305bvVhf6h',
'client_secret':
('rFb7jwAb3UmSUN57Bjlsi4DTl2owLwRpwCc0SggRN'
'EVb2Ebae2p5Umnq20rNMhmqN'),
'grant_type': 'client'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of the configuration file, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are credentials necessary to be present in qiitaDB to obtain a token ... finally on the plugin side. But here it is within the db side, just for testing.

For real plugins, client_id and client_secret get initialized with random values and the "plugin registry" process is when qiita admins "trust" this plugin and add their credentials into the DB.

So yes, these are values stored in the plugin configuration file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue about this: #3481

Comment on lines +24 to +26
# TODO: can we somehow check, if the requesting client (which should be
# one of the plugins) was started from a user that actually has
# access to the requested file?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, a user can only start a job if they have access to that artifact. However, I don't see why this shouldn't be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also undecided at the moment. I think of multiple "runner" locations for plugins, e.g. UCSD, JLU Gießen, .... Do we really want to trust these runners to be nice and access only files of the central filesystem necessary for their job?!
What if one of those plugins iterate over all artifacts even from private studies?!

@authenticate_oauth
@coroutine
@execute_as_transaction
def post(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this is a web blocking operation? If yes, I guess we should add a new block to the nginx/supervisord files so they have their own workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, shouldn't the @coroutine decorator spawn a new "thread" / "worker"? But this might only be true for python/tornado, not for the nginx itself. But doesn't come nginx with load balancing?

@sjanssen2
Copy link
Contributor Author

Thank you @sjanssen2; this looks good.

thanks :-)

Now, I'm wondering if the plugin_coupling should be a general setting in the qiita config file - this will mean that all plugins are either filesystem or https. The nice thing about this, is that then there are no changes required in the plugins themself and only in qiita-client.

I assume you are referring to the plugin configuration file with "the qiita config" not qiita_pet/db configuration, right?

I though, but might be wrong here, that one should not need to explicitly set the parameter as I defined filesystem as the default. Plugins that are not yet "decorated" with the new functions will ignore this setting altogether. Updated plugins can switch their behaviour with just one central setting. We might consider situations where e.g. UCSD communicated via "filesystem" with a deblur plugin OR via "https" with a deblur plugin hosted in our compute cloud. Thus, I thought it would not hurt to make this parameter configurable when setting up a plugin in general.

However, it looks like click enforces explicitly defining a value for plugin_coupling - which will break all non-migrated plugins :-( Is there a nice way to avoid this via click?

Another option is to use an environment variable to set this; which the qiita-client will use and define how the files are going to be moved. For example: here we are setting a variable that we later use here.

That's a possible way, however, I feel it make this setting more obfuscated. I agree, if there are free text / names allowed for something like a conda environment, this approach sounds reasonable. However, if you have to choose from a set of given values, it might be easier for novel users to go with the click approach.

What do you think?

@wasade
Copy link
Contributor

wasade commented Aug 20, 2025 via email

@antgonza
Copy link
Member

I agree with @wasade about using something like fsspec. It would be fine to just raise an error if that method is not implemented so you only support https/filesystem right now.

Now, @sjanssen2, sorry for not being clear, my suggestion was to add it in qiita_core/support_files/config_test.cfg, similar to ENABLE_HTTPS_PLUGIN_FILETRANSFER. However, the more I think about this, the more I think the best would be to allow to use a [https, filesyste, s3, whatever] in parallel. In other words, a single system can use any or all of them. If we add it to the plugin or qiita's config file, we will be limiting allowing to run either of these in parallel - basically, only allowing one of these per system or plugin installation; which is not ideal. An idea to solve this is to make qiita-client to be responsible of this transfers vs. qiita, qiita will only pass a parameter of how the user (or system) wants to run any given command, then qiita-client "moves" the files to the right location for the plugin to work. For example, the parameter can be added to the .sh script that is written so the command starts running.

What do you think?

@sjanssen2
Copy link
Contributor Author

sjanssen2 commented Aug 25, 2025

Hi again.
Important goals for the changes I suggest here and in the qiita_client code are:

  1. Don't break the current system, as I don't want to be in a situation where I have to change ALL plugin repos at once.
  2. Don't introduce multiple places where input / output files have to be "registered" for working plugins.
  3. Start simple and extend functionality on demand.

An idea to solve this is to make qiita-client to be responsible of this transfers vs. qiita, qiita will only pass a parameter of how the user (or system) wants to run any given command, then qiita-client "moves" the files to the right location for the plugin to work.

Exactly! And this is how I have implemented it :-) I agree that qiita_client should do the heavy lifting - as it is part of each plugin, but a central piece of code. The two new functions fetch_file_from_central and push_file_to_central in the qiita_client code must take care for plugin transparent file transfer. To avoid breaks, the default would be doing nothing aka. "filesystem" - as the plugin has direct access to BASE_DATA_DIR. Thus, it doesn't matter in the default setup, if the function is called (need slight changes in concrete plugin code) or not (no need to refactor existing plugins).

I am unsure to what degree use of fsspec would aid here. In my understanding, it provides elaborate functions on multiple (also chained) protocols to return file handles. Current plugins code provide one or multiple filepaths and access them in their own way, e.g.

https://github.com/qiita-spots/qp-deblur/blob/80467b7c76b6177abb577e19c83266ee54dbcd79/qp_deblur/deblur.py#L431

and

https://github.com/qiita-spots/qp-deblur/blob/80467b7c76b6177abb577e19c83266ee54dbcd79/qp_deblur/deblur.py#L470

where the later is a more complicated case as it actually iterated through a set of fastq files. I foresee more potential of breaking existing plugins if we introduce these more fundamental changes. It might also make plugin development slightly more complicated.

However, with my proposed change, I think it should be easy to extend functionally, i.e. _plugincoupling "protocols", within the above two functions in ONE central place.
This PR here is for qiita_pet, which controls the BASE_DATA_DIR. Not sure if it is necessary to access the "local" files via fsspec internally to deliver them to a remote plugin?

However, the more I think about this, the more I think the best would be to allow to use a [https, filesyste, s3, whatever] in parallel. In other words, a single system can use any or all of them.

Yes, absolutely. Once for being able to configure different plugins differently depending on their run-time / file-transfer size, two for avoiding the need to refactor all plugins at the same time and three to allow for scenarios where we can have multiple "runners" for the very same plugin. Think of executing qp-deblur on your local barnacle2 cluster (protocol=filesystem) or you offload some compute to our infrastructure (protocol=https).

If we add it to the plugin or qiita's config file, we will be limiting allowing to run either of these in parallel - basically, only allowing one of these per system or plugin installation; which is not ideal.

Here, I disagree. Your qp-deblur config file might hold PLUGINCOUPLING = https: https://github.com/jlab/qiita_client/blob/b9ade44b1d1a59425c95771e069ab1441549d8fa/qiita_client/plugin.py#L429-L430 while your very same qiita system can either remain on the as-is version of plugin codes or explicitly define PLUGINCOUPLING = filesystem for the e.g. qtp-biom plugin. I am actually exercising this in a docker compose version of qiita (code is here: jlab/qiita-keycloak#18), which Anna is using as a foundation to port qiita into a kubernetes setup.

Addendum: Reconsidering your objection, I realize that we might have diverging assumptions. I think of plugins as separate docker container (each with it's own file system that hold dependencies and a potentially shared volume that is mounted for data exchange), you might consider a fully shared file system with individual conda environments for the plugin. In your case, one might be able to duplicate the generated configuration file which are collected in PLUGIN_DIR and used during plugin registration?!

How about we set-up a video conference to further discuss pros and cons? I am not 100% sure if I understand how you would like to see use of fsspec.

@sjanssen2
Copy link
Contributor Author

However, it looks like click enforces explicitly defining a value for plugin_coupling - which will break all non-migrated plugins :-( Is there a nice way to avoid this via click?

Forgot that there is a conceptional difference between click.option and click.argument. I changed to the later.

@sjanssen2
Copy link
Contributor Author

For example, the parameter can be added to the .sh script that is written so the command starts running.

Reconsidering this, I added the ability to change the "protocol" via an environment variable:

https://github.com/jlab/qiita_client/blob/e336fbc60d2beee74b04f24574d4bb2636ed4f12/qiita_client/plugin.py#L285-L301

Thus, export QIITA_PLUGINCOUPLING=foo; script.sh should replace the setting from the plugin configuration file and you should be free to have different instances of the same plugin behaving differently

sjanssen2 and others added 2 commits August 27, 2025 17:11
…meter for ENABLE_HTTPS_PLUGIN_FILETRANSFER is necessary, we simply expose these new endpoints, whatever
@coveralls
Copy link

coveralls commented Aug 27, 2025

Coverage Status

coverage: 92.094% (-0.1%) from 92.236%
when pulling 53adaf0 on jlab:uncouplePlugins
into a34dceb on qiita-spots:dev.

@sjanssen2
Copy link
Contributor Author

should be good to go now @antgonza Again, thanks for your time!

@antgonza antgonza merged commit 1b8b0c7 into qiita-spots:dev Aug 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants