Refgenie roadmap #283

nsheff · 2024-02-22T14:34:40Z

Refgenie roadmap

Introduction

This issue outlines a roadmap for the next steps in the refgenie project. I'm intending to move to Refgenie 1.0, a new, improved version incorporating lessons learned from the earlier versions. This will be a big enough update that it will warrant a new major version release, with breaking changes. The major updates will include:

deployment of system for custom asset classes and custom recipes;
allowing to store configuration in either a file or a database.
integration with the new refget sequence collections GA4GH standard, and other standards.

In addition, there are other smaller improvements to be made. Here's a bit more detail for each of these:

1. Deployment of custom asset classes and recipes

This system is working in beta, we just need to polish and deploy it. It might be worth thinking about if there are ways to improve the implementation, with new tools we have available now.

One thing to do is just deploy something like the schema.databio.org front-end to make a nice user interface for the current github-backed recipes/assets repository.

2. How could refgenie use a database backend?

Database-backed config #284

Goal: use a postgres database for the refgenieserver back-end, but still allow an individual user to use a config file to store the information. Why? A database would be much easier for huge build processes to update than a giant yaml file. The challenge is that refgenie and refgenie server both use the same config file.

We already have built pipestat, which allows using either a yaml file or a postgres database for pipeline results. So, could we just use this to manage the refgenie config? That would be ideal, simplifying the code and reducing maintenance, if we can do this. Three possible levels of using pipestat:

All the way. Is the existing pipestat implementation flexible enough for the specifics needed by refgenie's data model?
Partial. If not, would it be possible to use pipestat for some aspects, and then build some custom functionality around it?
Inspiration. If not, can we use lessons learned from pipestat development to create our custom solution?

3. How should refgenie integrate GA4GH seqcol/refget?

Incompatibilities in current backends

SeqColAPI relies on a SeqColHenge object, which uses henge as its backend storage for sequence collections, but RefGenConf uses YAMLConfigManager as its backend for sequence collection (refgenie asset) info. It would be nice if these used the same back-end, so there was just one data store for a refgenieserver. Could both RefGenConf and SeqColHenge update to use pipestat? Then, there would be a single interface, would would make it easier to manage the refgenieserver back-end storage.

Ways they could interact

seqcol already has a load_from_refgenie function, so if you first stick the asset into refgenie, then we can just call that to load it into the seqcol storage.
add something to refgenie to provide a link to the specific zipped fasta file?
refgenie should use the seqcol digest function when computing its digest. maybe could just be returned from the load_from_refgenie call.
refgenieserver will use the api_router from refget to attach endpoints for the seqcol API, allowing refgenieserver to implement the seqcol API. Basically, the seqcolapi routes need to be able to be used in two places: first, in the seqcolapi service, but also I want refgenieserver to just import them and mount them as a router, so I can reuse them. So, I moved seqcolapi fastapi code into a module in the refget package. You can now import the seqcol_router, and then attach it to a fastAPI app. Instructions are in refget/seqcol_router.py. I also have a dev environment that doesn't require installing refget in order to work on the seqcolapi endpoints.

TODO items and next steps strategy

Exploratory items

Can I get seqcolapi embedded as a router in refget package? Yes, I did this.
Can I get pipestat operating as a henge backend for SeqColHenge? Yes, I did this.
Can I get pipestat operating as a backend for refgencof? This is the hardest one
Can I build a SeqColPipestatBackend that doesn't use henge? To avoid the complexities of the henge? Probably... it's a rewrite of SecColHenge, basically
Should we move the 3-package structure (refgenconf, refgenie, and refgenieserver) back into a single repository, with optional dependency blocks?. Or maybe just integrate refgenconf into refgenie ?
It would be nice to have a simple local web server to browser your own local genomes. It could even have a web interface for grabbing remote assets.
would it be useful to have a refgenie init that provides a prompt-based wizard for creating a new config?
Should we develop a simple local web UI for curating local assets?

Easy tasks

digests should switch to new refget sequence collections standard
endpoints for retrieving assets change to implement DRS, instead of ad-hoc remote structure (API update)
new recipe sharing system should be finished and launched
server should record download statistics (may correspond to switching to a database instead of a file)
content should be expanded to include all HPRC reference genomes
prep for yacman 1.0, eliminating attmap dependency
documentation should be merged into one location (isn't it already like this?); or re-deployed at new URL.
server should provide new seqcol API
client should provide new access to new seqcol API (e.g. comparison of local genomes to remote ones)

Harder tasks

build system should be simplified and revamped, and be driven by a PEP on PEPhub
web interface could be rewritten in react (?)
Backend should use pipestat for refgenconf, so you can either use a database or a file for configuration. (Might be too much trouble).

The text was updated successfully, but these errors were encountered:

nsheff added the brainstorming label Feb 22, 2024

nsheff changed the title ~~Refgene roadmap~~ Refgenie roadmap Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refgenie roadmap #283

Refgenie roadmap #283

nsheff commented Feb 22, 2024 •

edited

Refgenie roadmap #283

Refgenie roadmap #283

Comments

nsheff commented Feb 22, 2024 • edited

Refgenie roadmap

Introduction

1. Deployment of custom asset classes and recipes

2. How could refgenie use a database backend?

3. How should refgenie integrate GA4GH seqcol/refget?

Incompatibilities in current backends

Ways they could interact

TODO items and next steps strategy

Exploratory items

Easy tasks

Harder tasks

nsheff commented Feb 22, 2024 •

edited