You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue outlines a roadmap for the next steps in the refgenie project. I'm intending to move to Refgenie 1.0, a new, improved version incorporating lessons learned from the earlier versions. This will be a big enough update that it will warrant a new major version release, with breaking changes. The major updates will include:
deployment of system for custom asset classes and custom recipes;
allowing to store configuration in either a file or a database.
integration with the new refget sequence collections GA4GH standard, and other standards.
In addition, there are other smaller improvements to be made. Here's a bit more detail for each of these:
1. Deployment of custom asset classes and recipes
This system is working in beta, we just need to polish and deploy it. It might be worth thinking about if there are ways to improve the implementation, with new tools we have available now.
One thing to do is just deploy something like the schema.databio.org front-end to make a nice user interface for the current github-backed recipes/assets repository.
Goal: use a postgres database for the refgenieserver back-end, but still allow an individual user to use a config file to store the information. Why? A database would be much easier for huge build processes to update than a giant yaml file. The challenge is that refgenie and refgenie server both use the same config file.
We already have built pipestat, which allows using either a yaml file or a postgres database for pipeline results. So, could we just use this to manage the refgenie config? That would be ideal, simplifying the code and reducing maintenance, if we can do this. Three possible levels of using pipestat:
All the way. Is the existing pipestat implementation flexible enough for the specifics needed by refgenie's data model?
Partial. If not, would it be possible to use pipestat for some aspects, and then build some custom functionality around it?
Inspiration. If not, can we use lessons learned from pipestat development to create our custom solution?
3. How should refgenie integrate GA4GH seqcol/refget?
Incompatibilities in current backends
SeqColAPI relies on a SeqColHenge object, which uses henge as its backend storage for sequence collections, but RefGenConf uses YAMLConfigManager as its backend for sequence collection (refgenie asset) info. It would be nice if these used the same back-end, so there was just one data store for a refgenieserver. Could both RefGenConf and SeqColHenge update to use pipestat? Then, there would be a single interface, would would make it easier to manage the refgenieserver back-end storage.
Ways they could interact
seqcol already has a load_from_refgenie function, so if you first stick the asset into refgenie, then we can just call that to load it into the seqcol storage.
add something to refgenie to provide a link to the specific zipped fasta file?
refgenie should use the seqcol digest function when computing its digest. maybe could just be returned from the load_from_refgenie call.
refgenieserver will use the api_router from refget to attach endpoints for the seqcol API, allowing refgenieserver to implement the seqcol API. Basically, the seqcolapi routes need to be able to be used in two places: first, in the seqcolapi service, but also I want refgenieserver to just import them and mount them as a router, so I can reuse them. So, I moved seqcolapi fastapi code into a module in the refget package. You can now import the seqcol_router, and then attach it to a fastAPI app. Instructions are in refget/seqcol_router.py. I also have a dev environment that doesn't require installing refget in order to work on the seqcolapi endpoints.
TODO items and next steps strategy
Exploratory items
Can I get seqcolapi embedded as a router in refget package? Yes, I did this.
Can I get pipestat operating as a henge backend for SeqColHenge? Yes, I did this.
Can I get pipestat operating as a backend for refgencof? This is the hardest one
Can I build a SeqColPipestatBackend that doesn't use henge? To avoid the complexities of the henge? Probably... it's a rewrite of SecColHenge, basically
Should we move the 3-package structure (refgenconf, refgenie, and refgenieserver) back into a single repository, with optional dependency blocks?. Or maybe just integrate refgenconf into refgenie ?
It would be nice to have a simple local web server to browser your own local genomes. It could even have a web interface for grabbing remote assets.
would it be useful to have a refgenie init that provides a prompt-based wizard for creating a new config?
Should we develop a simple local web UI for curating local assets?
Easy tasks
digests should switch to new refget sequence collections standard
endpoints for retrieving assets change to implement DRS, instead of ad-hoc remote structure (API update)
new recipe sharing system should be finished and launched
server should record download statistics (may correspond to switching to a database instead of a file)
content should be expanded to include all HPRC reference genomes
prep for yacman 1.0, eliminating attmap dependency
documentation should be merged into one location (isn't it already like this?); or re-deployed at new URL.
server should provide new seqcol API
client should provide new access to new seqcol API (e.g. comparison of local genomes to remote ones)
Harder tasks
build system should be simplified and revamped, and be driven by a PEP on PEPhub
web interface could be rewritten in react (?)
Backend should use pipestat for refgenconf, so you can either use a database or a file for configuration. (Might be too much trouble).
The text was updated successfully, but these errors were encountered:
Refgenie roadmap
Introduction
This issue outlines a roadmap for the next steps in the refgenie project. I'm intending to move to Refgenie 1.0, a new, improved version incorporating lessons learned from the earlier versions. This will be a big enough update that it will warrant a new major version release, with breaking changes. The major updates will include:
In addition, there are other smaller improvements to be made. Here's a bit more detail for each of these:
1. Deployment of custom asset classes and recipes
This system is working in beta, we just need to polish and deploy it. It might be worth thinking about if there are ways to improve the implementation, with new tools we have available now.
One thing to do is just deploy something like the
schema.databio.org
front-end to make a nice user interface for the current github-backed recipes/assets repository.2. How could refgenie use a database backend?
Goal: use a postgres database for the refgenieserver back-end, but still allow an individual user to use a config file to store the information. Why? A database would be much easier for huge build processes to update than a giant yaml file. The challenge is that refgenie and refgenie server both use the same config file.
We already have built pipestat, which allows using either a yaml file or a postgres database for pipeline results. So, could we just use this to manage the refgenie config? That would be ideal, simplifying the code and reducing maintenance, if we can do this. Three possible levels of using pipestat:
3. How should refgenie integrate GA4GH seqcol/refget?
Incompatibilities in current backends
SeqColAPI relies on a
SeqColHenge
object, which useshenge
as its backend storage for sequence collections, butRefGenConf
usesYAMLConfigManager
as its backend for sequence collection (refgenie asset) info. It would be nice if these used the same back-end, so there was just one data store for a refgenieserver. Could bothRefGenConf
andSeqColHenge
update to use pipestat? Then, there would be a single interface, would would make it easier to manage the refgenieserver back-end storage.Ways they could interact
load_from_refgenie
function, so if you first stick the asset into refgenie, then we can just call that to load it into the seqcol storage.load_from_refgenie
call.refget
to attach endpoints for the seqcol API, allowing refgenieserver to implement the seqcol API. Basically, the seqcolapi routes need to be able to be used in two places: first, in the seqcolapi service, but also I wantrefgenieserver
to justimport
them and mount them as a router, so I can reuse them. So, I moved seqcolapi fastapi code into a module in therefget
package. You can now import theseqcol_router
, and then attach it to a fastAPI app. Instructions are in refget/seqcol_router.py. I also have a dev environment that doesn't require installing refget in order to work on the seqcolapi endpoints.TODO items and next steps strategy
Exploratory items
refgenconf
,refgenie
, andrefgenieserver
) back into a single repository, with optional dependency blocks?. Or maybe just integraterefgenconf
intorefgenie
?refgenie init
that provides a prompt-based wizard for creating a new config?Easy tasks
Harder tasks
The text was updated successfully, but these errors were encountered: