Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refgenie roadmap #283

Open
2 of 21 tasks
nsheff opened this issue Feb 22, 2024 · 0 comments
Open
2 of 21 tasks

Refgenie roadmap #283

nsheff opened this issue Feb 22, 2024 · 0 comments

Comments

@nsheff
Copy link
Contributor

nsheff commented Feb 22, 2024

Refgenie roadmap

Introduction

This issue outlines a roadmap for the next steps in the refgenie project. I'm intending to move to Refgenie 1.0, a new, improved version incorporating lessons learned from the earlier versions. This will be a big enough update that it will warrant a new major version release, with breaking changes. The major updates will include:

  1. deployment of system for custom asset classes and custom recipes;
  2. allowing to store configuration in either a file or a database.
  3. integration with the new refget sequence collections GA4GH standard, and other standards.

In addition, there are other smaller improvements to be made. Here's a bit more detail for each of these:

1. Deployment of custom asset classes and recipes

This system is working in beta, we just need to polish and deploy it. It might be worth thinking about if there are ways to improve the implementation, with new tools we have available now.

One thing to do is just deploy something like the schema.databio.org front-end to make a nice user interface for the current github-backed recipes/assets repository.

2. How could refgenie use a database backend?

Goal: use a postgres database for the refgenieserver back-end, but still allow an individual user to use a config file to store the information. Why? A database would be much easier for huge build processes to update than a giant yaml file. The challenge is that refgenie and refgenie server both use the same config file.

We already have built pipestat, which allows using either a yaml file or a postgres database for pipeline results. So, could we just use this to manage the refgenie config? That would be ideal, simplifying the code and reducing maintenance, if we can do this. Three possible levels of using pipestat:

  1. All the way. Is the existing pipestat implementation flexible enough for the specifics needed by refgenie's data model?
  2. Partial. If not, would it be possible to use pipestat for some aspects, and then build some custom functionality around it?
  3. Inspiration. If not, can we use lessons learned from pipestat development to create our custom solution?

3. How should refgenie integrate GA4GH seqcol/refget?

Incompatibilities in current backends

SeqColAPI relies on a SeqColHenge object, which uses henge as its backend storage for sequence collections, but RefGenConf uses YAMLConfigManager as its backend for sequence collection (refgenie asset) info. It would be nice if these used the same back-end, so there was just one data store for a refgenieserver. Could both RefGenConf and SeqColHenge update to use pipestat? Then, there would be a single interface, would would make it easier to manage the refgenieserver back-end storage.

Ways they could interact

  • seqcol already has a load_from_refgenie function, so if you first stick the asset into refgenie, then we can just call that to load it into the seqcol storage.
  • add something to refgenie to provide a link to the specific zipped fasta file?
  • refgenie should use the seqcol digest function when computing its digest. maybe could just be returned from the load_from_refgenie call.
  • refgenieserver will use the api_router from refget to attach endpoints for the seqcol API, allowing refgenieserver to implement the seqcol API. Basically, the seqcolapi routes need to be able to be used in two places: first, in the seqcolapi service, but also I want refgenieserver to just import them and mount them as a router, so I can reuse them. So, I moved seqcolapi fastapi code into a module in the refget package. You can now import the seqcol_router, and then attach it to a fastAPI app. Instructions are in refget/seqcol_router.py. I also have a dev environment that doesn't require installing refget in order to work on the seqcolapi endpoints.

TODO items and next steps strategy

Exploratory items

  • Can I get seqcolapi embedded as a router in refget package? Yes, I did this.
  • Can I get pipestat operating as a henge backend for SeqColHenge? Yes, I did this.
  • Can I get pipestat operating as a backend for refgencof? This is the hardest one
  • Can I build a SeqColPipestatBackend that doesn't use henge? To avoid the complexities of the henge? Probably... it's a rewrite of SecColHenge, basically
  • Should we move the 3-package structure (refgenconf, refgenie, and refgenieserver) back into a single repository, with optional dependency blocks?. Or maybe just integrate refgenconf into refgenie ?
  • It would be nice to have a simple local web server to browser your own local genomes. It could even have a web interface for grabbing remote assets.
  • would it be useful to have a refgenie init that provides a prompt-based wizard for creating a new config?
  • Should we develop a simple local web UI for curating local assets?

Easy tasks

  • digests should switch to new refget sequence collections standard
  • endpoints for retrieving assets change to implement DRS, instead of ad-hoc remote structure (API update)
  • new recipe sharing system should be finished and launched
  • server should record download statistics (may correspond to switching to a database instead of a file)
  • content should be expanded to include all HPRC reference genomes
  • prep for yacman 1.0, eliminating attmap dependency
  • documentation should be merged into one location (isn't it already like this?); or re-deployed at new URL.
  • server should provide new seqcol API
  • client should provide new access to new seqcol API (e.g. comparison of local genomes to remote ones)

Harder tasks

  • build system should be simplified and revamped, and be driven by a PEP on PEPhub
  • web interface could be rewritten in react (?)
  • Backend should use pipestat for refgenconf, so you can either use a database or a file for configuration. (Might be too much trouble).
@nsheff nsheff changed the title Refgene roadmap Refgenie roadmap Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant