Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate BioContainer images #2281

Open
Zethson opened this issue Jun 20, 2022 · 19 comments
Open

Generate BioContainer images #2281

Zethson opened this issue Jun 20, 2022 · 19 comments

Comments

@Zethson
Copy link
Member

Zethson commented Jun 20, 2022

Hey,

we moved from conda-forge to Bioconda -> #1169

Unfortunately, this comes at a cost which is highly relevant for pipeline building. nf-core wants to add support for scverse data structures: nf-core/scrnaseq#68 The issue is that Bioconda autogenerates Docker & Singularity containers which Nextflow pipelines always use to provide support for all execution environments. conda-forge does not. The official Dockerhub is firstly stuck in an old version and, when used, it lacks the package procps that is used by nextflow to track execution.

How serious are the Bioconda issues? Can we resolve them and move back? I'd avoid always having to manage also our own container releases and love the automated container building by Bioconda.
I was also made aware by @apeltzer that bio specific tools should live in bioconda. Choosing to put them in conda-forge is not really desired.

@ivirshup @flying-sheep

CC @drpatelh @fmalmeida @apeltzer @grst

@Zethson Zethson added Area – External Dealing with external tools Installation labels Jun 20, 2022
@ivirshup
Copy link
Member

People outside of biology have used anndata, so we don't want to tie it to bioconda. Could one point whatever generates the bioconda images at a conda-forge package?

@grst
Copy link
Contributor

grst commented Jun 20, 2022

I'm afraid it's the bioconda CI that generates them.

Switching back to bioconda has another caveat I think: The conda-forge channel is supposed to have a higher channel priority than the bioconda one. AFAIK if a more recent version of scanpy was on bioconda, it would still be the older conda-forge version that gets installed (unless the newer version is requested explicitly)

@ivirshup
Copy link
Member

I'm afraid it's the bioconda CI that generates them.

AFAICT it's using mulled, so it should be very straight forward. galaxy also uses this, so there's probably an anndata container being generated there.

Also there is no mention of singularity in the bioconda docs.

@grst
Copy link
Contributor

grst commented Jun 20, 2022

tbh, I don't know what generates the singularity containers, but as far as I can tell they are a mirror of the biocontainers: https://depot.galaxyproject.org/singularity/

The latest versions of scanpy and anndata there are the same as on bioconda.

@drpatelh
Copy link

it would still be the older conda-forge version that gets installed

This could be overcome by pinning the channel but I agree it could be an issue if not explicitly specified.

Also there is no mention of singularity in the bioconda docs

This is the Github repo used for all of the automation and is hosted by the Galaxy project:
https://github.com/BioContainers/singularity-build-bot

As others have mentioned here, it would be awesome to have the latest versions of scanpy on Bioconda because it is the primary channel for most Bioinformatics tools. This also allows other communities like nf-core to piggy back off their automation to make our lives easier when writing reproducible, standardised workflows.

@grst did come up with a couple of workarounds like adding a mulled container with scanpy but that adds a maintenance overhead to keep things up-to-date. How much work would it be to make this happen @ivirshup and would you be willing to help?

Thanks!

@ivirshup
Copy link
Member

I'm pretty against moving back to bioconda.

I would be more up for adding a job somewhere that makes containers for various scverse tools. It looks like BioContainers would make sense as a place for this? In theory it could just be adding lines to packages.tsv in BioContainers/mulled.

Maybe we should ask someone over there how this could be done?

@ivirshup
Copy link
Member

Btw

mulled-build build --singularity 'anndata=0.8.0,scanpy=1.9.1'

Seems to work fine

@drpatelh
Copy link

drpatelh commented Jun 20, 2022

If we need multiple tools in the same container the place to add it would be BioContainers/multi-package-containers as opposed to BioContainers/mulled. This is what I meant by adding a mulled container above and that would go some way to solving the problem because we would be able to get a Docker/Singularity Biocontainer. However, in the long term, it's always nicer if these containers come directly via recipe updates from the Bioconda community. The other alternative is that the package is eventually updated and maintained on Bioconda organically like with most other tools there but that's kinda out of our control unless we add it ourselves. Always makes sense to reach out to the tool developers first! 😎

I'm pretty against moving back to bioconda.

Curious to know why and if it's something that can be overcome? I did see #1169

@ivirshup
Copy link
Member

If we need multiple tools in the same container the place to add it would be BioContainers/multi-package-containers

We do make heavy use of optional dependencies, so this might be the way to go regardless.

Curious to know why and if it's something that can be overcome?

Practically

  • The documentation for bioconda has been incomplete and out of date for years.
  • conda-forge autoupdates recipes. When we make a pip release, a conda-forge release is automatically generated.
  • bioconda packages can depend on conda-forge packages, but not the other way around (last I checked at least). If we go on bioconda all our dependents do too – this could make it extremely painful to do a migration to bioconda.
  • All of our dependencies are on conda-forge
  • Fewer channels to search means easier, faster environment solving.

More philosophically

Why have separate package registries for biology vs everything else? Code for biology isn't particularly special, much of the tooling/ work here is duplicated effort. Why not just put all of bioconda onto conda-forge, but with a special tag saying they are bio packages? All the extra tooling/ maintenance consortiums can be developed orthogonally to the registry.

I think there are very clear problems that come out of separate registries. It was a huge pain to install anything from BioJulia until they deprecated the BioJuliaRegistry. If bioconda didn't use it's own build system there wouldn't be out of date docs for that build system.

It just seems like a lot of trouble to go through for unclear benefit.

I will admit, I think there were more benefits to this model ~a decade ago. But I think these benefits have been mitigated by significantly improved tooling for developing, building, and distributing packages.

@ivirshup
Copy link
Member

If we need multiple tools in the same container the place to add it would be BioContainers/multi-package-containers

We do make heavy use of optional dependencies, so this might be the way to go regardless.

Just saw there's already a pr for this!

@grst
Copy link
Contributor

grst commented Jun 20, 2022

At this point I'm not a big fan of moving back to bioconda either.

  • anndata is not bio-specific and should go to conda-forge anyway
  • it's debatable if it was a mistake to move scanpy, but moving it back causes confusion and more harm than good IMO

Why have separate package registries for biology vs everything else?

probably because bioconda predates conda-forge?

Just saw there's already a pr for this!

BioContainers/multi-package-containers#2209

The only downside of this is that we need to update that file manually for every release of scanpy/anndata

@fmalmeida
Copy link

fmalmeida commented Jun 20, 2022

Just saw there's already a pr for this!

I’ve opened it earlier today as a workaround. But don’t know if the file is properly defined.

@grst
Copy link
Contributor

grst commented Jun 20, 2022

@bgruening, is there any recommended way to generate Biocontainers for packages that are only on conda-forge?

Would it be ok to hijack the multi-package-containers for that?

@ivirshup
Copy link
Member

Why have separate package registries for biology vs everything else?

probably because bioconda predates conda-forge?

That would make sense! I think things like bioconda and the bioconductor registry were good things to start and have been very important. I just think some of the initial design decisions are now outdated.

The only downside of this is that we need to update that file manually for every release of scanpy/ anndata

Seems github action-able?

@bgruening
Copy link

@grst yes it is ok to use https://github.com/BioContainers/multi-package-containers/ for conda-forge packages, many people do this.

There is some misinformation in this thread, let me know if there is interest in resolving them.

My short comment here as conda-forge and bioconda admin is: use conda-forge whenever you (the community) can justify the extra efforts, it has the cleaner but is more "expensive" build-system.

@ivirshup
Copy link
Member

@bgruening I would be interested to hear your perspective on this, and hear any corrections.

Apologies if any comments were unfair. My comments were definitely coloured by painful memories of debugging via CI for a release that just went live – which is always an emotional experience 😅.

One thing I was wrong about: bioconda does have autoupdates. For whatever reason, it just looks like the scanpy recipe required a fair bit of manual intervention.

@ivirshup
Copy link
Member

@Zethson, would it be fair to retitle this something like "Generate BioContainer images", or should that be a separate issue?

@Zethson
Copy link
Member Author

Zethson commented Jun 20, 2022 via email

@ivirshup ivirshup changed the title Moving back to Bioconda (or both?) Generate BioContainer images Jun 20, 2022
@apeltzer
Copy link

apeltzer commented Jun 21, 2022

Out of experience, I don't really feel like conda-forge is much more complicated to maintain - at least didn't feel too bad for me when I added a few recipes there in the past. My concerns are mostly fueled by reading passively in the bioconda channel for years now and memorizing this rule of thumb regarding where to put recipes:

Anything bio-specific --> bioconda
Anything else --> conda-forge

If this does not hold true (anymore?), @bgruening , I believe that one could still stay with conda-forge and instead try to maintain own biocontainers (need to check with the folks there if uploading would be fine for them etc pp).

The documentation for bioconda has been incomplete and out of date for years.

It could be better, but most of the points are still valid and with some help from the community recipes are still created fine ;-)

conda-forge autoupdates recipes. When we make a pip release, a conda-forge release is automatically generated.

Bioconda-bot does the same for you ;-)

bioconda packages can depend on conda-forge packages, but not the other way around (last I checked at least). If we go on >bioconda all our dependents do too – this could make it extremely painful to do a migration to bioconda.

Thats not the case: E.g. when you move scanpy over, the libraries that are not bio related, can stay on conda-forge. That way, resolving will work. I am really not sure if the resolving will not take other channels into account, unless there is different versions of packages on various channels, e.g. a library both on conda-forge and bioconda which would then be handled by channel priorities.

All of our dependencies are on conda-forge

Thats the case for the majority of bio tools - most rely on general purpose tools ;-)

Fewer channels to search means easier, faster environment solving.

mamba can help you here, at least for most of the conda recipes I have used (some have hundreds of dependencies in total, especially in multi-tool environments), I didn't notice that much of a difference between using 1 - 2 channels ❓

And thanks all for the ongoing discussion, still learning things here and also getting new perspectives on the general topic here 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants