Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to DADA2 >= 1.7.3 #85

Closed
thermokarst opened this issue Jan 25, 2018 · 21 comments · Fixed by #113
Closed

Upgrade to DADA2 >= 1.7.3 #85

thermokarst opened this issue Jan 25, 2018 · 21 comments · Fixed by #113

Comments

@thermokarst
Copy link
Contributor

As noted here, when that new release lands, we should update the pinned version here, as well as revert the SSE changes made in January 2018 in the wake of SSE-gate.

@thermokarst
Copy link
Contributor Author

@benjjneb, can you let us know which bioconductor release cycle you expect this to be available in, that way we can track it in the appropriate QIIME 2 release cycle. Thanks!

@benjjneb
Copy link
Collaborator

This won't get to release until the next Bioconductor release (BioC 3.7) which will be around April. Given that there will also be a delay in the BioC release propagating to bioconda, this would make sense for a June/July Q2 release.

@benjjneb
Copy link
Collaborator

benjjneb commented Jul 31, 2018

Still waiting on the May Bioconductor release to propagate to bioconda. I think progress will show up on this issue: bioconda/bioconda-recipes#8947

Once it migrates, I'm planning to update the R scripts, and to address several other outstanding issues at the same time including #87, #93, #97, #99

And note to self: Add an extra step catching and removing very low depth sequences before learning the error rates.

@ebolyen
Copy link
Member

ebolyen commented Aug 24, 2018

@benjjneb, in the new granular pipeline scheme, would it be possible to provide a way to handle sample pooling? This came up recently in the context of breakaway and alpha diversity estimation.

breakaway in particular requires singletons to exist, if there was a way to conditionally return FeatureTable[Frequency % Properties('singletons')] we would be able to enforce that assumption (when sample pooling is true or pseudo at least).

We're going to be trying to address TypeMap and similar typing related things in the next release cycle. If I need to implement a way to dependently type the output, that should be possible (especially with a real use-case now).

@benjjneb
Copy link
Collaborator

@benjjneb, in the new granular pipeline scheme, would it be possible to provide a way to handle sample pooling? This came up recently in the context of breakaway and alpha diversity estimation.

Yes. Although given the frustrating speed penalty that bioconda-dada2 has, pseudo-pooling might make more sense.

if there was a way to conditionally return FeatureTable[Frequency % Properties('singletons')] we would be able to enforce that assumption (when sample pooling is true or pseudo at least).

We're going to be trying to address TypeMap and similar typing related things in the next release cycle. If I need to implement a way to dependently type the output, that should be possible (especially with a real use-case now).

Have to admit, I don't really follow this. Basically option to a different return type if pool=TRUE (or pool=psuedo)?

@ebolyen
Copy link
Member

ebolyen commented Aug 25, 2018

Yep! Basically we'd just figure out when that property can be added, and tools that need it can require it as an input. Everything else will continue not caring like normal.

@epruesse
Copy link
Member

Bioconda is now at 1.10: http://bioconda.github.io/recipes/bioconductor-dada2/README.html

@benjjneb Is the speed penalty still there?

This blocks qiime2/q2-alignment#64

@benjjneb
Copy link
Collaborator

Bioconda is now at 1.10: http://bioconda.github.io/recipes/bioconductor-dada2/README.html

Interesting. It would actually be easier to just jump ahead to 1.10 as the edge case merging bugs were mostly worked out between 1.8 and 1.10.

@benjjneb Is the speed penalty still there?

It will be a little better than now, but as far as I coiuld tell the speed penalty is pretty hard to overcome completely, because it is a result of how bioconda compiles code (old version of gcc, low levels of optimization to handle generic hardware).

@thermokarst
Copy link
Contributor Author

old version of gcc

I wonder if the recent bump to gcc7 on conda-forge and bioconda will help us out here...

@epruesse
Copy link
Member

That's what I meant. If it's not -march dependent, then it should now all be OK.

Having -march=native compiled stuff won't work for Bioconda. We've had the discussion in a few places, but just picture a user with miniconda in $HOME and doing qsub on a cluster with a few generations of compute nodes added over time. Even on Travis you'll sometimes run into different CPUs.

The reason why I can't easily build a SINA version that matches the pinned libs is that we followed Conda-Forge in the move to GCC7. Since that brought the ABI change required to support C++11, all C++ libs except for libstdc++ are incompatible with the old ones.

@benjjneb
Copy link
Collaborator

I wonder if the recent bump to gcc7 on conda-forge and bioconda will help us out here...

Didn't realize this had happened! I will definitely re-profile the speed issue with the 1.10 bioconda builds. fingers crossed would be awesome if the performance delta narrows.

@epruesse
Copy link
Member

Didn't realize this had happened!

It's still happening I suppose. We are rebuilding more or less on an as-needed-basis because of resource constraints. Conda-Forge had been rebuilding for a while, and this January did "the label switch". That means all the GCC7 package previously in conda-forge/label/gcc7 became available through conda-forge. We tagged our pre-gcc7 state and started rebuilding. So far no major disasters have happened fingers crossed.

@benjjneb
Copy link
Collaborator

@epruesse A brief update, we have started getting segfault error reports for the dada2 package that appear to be coming from conda-installed versions of the 1.10 package, i.e. the new package version that is being built with GCC7. See for example: benjjneb/dada2#684

@benjjneb
Copy link
Collaborator

I've just run some initial tests on the bioconda version of DADA2 1.10 built with GCC7, and it appears that the speed penalty versus natively installed DADA2 is now essentially gone.

I will do some more detailed testing, but this would be a major quality-of-life improvement for folks using DADA2 via QIIME2 (up to a 10x speedup) and strongly suggests skipping the 1.8 package version and going straight to 1.10. Is that in the realm of possibility for the next Q2 release?

@apcamargo
Copy link

@benjjneb Did you solve the segfault error or it didn't happen with your samples?

@ebolyen
Copy link
Member

ebolyen commented Mar 13, 2019

Is that in the realm of possibility for the next Q2 release?

Yep! There's still a full month or so on the next release. The dates are also probably something we can flex if it comes down to it, the speed improvement would certainly be worth it.

@benjjneb
Copy link
Collaborator

@benjjneb Did you solve the segfault error or it didn't happen with your samples?

Works fine in my initial testing, but I haven't tried to reproduce the reported bioconda-specific segfault yet. Will try soon.

Yep! There's still a full month or so on the next release. The dates are also probably something we can flex if it comes down to it, the speed improvement would certainly be worth it.

OK, I am going to move forward with testing and updating the Q2 scripts based on package version 1.10 then. I'll need help from your side figuring out the Q2 conda recipes and any python updates to the plugin though.

@ebolyen
Copy link
Member

ebolyen commented Mar 13, 2019

I'll need help from your side figuring out the Q2 conda recipes and any python updates to the plugin though.

Sounds like a plan! Feel free to structure the R scripts into as many or few as you need. I can adapt the Python side to work correctly. I think we had talked about breaking up the denoise-* methods into multiple actions before (and maybe even including taxonomic assignment?). We have pipelines now, so I can re-compose the original denoise-* methods in Python if you think that's worth attempting with this.


Re segfault: Is it possible that it's just a blanket runtime conflict? e.g. glibc on CentOS 5 vs anything compiled in the last decade?

@benjjneb
Copy link
Collaborator

Just wanted to ping this thread with the DADA2 issue that is open on segfaults with bioconda-dada2-1.10: benjjneb/dada2#684

In testing on my local machine, the bioconda install of 1.10 seems fine, but clearly there is something going on here because there are multiple people with the same report.

@epruesse
Copy link
Member

It might be not just 1.10.

(segfault in 1.8) bioconda/bioconda-recipes#13847
(just references) bioconda/bioconda-recipes#13776

In 1.8, it appears to have been a libc++ issue? I can't tell. We need someone proficient in R for this.

@epruesse
Copy link
Member

Ok updates - my best guess is a concurrency bug somewhere in dada2. See the issue at dada2 repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants