Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Multiple values for 'n_jobs' or 'threads' #18

Merged
merged 20 commits into from
Jun 15, 2020

Conversation

ChrisKeefe
Copy link
Collaborator

Attempts to diagnose and fix a bug initially discovered here, in which beta diversity n_Jobs and threads parameters are being passed multiple values when called through the framework with 'auto'.

@ChrisKeefe ChrisKeefe marked this pull request as draft June 11, 2020 17:37
@thermokarst
Copy link
Contributor

I was doing a whole-repo review this morning, as promised, and this bit of code looks like it has issues, which are probably related to the ci error:

if cpus_requested == 'auto':
# remove 'auto' from args to prevent 'multiple values' TypeError...
argslist = list(args)
argslist.remove('auto')
return_args = tuple(argslist)
# ...then inject number of available cpus
return wrapped_function(*return_args, **kwargs, **{param_name: cpus})


Not sure if this is just-copy-and-paste error, but the original error you linked to above says:

Multiple values for argument thread

while the ci error is

TypeError: unweighted_unifrac() got multiple values for argument 'threads'

Please note the difference in error, in particular, the offending variable name (thread vs threads).

@thermokarst
Copy link
Contributor

Here is a candidate for an idiomatic solution: https://docs.python.org/3.6/library/inspect.html#inspect.Signature.replace

@thermokarst
Copy link
Contributor

Here is an draft implementation, in case you get stuck: ChrisKeefe/q2-diversity-lib@multiple_values...thermokarst:multiple_values_thermokarst

I wanted to get my thoughts in order, which wound up producing a functional result, figured I would share.

Co-authored by: thermokarst <matthewrdillon@gmail.com>
Co-authored by thermokarst <mathewrdillon@gmail.com>
Co-authored-by: thermokarst <matthewrdillon@gmail.com>
@ChrisKeefe
Copy link
Collaborator Author

Here is an draft implementation, in case you get stuck: ChrisKeefe/q2-diversity-lib@multiple_values...thermokarst:multiple_values_thermokarst

This is much prettier (and more direct) than whatever I was doing with argument order convention/keyword-only arguments. The docs, as usual, spell it out. Thanks!

q2_diversity_lib/_util.py Outdated Show resolved Hide resolved
q2_diversity_lib/_util.py Outdated Show resolved Hide resolved
q2_diversity_lib/_util.py Show resolved Hide resolved
q2_diversity_lib/_util.py Outdated Show resolved Hide resolved
q2_diversity_lib/_util.py Outdated Show resolved Hide resolved
q2_diversity_lib/tests/test_beta.py Outdated Show resolved Hide resolved
self.unweighted_unifrac_thru_framework(self.table_as_artifact,
self.tree_as_artifact,
threads=2)
self.unweighted_unifrac_thru_framework(self.table_as_artifact,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tests! Do you have a similar test, for checking if n_jobs == auto works as expected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoo boy am I glad you asked. New commit with breaking test incoming. Something very messy is happening. I suspect it's related to the fact that everything is being unpacked into positional arguments in our return statement, but I'm really not sure.

Some notes:
jaccard fails with TypeError: jaccard() got multiple values for argument 'n_jobs'
unifrac methods don't trigger test failures, however...
when jaccard fails, a stderr message is exposed that comes from from unifrac that makes me think it's getting a memory address or something instead of the integer we're trying to deliver:

------------------------------------ Captured stderr call -------------------------------------
More threads were requested than stripes. Using 342021232 threads.

The number in the error message is a different large integer (positive or negative) every time.
These stderror lines appear only when there is a test that uses auto. Integer arguments are clean, which doesn't seem to jive well with my current hypothesis on why this is happening. 🤷‍♂️

Copy link
Contributor

@thermokarst thermokarst Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might've found a bug in the somewhere in the framework or test harness! First off, the main problem is in the test data itself:

        self.jaccard_thru_framework(self.table_as_artifact,
                                    self.tree_as_artifact,
                                    n_jobs=2)

Note, you are passing in a tree as the second param, even though this method doesn't take a tree. That appears to be your second n_jobs (which is what I think might be a bug in the framework).

While playing with this, my next thought was, "well, is this somehow related to the decorator signature munging?" - I removed the decorators and tested, the answer appears to be "no". Put another way, I don't see how this could be related to your decorator utils (yet). This test fails either way, decorated or not.

Applying the following patch to 8dfdb0b has a 100% passing test suite:

diff --git a/q2_diversity_lib/tests/test_util.py b/q2_diversity_lib/tests/test_util.py
index 13cacab..8db7a63 100644
--- a/q2_diversity_lib/tests/test_util.py
+++ b/q2_diversity_lib/tests/test_util.py
@@ -181,10 +181,8 @@ class ValidateRequestedCPUsTests(TestPluginBase):
                                                self.tree_as_artifact,
                                                threads='auto')
         self.jaccard_thru_framework(self.table_as_artifact,
-                                    self.tree_as_artifact,
                                     n_jobs=2)
-        # self.jaccard_thru_framework(self.table_as_artifact,
-        #                             self.tree_as_artifact,
-        #                             n_jobs='auto')
+        self.jaccard_thru_framework(self.table_as_artifact,
+                                    n_jobs='auto')
         # If we get here, then it ran without error
         self.assertTrue(True)

Copy link
Contributor

@thermokarst thermokarst Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a MWE:

import qiime2 
from qiime2.plugins import demux 
dmx = qiime2.Artifact.load('../data/moving-pictures/demux.qza')
demux.actions.summarize(dmx, dmx, dmx, dmx, n=dmx) 

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-87df907aed63> in <module>
      2 from qiime2.plugins import demux
      3 dmx = qiime2.Artifact.load('../data/moving-pictures/demux.qza')
----> 4 demux.actions.summarize(dmx, dmx, dmx, dmx, n=dmx)

TypeError: summarize() got multiple values for argument 'n'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my thinking RE a bug in the framework. I know that this "got multiple values" business is just python doing what python does, but I am thinking there is a guard or two that we might be missing, although I'm not sure.

Copy link
Collaborator Author

@ChrisKeefe ChrisKeefe Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed on this - seems like we should be doing a better job of checking argument counts somewhere. I'll go poking around with this in a little bit.

For now, I've figured out what's causing the stderr messages from unifrac, I'm just not sure why they display the numbers that they do. In every case where that stderror output appears, we were working with a small (3-sample) table, which unifrac can handle using a maximum of two threads. Using a larger table removes the issue, as does requesting a smaller number of threads. The 'auto' parameter had nothing to do with it, beyond that it was requesting 3 threads, one more than the max.

I'm wondering:

  • Why is unifrac reporting using 46530796 or -132493215 threads? ✔️
  • Should we be "catching" this error somehow, and passing a warning along to users? ❌
  • PR raised on unifrac ✔️

@ChrisKeefe
Copy link
Collaborator Author

Quick summary for posterity:
Based on this and this, @thermokarst hypothesized psutil might be picking up the host machine's CPU count, rather than the VM's CPU count. Some digging into the psutil issue tracker (1, 2) suggest that psutil is currently unable to generate accurate physical cpu counts on systems with multiple sockets. This may be fixed here, but for now there's not much we can do about it.

Sooooo, I'm patching psutil in the breaking tests temporarily to prevent failure. Issue incoming to keep this on the radar.

@ChrisKeefe ChrisKeefe marked this pull request as ready for review June 12, 2020 23:12
@thermokarst thermokarst merged commit 8f4e655 into qiime2:master Jun 15, 2020
@ChrisKeefe ChrisKeefe deleted the multiple_values branch November 30, 2020 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants