-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix awkwardness of statistics.mode() for multimodal datasets #80073
Comments
The current code for mode() does a good deal of extra work to support its two error outcomes (empty input and multimodal input). That latter case is informative but doesn't provide any reasonable way to find just one of those modes, where any of the most popular would suffice. This arises in nearest neighbor algorithms for example. I suggest adding an option to the API: def mode(seq, *, first_tie=False):
if tie_goes_to_first:
# CHOOSE FIRST x ∈ S | ∄ y ∈ S : x ≠ y ∧ count(y) > count(x)
return return Counter(seq).most_common(1)[0][0]
... Use it like this:
With the current API, there is no reasonable way to get to 'A' from 'ABBAC'. Also, the new code path is much faster than the existing code path because it extracts only the 1 most common using min() rather than the n most common which has to sort the whole items() list. New path: O(n). Existing path: O(n log n). Note, the current API is somewhat awkward to use. In general, a user can't know in advance that the data only contains a single mode. Accordingly, every call to mode() has to be wrapped in a try-except. And if the user just wants one of those modal values, there is no way to get to it. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html for comparison. There may be better names for the flag. "tie_goes_to_first_encountered" seemed a bit long though ;-) |
Could it may be an alternative to set the mode tie case in a form like: def mode(seq, *, case=CHOOSE_FIRST):
[...] (or TIE_CHOOSE_FIRST, ...) where CHOOSE_FIRST is an enum ? Thanks! |
Thanks Raymond for the interesting use-case. The original design of mode() was support only the basic form taught in I think it is time to consider a richer interface to support more uses, One reason I've been hesitant is that deciding what is the right
I like Francis' suggestion to use an enum to select the behavioural, er, |
I would stick with "first_tie=False". That caters to the common case, avoids API complications, and does something similar to what other stats packages are doing. ----- API Survey ---- Maple: """This function is only guaranteed to return one potential mode - in cases where multiple modes exist, only the first detected mode is guaranteed to be returned.""" -- https://www.maplesoft.com/support/help/maple/view.aspx?path=Statistics%2fMode R: 'R does not have a standard in-built function to calculate mode.' -- https://www.tutorialspoint.com/r/r_mean_median_mode.htm Matlab: "When there are multiple values occurring equally frequently, mode returns the smallest of those values. ... If A is an empty 0-by-0 matrix, mode(A) returns NaN." -- https://www.mathworks.com/help/matlab/ref/mode.html Mathematica: "The mode of a set of data is implemented in the Wolfram Language as Commonest[data]. ... When several elements occur with equal frequency, Commonest[list,…] picks first the ones that occur first in list." -- http://mathworld.wolfram.com/Mode.html and https://reference.wolfram.com/language/ref/Commonest.html SciPy: "If there is more than one such value, only the smallest is returned." -- https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.mode.html SAS: "Most frequent value (if not unique, the smallest mode)" -- https://documentation.sas.com/?docsetId=qcug&docsetTarget=qcug_capability_sect225.htm&docsetVersion=14.2&locale=en |
Good point Raymond! Only a minor observation on the packages API: [1] SciPy: scipy.stats.mode(a, axis=0, nan_policy='propagate') And use of different policies on parameters : Equivalent one could say 'multivalue_policy' [2] Matlab: Mode: "Most frequent **values** in array" ...returns the sample mode of A, which is the most frequently occurring *value* in A...." IMHO it seems inconsistent *values* vs. *value* (or a doc-bug ?). An a question: In that case the user has the chance to get the first, the last or just all, ... ---- |
IMHO, we don't need to add the option. We can return the smallest value from the **table** instead of the code below. if len(table) == 1:
return table[0][0] [1] https://github.com/python/cpython/blob/master/Lib/statistics.py#L502 |
Internally, that does too much work and then throws most of it away. The difference between Counter(data).most_common()[1] and Counter(data).most_common(1) is that the former materializes the entire items iterator into a list and does a full sort, while the latter is memory friendly and makes a single pass with min(). |
I've been thinking about this a bit more. ISTM that for now, it's best to get mode() working for everyday use, returning just the first mode encountered. This keeps the signature simple (Iterable -> Scalar). In the future, if needed, there is room to add separate functions with their own clean signatures and use cases. For example, having a separate multimode() that always returns a list would be clearer and easier to use than setting a flag on the current scalar valued mode() function. Another example would be mode_estimate() that returns a scalar float Categorical, binned, or ordinal data: mode(data: Iterable, *, first_tie=False) -> object Continuous data: mode(data: Iterable[Real]) -> Real |
I only tested stats.mode() from scipy, data = 'BBAAC' should also return 'A'. But in your code **return return Counter(seq).most_common(1)[0][0]** will return B. Did I miss something? |
Yes. It doesn't really matter which mode is returned as long as it is deterministically chosen. We're proposing to return the first mode rather than the smallest mode. Scipy returns the smallest mode because that is convenient given that the underlying operation is np.unique() which returns unique values in sorted order [1]. We want to return the first mode encountered because that is convenient given that the underlying operation is max() which returns the first maximum value it encounters. Another advantage of return-first rather than return-smallest is that our mode() would work for data values that are hashable but not orderable (i.e. frozensets). [1] https://github.com/scipy/scipy/blob/v0.19.1/scipy/stats/stats.py#L378-L443 |
The attraction to having a first_tie=False flag is that it is backwards compatible with the current API. However on further reflection, people would almost never want the existing behavior of raising an exception rather than returning a useful result. So, we could make first_tie the default behavior and not have a flag at all. We would also add a separate multimode() function with a clean signature, always returning a list. FWIW, MS Excel evolved to this solution as well (it has MODE.SGNL and MODE.MULT). This would give us a clean, fast API with no flags:
ISTM this is an all round win:
However, there would be an incompatibility in the API change. It is possible that a person really does want an exception instead of a value when there are multiple modes. In that case, this would break their code so that they have to switch to multimode() as a remediation. OTOH, it is also possible that people have existing code that uses mode() but their code has a latent bug because they weren't using a try/except and haven't yet encountered a multi-modal dataset. So here are some options:
For any of those options, we should still add a separate multimode() function. |
I think right now we can
as well as
In Python 3.9, we change the behavior. |
This seems reasonable to me due legacy (although I really thing that
What should return in that case: E.g.: mode([42.0, 42.0, 42.0, 1.0, 1.0, 1.0]) ? 42.0 ? or 1.0 ? or [42.0, 1.0] ? or do I have misunderstood something ? Thanks! |
Good options itemization!
LGTM upto 3.9, shouldn't be "mode" at some point be replaced by |
No. The signature is completely incompatible with the existing mode() function. Like MS Excel which has two functions, MODE.SGNL and MODE.MULT, we should also have two functions, each with a clean signature and with running speed that is optimal for the desired result: mode(data: Iterable) -> object multimode(data: Iterable) -> List[object] For the first one, I recommend skipping deprecation and just changing the behavior that usually raises an exception for multiple modes. In its current form, it likely to create bugs due to uncaught exceptions, and it doesn't provide any work-around if you do want only one of the modes. By analogy, consider what would happen if we had a max() function that raised an exception when there were duplicate maximum values. It would be almost usable. So, what the real max() actually does is return the first encountered maximum value: >>> max(3, 1, 3.0)
3
>>> max(3.0, 1, 3)
3.0 |
Executive summary:
Further comments in no particular order: I agree that in practice the current implementation of mode has turned out to be quite awkward. Unfortunately I feel that just chosing one mode and returning it as "the" mode is statistically dubious (despite it being commonplace!). I've checked with some statisticians, or at least people who claim to be statisticians on the Internet, and they agree. (The usual advice from statisticians is to always plot your data before talking about the mode or modes, as even local maxima can be meaningful.) Nevertheless, Raymond's use-case of the nearest neighbor algorithm is persuasive. We should change the mode function to return one of the modes. Which one? Again, people I've spoken to suggest that there's no statistical basis to picking any specific one (first found, last found, smallest, largest). I'm unhappy about documenting which mode is returned, as I expect it will be an implementation detail that could change in the future. Raymond, are you happy to leave the specific detail of which mode is returned unspecified? Or do you have an argument for why we should specify it, and can do so without limiting future implementations? I too strongly prefer a separate multimode function over a flag to the mode function. As for backwards-compatibility, I've asked around trying to find anyone who claims that changing the behaviour would break their code. I haven't found anyone. I guess that means that unless any of the other core devs object, I'm happy to go with Raymond's recommendation to just change the behaviour without a depreciation period. |
What do people think about leaving this as an "Easy First Issue" for the sprint? If others agree that it is sufficiently easy, we can assign the task to Cheryl. It should be fairly easy: mode calls an internal function _counts which is not public and not used anywhere else, so it can be changed completely or even discarded if needed. |
It's only easy if we clearly specify what we want to occur. Deciding what the right behavior should be is not a beginner skill. Proposed spec: TestCases: Implementation: Documentation:
When: |
Are you happy guaranteeing that it will always be the first mode
That should be 'b'.
And again 'b'.
If it is first encountered, that should be 'd'. |
Yes. All of the other implementations I looked at make some guarantee about which mode is returned. Maple, Matlab, and Excel all return the first encountered.¹ That is convenient for us because it is what Counter(data).most_common(1) already does and does cheaply (single pass, no auxiliary memory). It also matches what a number of our other tools do: >>> max(3, 3.0) # 3 is first encountered
3
>>> max(3.0, 3) # 3.0 is first encountered
3.0
>>> list(dict.fromkeys('aabbaacc'))[0] # 'a' is first encountered
'a'
>>> sorted([3, 3.0])[0] # 3 is first encountered (due to sort stability)
3
>>> sorted([3.0, 3])[0] # 3.0 is first encountered (due to sort stability)
3.0 ¹ Scipy returned the smallest value rather than first value but it algorithm was sorting based to accommodate running a parallel mode() computation on multiple columns of an array. For us, that approach would be much slow, would require more memory, and would require more bookkeeping. P.S. I'm no longer thinking this should be saved for Pycon sprints. That is right at the beta 1 feature freeze. We should aim to get this resolved well in advance of that. |
Attached a draft PR for discussion purposes. Let me know what you think (I'm not wedded to any part of it). |
Steven, are you okay with applying this PR so we can put this to rest, cleanly and permanently? |
Here's a text only link to the patch: https://patch-diff.githubusercontent.com/raw/python/cpython/pull/12089.patch |
Looks good to me, I'm happy to accept it. Thank you for your efforts Raymond, can I trouble you to do the merge yourself please, I'm still having issues using the Github website. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: