return list of modes for a multimodal distribution instead of raising a StatisticsError #73142

sria91 · 2016-12-13T04:21:42Z

BPO	28956
Nosy	@rhettinger, @terryjreedy, @stevendaprano, @wm75, @sria91, @scotchka
PRs	getpass: update docstring #49 [backport to 3.5] bpo-27122: Fix comment to point to correct issue number (#47) #50 bpo-28956: Return list of modes for a multimodal distribution. #5732

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-03-11.11:01:24.182>
created_at = <Date 2016-12-13.04:21:41.772>
labels = ['3.7', 'type-bug', 'library']
title = 'return list of modes for a multimodal distribution instead of raising a StatisticsError'
updated_at = <Date 2019-03-11.11:01:24.182>
user = 'https://github.com/sria91'

bugs.python.org fields:

activity = <Date 2019-03-11.11:01:24.182>
actor = 'steven.daprano'
assignee = 'none'
closed = True
closed_date = <Date 2019-03-11.11:01:24.182>
closer = 'steven.daprano'
components = ['Library (Lib)']
creation = <Date 2016-12-13.04:21:41.772>
creator = 'sria91'
dependencies = []
files = []
hgrepos = []
issue_num = 28956
keywords = ['patch']
message_count = 15.0
messages = ['283071', '283085', '283089', '283090', '283091', '283092', '283154', '283155', '283453', '312303', '312305', '337620', '337622', '337625', '337656']
nosy_count = 6.0
nosy_names = ['rhettinger', 'terry.reedy', 'steven.daprano', 'wolma', 'sria91', 'scotchka']
pr_nums = ['49', '50', '5732']
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue28956'
versions = ['Python 3.7']

sria91 · 2016-12-13T04:21:42Z

return minimum of modes for a multimodal distribution

instead of raising a StatisticsError

wm75 · 2016-12-13T08:50:13Z

What's the justification for this proposed change? Isn't it better to report the fact that there isn't an unambiguous result instead of returning a rather arbitrary one?

sria91 · 2016-12-13T09:35:22Z

A better choice would be to return a tuple of values (sliced from the
table). And let the user decide which one to use.

Hope that's justifiable...

Thanks & Regards
Srikanth Anantharam
+91 7204 350429
https://sria91.github.io/

Sent from Android

On 13-Dec-2016 2:20 PM, "Wolfgang Maier" <report@bugs.python.org> wrote:

Wolfgang Maier added the comment:

What's the justification for this proposed change? Isn't it better to
report the fact that there isn't an unambiguous result instead of returning
a rather arbitrary one?

----------
nosy: +steven.daprano, wolma
versions: +Python 3.7 -Python 3.5

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue28956\>

stevendaprano · 2016-12-13T09:54:04Z

On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote:

Srikanth Anantharam added the comment:

A better choice would be to return a tuple of values (sliced from the
table). And let the user decide which one to use.

The current mode() function is designed for a very basic use-case, where
you have an obvious single mode from discrete data.

The problem with dealing with multiple modes is that its not easy to
tell the difference between a genuinely multi-modal sample and one which
just happens to have a few samples with the same value:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal
with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9?

I have plans for introducing a binning function to collect data into
bins and run statistics on the bins. That might be a better way to deal
with multi-modal samples: if you bin the data (for discrete data, use a
bin size of 1) and then look at the frequencies, you can decide how many
modes there are.

Thanks for the suggestion.

sria91 · 2016-12-13T10:08:10Z

Please see the updated pull request PR 50, with the changes.

Thanks & Regards
Srikanth Anantharam
+91 7204 350429
https://sria91.github.io/

Sent from Android

On 13-Dec-2016 3:26 PM, "Srikanth Anantharam" <report@bugs.python.org>
wrote:

Changes by Srikanth Anantharam <sria91@gmail.com>:

----------
pull_requests: +4

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue28956\>

sria91 · 2016-12-13T10:17:21Z

@steven:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]
is clearly unimodal with mode 8

data would have been bimodal if 4 repeated exactly the same (7) number of
times as 8, like this:
data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

in which case the new patch in PR 50 would return a tuple
(4, 8)

Thanks & Regards
Srikanth Anantharam
+91 7204 350429
https://sria91.github.io/

Sent from Android

On 13-Dec-2016 3:24 PM, "Steven D'Aprano" <report@bugs.python.org> wrote:

Steven D'Aprano added the comment:

On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote:

Srikanth Anantharam added the comment:

A better choice would be to return a tuple of values (sliced from the
table). And let the user decide which one to use.

The current mode() function is designed for a very basic use-case, where
you have an obvious single mode from discrete data.

The problem with dealing with multiple modes is that its not easy to
tell the difference between a genuinely multi-modal sample and one which
just happens to have a few samples with the same value:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal
with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9?

I have plans for introducing a binning function to collect data into
bins and run statistics on the bins. That might be a better way to deal
with multi-modal samples: if you bin the data (for discrete data, use a
bin size of 1) and then look at the frequencies, you can decide how many
modes there are.

Thanks for the suggestion.

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue28956\>

stevendaprano · 2016-12-14T01:55:35Z

On Tue, Dec 13, 2016 at 10:08:10AM +0000, Srikanth Anantharam wrote:

Please see the updated pull request PR 50, with the changes.

I'm rejecting that pull request. As I said, mode() intentionally
returns only the single, unique mode. I may add a more advanced API or a
second function for dealing with multi-modal samples, but even if I do,
your suggestion wouldn't be sufficient. Your pull request merely returns
the entire list of unique values:

    return tuple(value for value, frequency in table)

with no way for the caller to tell which values might be a mode and
which are not.

(By the way, even if this function behaviour was acceptible, which I
stress it is not, this would not be sufficient for me to accept as a
patch. You should preferably update the documentation and the tests as
well. At the very least, you should update the function's docstring to
explain the changed return value.)

I'm sorry that I have to reject this, I am interested in having better
support for multiple modes. I'm not closing this issue just yet, if you
are interested in continuing the discussion, what would be *VERY*
valuable for me would be for you or some other volunteer to do some
research into numerical techniques for objectively determining the
number and value of modes, rather than just plotting a graph and
subjectively deciding whether a value is a peak or not.

Thanks for your interest.

stevendaprano · 2016-12-14T02:00:56Z

On Tue, Dec 13, 2016 at 10:17:21AM +0000, Srikanth Anantharam wrote:

Srikanth Anantharam added the comment:

@steven:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]
is clearly unimodal with mode 8

data would have been bimodal if 4 repeated exactly the same (7) number of
times as 8, like this:
data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Bimodal distributions do not require both modes to be exactly the same
height. And certainly when you have a sample from a bimodal
distribution, you should not expect exactly the same frequency for the
two modes. Just from random sampling error you will expect one or the
other to have a larger frequency.

You shouldn't take my example too literally. With such a small sample of
discrete values, it becomes a (hard) matter of personal judgement. The
point I was attempting to make was that identifying sample modes outside
of the simplest unimodal case is tricky and requires much thought.

terryjreedy · 2016-12-17T00:22:30Z

Srikanth, when you reply by email, please remove the quotation of the previous message. On the web page, it is just noise. The only exception should be when you reply to a specific sentence and need to quote that sentence for context.

In my particular experience, mode() is unusally reserved for crudely describing unordered categorical data, where the concept of 'minimum' does not apply.

Mode is useful for determining the winner of a vote (or other decision process), but in general, it is not a substitute for a more comprehensive look at a dataset.

Problems with possibly returning a tuple of data items instead of a data item include:

The user then has to be prepared to handle a tuple instead of a data item. It would be better then to always return a tuple, even for 1 item.
Data items can be tuples, making a tuple return ambiguous. Example use case: planar points with int coordinates.

>>> mode(((0,0), (0,0), (0,1)))
(0, 0)

So, while StatisticsError is a nuisance, so are the apparent alternatives. I think we should leave mode alone and close this.

stevendaprano · 2018-02-18T10:08:48Z

What makes the minimum mode better than the maximum?

sria91 · 2018-02-18T10:13:01Z

Please review the new PR with tests.
I'll update the documentation if the PR is acceptable.

scotchka · 2019-03-10T16:42:03Z

The problem remains that the function can return a number or a list for input that is a list of numbers. This means the user will need to handle both possibilities every time, which is a heavy burden for such a simple function.

SciPy's mode function does return the minimum mode when there is a tie, which as far as I can tell is an arbitrary choice. But in that context, since the input is almost always numerical, a minimum is at least well defined, which is not true for an input with a mix of types.

For the general use case, the current behavior - raising an exception - in case of tie conveys the most information.

scotchka · 2019-03-10T17:10:13Z

Yes, the mode function could ALWAYS return a list, but that breaks backward compatibility, as does the currently proposed change.

rhettinger · 2019-03-10T18:04:21Z

See the competing proposal and PR at https://bugs.python.org/issue35892 and #12089

stevendaprano · 2019-03-11T11:01:24Z

I'm closing this issue in favour of Raymond's bpo-35892, thank you to everyone even if your PRs didn't get used, I appreciate your efforts.

sria91 mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 13, 2016

wm75 mannequin added the 3.7 (EOL) end of life label Dec 13, 2016

sria91 mannequin changed the title ~~return minimum of modes for a multimodal distribution instead of raising a StatisticsError~~ return list of modes for a multimodal distribution instead of raising a StatisticsError Feb 18, 2018

stevendaprano closed this as completed Mar 11, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return list of modes for a multimodal distribution instead of raising a StatisticsError #73142

return list of modes for a multimodal distribution instead of raising a StatisticsError #73142

sria91 mannequin commented Dec 13, 2016

sria91 mannequin commented Dec 13, 2016

wm75 mannequin commented Dec 13, 2016

sria91 mannequin commented Dec 13, 2016

stevendaprano commented Dec 13, 2016

sria91 mannequin commented Dec 13, 2016

sria91 mannequin commented Dec 13, 2016

stevendaprano commented Dec 14, 2016

stevendaprano commented Dec 14, 2016

terryjreedy commented Dec 17, 2016

stevendaprano commented Feb 18, 2018

sria91 mannequin commented Feb 18, 2018

scotchka mannequin commented Mar 10, 2019

scotchka mannequin commented Mar 10, 2019

rhettinger commented Mar 10, 2019

stevendaprano commented Mar 11, 2019

return list of modes for a multimodal distribution instead of raising a StatisticsError #73142

return list of modes for a multimodal distribution instead of raising a StatisticsError #73142

Comments

sria91 mannequin commented Dec 13, 2016

sria91 mannequin commented Dec 13, 2016

wm75 mannequin commented Dec 13, 2016

sria91 mannequin commented Dec 13, 2016

stevendaprano commented Dec 13, 2016

sria91 mannequin commented Dec 13, 2016

sria91 mannequin commented Dec 13, 2016

stevendaprano commented Dec 14, 2016

stevendaprano commented Dec 14, 2016

terryjreedy commented Dec 17, 2016

stevendaprano commented Feb 18, 2018

sria91 mannequin commented Feb 18, 2018

scotchka mannequin commented Mar 10, 2019

scotchka mannequin commented Mar 10, 2019

rhettinger commented Mar 10, 2019

stevendaprano commented Mar 11, 2019