Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return list of modes for a multimodal distribution instead of raising a StatisticsError #73142

Closed
sria91 mannequin opened this issue Dec 13, 2016 · 15 comments
Closed

return list of modes for a multimodal distribution instead of raising a StatisticsError #73142

sria91 mannequin opened this issue Dec 13, 2016 · 15 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@sria91
Copy link
Mannequin

sria91 mannequin commented Dec 13, 2016

BPO 28956
Nosy @rhettinger, @terryjreedy, @stevendaprano, @wm75, @sria91, @scotchka
PRs
  • getpass: update docstring #49
  • [backport to 3.5] bpo-27122: Fix comment to point to correct issue number (#47) #50
  • bpo-28956: Return list of modes for a multimodal distribution. #5732
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-03-11.11:01:24.182>
    created_at = <Date 2016-12-13.04:21:41.772>
    labels = ['3.7', 'type-bug', 'library']
    title = 'return list of modes for a multimodal distribution instead of raising a StatisticsError'
    updated_at = <Date 2019-03-11.11:01:24.182>
    user = 'https://github.com/sria91'

    bugs.python.org fields:

    activity = <Date 2019-03-11.11:01:24.182>
    actor = 'steven.daprano'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-03-11.11:01:24.182>
    closer = 'steven.daprano'
    components = ['Library (Lib)']
    creation = <Date 2016-12-13.04:21:41.772>
    creator = 'sria91'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 28956
    keywords = ['patch']
    message_count = 15.0
    messages = ['283071', '283085', '283089', '283090', '283091', '283092', '283154', '283155', '283453', '312303', '312305', '337620', '337622', '337625', '337656']
    nosy_count = 6.0
    nosy_names = ['rhettinger', 'terry.reedy', 'steven.daprano', 'wolma', 'sria91', 'scotchka']
    pr_nums = ['49', '50', '5732']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue28956'
    versions = ['Python 3.7']

    @sria91
    Copy link
    Mannequin Author

    sria91 mannequin commented Dec 13, 2016

    return minimum of modes for a multimodal distribution

    instead of raising a StatisticsError

    @sria91 sria91 mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 13, 2016
    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Dec 13, 2016

    What's the justification for this proposed change? Isn't it better to report the fact that there isn't an unambiguous result instead of returning a rather arbitrary one?

    @wm75 wm75 mannequin added the 3.7 (EOL) end of life label Dec 13, 2016
    @sria91
    Copy link
    Mannequin Author

    sria91 mannequin commented Dec 13, 2016

    A better choice would be to return a tuple of values (sliced from the
    table). And let the user decide which one to use.

    Hope that's justifiable...

    Thanks & Regards
    Srikanth Anantharam
    +91 7204 350429
    https://sria91.github.io/

    Sent from Android

    On 13-Dec-2016 2:20 PM, "Wolfgang Maier" <report@bugs.python.org> wrote:

    Wolfgang Maier added the comment:

    What's the justification for this proposed change? Isn't it better to
    report the fact that there isn't an unambiguous result instead of returning
    a rather arbitrary one?

    ----------
    nosy: +steven.daprano, wolma
    versions: +Python 3.7 -Python 3.5


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28956\>


    @stevendaprano
    Copy link
    Member

    On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote:

    Srikanth Anantharam added the comment:

    A better choice would be to return a tuple of values (sliced from the
    table). And let the user decide which one to use.

    The current mode() function is designed for a very basic use-case, where
    you have an obvious single mode from discrete data.

    The problem with dealing with multiple modes is that its not easy to
    tell the difference between a genuinely multi-modal sample and one which
    just happens to have a few samples with the same value:

    data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

    Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal
    with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9?

    I have plans for introducing a binning function to collect data into
    bins and run statistics on the bins. That might be a better way to deal
    with multi-modal samples: if you bin the data (for discrete data, use a
    bin size of 1) and then look at the frequencies, you can decide how many
    modes there are.

    Thanks for the suggestion.

    @sria91
    Copy link
    Mannequin Author

    sria91 mannequin commented Dec 13, 2016

    Please see the updated pull request PR 50, with the changes.

    Thanks & Regards
    Srikanth Anantharam
    +91 7204 350429
    https://sria91.github.io/

    Sent from Android

    On 13-Dec-2016 3:26 PM, "Srikanth Anantharam" <report@bugs.python.org>
    wrote:

    Changes by Srikanth Anantharam <sria91@gmail.com>:

    ----------
    pull_requests: +4


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28956\>


    @sria91
    Copy link
    Mannequin Author

    sria91 mannequin commented Dec 13, 2016

    @steven:

    data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]
    is clearly unimodal with mode 8

    data would have been bimodal if 4 repeated exactly the same (7) number of
    times as 8, like this:
    data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

    in which case the new patch in PR 50 would return a tuple
    (4, 8)

    Thanks & Regards
    Srikanth Anantharam
    +91 7204 350429
    https://sria91.github.io/

    Sent from Android

    On 13-Dec-2016 3:24 PM, "Steven D'Aprano" <report@bugs.python.org> wrote:

    Steven D'Aprano added the comment:

    On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote:

    Srikanth Anantharam added the comment:

    A better choice would be to return a tuple of values (sliced from the
    table). And let the user decide which one to use.

    The current mode() function is designed for a very basic use-case, where
    you have an obvious single mode from discrete data.

    The problem with dealing with multiple modes is that its not easy to
    tell the difference between a genuinely multi-modal sample and one which
    just happens to have a few samples with the same value:

    data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

    Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal
    with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9?

    I have plans for introducing a binning function to collect data into
    bins and run statistics on the bins. That might be a better way to deal
    with multi-modal samples: if you bin the data (for discrete data, use a
    bin size of 1) and then look at the frequencies, you can decide how many
    modes there are.

    Thanks for the suggestion.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28956\>


    @stevendaprano
    Copy link
    Member

    On Tue, Dec 13, 2016 at 10:08:10AM +0000, Srikanth Anantharam wrote:

    Please see the updated pull request PR 50, with the changes.

    I'm rejecting that pull request. As I said, mode() intentionally
    returns only the single, unique mode. I may add a more advanced API or a
    second function for dealing with multi-modal samples, but even if I do,
    your suggestion wouldn't be sufficient. Your pull request merely returns
    the entire list of unique values:

        return tuple(value for value, frequency in table)

    with no way for the caller to tell which values might be a mode and
    which are not.

    (By the way, even if this function behaviour was acceptible, which I
    stress it is not, this would not be sufficient for me to accept as a
    patch. You should preferably update the documentation and the tests as
    well. At the very least, you should update the function's docstring to
    explain the changed return value.)

    I'm sorry that I have to reject this, I am interested in having better
    support for multiple modes. I'm not closing this issue just yet, if you
    are interested in continuing the discussion, what would be *VERY*
    valuable for me would be for you or some other volunteer to do some
    research into numerical techniques for objectively determining the
    number and value of modes, rather than just plotting a graph and
    subjectively deciding whether a value is a peak or not.

    Thanks for your interest.

    @stevendaprano
    Copy link
    Member

    On Tue, Dec 13, 2016 at 10:17:21AM +0000, Srikanth Anantharam wrote:

    Srikanth Anantharam added the comment:

    @steven:

    data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]
    is clearly unimodal with mode 8

    data would have been bimodal if 4 repeated exactly the same (7) number of
    times as 8, like this:
    data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

    Bimodal distributions do not require both modes to be exactly the same
    height. And certainly when you have a sample from a bimodal
    distribution, you should not expect exactly the same frequency for the
    two modes. Just from random sampling error you will expect one or the
    other to have a larger frequency.

    You shouldn't take my example too literally. With such a small sample of
    discrete values, it becomes a (hard) matter of personal judgement. The
    point I was attempting to make was that identifying sample modes outside
    of the simplest unimodal case is tricky and requires much thought.

    @terryjreedy
    Copy link
    Member

    Srikanth, when you reply by email, please remove the quotation of the previous message. On the web page, it is just noise. The only exception should be when you reply to a specific sentence and need to quote that sentence for context.

    In my particular experience, mode() is unusally reserved for crudely describing unordered categorical data, where the concept of 'minimum' does not apply.

    Mode is useful for determining the winner of a vote (or other decision process), but in general, it is not a substitute for a more comprehensive look at a dataset.

    Problems with possibly returning a tuple of data items instead of a data item include:

    1. The user then has to be prepared to handle a tuple instead of a data item. It would be better then to always return a tuple, even for 1 item.

    2. Data items can be tuples, making a tuple return ambiguous. Example use case: planar points with int coordinates.

    >>> mode(((0,0), (0,0), (0,1)))
    (0, 0)

    So, while StatisticsError is a nuisance, so are the apparent alternatives. I think we should leave mode alone and close this.

    @stevendaprano
    Copy link
    Member

    What makes the minimum mode better than the maximum?

    @sria91
    Copy link
    Mannequin Author

    sria91 mannequin commented Feb 18, 2018

    Please review the new PR with tests.
    I'll update the documentation if the PR is acceptable.

    @sria91 sria91 mannequin changed the title return minimum of modes for a multimodal distribution instead of raising a StatisticsError return list of modes for a multimodal distribution instead of raising a StatisticsError Feb 18, 2018
    @scotchka
    Copy link
    Mannequin

    scotchka mannequin commented Mar 10, 2019

    The problem remains that the function can return a number or a list for input that is a list of numbers. This means the user will need to handle both possibilities every time, which is a heavy burden for such a simple function.

    SciPy's mode function does return the minimum mode when there is a tie, which as far as I can tell is an arbitrary choice. But in that context, since the input is almost always numerical, a minimum is at least well defined, which is not true for an input with a mix of types.

    For the general use case, the current behavior - raising an exception - in case of tie conveys the most information.

    @scotchka
    Copy link
    Mannequin

    scotchka mannequin commented Mar 10, 2019

    Yes, the mode function could ALWAYS return a list, but that breaks backward compatibility, as does the currently proposed change.

    @rhettinger
    Copy link
    Contributor

    See the competing proposal and PR at https://bugs.python.org/issue35892 and #12089

    @stevendaprano
    Copy link
    Member

    I'm closing this issue in favour of Raymond's bpo-35892, thank you to everyone even if your PRs didn't get used, I appreciate your efforts.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants