Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Normal Distribution class to the statistics module #80199

Closed
rhettinger opened this issue Feb 18, 2019 · 35 comments
Closed

Add a Normal Distribution class to the statistics module #80199

rhettinger opened this issue Feb 18, 2019 · 35 comments
Assignees
Labels
3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

BPO 36018
Nosy @tim-one, @rhettinger, @mdickinson, @stevendaprano, @applio, @selik, @miss-islington, @tirkarthi
PRs
  • bpo-36018: Add the NormalDist class to the statistics module #11973
  • bpo-36018: Make __pos__ return a distinct instance of NormDist #12009
  • bpo-36018: Add properties for mean and stdev #12022
  • bpo-36018: Add special value tests and make minor tweaks to the docs #12096
  • bpo-36018: Add documentation link to "random variable" #12114
  • bpo-36018: Make "seed" into a keyword only argument for NormalDist.samples() #12921
  • bpo-36018: Quantiles. Test idempotence. Test two methods against one-another. #13021
  • bpo-36018: Update example to show mean and stdev properties #13047
  • bpo-36018: Minor doc update: Add examples to elucidate the formulas #14898
  • bpo-36018: Address more reviewer feedback #15733
  • [3.8] bpo-36018: Address more reviewer feedback (GH-15733) #15734
  • bpo-36018: Add another example for NormalDist() #18191
  • [3.8] bpo-36018: Add another example for NormalDist() (GH-18191) #18192
  • bpo-36018: Minor fixes to the NormalDist() examples and recipes. #18226
  • [3.8] bpo-36018: Minor fixes to the NormalDist() examples and recipes. (GH-18226) #18227
  • Files
  • gauss.py: NormalDist class
  • gauss_demo.py: Examples
  • normdist_22feb2019.diff: Full diff as of 22 Feb 2019
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/stevendaprano'
    closed_at = <Date 2019-02-24.21:46:21.796>
    created_at = <Date 2019-02-18.00:00:59.337>
    labels = ['3.8', 'type-feature', 'library']
    title = 'Add a Normal Distribution class to the statistics module'
    updated_at = <Date 2020-01-28.03:40:18.056>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2020-01-28.03:40:18.056>
    actor = 'rhettinger'
    assignee = 'steven.daprano'
    closed = True
    closed_date = <Date 2019-02-24.21:46:21.796>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2019-02-18.00:00:59.337>
    creator = 'rhettinger'
    dependencies = []
    files = ['48147', '48148', '48163']
    hgrepos = []
    issue_num = 36018
    keywords = ['patch']
    message_count = 35.0
    messages = ['335792', '335876', '336008', '336029', '336285', '336295', '336319', '336337', '336376', '336413', '336414', '336417', '336422', '336425', '336433', '336435', '336436', '336439', '336481', '336852', '336895', '337658', '337660', '337662', '337686', '340703', '341137', '341240', '348273', '351340', '351341', '360717', '360719', '360834', '360836']
    nosy_count = 8.0
    nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'steven.daprano', 'davin', 'selik', 'miss-islington', 'xtreak']
    pr_nums = ['11973', '12009', '12022', '12096', '12114', '12921', '13021', '13047', '14898', '15733', '15734', '18191', '18192', '18226', '18227']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue36018'
    versions = ['Python 3.8']

    @rhettinger
    Copy link
    Contributor Author

    Attached is a class that I've found useful for doing practical statistics work with normal distributions. It provides a nice, high-level API that makes short-work of everyday statistical problems.

    ------ Examples --------

    # Simple scaling and translation
    temperature_february = NormalDist(5, 2.5)            # Celsius
    print(temperature_february * (9/5) + 32)             # Fahrenheit
    
    
    # Classic probability problems
    # https://blog.prepscholar.com/sat-standard-deviation
    # The mean score on a SAT exam is 1060 with a standard deviation of 195
    # What percentage of students score between 1100 and 1200?
    sat = NormalDist(1060, 195)
    fraction = sat.cdf(1200) - sat.cdf(1100)
    print(f'{fraction * 100 :.1f}% score between 1100 and 1200')
    
    
    # Combination of normal distributions by summing variances
    birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
    drug_effects = NormalDist(0.4, 0.15)
    print(birth_weights + drug_effects)
    
    
    # Statistical calculation estimates using simulations
    # Estimate the distribution of X * Y / Z
    n = 100_000
    X = NormalDist(350, 15).examples(n)
    Y = NormalDist(47, 17).examples(n)
    Z = NormalDist(62, 6).examples(n)
    print(NormalDist.from_samples(x * y / z for x, y, z in zip(X, Y, Z)))

    # Naive Bayesian Classifier
    # https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification

    height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
    height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
    weight_male = NormalDist.from_samples([180, 190, 170, 165])
    weight_female = NormalDist.from_samples([100, 150, 130, 150])
    foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
    foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
    
    prior_male = 0.5
    prior_female = 0.5
    posterior_male = prior_male * height_male.pdf(6) * weight_male.pdf(130) * foot_size_male.pdf(8)
    posterior_female = prior_female * height_female.pdf(6) * weight_female.pdf(130) * foot_size_female.pdf(8)
    print('Predict', 'male' if posterior_male > posterior_female else 'female')

    @rhettinger rhettinger added the 3.8 (EOL) end of life label Feb 18, 2019
    @rhettinger rhettinger added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Feb 18, 2019
    @stevendaprano
    Copy link
    Member

    I like this idea!

    Should the "examples" method be re-named "samples"? That's the word used in the docstring, and it matches the from_samples method.

    @selik
    Copy link
    Mannequin

    selik mannequin commented Feb 19, 2019

    +1, This would be useful for quick analyses, avoiding the overhead of installing scipy and looking through its documentation.

    Given that it's in the statistics namespace, I think the name can be simply Normal rather than NormalDist. Also, instead of .from_examples consider naming the classmethod .fit.

    @rhettinger
    Copy link
    Contributor Author

    I'll work up a PR for this.

    We can continue to tease-out the best method names. I've has success with "examples" and "from_samples" when developing this code in the classroom. Both names had the virtue of being easily understood and never being misunderstood.

    Intellectually, the name fit() makes sense because we are using data to create best fit model parameters. So, technically this is probably the most accurate terminology. However, it doesn't match how I think about the problem though -- that is more along the lines of "use sampling data to make a random variable with a normal distribution". Another minor issue is that class methods are typically (but not always) recognizable by their from-prefix (e.g. dict.fromkeys, datetime.fromtimestamp, etc).

    "NormalDist" seems more self explanatory to me that just "Normal". Also, the noun form seems "more complete" than a dangling adjective (reading "normal" immediately raises the question "normal what?"). FWIW, MS Excel also calls their variant NORM.DIST (formerly spelled without the dot).

    @rhettinger
    Copy link
    Contributor Author

    Okay the PR is ready.

    If you all are mostly comfortable with it, it would great to get this in for the second alpha so that people have a chance to work with it.

    @stevendaprano
    Copy link
    Member

    Thanks Raymond.

    Apologies for commenting here instead of at the PR.

    While I've been fighting with more intermittedly broken than usual
    internet access, Github has stopped supporting my browser. I can't
    upgrade the browser without upgrading the OS, and I can't upgrade the OS
    without new hardware, and that will take money I don't have at the moment.

    So the bottom line is that while I can read *part* of the diffs on
    Github, that's about all I can do. I can't comment there, I can't fork,
    I can't make push requests, half the pages don't load for me and the
    other half don't work properly when they do load. I can't even do a git
    clone.

    So right now, the only thing I can do is comment on your extensive
    documentation in statistics.rst. That's very nicely done.

    The only thing that strikes me as problematic is the default value for
    sigma, namely 0.0. The PDF for normal curve divides by sigma, so if
    that's zero, things are undefined. So I think that sigma ought to be
    strictly positive.

    I also think it would be nice to default to the standard normal curve,
    with mu=0.0 and sigma=1.0. That will make it easy to work with Z scores.

    Thanks again for this class, and my apologies for my inability to
    follow the preferred workflow.

    @rhettinger
    Copy link
    Contributor Author

    I've made both suggested changes, "examples"->"samples" and set the defaults to the standard normal distribution.

    To bypass Github, I've attached a diff to this tracker issue. Let me know what you think :-)

    @tirkarthi
    Copy link
    Member

    @steven.daprano Bit off topic but you can also append .patch in the PR URL to generate patch file with all the commits made in the PR up to latest commit and .diff provides the current diff against master. They are plain text and can be downloaded through wget and viewed with an editor in case if it helps.

    https://github.com/python/cpython/pull/11973.patch
    https://github.com/python/cpython/pull/11973.diff

    @rhettinger
    Copy link
    Contributor Author

    Thanks for all the positive feedback. If there are no objections, I would like to push this so it will be in the second alpha release so that it can get exercised. We can still make adjustments afterwards.

    @rhettinger
    Copy link
    Contributor Author

    New changeset 11c7953 by Raymond Hettinger in branch 'master':
    bpo-36018: Add the NormalDist class to the statistics module (GH-11973)
    11c7953

    @rhettinger
    Copy link
    Contributor Author

    Okay, it's in for the second alpha. Please continue to make API or implementation suggestions. Nothing is set in stone.

    @applio
    Copy link
    Member

    applio commented Feb 23, 2019

    There is an inconsistency worth paying attention to in the choice of names of the input parameters.

    Currently in the statistics module, pvariance() accepts a parameter named "mu" and pstdev() and variance() each accept a parameter named "xbar". The docs describe both "mu" and "xbar" as "it should be the mean of data". I suggest it is worth rationalizing the names used within the statistics module for consistency before reusing "mu" or "xbar" or anything else in NormalDist.

    Using the names of mathematical symbols that are commonly used to represent a concept is potentially confusing because those symbols are not always *universally* used. For example, students are often introduced to new concepts in introductory mathematics texts where concepts such as "mean" appear in formulas and equations not as "mu" but as "xbar" or simply "m" or other simple (and hopefully "friendly") names/symbols. As a mathematician, if I am told a variable is named, "mu", I still feel the need to ask what it represents. Sure, I can try guessing based upon context but I will usually have more than one guess that I could make.

    Rather than continue down a path of using various mathematical-symbols-written-out-in-English-spelling, one alternative would be to use less ambiguous, more informative variable names such as "mean". It might be worth considering a change to the parameter names of "mu" and "sigma" in NormalDist to names like "mean" and "stddev", respectively. Or perhaps "mean" and "standard_deviation". Or perhaps "mean" and "variance" would be easier still (recognizing that variance can be readily computed from standard deviation in this particular context). In terms of consistency with other packages that users are likely to also use, scipy.stats functions/objects commonly refer to these concepts as "mean" and "var".

    I like the idea of making NormalDist readily approachable for students as well as those more familiar with these concepts. The offerings in scipy.stats are excellent but they are not always the most approachable things for new students of statistics.

    @stevendaprano
    Copy link
    Member

    Karthikeyan: thanks for the hint about Github.

    Raymond: thanks for the diff. Some comments:

    Why use object.__setattr__(self, 'mu', mu) instead of self.mu = mu in the __init__ method?

    Should __pos__ return a copy rather than the instance itself?

    The rest looks good to me, and I look forward to using it.

    @stevendaprano
    Copy link
    Member

    Davin: the chice of using mu versus xbar was deliberate, as they represent different quantities: the population mean versus a sample mean. But reading over the docs with fresh eyes, I can now see that the distinction is not as clear as I intended.

    I think that changing the names now would be a breaking change, but even if it wasn't, I don't want to change the names. The distinction between population parameters (mu) and sample statistics (xbar) is important and I think the function parameters should reflect that.

    As for the new NormalDist class, we aren't limited by backwards compatibility, but I would still argue for the current names mu and sigma. As well as matching the population parameters of the distribution, they also matches the names used in calculators such as the TI Nspire and Casio Classpad (two very popular CAS calculators used by secondary school students).

    See bpo-36099. If you would like to suggest some doc changes, please feel free to do so.

    @applio
    Copy link
    Member

    applio commented Feb 24, 2019

    Steven: Your point about population versus sample makes sense and your point that altering their names would be a breaking change is especially important. I think that pretty well puts an end to my suggestion of alternative names and says the current pattern should be kept with NormalDist.

    I particularly like the idea of using the TI Nspire and Casio Classpad to guide or help confirm what symbols might be recognizable to secondary students or 1st year university students.

    Raymond: As an idea for examples demonstrating the code, what about an example where a plot of pdf is created, possibly for comparison with cdf? This would require something like matplotlib but would help to visually communicate the concepts of pdf, perhaps with different sigma values?

    @rhettinger
    Copy link
    Contributor Author

    Why use object.__setattr__(self, 'mu', mu) instead of
    self.mu = mu in the __init__ method?

    The idea was the instances should be immutable and hashable, but this added unnecessary complexity, so I took this out prior to the check in.

    Should __pos__ return a copy rather than the instance itself?

    Yes. I'll fix that straight-way.

    ^ The chice of using mu versus xbar was deliberate

    I concur with that choice and also prefer to stick with mu and sigma:

    1. It's too late to change it elsewhere in statistics and the random modules. 2) Having attribute names the same as function names in the same module is confusing. 3) I had already user tested this API in some Python courses. 4) The variable names match the various external sources I've linked to in the docs. 5) Python historically hasn't shied from greek letter names (math: pi tau gamma random: alpha, better, lambd, mu, sigma).

    @rhettinger
    Copy link
    Contributor Author

    Steven, Davin, Michael: Thanks for the encouragement and taking the time to review this code.

    @miss-islington
    Copy link
    Contributor

    New changeset 79fbcc5 by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
    bpo-36018: Make __pos__ return a distinct instance of NormDist (GH-12009)
    79fbcc5

    @miss-islington
    Copy link
    Contributor

    New changeset 9e456bc by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
    bpo-36018: Add properties for mean and stdev (GH-12022)
    9e456bc

    @miss-islington
    Copy link
    Contributor

    New changeset ef17fdb by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
    bpo-36018: Add special value tests and make minor tweaks to the docs (GH-12096)
    ef17fdb

    @miss-islington
    Copy link
    Contributor

    New changeset 9add4b3 by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
    bpo-36018: Add documentation link to "random variable" (GH-12114)
    9add4b3

    @stevendaprano
    Copy link
    Member

    I've done some spot checks of NormDist.pdf and .cdf and compared the results to those returned by my TI Nspire calculator.

    So far, the PDF has matched that of the Nspire to 12 decimal places (the limit the calculator will show), but the CDF differs on or about the 8th decimal place:

    py> x = statistics.NormalDist(2, 1.3)
    py> x.cdf(5.374)
    0.9952757439207682
    # Nspire normCdf(-∞, 5.372, 2, 1.3) returns 0.995275710979
    # difference of 3.294176820212158e-08

    py> x.cdf(-0.23)
    0.04313736707891003
    # Nspire normCdf(-∞, -0.23, 2, 1.3) returns 0.043137332077
    # difference of 3.500191003008579e-08

    Wolfram Alpha doesn't help me decide which is correct, as it doesn't show enough decimal places.

    https://www.wolframalpha.com/input/?i=CDF[+NormalDistribution[2,+1.3],+5.374+]

    https://www.wolframalpha.com/input/?i=CDF[+NormalDistribution[2,+1.3],+-0.23+]

    Do we care about this difference? Should I raise a new ticket for it?

    @mdickinson
    Copy link
    Member

    According to GP/Pari, the correctly value for the first result, to the first few dozen places, is:

    0.995275743920768157605659214368609706759611629000344854339231928536087783251913252354...

    I'm assuming you meant 5.374 rather than 5.372 in the first Nspire result.

    @mdickinson
    Copy link
    Member

    Below is the full transcript from Pari/GP: note that I converted the float inputs to exact Decimal equivalents, assuming IEEE 754 binary64. Summary: both Python results look fine; it's Nspire that's inaccurate here.

    mirzakhani:~ mdickinson$ /opt/local/bin/gp
    GP/PARI CALCULATOR Version 2.11.1 (released)
    i386 running darwin (x86-64/GMP-6.1.2 kernel) 64-bit version
    compiled: Jan 24 2019, Apple LLVM version 10.0.0 (clang-1000.11.45.5)
    threading engine: single
    (readline v8.0 enabled, extended help enabled)

                                               Copyright (C) 2000-2018 The PARI Group
    

    PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER.

    Type ? for help, \q to quit.
    Type ?17 for how to get moral (and possibly technical) support.

    parisize = 8000000, primelimit = 500000
    ? \p 200
       realprecision = 211 significant digits (200 digits displayed)
    ? ncdf(x, mu, sig) = (2 - erfc((x - mu) / sig / sqrt(2))) / 2
    %1 = (x,mu,sig)->(2-erfc((x-mu)/sig/sqrt(2)))/2
    ? ncdf(5.37399999999999966604491419275291264057159423828125, 2, 1.3000000000000000444089209850062616169452667236328125)
    %2 = 0.99527574392076815760565921436860970675961162900034485433923192853608778325191325235412640687571628164064779657215907190523884572141701976336760387216713270956350229484865180142256611330976179584951493
    ? ncdf(-0.2300000000000000099920072216264088638126850128173828125, 2, 1.3000000000000000444089209850062616169452667236328125)
    %3 = 0.043137367078910025352120502108682523151629166877357644882244088336773338416883044522024586619860574718679715351558322591944140762629090301623352497457372937783778706411712862062109829239761761597057063

    @stevendaprano
    Copy link
    Member

    I'm assuming you meant 5.374 rather than 5.372 in the first Nspire result.

    Yes, that was a typo, sorry.

    Thanks for checking into the results.

    @rhettinger
    Copy link
    Contributor Author

    New changeset fb8c7d5 by Raymond Hettinger in branch 'master':
    bpo-36018: Make "seed" into a keyword only argument (GH-12921)
    fb8c7d5

    @rhettinger
    Copy link
    Contributor Author

    New changeset b0a2c0f by Raymond Hettinger in branch 'master':
    bpo-36018: Test idempotence. Test two methods against one-another. (GH-13021)
    b0a2c0f

    @rhettinger
    Copy link
    Contributor Author

    New changeset 671d782 by Raymond Hettinger in branch 'master':
    bpo-36018: Update example to show mean and stdev (GH-13047)
    671d782

    @stevendaprano
    Copy link
    Member

    I have a query about the documentation:

    The default *method* is "exclusive" and is used for data sampled
    from a population that can have more extreme values than found
    in the samples. ...
    Setting the *method* to "inclusive" is used for describing
    population data or for samples that include the extreme points.

    In all my reading about quantile calculation methods, this is the first time I've come across this recommendation. Do you have a source for it or a justification?

    Thanks.

    @rhettinger
    Copy link
    Contributor Author

    New changeset 4db25d5 by Raymond Hettinger in branch 'master':
    bpo-36018: Address more reviewer feedback (GH-15733)
    4db25d5

    @rhettinger
    Copy link
    Contributor Author

    New changeset cc1bdf9 by Raymond Hettinger in branch '3.8':
    [3.8] bpo-36018: Address more reviewer feedback (GH-15733) (GH-15734)
    cc1bdf9

    @rhettinger
    Copy link
    Contributor Author

    New changeset 10355ed by Raymond Hettinger in branch 'master':
    bpo-36018: Add another example for NormalDist() (bpo-18191)
    10355ed

    @rhettinger
    Copy link
    Contributor Author

    New changeset eebcff8 by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
    bpo-36018: Add another example for NormalDist() (GH-18191) (GH-18192)
    eebcff8

    @rhettinger
    Copy link
    Contributor Author

    New changeset 01bf219 by Raymond Hettinger in branch 'master':
    bpo-36018: Minor fixes to the NormalDist() examples and recipes. (GH-18226)
    01bf219

    @rhettinger
    Copy link
    Contributor Author

    New changeset 41f4dc3 by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
    bpo-36018: Minor fixes to the NormalDist() examples and recipes. (GH-18226) (GH-18227)
    41f4dc3

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants