Add a Normal Distribution class to the statistics module #80199

rhettinger · 2019-02-18T00:00:59Z

BPO	36018
Nosy	@tim-one, @rhettinger, @mdickinson, @stevendaprano, @applio, @selik, @miss-islington, @tirkarthi
PRs	bpo-36018: Add the NormalDist class to the statistics module #11973 bpo-36018: Make __pos__ return a distinct instance of NormDist #12009 bpo-36018: Add properties for mean and stdev #12022 bpo-36018: Add special value tests and make minor tweaks to the docs #12096 bpo-36018: Add documentation link to "random variable" #12114 bpo-36018: Make "seed" into a keyword only argument for NormalDist.samples() #12921 bpo-36018: Quantiles. Test idempotence. Test two methods against one-another. #13021 bpo-36018: Update example to show mean and stdev properties #13047 bpo-36018: Minor doc update: Add examples to elucidate the formulas #14898 bpo-36018: Address more reviewer feedback #15733 [3.8] bpo-36018: Address more reviewer feedback (GH-15733) #15734 bpo-36018: Add another example for NormalDist() #18191 [3.8] bpo-36018: Add another example for NormalDist() (GH-18191) #18192 bpo-36018: Minor fixes to the NormalDist() examples and recipes. #18226 [3.8] bpo-36018: Minor fixes to the NormalDist() examples and recipes. (GH-18226) #18227
Files	gauss.py: NormalDist class gauss_demo.py: Examples normdist_22feb2019.diff: Full diff as of 22 Feb 2019

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/stevendaprano'
closed_at = <Date 2019-02-24.21:46:21.796>
created_at = <Date 2019-02-18.00:00:59.337>
labels = ['3.8', 'type-feature', 'library']
title = 'Add a Normal Distribution class to the statistics module'
updated_at = <Date 2020-01-28.03:40:18.056>
user = 'https://github.com/rhettinger'

bugs.python.org fields:

activity = <Date 2020-01-28.03:40:18.056>
actor = 'rhettinger'
assignee = 'steven.daprano'
closed = True
closed_date = <Date 2019-02-24.21:46:21.796>
closer = 'rhettinger'
components = ['Library (Lib)']
creation = <Date 2019-02-18.00:00:59.337>
creator = 'rhettinger'
dependencies = []
files = ['48147', '48148', '48163']
hgrepos = []
issue_num = 36018
keywords = ['patch']
message_count = 35.0
messages = ['335792', '335876', '336008', '336029', '336285', '336295', '336319', '336337', '336376', '336413', '336414', '336417', '336422', '336425', '336433', '336435', '336436', '336439', '336481', '336852', '336895', '337658', '337660', '337662', '337686', '340703', '341137', '341240', '348273', '351340', '351341', '360717', '360719', '360834', '360836']
nosy_count = 8.0
nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'steven.daprano', 'davin', 'selik', 'miss-islington', 'xtreak']
pr_nums = ['11973', '12009', '12022', '12096', '12114', '12921', '13021', '13047', '14898', '15733', '15734', '18191', '18192', '18226', '18227']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue36018'
versions = ['Python 3.8']

rhettinger · 2019-02-18T00:00:59Z

Attached is a class that I've found useful for doing practical statistics work with normal distributions. It provides a nice, high-level API that makes short-work of everyday statistical problems.

------ Examples --------

# Simple scaling and translation
temperature_february = NormalDist(5, 2.5)            # Celsius
print(temperature_february * (9/5) + 32)             # Fahrenheit


# Classic probability problems
# https://blog.prepscholar.com/sat-standard-deviation
# The mean score on a SAT exam is 1060 with a standard deviation of 195
# What percentage of students score between 1100 and 1200?
sat = NormalDist(1060, 195)
fraction = sat.cdf(1200) - sat.cdf(1100)
print(f'{fraction * 100 :.1f}% score between 1100 and 1200')


# Combination of normal distributions by summing variances
birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
drug_effects = NormalDist(0.4, 0.15)
print(birth_weights + drug_effects)


# Statistical calculation estimates using simulations
# Estimate the distribution of X * Y / Z
n = 100_000
X = NormalDist(350, 15).examples(n)
Y = NormalDist(47, 17).examples(n)
Z = NormalDist(62, 6).examples(n)
print(NormalDist.from_samples(x * y / z for x, y, z in zip(X, Y, Z)))

# Naive Bayesian Classifier
# https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification

height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
weight_male = NormalDist.from_samples([180, 190, 170, 165])
weight_female = NormalDist.from_samples([100, 150, 130, 150])
foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
foot_size_female = NormalDist.from_samples([6, 8, 7, 9])

prior_male = 0.5
prior_female = 0.5
posterior_male = prior_male * height_male.pdf(6) * weight_male.pdf(130) * foot_size_male.pdf(8)
posterior_female = prior_female * height_female.pdf(6) * weight_female.pdf(130) * foot_size_female.pdf(8)
print('Predict', 'male' if posterior_male > posterior_female else 'female')

stevendaprano · 2019-02-18T23:42:51Z

I like this idea!

Should the "examples" method be re-named "samples"? That's the word used in the docstring, and it matches the from_samples method.

selik · 2019-02-19T19:38:32Z

+1, This would be useful for quick analyses, avoiding the overhead of installing scipy and looking through its documentation.

Given that it's in the statistics namespace, I think the name can be simply Normal rather than NormalDist. Also, instead of .from_examples consider naming the classmethod .fit.

rhettinger · 2019-02-20T00:33:27Z

I'll work up a PR for this.

We can continue to tease-out the best method names. I've has success with "examples" and "from_samples" when developing this code in the classroom. Both names had the virtue of being easily understood and never being misunderstood.

Intellectually, the name fit() makes sense because we are using data to create best fit model parameters. So, technically this is probably the most accurate terminology. However, it doesn't match how I think about the problem though -- that is more along the lines of "use sampling data to make a random variable with a normal distribution". Another minor issue is that class methods are typically (but not always) recognizable by their from-prefix (e.g. dict.fromkeys, datetime.fromtimestamp, etc).

"NormalDist" seems more self explanatory to me that just "Normal". Also, the noun form seems "more complete" than a dangling adjective (reading "normal" immediately raises the question "normal what?"). FWIW, MS Excel also calls their variant NORM.DIST (formerly spelled without the dot).

rhettinger · 2019-02-22T08:36:14Z

Okay the PR is ready.

If you all are mostly comfortable with it, it would great to get this in for the second alpha so that people have a chance to work with it.

stevendaprano · 2019-02-22T11:39:50Z

Thanks Raymond.

Apologies for commenting here instead of at the PR.

While I've been fighting with more intermittedly broken than usual
internet access, Github has stopped supporting my browser. I can't
upgrade the browser without upgrading the OS, and I can't upgrade the OS
without new hardware, and that will take money I don't have at the moment.

So the bottom line is that while I can read *part* of the diffs on
Github, that's about all I can do. I can't comment there, I can't fork,
I can't make push requests, half the pages don't load for me and the
other half don't work properly when they do load. I can't even do a git
clone.

So right now, the only thing I can do is comment on your extensive
documentation in statistics.rst. That's very nicely done.

The only thing that strikes me as problematic is the default value for
sigma, namely 0.0. The PDF for normal curve divides by sigma, so if
that's zero, things are undefined. So I think that sigma ought to be
strictly positive.

I also think it would be nice to default to the standard normal curve,
with mu=0.0 and sigma=1.0. That will make it easy to work with Z scores.

Thanks again for this class, and my apologies for my inability to
follow the preferred workflow.

rhettinger · 2019-02-22T14:57:27Z

I've made both suggested changes, "examples"->"samples" and set the defaults to the standard normal distribution.

To bypass Github, I've attached a diff to this tracker issue. Let me know what you think :-)

tirkarthi · 2019-02-22T17:44:10Z

@steven.daprano Bit off topic but you can also append .patch in the PR URL to generate patch file with all the commits made in the PR up to latest commit and .diff provides the current diff against master. They are plain text and can be downloaded through wget and viewed with an editor in case if it helps.

https://github.com/python/cpython/pull/11973.patch
https://github.com/python/cpython/pull/11973.diff

rhettinger · 2019-02-23T09:42:35Z

Thanks for all the positive feedback. If there are no objections, I would like to push this so it will be in the second alpha release so that it can get exercised. We can still make adjustments afterwards.

rhettinger · 2019-02-23T22:44:12Z

New changeset 11c7953 by Raymond Hettinger in branch 'master':
bpo-36018: Add the NormalDist class to the statistics module (GH-11973)
11c7953

rhettinger · 2019-02-23T22:46:56Z

Okay, it's in for the second alpha. Please continue to make API or implementation suggestions. Nothing is set in stone.

applio · 2019-02-23T23:37:55Z

There is an inconsistency worth paying attention to in the choice of names of the input parameters.

Currently in the statistics module, pvariance() accepts a parameter named "mu" and pstdev() and variance() each accept a parameter named "xbar". The docs describe both "mu" and "xbar" as "it should be the mean of data". I suggest it is worth rationalizing the names used within the statistics module for consistency before reusing "mu" or "xbar" or anything else in NormalDist.

Using the names of mathematical symbols that are commonly used to represent a concept is potentially confusing because those symbols are not always *universally* used. For example, students are often introduced to new concepts in introductory mathematics texts where concepts such as "mean" appear in formulas and equations not as "mu" but as "xbar" or simply "m" or other simple (and hopefully "friendly") names/symbols. As a mathematician, if I am told a variable is named, "mu", I still feel the need to ask what it represents. Sure, I can try guessing based upon context but I will usually have more than one guess that I could make.

Rather than continue down a path of using various mathematical-symbols-written-out-in-English-spelling, one alternative would be to use less ambiguous, more informative variable names such as "mean". It might be worth considering a change to the parameter names of "mu" and "sigma" in NormalDist to names like "mean" and "stddev", respectively. Or perhaps "mean" and "standard_deviation". Or perhaps "mean" and "variance" would be easier still (recognizing that variance can be readily computed from standard deviation in this particular context). In terms of consistency with other packages that users are likely to also use, scipy.stats functions/objects commonly refer to these concepts as "mean" and "var".

I like the idea of making NormalDist readily approachable for students as well as those more familiar with these concepts. The offerings in scipy.stats are excellent but they are not always the most approachable things for new students of statistics.

stevendaprano · 2019-02-24T03:17:37Z

Karthikeyan: thanks for the hint about Github.

Raymond: thanks for the diff. Some comments:

Why use object.__setattr__(self, 'mu', mu) instead of self.mu = mu in the __init__ method?

Should __pos__ return a copy rather than the instance itself?

The rest looks good to me, and I look forward to using it.

stevendaprano · 2019-02-24T03:41:08Z

Davin: the chice of using mu versus xbar was deliberate, as they represent different quantities: the population mean versus a sample mean. But reading over the docs with fresh eyes, I can now see that the distinction is not as clear as I intended.

I think that changing the names now would be a breaking change, but even if it wasn't, I don't want to change the names. The distinction between population parameters (mu) and sample statistics (xbar) is important and I think the function parameters should reflect that.

As for the new NormalDist class, we aren't limited by backwards compatibility, but I would still argue for the current names mu and sigma. As well as matching the population parameters of the distribution, they also matches the names used in calculators such as the TI Nspire and Casio Classpad (two very popular CAS calculators used by secondary school students).

See bpo-36099. If you would like to suggest some doc changes, please feel free to do so.

applio · 2019-02-24T05:26:14Z

Steven: Your point about population versus sample makes sense and your point that altering their names would be a breaking change is especially important. I think that pretty well puts an end to my suggestion of alternative names and says the current pattern should be kept with NormalDist.

I particularly like the idea of using the TI Nspire and Casio Classpad to guide or help confirm what symbols might be recognizable to secondary students or 1st year university students.

Raymond: As an idea for examples demonstrating the code, what about an example where a plot of pdf is created, possibly for comparison with cdf? This would require something like matplotlib but would help to visually communicate the concepts of pdf, perhaps with different sigma values?

rhettinger · 2019-02-24T05:56:42Z

Why use object.__setattr__(self, 'mu', mu) instead of
self.mu = mu in the __init__ method?

The idea was the instances should be immutable and hashable, but this added unnecessary complexity, so I took this out prior to the check in.

Should __pos__ return a copy rather than the instance itself?

Yes. I'll fix that straight-way.

^ The chice of using mu versus xbar was deliberate

I concur with that choice and also prefer to stick with mu and sigma:

It's too late to change it elsewhere in statistics and the random modules. 2) Having attribute names the same as function names in the same module is confusing. 3) I had already user tested this API in some Python courses. 4) The variable names match the various external sources I've linked to in the docs. 5) Python historically hasn't shied from greek letter names (math: pi tau gamma random: alpha, better, lambd, mu, sigma).

rhettinger · 2019-02-24T06:00:41Z

Steven, Davin, Michael: Thanks for the encouragement and taking the time to review this code.

miss-islington · 2019-02-24T06:19:08Z

New changeset 79fbcc5 by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
bpo-36018: Make __pos__ return a distinct instance of NormDist (GH-12009)
79fbcc5

miss-islington · 2019-02-24T19:45:00Z

New changeset 9e456bc by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
bpo-36018: Add properties for mean and stdev (GH-12022)
9e456bc

miss-islington · 2019-02-28T17:16:28Z

New changeset ef17fdb by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
bpo-36018: Add special value tests and make minor tweaks to the docs (GH-12096)
ef17fdb

miss-islington · 2019-03-01T05:47:29Z

New changeset 9add4b3 by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
bpo-36018: Add documentation link to "random variable" (GH-12114)
9add4b3

stevendaprano · 2019-03-11T11:31:31Z

I've done some spot checks of NormDist.pdf and .cdf and compared the results to those returned by my TI Nspire calculator.

So far, the PDF has matched that of the Nspire to 12 decimal places (the limit the calculator will show), but the CDF differs on or about the 8th decimal place:

py> x = statistics.NormalDist(2, 1.3)
py> x.cdf(5.374)
0.9952757439207682
# Nspire normCdf(-∞, 5.372, 2, 1.3) returns 0.995275710979
# difference of 3.294176820212158e-08

py> x.cdf(-0.23)
0.04313736707891003
# Nspire normCdf(-∞, -0.23, 2, 1.3) returns 0.043137332077
# difference of 3.500191003008579e-08

Wolfram Alpha doesn't help me decide which is correct, as it doesn't show enough decimal places.

https://www.wolframalpha.com/input/?i=CDF[+NormalDistribution[2,+1.3],+5.374+]

https://www.wolframalpha.com/input/?i=CDF[+NormalDistribution[2,+1.3],+-0.23+]

Do we care about this difference? Should I raise a new ticket for it?

mdickinson · 2019-03-11T12:12:47Z

According to GP/Pari, the correctly value for the first result, to the first few dozen places, is:

0.995275743920768157605659214368609706759611629000344854339231928536087783251913252354...

I'm assuming you meant 5.374 rather than 5.372 in the first Nspire result.

mdickinson · 2019-03-11T12:44:52Z

Below is the full transcript from Pari/GP: note that I converted the float inputs to exact Decimal equivalents, assuming IEEE 754 binary64. Summary: both Python results look fine; it's Nspire that's inaccurate here.

mirzakhani:~ mdickinson$ /opt/local/bin/gp
GP/PARI CALCULATOR Version 2.11.1 (released)
i386 running darwin (x86-64/GMP-6.1.2 kernel) 64-bit version
compiled: Jan 24 2019, Apple LLVM version 10.0.0 (clang-1000.11.45.5)
threading engine: single
(readline v8.0 enabled, extended help enabled)

                                           Copyright (C) 2000-2018 The PARI Group

PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER.

Type ? for help, \q to quit.
Type ?17 for how to get moral (and possibly technical) support.

parisize = 8000000, primelimit = 500000
? \p 200
   realprecision = 211 significant digits (200 digits displayed)
? ncdf(x, mu, sig) = (2 - erfc((x - mu) / sig / sqrt(2))) / 2
%1 = (x,mu,sig)->(2-erfc((x-mu)/sig/sqrt(2)))/2
? ncdf(5.37399999999999966604491419275291264057159423828125, 2, 1.3000000000000000444089209850062616169452667236328125)
%2 = 0.99527574392076815760565921436860970675961162900034485433923192853608778325191325235412640687571628164064779657215907190523884572141701976336760387216713270956350229484865180142256611330976179584951493
? ncdf(-0.2300000000000000099920072216264088638126850128173828125, 2, 1.3000000000000000444089209850062616169452667236328125)
%3 = 0.043137367078910025352120502108682523151629166877357644882244088336773338416883044522024586619860574718679715351558322591944140762629090301623352497457372937783778706411712862062109829239761761597057063

stevendaprano · 2019-03-11T16:36:23Z

I'm assuming you meant 5.374 rather than 5.372 in the first Nspire result.

Yes, that was a typo, sorry.

Thanks for checking into the results.

rhettinger · 2019-04-23T08:46:25Z

New changeset fb8c7d5 by Raymond Hettinger in branch 'master':
bpo-36018: Make "seed" into a keyword only argument (GH-12921)
fb8c7d5

rhettinger · 2019-04-30T06:47:37Z

New changeset b0a2c0f by Raymond Hettinger in branch 'master':
bpo-36018: Test idempotence. Test two methods against one-another. (GH-13021)
b0a2c0f

rhettinger · 2019-05-02T00:49:16Z

New changeset 671d782 by Raymond Hettinger in branch 'master':
bpo-36018: Update example to show mean and stdev (GH-13047)
671d782

stevendaprano · 2019-07-22T00:23:44Z

I have a query about the documentation:

The default *method* is "exclusive" and is used for data sampled
from a population that can have more extreme values than found
in the samples. ...
Setting the *method* to "inclusive" is used for describing
population data or for samples that include the extreme points.

In all my reading about quantile calculation methods, this is the first time I've come across this recommendation. Do you have a source for it or a justification?

Thanks.

rhettinger · 2019-09-08T23:58:06Z

New changeset 4db25d5 by Raymond Hettinger in branch 'master':
bpo-36018: Address more reviewer feedback (GH-15733)
4db25d5

rhettinger · 2019-09-09T01:40:09Z

New changeset cc1bdf9 by Raymond Hettinger in branch '3.8':
[3.8] bpo-36018: Address more reviewer feedback (GH-15733) (GH-15734)
cc1bdf9

rhettinger · 2020-01-26T04:21:22Z

New changeset 10355ed by Raymond Hettinger in branch 'master':
bpo-36018: Add another example for NormalDist() (bpo-18191)
10355ed

rhettinger · 2020-01-26T05:24:18Z

New changeset eebcff8 by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
bpo-36018: Add another example for NormalDist() (GH-18191) (GH-18192)
eebcff8

rhettinger · 2020-01-28T02:31:56Z

New changeset 01bf219 by Raymond Hettinger in branch 'master':
bpo-36018: Minor fixes to the NormalDist() examples and recipes. (GH-18226)
01bf219

rhettinger · 2020-01-28T03:40:18Z

New changeset 41f4dc3 by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
bpo-36018: Minor fixes to the NormalDist() examples and recipes. (GH-18226) (GH-18227)
41f4dc3

rhettinger added the 3.8 only security fixes label Feb 18, 2019

rhettinger assigned stevendaprano Feb 18, 2019

rhettinger added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Feb 18, 2019

rhettinger closed this as completed Feb 24, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Add a Normal Distribution class to the statistics module #80199

Add a Normal Distribution class to the statistics module #80199

Comments

rhettinger commented Feb 18, 2019

rhettinger commented Feb 18, 2019

stevendaprano commented Feb 18, 2019

selik mannequin commented Feb 19, 2019

rhettinger commented Feb 20, 2019

rhettinger commented Feb 22, 2019

stevendaprano commented Feb 22, 2019

rhettinger commented Feb 22, 2019

tirkarthi commented Feb 22, 2019

rhettinger commented Feb 23, 2019

rhettinger commented Feb 23, 2019

rhettinger commented Feb 23, 2019

applio commented Feb 23, 2019

stevendaprano commented Feb 24, 2019

stevendaprano commented Feb 24, 2019

applio commented Feb 24, 2019

rhettinger commented Feb 24, 2019

rhettinger commented Feb 24, 2019

miss-islington commented Feb 24, 2019

miss-islington commented Feb 24, 2019

miss-islington commented Feb 28, 2019

miss-islington commented Mar 1, 2019

stevendaprano commented Mar 11, 2019

mdickinson commented Mar 11, 2019

mdickinson commented Mar 11, 2019

stevendaprano commented Mar 11, 2019

rhettinger commented Apr 23, 2019

rhettinger commented Apr 30, 2019

rhettinger commented May 2, 2019

stevendaprano commented Jul 22, 2019

rhettinger commented Sep 8, 2019

rhettinger commented Sep 9, 2019

rhettinger commented Jan 26, 2020

rhettinger commented Jan 26, 2020

rhettinger commented Jan 28, 2020

rhettinger commented Jan 28, 2020