-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Normal Distribution class to the statistics module #80199
Comments
Attached is a class that I've found useful for doing practical statistics work with normal distributions. It provides a nice, high-level API that makes short-work of everyday statistical problems. ------ Examples -------- # Simple scaling and translation
temperature_february = NormalDist(5, 2.5) # Celsius
print(temperature_february * (9/5) + 32) # Fahrenheit
# Classic probability problems
# https://blog.prepscholar.com/sat-standard-deviation
# The mean score on a SAT exam is 1060 with a standard deviation of 195
# What percentage of students score between 1100 and 1200?
sat = NormalDist(1060, 195)
fraction = sat.cdf(1200) - sat.cdf(1100)
print(f'{fraction * 100 :.1f}% score between 1100 and 1200')
# Combination of normal distributions by summing variances
birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
drug_effects = NormalDist(0.4, 0.15)
print(birth_weights + drug_effects)
# Statistical calculation estimates using simulations
# Estimate the distribution of X * Y / Z
n = 100_000
X = NormalDist(350, 15).examples(n)
Y = NormalDist(47, 17).examples(n)
Z = NormalDist(62, 6).examples(n)
print(NormalDist.from_samples(x * y / z for x, y, z in zip(X, Y, Z))) # Naive Bayesian Classifier height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
weight_male = NormalDist.from_samples([180, 190, 170, 165])
weight_female = NormalDist.from_samples([100, 150, 130, 150])
foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
prior_male = 0.5
prior_female = 0.5
posterior_male = prior_male * height_male.pdf(6) * weight_male.pdf(130) * foot_size_male.pdf(8)
posterior_female = prior_female * height_female.pdf(6) * weight_female.pdf(130) * foot_size_female.pdf(8)
print('Predict', 'male' if posterior_male > posterior_female else 'female') |
I like this idea! Should the "examples" method be re-named "samples"? That's the word used in the docstring, and it matches the from_samples method. |
+1, This would be useful for quick analyses, avoiding the overhead of installing scipy and looking through its documentation. Given that it's in the statistics namespace, I think the name can be simply |
I'll work up a PR for this. We can continue to tease-out the best method names. I've has success with "examples" and "from_samples" when developing this code in the classroom. Both names had the virtue of being easily understood and never being misunderstood. Intellectually, the name fit() makes sense because we are using data to create best fit model parameters. So, technically this is probably the most accurate terminology. However, it doesn't match how I think about the problem though -- that is more along the lines of "use sampling data to make a random variable with a normal distribution". Another minor issue is that class methods are typically (but not always) recognizable by their from-prefix (e.g. dict.fromkeys, datetime.fromtimestamp, etc). "NormalDist" seems more self explanatory to me that just "Normal". Also, the noun form seems "more complete" than a dangling adjective (reading "normal" immediately raises the question "normal what?"). FWIW, MS Excel also calls their variant NORM.DIST (formerly spelled without the dot). |
Okay the PR is ready. If you all are mostly comfortable with it, it would great to get this in for the second alpha so that people have a chance to work with it. |
Thanks Raymond. Apologies for commenting here instead of at the PR. While I've been fighting with more intermittedly broken than usual So the bottom line is that while I can read *part* of the diffs on So right now, the only thing I can do is comment on your extensive The only thing that strikes me as problematic is the default value for I also think it would be nice to default to the standard normal curve, Thanks again for this class, and my apologies for my inability to |
I've made both suggested changes, "examples"->"samples" and set the defaults to the standard normal distribution. To bypass Github, I've attached a diff to this tracker issue. Let me know what you think :-) |
@steven.daprano Bit off topic but you can also append .patch in the PR URL to generate patch file with all the commits made in the PR up to latest commit and .diff provides the current diff against master. They are plain text and can be downloaded through wget and viewed with an editor in case if it helps. https://github.com/python/cpython/pull/11973.patch |
Thanks for all the positive feedback. If there are no objections, I would like to push this so it will be in the second alpha release so that it can get exercised. We can still make adjustments afterwards. |
Okay, it's in for the second alpha. Please continue to make API or implementation suggestions. Nothing is set in stone. |
There is an inconsistency worth paying attention to in the choice of names of the input parameters. Currently in the statistics module, pvariance() accepts a parameter named "mu" and pstdev() and variance() each accept a parameter named "xbar". The docs describe both "mu" and "xbar" as "it should be the mean of data". I suggest it is worth rationalizing the names used within the statistics module for consistency before reusing "mu" or "xbar" or anything else in NormalDist. Using the names of mathematical symbols that are commonly used to represent a concept is potentially confusing because those symbols are not always *universally* used. For example, students are often introduced to new concepts in introductory mathematics texts where concepts such as "mean" appear in formulas and equations not as "mu" but as "xbar" or simply "m" or other simple (and hopefully "friendly") names/symbols. As a mathematician, if I am told a variable is named, "mu", I still feel the need to ask what it represents. Sure, I can try guessing based upon context but I will usually have more than one guess that I could make. Rather than continue down a path of using various mathematical-symbols-written-out-in-English-spelling, one alternative would be to use less ambiguous, more informative variable names such as "mean". It might be worth considering a change to the parameter names of "mu" and "sigma" in NormalDist to names like "mean" and "stddev", respectively. Or perhaps "mean" and "standard_deviation". Or perhaps "mean" and "variance" would be easier still (recognizing that variance can be readily computed from standard deviation in this particular context). In terms of consistency with other packages that users are likely to also use, scipy.stats functions/objects commonly refer to these concepts as "mean" and "var". I like the idea of making NormalDist readily approachable for students as well as those more familiar with these concepts. The offerings in scipy.stats are excellent but they are not always the most approachable things for new students of statistics. |
Karthikeyan: thanks for the hint about Github. Raymond: thanks for the diff. Some comments: Why use object.__setattr__(self, 'mu', mu) instead of self.mu = mu in the __init__ method? Should __pos__ return a copy rather than the instance itself? The rest looks good to me, and I look forward to using it. |
Davin: the chice of using mu versus xbar was deliberate, as they represent different quantities: the population mean versus a sample mean. But reading over the docs with fresh eyes, I can now see that the distinction is not as clear as I intended. I think that changing the names now would be a breaking change, but even if it wasn't, I don't want to change the names. The distinction between population parameters (mu) and sample statistics (xbar) is important and I think the function parameters should reflect that. As for the new NormalDist class, we aren't limited by backwards compatibility, but I would still argue for the current names mu and sigma. As well as matching the population parameters of the distribution, they also matches the names used in calculators such as the TI Nspire and Casio Classpad (two very popular CAS calculators used by secondary school students). See bpo-36099. If you would like to suggest some doc changes, please feel free to do so. |
Steven: Your point about population versus sample makes sense and your point that altering their names would be a breaking change is especially important. I think that pretty well puts an end to my suggestion of alternative names and says the current pattern should be kept with NormalDist. I particularly like the idea of using the TI Nspire and Casio Classpad to guide or help confirm what symbols might be recognizable to secondary students or 1st year university students. Raymond: As an idea for examples demonstrating the code, what about an example where a plot of pdf is created, possibly for comparison with cdf? This would require something like matplotlib but would help to visually communicate the concepts of pdf, perhaps with different sigma values? |
The idea was the instances should be immutable and hashable, but this added unnecessary complexity, so I took this out prior to the check in.
Yes. I'll fix that straight-way. ^ The chice of using mu versus xbar was deliberate I concur with that choice and also prefer to stick with mu and sigma:
|
Steven, Davin, Michael: Thanks for the encouragement and taking the time to review this code. |
I've done some spot checks of NormDist.pdf and .cdf and compared the results to those returned by my TI Nspire calculator. So far, the PDF has matched that of the Nspire to 12 decimal places (the limit the calculator will show), but the CDF differs on or about the 8th decimal place: py> x = statistics.NormalDist(2, 1.3) py> x.cdf(-0.23) Wolfram Alpha doesn't help me decide which is correct, as it doesn't show enough decimal places. https://www.wolframalpha.com/input/?i=CDF[+NormalDistribution[2,+1.3],+5.374+] https://www.wolframalpha.com/input/?i=CDF[+NormalDistribution[2,+1.3],+-0.23+] Do we care about this difference? Should I raise a new ticket for it? |
According to GP/Pari, the correctly value for the first result, to the first few dozen places, is: 0.995275743920768157605659214368609706759611629000344854339231928536087783251913252354... I'm assuming you meant 5.374 rather than 5.372 in the first Nspire result. |
Below is the full transcript from Pari/GP: note that I converted the float inputs to exact Decimal equivalents, assuming IEEE 754 binary64. Summary: both Python results look fine; it's Nspire that's inaccurate here. mirzakhani:~ mdickinson$ /opt/local/bin/gp
PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER. Type ? for help, \q to quit. parisize = 8000000, primelimit = 500000
? \p 200
realprecision = 211 significant digits (200 digits displayed)
? ncdf(x, mu, sig) = (2 - erfc((x - mu) / sig / sqrt(2))) / 2
%1 = (x,mu,sig)->(2-erfc((x-mu)/sig/sqrt(2)))/2
? ncdf(5.37399999999999966604491419275291264057159423828125, 2, 1.3000000000000000444089209850062616169452667236328125)
%2 = 0.99527574392076815760565921436860970675961162900034485433923192853608778325191325235412640687571628164064779657215907190523884572141701976336760387216713270956350229484865180142256611330976179584951493
? ncdf(-0.2300000000000000099920072216264088638126850128173828125, 2, 1.3000000000000000444089209850062616169452667236328125)
%3 = 0.043137367078910025352120502108682523151629166877357644882244088336773338416883044522024586619860574718679715351558322591944140762629090301623352497457372937783778706411712862062109829239761761597057063 |
Yes, that was a typo, sorry. Thanks for checking into the results. |
I have a query about the documentation: The default *method* is "exclusive" and is used for data sampled In all my reading about quantile calculation methods, this is the first time I've come across this recommendation. Do you have a source for it or a justification? Thanks. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: