Skip to content

Commit

Permalink
bpo-36018: Add the NormalDist class to the statistics module (GH-11973)
Browse files Browse the repository at this point in the history
  • Loading branch information
rhettinger committed Feb 23, 2019
1 parent 64d6cc8 commit 11c7953
Show file tree
Hide file tree
Showing 5 changed files with 556 additions and 1 deletion.
195 changes: 195 additions & 0 deletions Doc/library/statistics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,201 @@ A single exception is defined:

Subclass of :exc:`ValueError` for statistics-related exceptions.


:class:`NormalDist` objects
===========================

A :class:`NormalDist` is a a composite class that treats the mean and standard
deviation of data measurements as a single entity. It is a tool for creating
and manipulating normal distributions of a random variable.

Normal distributions arise from the `Central Limit Theorem
<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
of applications in statistics, including simulations and hypothesis testing.

.. class:: NormalDist(mu=0.0, sigma=1.0)

Returns a new *NormalDist* object where *mu* represents the `arithmetic
mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of data and *sigma*
represents the `standard deviation
<https://en.wikipedia.org/wiki/Standard_deviation>`_ of the data.

If *sigma* is negative, raises :exc:`StatisticsError`.

.. attribute:: mu

The mean of a normal distribution.

.. attribute:: sigma

The standard deviation of a normal distribution.

.. attribute:: variance

A read-only property representing the `variance
<https://en.wikipedia.org/wiki/Variance>`_ of a normal
distribution. Equal to the square of the standard deviation.

.. classmethod:: NormalDist.from_samples(data)

Class method that makes a normal distribution instance
from sample data. The *data* can be any :term:`iterable`
and should consist of values that can be converted to type
:class:`float`.

If *data* does not contain at least two elements, raises
:exc:`StatisticsError` because it takes at least one point to estimate
a central value and at least two points to estimate dispersion.

.. method:: NormalDist.samples(n, seed=None)

Generates *n* random samples for a given mean and standard deviation.
Returns a :class:`list` of :class:`float` values.

If *seed* is given, creates a new instance of the underlying random
number generator. This is useful for creating reproducible results,
even in a multi-threading context.

.. method:: NormalDist.pdf(x)

Using a `probability density function (pdf)
<https://en.wikipedia.org/wiki/Probability_density_function>`_,
compute the relative likelihood that a random sample *X* will be near
the given value *x*. Mathematically, it is the ratio ``P(x <= X <
x+dx) / dx``.

Note the relative likelihood of *x* can be greater than `1.0`. The
probability for a specific point on a continuous distribution is `0.0`,
so the :func:`pdf` is used instead. It gives the probability of a
sample occurring in a narrow range around *x* and then dividing that
probability by the width of the range (hence the word "density").

.. method:: NormalDist.cdf(x)

Using a `cumulative distribution function (cdf)
<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
compute the probability that a random sample *X* will be less than or
equal to *x*. Mathematically, it is written ``P(X <= x)``.

Instances of :class:`NormalDist` support addition, subtraction,
multiplication and division by a constant. These operations
are used for translation and scaling. For example:

.. doctest::

>>> temperature_february = NormalDist(5, 2.5) # Celsius
>>> temperature_february * (9/5) + 32 # Fahrenheit
NormalDist(mu=41.0, sigma=4.5)

Dividing a constant by an instance of :class:`NormalDist` is not supported.

Since normal distributions arise from additive effects of independent
variables, it is possible to `add and subtract two normally distributed
random variables
<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
represented as instances of :class:`NormalDist`. For example:

.. doctest::

>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
>>> drug_effects = NormalDist(0.4, 0.15)
>>> combined = birth_weights + drug_effects
>>> f'mu={combined.mu :.1f} sigma={combined.sigma :.1f}'
'mu=3.1 sigma=0.5'

.. versionadded:: 3.8


:class:`NormalDist` Examples and Recipes
----------------------------------------

A :class:`NormalDist` readily solves classic probability problems.

For example, given `historical data for SAT exams
<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
are normally distributed with a mean of 1060 and standard deviation of 192,
determine the percentage of students with scores between 1100 and 1200:

.. doctest::

>>> sat = NormalDist(1060, 195)
>>> fraction = sat.cdf(1200) - sat.cdf(1100)
>>> f'{fraction * 100 :.1f}% score between 1100 and 1200'
'18.2% score between 1100 and 1200'

To estimate the distribution for a model than isn't easy to solve
analytically, :class:`NormalDist` can generate input samples for a `Monte
Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the
model:

.. doctest::

>>> n = 100_000
>>> X = NormalDist(350, 15).samples(n)
>>> Y = NormalDist(47, 17).samples(n)
>>> Z = NormalDist(62, 6).samples(n)
>>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)]
>>> NormalDist.from_samples(model_simulation) # doctest: +SKIP
NormalDist(mu=267.6516398754636, sigma=101.357284306067)

Normal distributions commonly arise in machine learning problems.

Wikipedia has a `nice example with a Naive Bayesian Classifier
<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_. The challenge
is to guess a person's gender from measurements of normally distributed
features including height, weight, and foot size.

The `prior probability <https://en.wikipedia.org/wiki/Prior_probability>`_ of
being male or female is 50%:

.. doctest::

>>> prior_male = 0.5
>>> prior_female = 0.5

We also have a training dataset with measurements for eight people. These
measurements are assumed to be normally distributed, so we summarize the data
with :class:`NormalDist`:

.. doctest::

>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])

We observe a new person whose feature measurements are known but whose gender
is unknown:

.. doctest::

>>> ht = 6.0 # height
>>> wt = 130 # weight
>>> fs = 8 # foot size

The posterior is the product of the prior times each likelihood of a
feature measurement given the gender:

.. doctest::

>>> posterior_male = (prior_male * height_male.pdf(ht) *
... weight_male.pdf(wt) * foot_size_male.pdf(fs))

>>> posterior_female = (prior_female * height_female.pdf(ht) *
... weight_female.pdf(wt) * foot_size_female.pdf(fs))

The final prediction is awarded to the largest posterior -- this is known as
the `maximum a posteriori
<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:

.. doctest::

>>> 'male' if posterior_male > posterior_female else 'female'
'female'


..
# This modelines must appear within the last ten lines of the file.
kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
26 changes: 26 additions & 0 deletions Doc/whatsnew/3.8.rst
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,32 @@ Added :func:`statistics.fmean` as a faster, floating point variant of
:func:`statistics.mean()`. (Contributed by Raymond Hettinger and
Steven D'Aprano in :issue:`35904`.)

Added :class:`statistics.NormalDist`, a tool for creating
and manipulating normal distributions of a random variable.
(Contributed by Raymond Hettinger in :issue:`36018`.)

::

>>> temperature_feb = NormalDist.from_samples([4, 12, -3, 2, 7, 14])
>>> temperature_feb
NormalDist(mu=6.0, sigma=6.356099432828281)

>>> temperature_feb.cdf(3) # Chance of being under 3 degrees
0.3184678262814532
>>> # Relative chance of being 7 degrees versus 10 degrees
>>> temperature_feb.pdf(7) / temperature_feb.pdf(10)
1.2039930378537762

>>> el_nino = NormalDist(4, 2.5)
>>> temperature_feb += el_nino # Add in a climate effect
>>> temperature_feb
NormalDist(mu=10.0, sigma=6.830080526611674)

>>> temperature_feb * (9/5) + 32 # Convert to Fahrenheit
NormalDist(mu=50.0, sigma=12.294144947901014)
>>> temperature_feb.samples(3) # Generate random samples
[7.672102882379219, 12.000027119750287, 4.647488369766392]


tokenize
--------
Expand Down
Loading

0 comments on commit 11c7953

Please sign in to comment.