## Scipy Statistics Module

Although there are some basic statistical functions in Numpy (e.g.,
`mean`, `std`, `median`), the real repository for statistical functions is in
`scipy.stats`. There are over eighty continuous probability distributions
implemented in `scipy.stats` and an additional set of more than
ten discrete distributions, along with many other supplementary statistical
functions that we will select from in what follows.

To get started with `scipy.stats`, you have to load the module and create
an object that has the distribution you're interested in. For example,

In [1]:
from __future__ import division
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
>>> import scipy.stats # might take awhile
>>> n = scipy.stats.norm(0,10) # create normal distrib

 The `n` variable is an object that represents a 
normally distributed random variable with mean zero and 
standard deviation, $\sigma=10$. Note that the more general term
for these two parameters is *location* and *scale*, respectively. Now
that we have this defined, we can compute `mean`, as in the following:

In [3]:
>>> n.mean() # we already know this from its definition!
0

0

 We can also compute higher order moments as

In [4]:
>>> n.moment(4)
30000

30000

 The main public methods for continuous random variables are

   * `rvs`: random variates

   * `pdf`: probability density function

   * `cdf`: cumulative distribution function

   * `sf`: survival Function (1-CDF)

   * `ppf`: percent point function (Inverse of CDF)

   * `isf`: inverse survival function (Inverse of SF)

   * `stats`: mean, variance, (Fisher's) skew, or (Fisher's) kurtosis

   * `moment`: non-central moments of the distribution

For example, we can compute the value of the pdf at a specific point.

In [5]:
>>> n.pdf(0)
0.039894228040143268

0.03989422804014327

 or, the `cdf` for the same random variable.

In [6]:
>>> n.cdf(0)
0.5

0.5

 You can also create samples from this distribution as in the following:

In [7]:
>>> n.rvs(10)
array([ -8.11311677,   1.48034316,   1.0824489 ,  -4.38642452,
        23.69872505, -22.19428082,  -7.19207387,  10.6447697,
         3.4549407 ,   1.67282213])

array([ -8.11311677,   1.48034316,   1.0824489 ,  -4.38642452,
        23.69872505, -22.19428082,  -7.19207387,  10.6447697 ,
         3.4549407 ,   1.67282213])

 Many common statistical tests are already built-in. For example,
Shapiro-Wilks tests the null hypothesis that the data were drawn from a
normal distribution [^hypo], as in the following:

[^hypo]: We will explain null hypothesis and the rest of it later.

In [8]:
>>> scipy.stats.shapiro(n.rvs(100))
(0.9914381504058838, 0.779195249080658)

(0.9914381504058838, 0.779195249080658)

 The second value in the tuple is the p-value.

## Sympy Statistics Module

Sympy has its own much smaller, but still extremely useful statistics  module
that enables symbolic manipulation of statistical quantities.  For example,

In [9]:
>>> from sympy import stats
>>> X = stats.Normal('x',0,10) # create normal random variable

 We can obtain the probability density function as,

In [10]:
>>> from sympy.abc import x
>>> stats.density(X)(x)

sqrt(2)*exp(-x**2/200)/(20*sqrt(pi))

 and we can evaluate the cumulative density function as the following,

In [11]:
>>> stats.cdf(X)(0)
1/2

0.5

 Note that you can evaluate this numerically by using the `evalf()`
method on the output. Sympy provides intuitive ways to consider standard
probability questions by using the `stats.P` function, as in the following:

In [12]:
>>> stats.P(X>0) # prob X >0?
1/2

0.5

 There is also a corresponding expectation function, `stats.E`
you can use to compute complicated expectations using all of Sympy's powerful
built-in integration machinery. For example we can compute,
$\mathbb{E}(\sqrt{\lvert X \rvert})$ in the following,

In [13]:
>>> stats.E(abs(X)**(1/2)).evalf()
2.59995815363879

2.59995815363879

 Unfortunately, there is very limited support for multivariate
distributions at  the time of this writing.

## Other Python Modules for Statistics

There are many other important Python modules for statistical work. Two
important modules are Seaborn and Statsmodels. As we discussed earlier,
Seaborn is library built on top of Matplotlib for very detailed and expressive
statistical visualizations, ideally suited for exploratory data analysis.
Statsmodels is designed to complement Scipy with descriptive
statistics, estimation, and inference for a large variety of statistical
models. Statsmodels includes (among many others) generalized linear models,
robust linear models, and methods for timeseries analysis, with an emphasis on
econometric data and problems. Both these modules are well supported and very
well documented and designed to integrate tightly into Matplotlib, Numpy,
Scipy, and the rest of the scientific Python stack.  Because the focus of this
text is more conceptual as opposed to domain-specific, I have chosen not to
emphasize either of these, notwithstanding how powerful each is.