# Frequency distributions

We often summarize when working with data. Although there is more information contained in the complete list of observations, a summary that makes the data more comprehensible may be more useful. For instance, we often cite the mean of a set of data instead of listing out all of the values. This is particulary useful when we have a very large number of data points.

A frequency distribution consists of a set of intervals and the number of observations that fall into each. Talking about data in intervals often makes more sense than using specific values. For instance, if all 50 observations of a random variable have landed in the interval [99, 101], we expect the next observation to be approximately 100. We do not expect it to come up with any particular value, however, and would be more likely to bet that the next observation will be in the interval [99, 101] than that it will be exactly equal to 100.72.

When constructing a frequency distribution, we use some number $k$ of equally sized bins. Each of the $k$ intervals has length (range of data)/$k$.

The usefulness of frequency distributions can be seen below. Binning the data shows that the observations are sampled from a normal distribution much better than the one that marks each data point individually.

In [93]:
import numpy as np

X = np.random.randn(150)

# Print list of frequencies of data points
# Second argument is number of bins
hist, _ = np.histogram(X, 300)
print 'Data frequencies: ', hist

Data frequencies:  [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 0 2 1 0 0 1 0
 2 1 0 0 0 1 0 1 0 1 0 2 1 0 0 1 0 1 1 1 1 1 0 2 3 1 1 0 2 1 1 0 0 0 1 0 1
 0 1 1 0 0 0 0 0 1 0 1 3 0 3 1 1 1 2 0 0 1 0 0 2 1 1 0 0 2 4 3 1 1 1 1 0 2
 3 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 2 0 2 2 0 2 1 1 0 0 0 1 0 0 0 1 1 1 2
 1 2 2 3 1 0 2 0 1 1 0 0 0 0 1 2 0 2 1 0 3 0 0 0 1 0 0 1 0 0 1 0 2 1 0 0 2
 0 0 0 1 0 0 0 1 2 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 1]


In [94]:
# Print frequency of data per bin
hist, _ = np.histogram(X, 15)
print 'Binned data frequencies: ', hist

Binned data frequencies:  [ 1  1  0  1  4 11 19 11 23 18 15 21 12  8  5]


However, if bin sizes are too large, the data ceases to be very informative:

In [95]:
# Print frequency of data per bin
hist, _ = np.histogram(X, 4)
print 'Binned data frequencies: ', hist

Binned data frequencies:  [ 3 39 70 38]


Frequency distributions are often used in graphing because binned data is visually more comprehensible. For example, the commonly-used graphing library `matplotlib.pyplot` has a built-in histogram function which requires the number of bins $k$.