- [Chapter 4: Cumulative distribution functions](#chapter4)
    - [4.1 The limits of PMFs](#subchapter4.1)
    - [4.2 Percentiles](#subchapter4.2)
    - [4.3 CDFs](#subchapter4.3)
    - [4.4 Representing CDFs](#subchapter4.4)
    - [4.5 Comparing CDFs](#subchapter4.5)
    - [4.6 Percentile-based statistics](#subchapter4.6)
    - [4.7 Random numbers](#subchapter4.7)    
    - [4.8 Comparing percentile ranks](#subchapter4.8)
    - [4.9 Exercises](#subchapter4.9)
    - [4.10 Glossary](#subchapter4.10)

<a id='chapter4'></a>
# Chapter 4: Cumulative distribution functions

<a id='subchapter4.1'></a>
## 4.1 The limits of PMFs

PMFs work well if the number of values is small. But as the number increases, the probability associated with each value gets smaller and the effect of random noise increases. For example, this next figure shows the PMF of weight at birth of first babies and others.

![title](figs/chap04limitations.png)

Overall, these distributions resemble the bell shape of a normal distribution, with many values near the mean and a few values much higher and lower. But parts of this figure are hard to interpret. There are many spikes and valleys, and some apparent differences between the distributions. It is hard to tell which of these features are meaningful.

These problems can be solved by binning the data: dividing the range of values into non-overlapping intervals and counting the number of values in each bin. Note: it is tricky to get the size of bins right. 

An alternative that avoids these problems is the cumulative distribution function (CDF). We'll get to that after we talk about percemtiles.

<a id='subchapter4.2'></a>
## 4.2 Percentiles 

Getting a value and list of values, and calculating the percentile rank: 

In [1]:
'''
    your_score = your grade 
    scores = sequence that includes all scores
'''

def PercentileRank(scores, your_score):
    count = 0
    for score in scores:
        if score <= your_score:
            count += 1
    percentile_rank = 100.0 * count / len(scores)
    
    return percentile_rank

In [2]:
# For example: you score 88 in a test where others scored:
scores = [55,66,77,88,99]
your_score = 88
PercentileRank(scores, your_score)

80.0

Getting a percentile rank and calculating the value is trickier:

In [3]:
def Percentile(scores, percentile_rank):
    scores.sort()
    for score in scores:
        if PercentileRank(scores, score) >= percentile_rank:
            return score

In [4]:
# lets take the same example of scores with percentile 80.0:
Percentile(scores, 80)

88

The implementation of Percentile is not efficient. A better approach is to use the percentile rank to compute the index of the corresponding percentile:

In [5]:
def Percentile2(scores, percentile_rank):
    scores.sort()
    index = percentile_rank * (len(scores) - 1) // 100
    return scores[index]   

In [6]:
Percentile2(scores, 80)

88

<a id='subchapter4.3'></a>
## 4.3 CDFs 

Cumulative distribution function (CDF) is a function that maps from values to their cumulative probabilities. CDF(x) is the fraction of the sample less than or equal to x. x is any number that might appear in the distribution. To evaluate CDF(x) for a particular number of x, we compute the fraction of values in the distribution less than or equal to x. 


In [7]:
def EvalCdf(sample, x):
    count = 0.0
    for value in sample:
        if value <= x:
            count += 1
    prob = count / len(sample)
    return prob

The result of this is a probability between 0 to 1, rather than a percentile between 0 to 100. 

For example:

In [8]:
sample = [1,2,2,3,5]
EvalCdf(sample, 0) # CDF(0)

0.0

In [9]:
EvalCdf(sample, 1) # CDF(1)

0.2

In [None]:
EvalCdf(sample, 5) # CDF(5)

1.0

We can evaluate the CDF for any value of x, not just values that appear in the sample. If x is less than the smallest value in the sample, CDF(x) is 0. CDF of a sample is a step function. 

<a id='subchapter4.4'></a>
## 4.4 Representing CDFs 

Thinkstats2 provides a class named Cdf. Here is how it works: 

- Prob(x): Given a value x, computes the probability p = CDF(x). The bracket operator is equivalent to Prob. 

- Value(p): Given a probability p, computes the corresponding value, x; that is, the <b>inverse CDF</b> of p.

The Cdf constructor can take as an argument a list of values, a pandas Series, a Hist, Pmf and another Cdf. The following code makes a Cdf for the distribution of pregnancy length in the NSFG:

In [None]:
from __future__ import print_function

import math
import numpy as np
import pandas as pd

import nsfg
import first
import thinkstats2
import thinkplot

%matplotlib inline



In [None]:
live, firsts, others = first.MakeFrames()
cdf = thinkstats2.Cdf(live.prglngth, label='prglngth')

thinkplot provides a function named Cdf that plots Cdfs as lines:

In [None]:
thinkplot.Cdf(cdf)
thinkplot.Show(xlabel="weeks", ylabel="CDF")

One way to read a CDF is to look up percentiles. For example, it looks like about 10% of pregnancies are shorter than 36 weeks, and about 90% are shorter than 41 weeks. The CDF also provides a visual representation of the shape of the distribution. Common values appear as steep or vertical sections of the CDF; in this example, the mode at 39 weeks is apparent. There are few values below 30 weeks, so the CDF in this range is flat.

<a id='subchapter4.5'></a>
## 4.5 Comparing CDFs 

CDFs are especially useful for comparing distributions. For example, here is the code that plots the CDF of birth weight for first babies and others. 

In [None]:
first_cdf = thinkstats2.Cdf(firsts.totalwgt_lb, label='first')
others_cdf = thinkstats2.Cdf(others.totalwgt_lb, label='others')

thinkplot.PrePlot(2)
thinkplot.Cdfs([first_cdf, others_cdf])
thinkplot.Show(xlabel='weight (pounds)', ylabel='CDF')

Compared to the previous plot that compared distributions, this plot makes the shape of the distributions, and the differences between them, much clearer. We can see that first babies are slightly lighter throughout the distribution, with a larger discrepancy above the mean.

<a id='subchapter4.6'></a>
## 4.6 Percentile-based statistics 

Once you have computed a CDF, it is easy to compute percentiles and percentile ranks. The Cdf class provides these two methods:

<b>PercentileRank(x)</b>: Given a value x, computes its percentile rank, 100*CDF(x).

<b>Percentile(p)</b>: Given a percentile rank p, computes the corresponding value, x. Equivalent to Value(p/100). Percentile can be used to compute percentile-based summary statistics. For example, finding the median or the interquartile range (IQR). Another option is to use percentile to split the distribution to equaly sized percentiles (20,40, 60, etc.) called quantiles.

<a id='subchapter4.7'></a>
## 4.7 Random numbers 

Suppose we chose a random sample from the population of live births and look up percentile rank of their birth weight. Suppose we compute the CDF of the percentile ranks. We can assume that the distribution would be uniform (a straight line). 

In [None]:
weights = live.totalwgt_lb
cdf = thinkstats2.Cdf(weights, label = 'totalwgt_lb')

# generating and computing the percentile rank of each value in the sample
# sample is a random sample of 100 birth weights chosen with replacement
# meaning every value can be chosen more than once 
sample = np.random.choice(weights, 100, replace=True)
ranks = [cdf.PercentileRank(x) for x in sample]

# ploting the cdf looks like this:
rank_cdf = thinkstats2.Cdf(ranks)
thinkplot.Cdf(rank_cdf)
thinkplot.Show(xlabel='percentile rank', ylabel='CDF')

So regardless of the shape of the CDF, the distribution of percentile ranks is uniform. This property is useful, because it is the basis of a simple and efficient algorithm for generating random numbers with a given CDF. 
Here's how:

- Chose a percentile rank uniformly from the range 0-100. 
- Use Cdf.Percentile to find the value in the distribution that corresponds to the percentile rank you choose. 

Cdf provides an implementation of this algorithm, called Random:
#class Cdf:
    def Random(self):
        return self.Percentile(random.uniform(0,100))
        
Cdf also provides Sample, which takes an integer, n, and return a list of n values chosen at random from the Cdf.

<a id='subchapter4.8'></a>
## 4.8 Comparing percentile ranks

Percentile ranks are useful for comparing measurements across different groups. 

Given position and field size, we can compute percentile rank:

In [None]:
def PositionPercentile(position, field_size):
    beat = field_size - position + 1
    percentile = 100.0 * beat/field_size
    return percentile

In [None]:
# a runner finishes 26th's from 256 runners. 
# what is his percentile?
PositionPercentile(26, 256)

In [None]:
# what is the 90th percentile in a group of 171 runners?
def PercentileToPosition(percentile, field_size):
    beat = percentile * field_size / 100.0
    position = field_size - beat + 1
    return position

In [None]:
PercentileToPosition(90, 171)

<a id='subchapter4.9'></a>
## 4.9 Exercises

<b>Exercise 4.1</b>

[Exercise 1](chap04ex01.ipynb)

[Exercise 1 solution](chap04ex01soln.ipynb)

<a id='subchapter4.10'></a>
## 4.10 Glossary

<b>Percentile rank</b>: The percentage of values in a distribution that are less than or equal to a given value.

<b>Percentile</b>: The value associated with a given percentile rank.

<b>Cumulative distribution function (CDF)</b>: A function that maps from values to their cumulative probabilities. CDF(x) is the fraction of the sample less than or equal to x.

<b>Inverse CDF</b>: A function that maps from a cumulative probability, p, to the corresponding value.

<b>Median</b>: The 50th percentile, often used as a measure of central ten- dency.

<b>Interquartile range</b>: The difference between the 75th and 25th per- centiles, used as a measure of spread.

<b>Quantile</b>: A sequence of values that correspond to equally spaced per- centile ranks; for example, the quartiles of a distribution are the 25th, 50th and 75th percentiles.

<b>Replacement</b>: A property of a sampling process. “With replacement” means that the same value can be chosen more than once; “without replacement” means that once a value is chosen, it is removed from the population.

Next up: [Chapter 5: Modeling distributions](chap05.ipynb)