# Distributions

In the first chapter, having cleaned and validated your data, you began exploring it by using histograms to visualize distributions. In this chapter, you'll learn how to represent distributions using Probability Mass Functions (PMFs) and Cumulative Distribution Functions (CDFs). You'll learn when to use each of them, and why, while working with a new dataset obtained from the General Social Survey.

# 1. Probability mass functions


1.2 Make a PMF

The GSS dataset has been pre-loaded for you into a DataFrame called gss. You can explore it in the IPython 
Shell to get familiar with it.

In this exercise, you'll focus on one variable in this dataset, 'year', which represents the year each 
respondent was interviewed.

You can access the Pmf classvia the empiricaldist library 
https://nbviewer.org/github/AllenDowney/empiricaldist/blob/master/empiricaldist/dist_demo.ipynb

and https://github.com/AllenDowney/ExploratoryDataAnalysis/blob/master/distribution.ipynb

The Pmf class

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Pmf

def underride(d, **options):
    """Add key-value pairs to d only if key is not in d.

    d: dictionary
    options: keyword args to add to d
    """
    for key, val in options.items():
        d.setdefault(key, val)

    return d

class Pmf(pd.Series):
    
    def __init__(self, seq, name='Pmf', **options):
        """Make a PMF from a sequence.
        
        seq: sequence of values
        name: string
        sort: boolean, whether to sort the values, default True
        normalize: boolean, whether to normalize the Pmf, default True
        dropna: boolean, whether to drop NaN, default True
        """
        # get the sort flag
        sort = options.pop('sort', True)

        # normalize unless the caller said not to
        underride(options, normalize=True)
        
        # put the seq in a Series so we can use value_counts
        series = pd.Series(seq, copy=False)
        
        # make the counts
        # by default value_counts sorts by frequency, which
        # is not what we want
        options['sort'] = False
        counts = series.value_counts(**options)
        
        # sort by value
        if sort:
            counts.sort_index(inplace=True)
            
        # call Series.__init__
        super().__init__(counts, name=name)

    @property
    def qs(self):
        return self.index.values

    @property
    def ps(self):
        return self.values

    def __call__(self, qs):
        """Look up a value in the PMF."""
        return self.get(qs, 0)

    def normalize(self):
        """Normalize the PMF."""
        self /= self.sum()

    def bar(self, **options):
        """Plot the PMF as a bar plot."""
        underride(options, label=self.name)
        plt.bar(self.index, self.values, **options)

    def plot(self, **options):
        """Plot the PMF with lines."""
        underride(options, label=self.name)
        plt.plot(self.index, self.values, **options)

In [None]:
# from empiricaldist import Pmf
gss = pd.read_hdf('C:\\Users\\yazan\\Desktop\\Data_Analytics\\7-Exploratory Data Analysis in Python\\datasets\\gss.hdf5','gss')

# Compute the PMF for year and set normalize to False
pmf_year = Pmf(gss.year, normalize=False)

# Print the result
print(pmf_year)

In [None]:
# from empiricaldist import Pmf
gss = pd.read_hdf('C:\\Users\\yazan\\Desktop\\Data_Analytics\\7-Exploratory Data Analysis in Python\\datasets\\gss.hdf5','gss')

# Compute the PMF for year and set normalize to True
pmf_year = Pmf(gss.year, normalize=True)

# Print the result
print(pmf_year)

1.2 Plot a PMF

Now let's plot a PMF for the age of the respondents in the GSS dataset. The variable 'age' contains respondents' age in years.

In [None]:
# Select the age column
age = gss['age']

# Make a PMF of age
pmf_age = Pmf(age)

# Plot the PMF
pmf_age.bar(label='age')

# Label the axes
plt.xlabel('Age')
plt.ylabel('PMF')
plt.show()

# 2. Cumulative distribution functions

2.2 Make a CDF

In this exercise, you'll make a CDF and use it to determine the fraction of respondents in the GSS dataset who are OLDER than 30.

The GSS dataset has been preloaded for you into a DataFrame called gss.

As with the Pmf class from the previous lesson, the Cdf class has been created, and you can access it via the empiricaldist library.

https://nbviewer.org/github/AllenDowney/empiricaldist/blob/master/empiricaldist/dist_demo.ipynb

and https://github.com/AllenDowney/ExploratoryDataAnalysis/blob/master/distribution.ipynb

The CDF Class

In [None]:
from scipy.interpolate import interp1d
import pandas as pd

class Cdf(pd.Series):

    def __init__(self, seq, name='Cdf', **options):
        """Make a CDF from a sequence.
        
        seq: sequence of values
        name: string
        sort: boolean, whether to sort the values, default True
        normalize: boolean, whether to normalize the Cdf, default True
        dropna: boolean, whether to drop NaN, default True
        """
        # get the normalize option
        normalize = options.pop('normalize', True)
        
        # make the PMF and CDF
        pmf = Pmf(seq, normalize=False, **options)
        cdf = pmf.cumsum()
        
        # normalizing the CDF, rather than the PMF,
        # avoids floating-point errors and guarantees
        # that the last proability is 1.0
        if normalize:
            cdf /= cdf.values[-1]
        super().__init__(cdf, name=name, copy=False)
        
    @property
    def qs(self):
        return self.index.values

    @property
    def ps(self):
        return self.values

    @property
    def forward(self):
        return interp1d(self.qs, self.ps,
                        kind='previous',
                        assume_sorted=True,
                        bounds_error=False,
                        fill_value=(0,1))

    @property
    def inverse(self):
        return interp1d(self.ps, self.qs,
                        kind='next',
                        assume_sorted=True,
                        bounds_error=False,
                        fill_value=(self.qs[0], np.nan))

    def __call__(self, qs):
        return self.forward(qs)

    def percentile_rank(self, qs):
        return self.forward(qs) * 100

    def percentile(self, percentile_ranks):
        return self.inverse(percentile_ranks / 100)

    def step(self, **options):
        """Plot the CDF as a step function."""
        underride(options, label=self.name, where='post')
        plt.step(self.index, self.values, **options)

    def plot(self, **options):
        """Plot the CDF as a line."""
        underride(options, label=self.name)
        plt.plot(self.index, self.values, **options)

In [None]:
# Select the age column
age = gss['age']

# Compute the CDF of age
cdf_age = Cdf(age)

# Calculate the CDF of 30
print(cdf_age(30))

Question:

What fraction of the respondents in the GSS dataset are OLDER than 30?

Possible Answers:

- Approximately 75% (True)
- Approximately 65%
- Approximately 45%
- Approximately 25%