In [None]:
""" Data distribution, adapted from Elements of Data Science, Allen Downey """

In [None]:
"""`empiricaldist` provides an object type called `Pmf`, which stands for "probability mass function".
A `Pmf` object contains a set of possible outcomes and their probabilities.
For example, here's a `Pmf` that represents the outcome of rolling a six-sided die:
"""

from empiricaldist import Pmf

outcomes = [1,2,3,4,5,6]
die = Pmf(1/6, outcomes)

In [None]:
import os
os.getcwd()

In [None]:
"""We'll use `Pmf` objects to represent distributions of values from a new dataset, the General Social Survey (GSS).
The GSS surveys a representative sample of adult residents of the U.S. and asks questions about demographics, personal history, and beliefs about social and political issues.
"""

data_file = 'Data/gss_extract_2022.hdf'
import pandas as pd

gss = pd.read_hdf(data_file, 'gss')
print(gss.shape)
print(gss.head())
## Distribution of Education

"""
To get started with this dataset, let's look at the `educ` column, which records the number of years of education for each respondent.
We can select this column from the `DataFrame` like this:
"""

educ = gss['educ']

"""To see what the distribution of the responses looks like, we can use the `hist` method to plot a histogram."""

import matplotlib.pyplot as plt
educ.hist() #grid=False)
plt.xlabel('Years of education')
plt.ylabel('Number of respondents')
plt.title('Histogram of education level');

In [None]:
"""
The function `Pmf.from_seq` takes any kind of sequence -- like a list, tuple, or Pandas `Series` -- and computes the distribution of the values.
"""

pmf_educ = Pmf.from_seq(educ, normalize=False)
type(pmf_educ)
# We can use the bracket operator to look up a value in `pmf_educ` and get the corresponding count.
print(pmf_educ[20])

In [None]:
"""when we make a `Pmf`, we want to know the *fraction* of respondents with each value, rather than the counts.
We can do that by setting `normalize=True`.
Then we get a **normalized** `Pmf`, which means that the fractions add up to 1.
"""

pmf_educ_norm = Pmf.from_seq(educ, normalize=True)
print(pmf_educ_norm.head())
"""Now if we use the bracket operator to look up a value, the result is a fraction rather than a count."""

print(pmf_educ_norm[20])
pmf_educ_norm.bar(label='educ')

plt.xlabel('Years of education')
plt.xticks(range(0, 21, 4))
plt.ylabel('PMF')
plt.title('Distribution of years of education')
plt.legend();

In [None]:
"""
Let's look at the `year` column in the `DataFrame`, which represents the year each respondent was interviewed.
Make an unnormalized `Pmf` for `year` and plot the result as a bar chart.
"""
year = gss['year']
pmf_year = Pmf.from_seq(year, normalize=False)
pmf_year.bar(label='year')
print(pmf_year[2022])

In [None]:
"""
CDF Cumulative Distribution Functions
If we compute the cumulative sum of a PMF, the result is a cumulative distribution function (CDF).
"""
values = 1, 2, 2, 3, 5
pmf = Pmf.from_seq(values)
print(pmf)
cdf = pmf.make_cdf()
print(cdf)

In [None]:
"""
## CDF of Age
"""
age = gss['age']

"""`empiricaldist` provides a `Cdf.from_seq` function that takes any kind of sequence and computes the CDF of the values."""

from empiricaldist import Cdf

cdf_age = Cdf.from_seq(age)

"""The result is a `Cdf` object, which provides a method called `plot` that plots the CDF as a line."""

cdf_age.plot()

plt.xlabel('Age (years)')
plt.ylabel('CDF')
plt.title('Distribution of age');

In [None]:
"""`q` stands for "quantity", which is another name for a value in a distribution.
`p` stands for probability, which is the result.
In this example, the quantity is age 51, and the corresponding probability is about 0.62.
That means that about 62% of the respondents are age 51 or younger.
"""
q = 51
p = cdf_age(q)
print(p)
def draw_line(p, q, x):
    xs = [q, q, x]
    ys = [0, p, p]
    plt.plot(xs, ys, ':', color='gray')

def draw_arrow_left(p, q, x):
    dx = 3
    dy = 0.025
    xs = [x+dx, x, x+dx]
    ys = [p-dy, p, p+dy]
    plt.plot(xs, ys, ':', color='gray')

def draw_arrow_down(p, q, y):
    dx = 1.25
    dy = 0.045
    xs = [q-dx, q, q+dx]
    ys = [y+dy, y, y+dy]
    plt.plot(xs, ys, ':', color='gray')

cdf_age.plot()
x = 17
draw_line(p, q, x)
draw_arrow_left(p, q, x)

plt.xlabel('Age (years)')
plt.xlim(x-1, 91)
plt.ylabel('CDF')
plt.title('Distribution of age');


In [None]:
"""The CDF is an invertible function, which means that if you have a probability, `p`,
you can look up the corresponding quantity, `q`.
The `Cdf` object provides a method called `inverse` that computes the inverse of the cumulative distribution function.
We look up the probability 0.25 and the result is 32.
That means that 25% of the respondents are age 32 or less.
Another way to say the same thing is "age 32 is the 25th percentile of this distribution".
"""

p1 = 0.25
q1 = cdf_age.inverse(p1)
print(q1)
cdf_age.plot()

p2 = 0.75
q2 = cdf_age.inverse(p2)
print(q2)

x = 17
draw_line(p1, q1, x)
draw_arrow_down(p1, q1, 0)

draw_line(p2, q2, x)
draw_arrow_down(p2, q2, 0)

plt.xlabel('Age (years)')
plt.xlim(x-1, 91)
plt.ylabel('CDF')
plt.title('Distribution of age');


In [None]:
"""## Comparing Distributions

So far we've seen two ways to represent distributions, PMFs and CDFs.
Now we'll use PMFs and CDFs to compare distributions, and we'll see the pros and cons of each.
One way to compare distributions is to plot multiple PMFs on the same axes.
For example, suppose we want to compare the distribution of age for male and female respondents.
First we'll create a Boolean `Series` that's true for male respondents and another that's true for female respondents.
"""

male = (gss['sex'] == 1)
female = (gss['sex'] == 2)

"""We can use these `Series` to select ages for male and female respondents."""

male_age = age[male]
female_age = age[female]

"""And plot a PMF for each."""

pmf_male_age = Pmf.from_seq(male_age)
pmf_male_age.plot(label='Male')

pmf_female_age = Pmf.from_seq(female_age)
pmf_female_age.plot(label='Female')

plt.xlabel('Age (years)')
plt.ylabel('PMF')
plt.title('Distribution of age by sex')
plt.legend();

In [None]:
"""A plot as variable as this is often described as **noisy**.
If we ignore the noise, it looks like the PMF is higher for men between ages 40 and 50,
and higher for women between ages 70 and 80.
But both of those differences might be due to randomness.

Now let's do the same thing with CDFs -- everything is the same except we replace `Pmf` with `Cdf`.
"""

cdf_male_age = Cdf.from_seq(male_age)
cdf_male_age.plot(label='Male')

cdf_female_age = Cdf.from_seq(female_age)
cdf_female_age.plot(label='Female')

plt.xlabel('Age (years)')
plt.ylabel('CDF')
plt.title('Distribution of age by sex')
plt.legend();

In [None]:
"""Because CDFs smooth out randomness, they provide a better view of real differences between distributions.
In this case, the lines are close together until age 40 -- after that, the CDF is higher for men than women.

So what does that mean?
One way to interpret the difference is that the fraction of men below a given age is generally more than the fraction of 
women below the same age.
For example, about 77% of men are 60 or less, compared to 75% of women.
"""

print(cdf_male_age(60), cdf_female_age(60))
"""Going the other way, we could also compare percentiles.
For example, the median age woman is older than the median age man, by about one year.
"""

print(cdf_male_age.inverse(0.5), cdf_female_age.inverse(0.5))

In [None]:
"""## Comparing Incomes

As another example, let's look at household income and compare the distribution before and after 1995 (I chose 1995 because 
it's roughly the midpoint of the survey).
We'll make two Boolean `Series` objects to select respondents interviewed before and after 1995.
"""

pre95 = (gss['year'] < 1995)
post95 = (gss['year'] >= 1995)

"""Now we can plot the PMFs of `realinc`, which records household income converted to 1986 dollars."""

realinc = gss['realinc']

Pmf.from_seq(realinc[pre95]).plot(label='Before 1995')
Pmf.from_seq(realinc[post95]).plot(label='After 1995')

plt.xlabel('Income (1986 USD)')
plt.ylabel('PMF')
plt.title('Distribution of income')
plt.legend();

In [None]:
"""There are a lot of unique values in this distribution, and none of them appear very often.  As a result, the PMF is so noisy and we 
can't really see the shape of the distribution.
It's also hard to compare the distributions.
It looks like there are more people with high incomes after 1995, but it's hard to tell.  We can get a clearer picture with a CDF.
"""

Cdf.from_seq(realinc[pre95]).plot(label='Before 1995')
Cdf.from_seq(realinc[post95]).plot(label='After 1995')

plt.xlabel('Income (1986 USD)')
plt.ylabel('CDF')
plt.title('Distribution of income')
plt.legend();

In [None]:
"""
Below 30K the CDFs are almost identical; above that, we can see that the post-1995 distribution is shifted to the right.
In other words, the fraction of people with high incomes is about the same, but the income of high earners has increased.
"""