In [1]:
from scipy.stats import entropy
import pandas as pd
import numpy as np

Load the results of part 1 into a pandas.DataFrame:

In [2]:
results1 = pd.read_csv('hw1-ii-1.csv')

Round each number to two decimal places:

In [3]:
results_rounded = {group_name: np.array(results).round(2) for group_name, results in results1.to_dict('list').items()}

I will use the Kullback–Leibler divergence ($D_{\text{KL}}$) to measure the distance of each of the last 3 groups of results from the original results. For discrete probability distributions $P$ and $Q$ defined on the same sample space, $\mathcal{X}$,
$$ D_{\text{KL}}(P \ || \ Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right).$$
Let $D_{\text{O}}$, $D_{\min}$, $D_{26}$ and $D_{\max}$ denote the 4 adjacent datasets. We need to verify that
$$ D_{\text{KL}}( \text{A}(D_{\text{O}}) \ || \ \text{A}(D_{\text{X}}) ) < \epsilon, $$
for all $D_{\text{X}} \in \{ D_{\min}, D_{26}, D_{\max} \}$.

I will utilize the function `scipy.stats.entropy` ([docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html)) to calculate $D_{\text{KL}}$ per the definition above:

In [4]:
def validate_group(group_name, epsilon):
    # Calculate frequency of each %.2f result (bin):
    orig_bins, orig_freq = np.unique(results_rounded['original'], return_counts=True)
    nebr_bins, nebr_freq = np.unique(results_rounded[group_name], return_counts=True)

    # Calculate vector of frequencies over common %.2f results (bins):
    orig_v = orig_freq[np.nonzero(np.in1d(orig_bins, nebr_bins))[0]]  # per https://stackoverflow.com/a/2333682
    nebr_v = nebr_freq[np.nonzero(np.in1d(nebr_bins, orig_bins))[0]]

    return entropy(orig_v, nebr_v) < epsilon

Run the check for all 3 groups:

In [5]:
print({name: validate_group(name, 0.5) for name in ['sans_oldest', 'sans_26', 'sans_youngest']})

{'sans_oldest': True, 'sans_26': True, 'sans_youngest': True}
