# Day 26 notebook

The objectives of this notebook are to practice

* the bootstrap
* permutation testing
* computing confidence measures of network features

## Modules used for this assignment

In [None]:
# standard library modules
import random           # for seed and other randomizing functions
import collections      # for Counter

# course modules
import bayesian_network # for the BayesianNetwork class

## Sample data
As some toy data to work with in this notebook, we'll use some stats from the regular season games that the Green Bay Packers have played thus far.

In [None]:
# Green Bay Packers 2019 regular season games
# variables: opponent, home/away, packers score, opponent score, packers pass yards, packers rush yards
packers_data = [
    ('Bears',    'away', 10,  3, 166,  47),
    ('Vikings',  'home', 21, 16, 191, 144),
    ('Broncos',  'home', 27, 16, 235,  77),
    ('Eagles',   'home', 27, 34, 414,  77),
    ('Cowboys',  'away', 34, 24, 215, 120),
    ('Lions',    'home', 23, 22, 277, 170),
    ('Raiders',  'home', 42, 24, 421,  60),
    ('Chiefs',   'away', 31, 24, 256, 118),
    ('Chargers', 'away', 11, 26, 139,  45),
    ('Panthers', 'home', 24, 16, 225, 163),
    ('49ers',    'away',  8, 37,  81, 117),
    ('Giants',   'away', 31, 13, 243,  79),
    ('Redskins', 'home', 20, 15, 167, 174)]

## PROBLEM 1: The bootstrap (1 POINT)
Implement the `bootstrap` function below which constructs a bootstrapped data set from a given data set.  Recall that a bootstrapped data set is generated by sampling, *with replacement*, observations from the original data set.  Your function should simply be a **single function call** to the appropriate function in Python's [`random`](https://docs.python.org/library/random.html) module.

In [None]:
def bootstrap(data):
    """Returns a bootstrap sample from the data set.
    
    Args:
        data: a list of observations (tuples)
    Returns:
        A list of observations with the same number of observations as in data.
    """
    ###
    ### YOUR CODE HERE
    ###


In [None]:
# tests for bootstrap
random.seed(42)
assert bootstrap(packers_data) == [
    ('Chargers', 'away', 11, 26, 139,  45),
    ('Bears',    'away', 10,  3, 166,  47),
    ('Eagles',   'home', 27, 34, 414,  77),
    ('Broncos',  'home', 27, 16, 235,  77),
    ('Panthers', 'home', 24, 16, 225, 163),
    ('Chargers', 'away', 11, 26, 139,  45),
    ('Giants',   'away', 31, 13, 243,  79),
    ('Vikings',  'home', 21, 16, 191, 144),
    ('Lions',    'home', 23, 22, 277, 170),
    ('Bears',    'away', 10,  3, 166,  47),
    ('Broncos',  'home', 27, 16, 235,  77),
    ('Raiders',  'home', 42, 24, 421,  60),
    ('Bears',    'away', 10,  3, 166,  47)]

packers_points = [(obs[2],) for obs in packers_data]
random.seed(1)
assert bootstrap(packers_points) == [(21,), (31,), (24,), (27,), (42,), (23,), (11,), (8,), (21,), (10,), (8,), (23,), (24,)]

random.seed(2)
assert bootstrap([("single",)]) == [("single",)]
print("SUCCESS: bootstrap passed all tests!")

## PROBLEM 2: Permuting (randomizing) data (1 POINT)
Implement the `permute` function below which constructs a permuted version of the given data set.  Recall that permuting a data set involves shuffling the values within each variable (column in our tables).  In your implementation, you should shuffle the values of each variable (column) separately using the [`random.sample`](https://docs.python.org/library/random.html#random.sample) function (not the `random.shuffle` function!), with the first variable shuffled first, the second variable shuffled second, and so on.  Recall that `zip(*seq)` is useful for transposing a data matrix (`seq`) that is represented as a list of tuples/lists.

In [None]:
def permute(data):
    """Returns a permuted (randomized) version of the data set.
    
    Args:
        data: a list of observations (tuples)
    Returns:
        A list of observations with the same number of observations as in data.
    """
    by_variable = zip(*data)
    ###
    ### YOUR CODE HERE
    ###


In [None]:
# tests for permute
random.seed(42)
assert permute(packers_data) == [
    ('49ers',    'away', 20, 22, 225,  77),
    ('Vikings',  'home', 10, 15, 414, 170),
    ('Bears',    'home', 27, 16, 191,  79),
    ('Cowboys',  'away', 42, 24, 166,  77),
    ('Eagles',   'away', 23,  3, 243, 120),
    ('Chargers', 'home', 34, 24, 215, 144),
    ('Giants',   'away', 21, 37, 167,  45),
    ('Lions',    'away', 24, 26,  81, 117),
    ('Redskins', 'home',  8, 34, 421, 118),
    ('Panthers', 'home', 31, 13, 139,  60),
    ('Broncos',  'away', 27, 16, 277,  47),
    ('Raiders',  'home', 11, 16, 235, 163),
    ('Chiefs',   'home', 31, 24, 256, 174)]

packers_points_and_pass_yards = [(obs[2], obs[4]) for obs in packers_data]
random.seed(1)
assert permute(packers_points_and_pass_yards) == [
    (27, 166),
    (24, 421),
    (21, 243),
    (34, 225),
    ( 8, 167),
    (31, 256),
    (42, 235),
    (27, 277),
    (23, 191),
    (31, 139),
    (10, 215),
    (20, 414),
    (11,  81)]

print("SUCCESS: permute passed all tests!")

## Updates to the `BayesianNetwork` class

The `BayesianNetwork` class has now been filled out a bit more to include two key methods/functions:

1. `all_possible_networks`: An iterator over all possible Bayesian Networks structures for the given random variables
2. `BayesianNetwork.model_evidence`: Returns the model evidence: $\log P(D)$

With these two functions, we can now define a function that does a brute force search over all possible networks and returns the network that maximizes the model evidence score:

In [None]:
def best_net(data, possible_nets):
    model_evidences = [net.model_evidence(data) for net in possible_nets]
    return possible_nets[model_evidences.index(max(model_evidences))]

For example, we can use this function to attempt to reconstruct the flight/weather/airline model from a data set simulated from that model.  In this example, the best network turns out to be the true network!

In [None]:
flight_weather_network = bayesian_network.make_flight_weather_network()

all_possible_flight_weather_nets = list(bayesian_network.all_possible_networks(flight_weather_network.vertex_labels(),
                                                                               flight_weather_network.possible_values))

# Generate a data set from the model, as well as a permuted version of this data set
random.seed(1)
flight_weather_dataset = [flight_weather_network.sample() for _ in range(300)]
permuted_flight_weather_dataset = permute(flight_weather_dataset)

# Predict the network structure from this data set using the model evidence score
best_net(flight_weather_dataset, all_possible_flight_weather_nets).plot()

## PROBLEM 3: Feature confidence via the bootstrap (1 POINT)
You are to use the `best_net` function, along with your `bootstrap` function from Problem 1 to compute confidence levels in the features of a network learned from a given data set.  The confidence levels of each feature should be the fraction of bootstrap data sets in which the feature is present in the learned network (the one output by `best_net`).  Your function should take as input a `features_func` function, which extracts the features of interest from a single learned network.  A couple of such feature extraction functions are provided below.  I recommend that you use [`collections.Counter`](https://docs.python.org/3.6/library/collections.html#collections.Counter) in your implementation.

In [None]:
def extract_edges(net):
    """Extracts the directed edges (with variable names) of the network."""
    return {(net.vertex_label(i), net.vertex_label(j)) for i, j in net.edges()}

def extract_undirected_edges(net):
    """Extracts the undirected edges (with variable names) of the network."""
    return {tuple(sorted(edge)) for edge in extract_edges(net)}

def compute_feature_confidences(data, possible_networks, features_func, num_bootstraps):
    """Computes the bootstrap confidence levels of features in networks learned from the data set.
    Args:
        data: a list of observations (tuples)
        possible_networks: a list of all possible BayesianNetworks
        features_func: a function that returns a set of features present in a learned network
        num_bootstraps: the number of bootstrapped data sets to use
    Returns:
        A dictionary mapping network features to confidence levels (fraction of bootstrap datasets 
        in which the feature is present in the learned network)
    """
    ###
    ### YOUR CODE HERE
    ###


In [None]:
# tests for compute_feature_confidences

# test with just one bootstrapped data set
random.seed(41)
assert compute_feature_confidences(flight_weather_dataset, 
                                   all_possible_flight_weather_nets,
                                   extract_undirected_edges,
                                   1) == {
    ('airline', 'flight_status'): 1.0, 
    ('flight_status', 'weather'): 1.0}

# test with 10 bootstrapped data sets
random.seed(42)
assert compute_feature_confidences(flight_weather_dataset, 
                                   all_possible_flight_weather_nets,
                                   extract_undirected_edges,
                                   10) == {
    ('flight_status', 'weather'): 1.0,
    ('airline', 'flight_status'): 0.6,
    ('airline',       'weather'): 0.5}

# test with 100 bootstrapped data sets
random.seed(42)
assert compute_feature_confidences(flight_weather_dataset, 
                                   all_possible_flight_weather_nets,
                                   extract_undirected_edges,
                                   100) == {
    ('flight_status', 'weather'): 1.0,
    ('airline', 'flight_status'): 0.71,
    ('airline',       'weather'): 0.41}

# test with a permuted data set, which shouldn't have any real associations
random.seed(42)
assert compute_feature_confidences(permuted_flight_weather_dataset, 
                                   all_possible_flight_weather_nets,
                                   extract_undirected_edges,
                                   100) == {
    ('airline', 'flight_status'): 0.14,
    ('flight_status', 'weather'): 0.67,
    ('airline',       'weather'): 0.15}

print("SUCCESS: compute_feature_confidences passed all tests!")

## BONUS ACTIVITIES

1. How much data is needed from the flight/weather/airline model in order for the airline->flight_status edge to be confidently learned?
2. Below is a toy four variable model.  Simulate a data set from this model and then see if you can the edges of the model confidently.  There are many more four variable models, so this will take some time to run.
3. Implement another feature extraction function, `markov_blanket_relations`, which outputs a set of pairs of variables (i, j), for which i is in the Markov blanket of j.  Try using it with the toy four variable model.

In [None]:
four_var_network = bayesian_network.BayesianNetwork(["x1", "x2", "x3", "x4"])
four_var_network.set_cpd("x1", [], [0, 1], 
                         {(): [0.75, 0.25]})
four_var_network.set_cpd("x2", ["x1"], [0, 1],
                         {(0,): [0.9, 0.1],
                          (1,): [0.1, 0.9]})
four_var_network.set_cpd("x3", ["x2"], [0, 1],
                         {(0,): [0.8, 0.2],
                          (1,): [0.2, 0.8]})
four_var_network.set_cpd("x4", ["x3"], [0, 1],
                         {(0,): [0.7, 0.3],
                          (1,): [0.3, 0.7]})
four_var_network.plot()

In [None]:
###
### YOUR CODE HERE
###
