# Bayesian Network from Conditional Probability Table
[//]: # (Summary)
In this tutorial we are creating a Bayesian Network (BN) for the fix task of a Kill Chain as shown in the figure below.

![alt text](images/Ship-Wake-BN-Fix.png)

We will walk through the aspectes of a BN and then demonstrate constructing a BN by manually creating the Conditional Probability Distribution (CPD) tables that define the edges in the graph. Finally, we demonstrate how the BN task is incorporated in mimik's `Killweb` module.

[//]: # (Objective)
By the end of this tutorial you will understand how a BN is created using a defined graph and manually created CPD tables, how to infer with that BN, and how to run the `Killweb` module with this Fix BN task.

This tutorial requires the Python package [pgmpy](https://pgmpy.org). Please ensure that pgmpy is installed by running the following command in a terminal with the mimik virtual environment active:
```
$ pip install pgmpy
```
Or, follow the [pgmpy installation instructions here.](https://pgmpy.org/started/base.html)

In [None]:
import json
import warnings
import seaborn as sns
from tasks.component_BN import ComponentBN
from pgmpy.factors.discrete import TabularCPD
from mimik.killweb import Killweb

warnings.filterwarnings('ignore')

## Creating the Network

### Edges

Edges define the nodes and their relationship. For example, if there is a connection from node $\alpha$ to $\beta$ we specify that with a the tuple ( $\alpha$ , $\beta$ ).

In this example, the success of the Fix task is affected by the time of day and the weather, so our edges are:
- (Time, Fix)
- (Weather, Fix)

Edges are set in an `edges` variable that is a list of tuples for all the edges in the network.

In [None]:
edges = [
    ('Time', 'Fix'),
    ('Weather', 'Fix'),
]

### Conditional Probability Distribution (CPD) Table

The CPD tables contain the probability of outcomes for a node, and, if relevant, how probabilities are conditioned on inputs. For example, $\alpha$ will have a table that contains the probability of its outcomes, $P(\alpha_i)$ where, which must sum to 1:

| &alpha; | Probability |
| -------- | ----------- |
|    0     | P(&alpha;=0)=0.3 |
|    1     | P(&alpha;=1)=0.7 |

For the CPD table of $\beta$, probabilities are conditioned on the input $\alpha$, so each cell is $P(\beta_j\mid\alpha_i)$:

| &beta; | Probability | Probability |
| ------- | ----------- | ----------- |
|         |           &alpha;=0          |           &alpha;=1           |
|    0    | P(&beta;=0 &#124; &alpha;=0)=0.8 | P(&beta;=0 &#124; &alpha;=1)=0.05 |
|    1    | P(&beta;=1 &#124; &alpha;=0)=0.2 | P(&beta;=1 &#124; &alpha;=1)=0.95 |


In the remainder of this notebook, we do not include the $P(\alpha_i)$ or $P(\beta_j\mid\alpha_i)$ in the table cells for simplicity. Continuing the Fix task example from above, we first define the table for "Time":

|  Time   | Probability |
| ------- | ----------- |
|   Day   |   0.70      |
|  Night  |   0.30      |

The Bayesian network (BN) requires this information as a `TabularCPD` object, which takes the probabilities of the table along with additional metadata. For a simple table like this, we specify the `variable` name (which must match the spelling of the associated node in `edges`, so Time), the number of possible outcomes (`variable_card`), the probability `values` in the table, and the names of the outcomes of the table ("Day" and "Night" in the dictionary `state_names` -- this format will make more sense when there are multiple inputs to a node, which we will see shortly). The cell below shows how to create the relevant CPD for the Time node.

In [None]:
time_cpd = TabularCPD(
    variable='Time',  # name of the variable node
    variable_card=2,  # number of possible outcomes
    values=[[0.7],    # conditional probability values
            [0.3]],
    state_names={    # dictionary of input and output names
        'Time': ["Day", "Night"]
    }
)

Next, we do the same with the "Weather" node and its table:

| Weather | Probability |
| ------- | ----------- |
|  Clear  |   0.90      |
|   Fog   |   0.10      |

In [None]:
weather_cpd = TabularCPD(
    variable='Weather',
    variable_card=2,
    values=[[0.9],
            [0.1]],
    state_names={
        'Weather': ["Clear", "Fog"]
    }
)

We must then condition the output of Fix ("Failure" or "Success") based on the inputs of Time (Day or Night) and Weather ("Clear" or "Fog"). The CPD for the Fix node is:

| Fix     | Probability | Probability | Probability | Probability |
| ------- | ----------- | ----------- | ----------- | ----------- |
|         |  Time=Day   |  Time=Day   |  Time=Night |  Time=Night |
|         |Weather=Clear| Weather=Fog |Weather=Clear| Weather=Fog |
| Failure |     0.01    |     0.60    |     0.10    |     0.80    |
| Success |     0.99    |     0.40    |     0.90    |     0.20    |

Remember, this table is read as $P(Fix_k\mid Time_j,Weather_i)$, so $P(Fix=Success\mid Time=Day,Weather=Clear)=0.99$. The values for the probability of outcomes can be defined as a list of lists for the CPD of this node:

In [None]:
values=[[0.01, 0.60, 0.10, 0.80],  # Failure
        [0.99, 0.40, 0.90, 0.20]]  # Success

In addition to the arguments above, for the Fix node we have to specify `evidence`, which are the names of the input nodes (Time and Weather), the `evidence_card` (which holds the number of outcomes for each input evidence node, in this case 2 for each input), and the `state_names` are expanded to include the input nodes. 

Note that the ordering in `values`, `evidence`, `evidence_card`, and the lists in `state_names` must all align properly. The first row of `values` is for the outcome listed first in the `state_names` for Fix, "Failure". Likewise, the conditions for the inputs start with the first condition listed in the `state_names` for Time and Weather, then the last item in `evidence` is cycled through all its possible outcomes before the preceeding is cycled. To help, we've added comments around the values to show the associated labels of the table above.

In [None]:
fix_cpd = TabularCPD(
    variable='Fix',
    variable_card=2,
    # Time   Day    Day  Night  Night
    # Wx    Clear   Fog  Clear  Fog
    values=[[0.01, 0.60, 0.10, 0.80],  # Failure
            [0.99, 0.40, 0.90, 0.20]],  # Success
    evidence=['Time', 'Weather'],
    evidence_card=[2, 2],
    state_names={
        "Time": ["Day", "Night"],
        "Weather": ["Clear", "Fog"],
        "Fix": ["Failure", "Success"]
    }
)

Below is a template for a node $\delta$ conditioned on three inputs: $\alpha$, $\beta$, and $\gamma$, to help illustrated ordering for a more complex table.

<!-- | &delta; | Probability | Probability | Probability | Probability | Probability | Probability | Probability | Probability |
| -------  | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
||&alpha;=0|&alpha;=0|&alpha;=0|&alpha;=0|&alpha;=1|&alpha;=1|&alpha;=1|&alpha;=1|
||&beta;=0|&beta;=0|&beta;=1|&beta;=1|&beta;=0|&beta;=0|&beta;=1|&beta;=1|
||&gamma;=0|&gamma;=1|&gamma;=0|&gamma;=1|&gamma;=0|&gamma;=1|&gamma;=0|&gamma;=1|
|0|P(&delta;=0 &#124; &alpha;=0, &beta;=0, &gamma;=0)|P(&delta;=0 &#124; &alpha;=0, &beta;=0, &gamma;=1)|P(&delta;=0 &#124; &alpha;=0, &beta;=1, &gamma;=0)|P(&delta;=0 &#124; &alpha;=0, &beta;=1, &gamma;=1)|P(&delta;=0 &#124; &alpha;=1, &beta;=0, &gamma;=0)|P(&delta;=0 &#124; &alpha;=1, &beta;=0, &gamma;=1)|P(&delta;=0 &#124; &alpha;=1, &beta;=1, &gamma;=0)|P(&delta;=0 &#124; &alpha;=1, &beta;=1, &gamma;=1)|
|1|P(&delta;=1 &#124; &alpha;=0, &beta;=0, &gamma;=0)|P(&delta;=1 &#124; &alpha;=0, &beta;=0, &gamma;=1)|P(&delta;=1 &#124; &alpha;=0, &beta;=1, &gamma;=0)|P(&delta;=1 &#124; &alpha;=0, &beta;=1, &gamma;=1)|P(&delta;=1 &#124; &alpha;=1, &beta;=0, &gamma;=0)|P(&delta;=1 &#124; &alpha;=1, &beta;=0, &gamma;=1)|P(&delta;=1 &#124; &alpha;=1, &beta;=1, &gamma;=0)|P(&delta;=1 &#124; &alpha;=1, &beta;=1, &gamma;=1)| -->

![](images/template-table.png)

The other metadata for the CPD would be:
- evidence = [&alpha;, &beta;, &gamma;]
- evidence_card = [2, 2, 2]
- state_names
    - &alpha; = [0, 1] 
    - &beta; = [0, 1] 
    - &gamma; = [0, 1]
    - &delta; = [0, 1] 
  

### Constructing the BN

Finally, we create the BN using the `ComponentBN` class, which takes the edges and a list of the tabular CPDs as input to construct the BN.

In [None]:
fix_BN = ComponentBN(
    edges=edges,
    CPDs=[
        time_cpd,
        weather_cpd,
        fix_cpd
    ]
)

We can visualize the network by calling the `draw_network` method.

In [None]:
fix_BN.draw_network()

## Using the BN

### Inference Probability of Success

Given some evidence, same `Time=Day` and `Weather=Clear`, we can get the associated probability table for the Fix node by calling the `get_infer` method:

In [None]:
query = fix_BN.get_infer(
    evidence={
        "Time": "Day",
        "Weather": "Clear"
    },
    variables=["Fix"]
)
print(query)

Notice that this matches the probabilities from the table above. For simple cases like this, where a set of inputs goes directly to an outcome node, the inference will be a look up in the table. If more complex graphs are used the inference method will perform the math needed to correctly compute the inference. For example, if we had the following graph and evidence for $\alpha$, $\beta$, and $\gamma$, then we could run the same method to get the probability of outcomes for $\delta$.

![](images/example-graph.svg)

We can also get just the probability of success using the `get_prob_success` method and providing the evidence and desired outcome `Fix="Success"`: 

In [None]:
fix_BN.get_prob_success(
    evidence={
        "Time": "Day",
        "Weather": "Clear"
    },
    Fix="Success"
)

### Sampling

While the above inference provides the probability of success (or failure), the Monte Carlo simulation in mimik requires random sampling from the probability of outcomes for the Fix node. This is done through the `get_sample` method, which does return a pandas data frame, but we can just look at the outcome. Given the high probability of success and the low sample size, the outcome is mostly likely a "Success"

In [None]:
%%capture --no-stderr
df = fix_BN.get_sample(
    evidence={
        "Time": "Day",
        "Weather": "Clear"
    },
    size=1
)
df

We can just grab the value of `Fix` for the sample generated above.

In [None]:
df["Fix"].values[0]

If we increase the sample `size` we can get enough results to see failures and can plot a histogram of the results to visualize the ratio of success to failure.

In [None]:
%%capture --no-stderr
results = fix_BN.get_sample(
    evidence={
        "Time": "Day",
        "Weather": "Clear"
    },
    size=100_000
)["Fix"]

In [None]:
print(results.value_counts())

sns.histplot(results.values);

If we change the evidence for weather to "Fog" we can see that the probability of success drops significantly. By constructing and using BNs in MIMIK we can account for changes in conditions like this when modeling Killwebs.

In [None]:
%%capture --no-stderr
results = fix_BN.get_sample(
    evidence={
        "Time": "Day",
        "Weather": "Fog"
    },
    size=100_000
)["Fix"]

In [None]:
print(results.value_counts())

sns.histplot(results.values);

## JSON Configs

This is the json configuration for this Bayesian network. Because it's constant for any similar Fix task node in the Killweb we save this configuration in its own file in the `configs/` directory. This type of configuration is how users can create their own custom BN for new Killwebs. Each type of BN node will have its own json config file, separate from the Killweb config file. More details for the configuration files are discussed in the enxt section. This section is just to demonstrate what the BN configs looks like and how it compares to the construction we did in the previous section.

In [None]:
text = """
{
  "edges": [
    ["Time", "Fix"],
    ["Weather", "Fix"]
  ],
  "data": null,
  "CPDs":
  {
    "Time":
    { "variable": "Time",
      "variable_card": 2,
      "values": [
        [0.70],
        [0.30]
      ],
      "state_names": {
        "Time": ["Day", "Night"]
      }
    },
    "Weather":
    { "variable": "Weather",
      "variable_card": 2,
      "values": [
        [0.90],
        [0.10]
      ],
      "state_names": {
        "Weather": ["Clear", "Fog"]
      }
    },
    "Fix":
    { "variable": "Fix",
      "variable_card": 2,
      "values": [
        [0.01, 0.60, 0.10, 0.80],
        [0.99, 0.40, 0.90, 0.20]
      ],
      "evidence": [
        "Time",
        "Weather"
      ],
      "evidence_card": [2,2],
      "state_names": {
        "Time": ["Day", "Night"],
        "Weather": ["Clear", "Fog"],
        "Fix": ["Failure", "Success"]
      }
    }
  }
}
"""

We can load the text above and provide it to the `ComponentBN` class to create the Fix task node. Then we rerun the `draw_network` and `get_infer` methods to demonstrate the BN is the same as the one we created in the previous section.

In [None]:
data = json.loads(text)
CPD_list = []
CPD_cfg = data.pop("CPDs")
for name, config in CPD_cfg.items():
    CPD_list.append(TabularCPD(**config))

fix_BN = ComponentBN(
    # edges=data["Edges"],
    # data=data["Data"],
    **data,
    CPDs=CPD_list
)

In [None]:
fix_BN.draw_network()

In [None]:
q = fix_BN.get_infer(
    evidence={
        "Time": "Day",
        "Weather": "Clear"
    },
    variables=["Fix"]
)
print(q)

## Using the Killweb Module

All of the above setup and sampling of outcomes is automoated by the `Killweb` module in MIMIK. All the user needs to provide are the json configs for the Killweb and the associated Bayesian network tasks. For this tutorial, the correct configurations are already set up in the `configs` subdirectory. All we have to do is specify this working directory (`.`) and the main config file `configs/bn_killchain.json` ("killchain" becuase this tutorial only has a single path). The `task_arguments` for this Fix task node must include the absolute path to the BN config file. We must also specify the name of the `outcome` node and the success `condition` so the `BN.get_sample` method can be called correctly during the Monte Carlo simulation. The remaining arguments under `task_arguments` are the evidence that are used for the inference and sampling.

```json
"task_arguments": {
    "BN_config": "/absolute/path/to/mimik/ship_wake_fix_task/configs/fix_simple.json",
    "outcome": "Fix",
    "condition": "Success",
    "Time": "Day",
    "Weather": "Clear"
}
```

The killweb module is called by specifying the working directory of this example case and the path from there to the killchain config file.

In [None]:
killweb = Killweb(
    working_dir=".",
    config_file="configs/bn_killchain.json",
)

The data is read and confirmed that the configs are in valid json format. Next we can print all the paths through the killweb (there will be only one path in this tutorial -- "Radar_1, Sensor_1, Track Algorithm_1, Equation_1, Missle_1, Personnel_1"), and then run the Monte Carlo simulation with 1,000 iterations.

In [None]:
killweb.print_all_paths_in_killweb()

In [None]:
%%capture --no-stderr
killweb.monte_carlo_on_paths(1000)

Now that the Monte Carlo is complete, we can print the probabilities of success for the path in the killchain:

In [None]:
killweb.print_probabilities_of_paths(1)

In cases where there are more than one path in the Killweb, we can specify the path to test and get its probability of success as the proportion of time it sucessfully complete:

In [None]:
path_to_test = ["Radar_1", "Sensor_1", "Track Algorithm_1", "Equation_1", "Missle_1", "Personnel_1"]
killweb.print_proportion_complete(path_to_test)

For a specific path, we can also plot the distribution of outcomes for each node as both a proportion of binary success using the `killweb.plot_monte_carlo_distribution` method or as the histogram of actual probabilities of success using the `killweb.plot_probability_distribution` method. Not that for the `killweb.plot_probability_distribution`, becuase all the probabilities are high it will look like outcomes are either 1.0 or 0.