# Examples of Flat Provenance Types and Feature Vectors

In [1]:
from itertools import chain
from typing import Dict

In [2]:
# requiring Python packages: requests, prov
import requests
from prov.model import ProvDocument

The code to produce flat provenance types is provided in [flatprovenancetypes.py](flatprovenancetypes.py). We import the function `calculate_flat_provenance_types` contained therein to use here.

In [3]:
from flatprovenancetypes import calculate_flat_provenance_types, print_flat_type, count_fp_types

## Utility functions

In [4]:
def download_prov_json_document(url: str) -> ProvDocument:
    # try to download the provided url
    r = requests.get(url)
    r.raise_for_status()

    # no exception so far, we have successfuly downloaded it
    prov_doc = ProvDocument.deserialize(content=r.text)
    return prov_doc

## Producing flat provenance types

In this example, we use a public PROV document at https://openprovenance.org/store/documents/282.

In [5]:
# note the .json extension at the end of the URL.
prov_doc = download_prov_json_document("https://openprovenance.org/store/documents/282.json")

In the command below, we produce the flat provenance types for every nodes in the input ProvDocument object (a node in a provenance graph is either an entity, an agent, or an activity). We will produce types up to level 2 and ignore all application-specific types.

In [6]:
fptypes = calculate_flat_provenance_types(
    prov_doc,  # the ProvDocument object to produce types from
    to_level=2,  # calculate types up to this given level (2)
    including_primitives_types=False,  # only consider PROV generic types, ignoring application-specific types
)

The returned `fptypes` structure contains 0-types for all nodes in `fptypes[0]`, 1-types in `fptypes[1]`, and so on, up to the specified level in the function above.

In [7]:
fptypes.keys()

dict_keys([0, 1, 2])

At each level, we have a map of node to its type in that level. For example, if we look at level 1, the map contains all the identifiers of the nodes found in the document.

In [8]:
for node_identifier, fptype in fptypes[1].items():
    print(f"{node_identifier}: {print_flat_type(fptype)}")

instructions/InfrastructureDamage529-1761.2: [gen|spe]→[act|ent]
transporter272.2: [gen]→[act]
uav/target/9.2: [att|der|gen]→[act|agt|ent]
instructions/InfrastructureDamage529-1762.2: [gen|spe]→[act|ent]
medic273.2: [gen]→[act]
instructions/InfrastructureDamage529-1761: [der]→[ent]
instructions/InfrastructureDamage529-1762: [der]→[ent]
cs/target/9.0: [att|der|spe]→[agt|ent]
activity/AcceptInstruction1761: [usd]→[ent]
cs/target/9.1: [att|der|spe]→[agt|ent]
instructions/InfrastructureDamage529-1762.1: [spe]→[ent]
instructions/InfrastructureDamage529-1761.1: [spe]→[ent]
InfrastructureDamage529: [der]→[ent]
confirmed_plans/166: [mem]→[ent]
activity/uav_verification/1411560570.812: [usd|waw]→[agt|ent]
activity/AcceptInstruction1762: [usd]→[ent]
cs/report/43: [att]→[agt]
cs/report/64: [att]→[agt]
cs/report/2: [att]→[agt]
cs/report/16: [att]→[agt]
cs/report/33: [att]→[agt]


`print_flat_type` function above is a utility function to print flat provenance types in an easy-to-read representation, similar to how they are presented in [our paper](https://arxiv.org/abs/2010.10343).

## Generating feature vectors

With a given provenance document, we calculate the types and then count the occurences of each type we see. The numbers of occurences for all flat provenance types in that document are then used as the feature vector for it.

Using the provenance graph above and the `fptypes` structure we already calculated from it, assumming that we want to count only 0-types, we can generate the feature vector for level 0 as follows.

In [9]:
count_fp_types(fptypes[0].values())

{'[act]': 3, '[ent]': 19, '[agt]': 5}

Note that we only use the `.values()` of the map (a Python `dict` in this case) because we care only the types of the nodes, not the their identifiers (which are the "keys" of the map).

The result we see above is the sparse representation of a vector `(3, 19, 5)` for features `[act]`, `[ent]`, and `[agt]`, respectively. Since we do not know beforehand how many different features we will actually see from a provenance document, it is not possible to define the dimension of the feature vector, hence the need to use the sparse representation.

In the above statement, we only count 0-types. If we want to produce a feature vector to contain all types up to, say, level 2, we can merge all the types of the desired levels before counting.

In [10]:
count_fp_types(
    chain.from_iterable(fpt_level.values() for fpt_level in fptypes.values())
)

{'[act]': 3,
 '[ent]': 19,
 '[agt]': 5,
 '[gen|spe]→[act|ent]': 2,
 '[gen]→[act]': 2,
 '[att|der|gen]→[act|agt|ent]': 1,
 '[der]→[ent]': 3,
 '[att|der|spe]→[agt|ent]': 2,
 '[usd]→[ent]': 2,
 '[spe]→[ent]': 2,
 '[mem]→[ent]': 1,
 '[usd|waw]→[agt|ent]': 1,
 '[att]→[agt]': 5,
 '[mem]→[gen|spe]→[act|ent]': 1,
 '[der]→[att|der|gen]→[act|agt|ent]': 1,
 '[spe]→[der]→[ent]': 2,
 '[gen|spe]→[der|usd]→[ent]': 2,
 '[der]→[att|der|spe]→[agt|ent]': 1,
 '[gen]→[usd]→[ent]': 2,
 '[usd]→[att|der|spe]→[agt|ent]': 1,
 '[der|gen]→[att|der|spe|usd|waw]→[agt|ent]': 1,
 '[usd]→[spe]→[ent]': 2,
 '[der]→[der]→[ent]': 2,
 '[der]→[att]→[agt]': 1}

Combining all the code above into a single function, we can produce the (sparse) feature vector from a provenance document with the following function.

In [11]:
def sparse_feature_vector(
    prov_doc: ProvDocument, 
    to_level: int = 0,
    including_primitives_types: bool = True
) -> Dict[str, int]:
    
    fptypes = calculate_flat_provenance_types(prov_doc, to_level, including_primitives_types)
    return count_fp_types(
        chain.from_iterable(fpt_level.values() for fpt_level in fptypes.values())
    )

In [12]:
sparse_feature_vector(prov_doc, to_level=2, including_primitives_types=True)

{'[act]': 3,
 '[ent|ao:InfrastructureDamage]': 1,
 '[ent]': 15,
 '[ent|ao:Instruction]': 2,
 '[ent|ao:Plan]': 1,
 '[agt|ao:CrowdReporter|prov:Person]': 3,
 '[agt|ao:IBCCAlgo|prov:SoftwareAgent]': 1,
 '[agt|prov:Person]': 1,
 '[gen|spe]→[act|ent|ao:Instruction]': 2,
 '[gen]→[act]': 2,
 '[att|der|gen]→[act|agt|ent|prov:Person]': 1,
 '[der]→[ent|ao:InfrastructureDamage]': 2,
 '[att|der|spe]→[agt|ent|ao:IBCCAlgo|prov:SoftwareAgent]': 2,
 '[usd]→[ent]': 2,
 '[spe]→[ent|ao:Instruction]': 2,
 '[der]→[ent]': 1,
 '[mem]→[ent]': 1,
 '[usd|waw]→[agt|ent|prov:Person]': 1,
 '[att]→[agt|ao:CrowdReporter|prov:Person]': 5,
 '[mem]→[gen|spe]→[act|ent|ao:Instruction]': 1,
 '[der]→[att|der|gen]→[act|agt|ent|prov:Person]': 1,
 '[spe]→[der]→[ent|ao:InfrastructureDamage]': 2,
 '[gen|spe]→[der|usd]→[ent|ao:InfrastructureDamage]': 2,
 '[der]→[att|der|spe]→[agt|ent|ao:CrowdReporter|ao:IBCCAlgo|prov:Person|prov:SoftwareAgent]': 1,
 '[gen]→[usd]→[ent]': 2,
 '[usd]→[att|der|spe]→[agt|ent|ao:IBCCAlgo|prov:Software