# Tutorial: An introduction to property graph transformations

In this tutorial, you will learn the key concepts related to our property graph transformations:
* How to specify transformations of property graph with this framework (i.e., Understanding how we make use of Skolem functions to construct new property graphs.)
* How the rules in a single transformation interact with each other (i.e., How content of output elements can be jointly specified in several rules.)
* Understand the notion of *conflicts*, and how to deal with them.
* How this framework integrates with openCypher (i.e., How the rules are compiled into openCypher scripts and in which context they are executed.)
* What property graph transformations are capable of doing (i.e., The kind of constructs that can be expressed.)

## Part 1: Preliminaries


By default this notebook is configured to connect to a local Neo4j instance running inside a Docker container. This [notebook](./Tutorial_Connecting_Neo4j_Docker.ipynb) will guide you through the process of setting up a local Docker container and connecting to it.

In [1]:
from dtgraph import Neo4jGraph, Rule, Transformation
hostname = "localhost"
password = ""
uri = f"bolt://{hostname}:7687"
graph = Neo4jGraph(uri, database="neo4j", username="", password=password)

For this tutorial, we will use a stripped version of the Movies dataset from Neo4j, which can be loaded using the following command.

In [2]:
from dtgraph.scenarios.movies import Movies
Movies.load(graph)

Flushed database: Deleted 296 nodes, deleted 721 relationships, completed after 2 ms.
Load scenario: Added 171 labels, created 171 nodes, set 564 properties, created 253 relationships, completed after 1 ms.


**PICTURE OF THE INPUT INSTANCE**

## Part 2: Transformation rules

This dataset contains information about **Movies** and **Person** related to these movies.
Such persons could could have **:ACTED_IN**, **:DIRECTED** or **:PRODUCED** a movie. With this schema, information about whether people are actors, directors or producers is not found in the nodes.

**PICTURE OF THE SCHEMA**

## Node rules

Let's build a new graph to make this information explicit, we start by introducing the new label **Actor** to tag people that have been actor in at least one movie.
We will do this with the following transformation rule:

In [3]:
generate_actors = Rule('''
MATCH (n:Person)-[:ACTED_IN]->(:Movie)
GENERATE 
(x = (n) : Actor {
    name = n.name,
    born = n.born,
    source = "Movies dataset"
})
''')

A rule typically consists of three parts:
- `MATCH (n:Person)-[:ACTED_IN]->(m:Movie)` which is an openCypher query to retrieve the relevant information from the input graph.
  This Cypher query should bind its exported variables only to graph elements such as nodes and relationships.
- `((n) : Actor { name = n.name, born = n.born })` which is a node constructor, composed of the following elements:
  - `(n)` contains a list of arguments to identify the new element in the output graph. `x` is an optional alias for cross-referencing a constructor inside the scope of a rule.
  - A set of labels (here there is only one label, `Actor`) for the new elements.
  - A list of properties `{ name = n.name, born = n.born, source = "movie dataset" }` for the new elements. Values from the initial graph can be retrieved using access keys such as `n.born`, fixed constants (i.e., `"movie dataset"`) can also be specified.
- `=>` or `GENERATE` to connect the two parts above.

In [4]:
my_transform = Transformation([generate_actors])
my_transform.apply_on(graph)

Index: Added 0 index, completed after 0 ms.
Rule: Added 204 labels, created 102 nodes, set 618 properties, created 0 relationships, completed after 5 ms.


When executing a transformation, we see above that each rule reports some *metadata* including the completion time, and the number of labels, properties, nodes and relationships which have been created by applying the rule.

It is important to notice that the output of a transformation is a new property graph that is **completely independent** from the initial one. They do not share any common element. Now that we have created and executed the transformation, we can see the current output.

**PICTURE OF THE DATABASE**

We now introduce the **Director** label to tag people that have directed at least one movie. We will do this with a new transformation rule that we add to current transformation:

In [5]:
generate_directors = Rule('''
MATCH (n:Person)-[:DIRECTED]->(:Movie)
GENERATE 
(x = (n) : Director {
    name = n.name,
    born = n.born,
    source = "Movies dataset"
})
''')
my_transform.add(generate_directors)

Rule: Added 51 labels, created 23 nodes, set 155 properties, created 0 relationships, completed after 2 ms.


Both rules use the same argument list for identifying new Directors and Actors: i.e., `x = (n) : Director`. 
Hence, people that have both been an actor and a director of some movies should have both labels.

We can confirm with the following query that you can execute on the [Neo4j browser](http://localhost:7474) that a single node is created on the output with both labels if that happens:

```
MATCH (n)
WHERE n:Actor and n:Director
RETURN n
```

This query should return the following output:
```
╒══════════════════════════════════════════════════════════════════════╕
│n                                                                     │
╞══════════════════════════════════════════════════════════════════════╡
│(:Actor:Director:_dummy {born: 1967,name: "James Marshall",_id: "(4:7f│
│732a8b-14ba-4846-8477-f326f7a1b5d0:2469)"})                           │
├──────────────────────────────────────────────────────────────────────┤
│(:Actor:Director:_dummy {born: 1956,name: "Tom Hanks",_id: "(4:7f732a8│
│b-14ba-4846-8477-f326f7a1b5d0:2515)"})                                │
├──────────────────────────────────────────────────────────────────────┤
│(:Actor:Director:_dummy {born: 1930,name: "Clint Eastwood",_id: "(4:7f│
│732a8b-14ba-4846-8477-f326f7a1b5d0:2543)"})                           │
├──────────────────────────────────────────────────────────────────────┤
│(:Actor:Director:_dummy {born: 1944,name: "Danny DeVito",_id: "(4:7f73│
│2a8b-14ba-4846-8477-f326f7a1b5d0:2586)"})                             │
├──────────────────────────────────────────────────────────────────────┤
│(:Actor:Director:_dummy {born: 1942,name: "Werner Herzog",_id: "(4:7f7│
│32a8b-14ba-4846-8477-f326f7a1b5d0:2503)"})                            │
└──────────────────────────────────────────────────────────────────────┘
```

Hence we are able to define the content of element with multiple rules, wich are independent of each other.
The mechanism permitting to do so is based on Skolem functions, we explain how we implement these Skolem functions in Cypher in Part 4.

## Edge rules

We now describe how to use edge constructor for specifying relationships in the output graph.
We introduce a new rule in the transformation to create a relationship of type **:COLLEAGUE** whenever a movie stars two persons which are either actors or directors of the movie:

In [6]:
generate_colleague = Rule('''
CALL {
    MATCH (n:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(o:Person)
    WHERE n.name < o.name
    RETURN n, m, o
    UNION 
    MATCH (n:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(o:Person)
    WHERE n.name < o.name
    RETURN n, m, o
}
WITH n, m, o
GENERATE 
(x = (n) : )-[() : COLLEAGUE {
    movie = m.title
}]->(y = (o) : )
''')
my_transform.add(generate_colleague)

Rule: Added 0 labels, created 0 nodes, set 978 properties, created 468 relationships, completed after 32 ms.


## Part 3: Conflicts

## Part 4: Internal overview of the system

Our transformation rules are written in our own *Domain Specific Language* described previously. It takes the form of a new terminal clause `GENERATE` for openCypher queries.
Internally, **DTGraph** analyzes and translates these rules into executable, efficient and portable openCypher scripts.
It then runs these scripts under an execution environment containing indexes and other metadata. 
The openCypher script corresponding to a rule can be probed and pretty-printed with the following code:

In [7]:
generate_actors._compile()
print(generate_actors._compiled)

MATCH (n:Person)-[:ACTED_IN]->(:Movie)
MERGE (x:_dummy {
    _id: "(" + elementID(n) + ")" 
})
ON CREATE
    SET x:Actor,
        x.name = n.name,
        x.born = n.born,
        x.source = "Movies dataset"
ON MATCH
    SET x:Actor,
        x.name = 
        CASE
            WHEN x.name <> n.name THEN
                "Conflict Detected!"
            ELSE
                n.name
        END,
        x.born = 
        CASE
            WHEN x.born <> n.born THEN
                "Conflict Detected!"
            ELSE
                n.born
        END,
        x.source = 
        CASE
            WHEN x.source <> "Movies dataset" THEN
                "Conflict Detected!"
            ELSE
                "Movies dataset"
        END



Each constructor, e.g. `(x = (n) : Actor { name = n.name, born = n.born, source = "movies dataset" })` is translated into a `MERGE` statement.
The Skolem function is implemented using string operations. The textual representation of the identifier resulting from applying the Skolem function to the list of arguments, e.g. `"(" + elementID(n) + ")"` is stored in an internal `_id` attribute.

Each `MERGE` statement checks for each binding (i.e., a row) produced by the input Cypher query (i.e., the left-hand-side of the rule) if there already exists an element in the output property graph with such identifier:
- if so, the label specified in the constructor is added to the list of labels for the existing node and each property is set, depending whether a conflict is detected of not. If a conflict is detected the special value `Conflict Detected!` is stored in lieu of the specified value.
- if no element corresponds to such identifier, a new element is created with this specific value for `_id` and its content is set according to the constructor's specification.

Our framework consists of declarative rules and uses Skolem functions as a mechanism for identifying new elements, as such it maintains the following invariants:
- The transformations are well-defined; i.e., for an input property graph and a set of rules, exactly one property graph corresponds to the output of the transformation.
  - Importantly, the order in which the rules are applied does not have any impact on the output of the transformation.
- There is a one-to-one correspondance between the values of the `_id` attribute and the internal identifiers in the output property graph (which can be accessed with the built in *elementID* function in Neo4j and *ID* function in Memgraph, respectively.

## Part 5: Expressivity