## Materialization 101

In this notebook we will attempt to _materialize_ an RDFS graph with deep learning.

An RDFS graph is a multigraph, that is, a collection of nodes with edges and labels, out of which there could be multiple outgoing edges from a single node, that has a specific _semantics_.

In order to understand the semantics of RDFS, we ought to first have a look at how the data really looks like.

Data-format wise, it is usually stored with the file extension `.nt` in a very simple format: one triple per line.

There are two parts of an RDFS graph.

The first is the _TBOX_, the ontology, that is, the set of nodes that encodes the hierarchy of the graph:

```
(employee, rdf:type, Class)
(faculty, rdfs:subClassOf, employee)
(professor, rdfs:subClassOf, faculty)
(teaches, rdf:type, rdf:Property)
(lectures, rdfs:subPropertyOf, teaches)
(teaches, rdfs:domain, professor)
(course, rdf:type, Class)
(teaches, rdfs:range, course)
```

In this exemplary graph, we define that *employee* is a _Class_, *faculty* is a _type_ of *employee*, *professor* is a _subClass_ of *faculty*, *teaches* is a _type_ of _property_, *lectures* is a _subProperty_ of *teaches*, and that *teaches* is in the _domain_ of *professor*, alongside with *course*, a _Class_, is in the _range_ of *teaches*.

Next up we have the _ABOX_, which is where we will make assertions about individuals with the rules we've defined on the _TBOX_.

```
(professor1, lectures, course1)
```

Now we can talk about materialization.

RDFS has a set of _entailment_ rules which dictate its _semantics_.

Here are they(the ones that matter, for now):

```
:A(?y, rdf:type, ?x) :- :T(?a, rdfs:domain, ?x), :A(?y, ?a, ?z) . // 1
:A(?z, rdf:type, ?x) :- :T(?a, rdfs:range, ?x), :A(?y, ?a, ?z) . // 2
:T(?x, rdfs:subPropertyOf, ?z) :- :T(?x, rdfs:subPropertyOf, ?y), :T(?y, rdfs:subPropertyOf, ?z) . // 3
:T(?x, rdfs:subClassOf, ?z) :- :T(?x, rdfs:subClassOf, ?y), :T(?y, rdfs:subClassOf, ?z) . // 4
:A(?x, ?b, ?y) :- :T(?a, rdfs:subPropertyOf, ?b), :A(?x, ?a, ?y) . // 5
:A(?z, rdf:type, ?y) :- :T(?x, rdfs:subClassOf, ?y), :A(?z, rdf:type, ?x) . // 6
```

The way to read a rule is quite straightforward.

For instance, `:T(?x, rdfs:subClassOf, ?z) :- :T(?x, rdfs:subClassOf, ?y), :T(?y, rdfs:subClassOf, ?z) .` is spelled as: If the tbox triples (?x, rdfs:subClassOf, ?y)
and (?y, rdfs:subClassOf, ?z) exist in the tbox, then (?x, subClassOf, ?z) *must* exist in the tbox as well.

To _materialize_ an RDFS graph, means adding all triples which *must* exist.

For instance, materializing the given _TBOX_ yields the following triples to be added:

```
(faculty, rdfs:type, Class)
(professor, rdf:type, Class)
(professor, rdfs:subClassOf, employee)
(lectures, rdf:type, rdf:Property)
```

And now for the _ABOX_, we get:

```
(professor1, rdf:type, professor)
(course1, rdf:type, course)
(professor1, teaches, course1)
(professor1, rdf:type, faculty)
(professor1, rdf:type, employee)
```

As it can be seen, there are is no more *knowledge* that can be inferred.

## Experiments

We have 6 files under the folder `data`.

`tiny_tbox.nt/ntenc` and `tiny_abox.nt/ntenc` are a small tbox and abox whose materialization could be verified by hand. the files with `ntenc` extension
are the same as those with `nt`, except they are encoded with integers, taking far less space. This is something to keep in mind when dealing with large amounts of data.

`real_abox.nt` and `real_tbox.nt` are actual data that are used to benchmark materialization engines, hence, we could fit the same data, after materialization, into other
 reasoners in order to verify that what we are doing is correct.

In [None]:
from triple_loader import *
import pandas as pd

In [None]:
raw_real_tbox = read_triples("./data/real_tbox.nt")
raw_real_abox = read_triples("./data/real_abox.nt")

tbox = pd.DataFrame(data=raw_real_tbox, columns=['s', 'p', 'o'])
abox = pd.DataFrame(data=raw_real_abox, columns=['s', 'p', 'o'])