## Materialization 101

In this notebook we will attempt to _materialize_ an RDFS graph with deep learning.

An RDFS graph is a multigraph, that is, a collection of nodes with edges and labels, out of which there could be multiple outgoing edges from a single node, that has a specific _semantics_.

In order to understand the semantics of RDFS, we ought to first have a look at how the data really looks like.

Data-format wise, it is usually stored with the file extension `.nt` in a very simple format: one triple per line.

There are two parts of an RDFS graph.

The first is the _TBOX_, the ontology, that is, the set of nodes that encodes the hierarchy of the graph:

```
(employee, rdf:type, Class)
(faculty, rdfs:subClassOf, employee)
(professor, rdfs:subClassOf, faculty)
(teaches, rdf:type, rdf:Property)
(lectures, rdfs:subPropertyOf, teaches)
(teaches, rdfs:domain, professor)
(course, rdf:type, Class)
(teaches, rdfs:range, course)
```

In this exemplary graph, we define that *employee* is a _Class_, *faculty* is a _type_ of *employee*, *professor* is a _subClass_ of *faculty*, *teaches* is a _type_ of _property_, *lectures* is a _subProperty_ of *teaches*, and that *teaches* is in the _domain_ of *professor*, alongside with *course*, a _Class_, is in the _range_ of *teaches*.

Next up we have the _ABOX_, which is where we will make assertions about individuals with the rules we've defined on the _TBOX_.

```
(professor1, lectures, course1)
```

Now we can talk about materialization.

RDFS has a set of _entailment_ rules which dictate its _semantics_.

Here are they(the ones that matter, for now):

```
:A(?y, rdf:type, ?x) :- :T(?a, rdfs:domain, ?x), :A(?y, ?a, ?z) . // 1
:A(?z, rdf:type, ?x) :- :T(?a, rdfs:range, ?x), :A(?y, ?a, ?z) . // 2
:T(?x, rdfs:subPropertyOf, ?z) :- :T(?x, rdfs:subPropertyOf, ?y), :T(?y, rdfs:subPropertyOf, ?z) . // 3
:T(?x, rdfs:subClassOf, ?z) :- :T(?x, rdfs:subClassOf, ?y), :T(?y, rdfs:subClassOf, ?z) . // 4
:A(?x, ?b, ?y) :- :T(?a, rdfs:subPropertyOf, ?b), :A(?x, ?a, ?y) . // 5
:A(?z, rdf:type, ?y) :- :T(?x, rdfs:subClassOf, ?y), :A(?z, rdf:type, ?x) . // 6
```

The way to read a rule is quite straightforward.

For instance, `:T(?x, rdfs:subClassOf, ?z) :- :T(?x, rdfs:subClassOf, ?y), :T(?y, rdfs:subClassOf, ?z) .` is spelled as: If the tbox triples (?x, rdfs:subClassOf, ?y)
and (?y, rdfs:subClassOf, ?z) exist in the tbox, then (?x, subClassOf, ?z) *must* exist in the tbox as well.

To _materialize_ an RDFS graph, means adding all triples which *must* exist.

For instance, materializing the given _TBOX_ yields the following triples to be added:

```
(faculty, rdfs:type, Class)
(professor, rdf:type, Class)
(professor, rdfs:subClassOf, employee)
(lectures, rdf:type, rdf:Property)
```

And now for the _ABOX_, we get:

```
(professor1, rdf:type, professor)
(course1, rdf:type, course)
(professor1, teaches, course1)
(professor1, rdf:type, faculty)
(professor1, rdf:type, employee)
```

As it can be seen, there are is no more *knowledge* that can be inferred.

## Experiments

We have 6 files under the folder `data`.

`tiny_tbox.nt/ntenc` and `tiny_abox.nt/ntenc` are a small tbox and abox whose materialization could be verified by hand. the files with `ntenc` extension
are the same as those with `nt`, except they are encoded with integers, taking far less space. This is something to keep in mind when dealing with large amounts of data.

`real_abox.nt` and `real_tbox.nt` are actual data that are used to benchmark materialization engines, hence, we could fit the same data, after materialization, into other
 reasoners in order to verify that what we are doing is correct.

In [1]:
from triple_loader import read_encoded_triples, read_triples

import pandas as pd

In [2]:
raw_real_tbox = read_triples("./data/real_tbox.nt")
raw_real_abox = read_triples("./data/real_abox.nt")

tbox = pd.DataFrame(data=raw_real_tbox, columns=['s', 'p', 'o'])
abox = pd.DataFrame(data=raw_real_abox, columns=['s', 'p', 'o'])

In [3]:
tbox.head(10)

Unnamed: 0,s,p,o
0,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>,<http://www.w3.org/2002/07/owl#Ontology>
1,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/2000/01/rdf-schema#comment>,An university ontology for benchmark tests
2,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/2000/01/rdf-schema#label>,Univ-bench Ontology
3,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/2002/07/owl#versionInfo>,"univ-bench-ontology-owl, ver April 1, 2004"
4,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>,<http://www.w3.org/2002/07/owl#Class>
5,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/2000/01/rdf-schema#label>,administrative staff worker
6,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/2000/01/rdf-schema#subClassOf>,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...
7,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>,<http://www.w3.org/2002/07/owl#Class>
8,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/2000/01/rdf-schema#label>,article
9,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...,<http://www.w3.org/2000/01/rdf-schema#subClassOf>,<http://www.lehigh.edu/~zhp2/2004/0401/univ-be...


Instead of using strings we will use integers, because strings __suck__.

In [5]:
materialized_real_tbox = read_encoded_triples("./data/materialized_real_tbox.ntenc")
real_abox = read_encoded_triples("./data/real_abox.ntenc")
materialized_real_abox = read_encoded_triples("./data/materialized_real_abox.ntenc")

In [6]:
actual_abox_materialization = materialized_real_abox - real_abox

In [45]:
import random
shuffled_abox_materialization = [tuple(random.sample(list(triple), len(triple))) for triple in materialized_real_abox]

In [46]:
filtered_shuffled_abox_materialization = [triple for triple in shuffled_abox_materialization if triple not in materialized_real_abox]

In [47]:
filtered_shuffled_abox_materialization = random.sample(filtered_shuffled_abox_materialization, len(actual_abox_materialization))

In [48]:
base_abox_and_tbox = pd.DataFrame(materialized_real_tbox | real_abox, columns=['s', 'p', 'o'])
base_abox_and_tbox.sample(frac=1)
base_abox_and_tbox['real'] = 1
#base_abox_and_tbox['weight'] = 1
materialization = pd.DataFrame(actual_abox_materialization, columns=['s', 'p', 'o'])
materialization.sample(frac=1)
materialization['real'] = 1
#materialization['weight'] = 4
fake_data = pd.DataFrame(actual_abox_materialization, columns=['s', 'p', 'o']).sample(frac=1)
fake_data.sample(frac=1)
fake_data['real'] = 0
#fake_data['weight'] = 4

In [49]:
import numpy as np

def split_df(df):
    if len(df) % 2 != 0:
        df = df.iloc[:-1, :]
    df1, df2 =  np.array_split(df, 2)
    return df1, df2

train_mat, test_mat = split_df(materialization)
train_fake, test_fake = split_df(fake_data)

In [50]:
train = pd.concat([base_abox_and_tbox, train_mat, train_fake])
train_X = train.loc[:, train.columns != 'real']
train_Y = train['real']
test = pd.concat([test_mat, test_fake])
test_X = test.loc[:, train.columns != 'real']
test_Y = test['real']

In [51]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
class Data(Dataset):
  def __init__(self, X_train, y_train):
    self.X = torch.from_numpy(X_train).type(torch.FloatTensor)
    self.y = torch.from_numpy(y_train).type(torch.FloatTensor)
    self.len = self.X.shape[0]

  def __getitem__(self, index):
    return self.X[index], self.y[index]
  def __len__(self):
    return self.len

In [52]:
traindata = Data(train_X.to_numpy(), train_Y.to_numpy())

In [53]:
batch_size = 1000
trainloader = DataLoader(traindata, batch_size=batch_size)
def binary_acc(y_pred, y_test):
    y_pred_tag = torch.round(torch.sigmoid(y_pred))

    correct_results_sum = (y_pred_tag == y_test).sum().float()
    acc = correct_results_sum/y_test.shape[0]
    acc = torch.round(acc * 100)

    return acc

In [63]:
import torch.nn as nn
# number of features (len of X cols)
input_dim = 3
# number of hidden layers
hidden_layers = 50
# number of classes (unique of y)
output_dim = 1
class Network(nn.Module):
  def __init__(self):
    super(Network, self).__init__()
    self.linear1 = nn.Linear(input_dim, hidden_layers)
    self.linear2 = nn.Linear(hidden_layers, output_dim)

    self.relu = nn.ReLU()
    self.dropout = nn.Dropout(p=0.1)
    self.batchnorm1 = nn.BatchNorm1d(64)
    self.batchnorm2 = nn.BatchNorm1d(64)
  def forward(self, x):
    x = self.linear1(x)
    x = self.relu(x)
    x = torch.sigmoid(self.linear2(x))
    return x

In [65]:
clf = Network()

In [66]:
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(clf.parameters(), lr=0.1)

In [67]:
epochs = 20
running_loss = 0.0
clf.train()
for epoch in range(epochs):
  for i, data in enumerate(trainloader, 0):
    inputs, labels = data
    # set optimizer to zero grad to remove previous epoch gradients
    optimizer.zero_grad()
    # forward propagation
    outputs = clf(inputs)
    #print(inputs)
    #print(outputs)
    loss = criterion(outputs, labels.unsqueeze(-1))
    acc = binary_acc(outputs, labels.unsqueeze(-1))
    # backward propagation
    loss.backward()
    # optimize
    optimizer.step()
    running_loss += loss.item()
  # display statistics
    print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.5f} acc: {acc}')
  running_loss = 0.0

[1,     1] loss: 0.04831 acc: 0.0
[1,     2] loss: 0.04831 acc: 100.0
[1,     3] loss: 0.04831 acc: 100.0
[1,     4] loss: 0.04831 acc: 100.0
[1,     5] loss: 0.04831 acc: 100.0
[1,     6] loss: 0.04831 acc: 100.0
[1,     7] loss: 0.04831 acc: 100.0
[1,     8] loss: 0.04831 acc: 100.0
[1,     9] loss: 0.04831 acc: 100.0
[1,    10] loss: 0.04831 acc: 100.0
[1,    11] loss: 0.04831 acc: 100.0
[1,    12] loss: 0.04831 acc: 100.0
[1,    13] loss: 0.04831 acc: 100.0
[1,    14] loss: 0.04831 acc: 100.0
[1,    15] loss: 0.04831 acc: 100.0
[1,    16] loss: 0.04831 acc: 100.0
[1,    17] loss: 0.04831 acc: 100.0
[1,    18] loss: 0.04831 acc: 100.0
[1,    19] loss: 0.04831 acc: 100.0
[1,    20] loss: 0.04831 acc: 100.0
[1,    21] loss: 0.04831 acc: 100.0
[1,    22] loss: 0.04831 acc: 100.0
[1,    23] loss: 0.04831 acc: 100.0
[1,    24] loss: 0.04831 acc: 100.0
[1,    25] loss: 0.04831 acc: 100.0
[1,    26] loss: 0.04831 acc: 100.0
[1,    27] loss: 0.04831 acc: 100.0
[1,    28] loss: 0.04831 acc: 

tensor([1.], grad_fn=<SigmoidBackward0>)