#### The purpose of this notebook is to compare D-REPR with other methods such as KR2RML and R2RML in term of performance

In [45]:
import re, numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook as tqdm

%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0) # set default size of plots
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
%reload_ext autoreload

In [61]:
def read_exec_time(log_file: str, tag_str: str='>>> [DREPR]', print_exec_time: bool=True):
    """Read the executing time of the program"""
    with open(log_file, "r") as f:
        for line in f:
            if line.startswith(">>> [DREPR]"):
                m = re.search("((?:\d+\.)?\d+) ?ms", line)
                exec_time = m.group(1)
                if print_exec_time:
                    print(line.strip(), "-- extract exec_time:", exec_time)
                return float(exec_time)
    raise Exception("Doesn't found any output message")

#### KR2RML

To setup KR2RML, we need to first download Web-Karma-2.2 from the web, modify the file: `karma-offline/src/main/java/edu/isi/karma/rdf/OfficeRDFGenerator` to add this code to line 184: `System.out.println(">>> [DREPR] Finish converting RDF after " + String.valueOf(System.currentTimeMillis() - l) + "ms");` to print the runtime to stdout.

Then run `mvn install -Dmaven.test.skip=true` at the root directory to install dependencies before actually converting data to RDF

In [51]:
%cd /workspace/tools-evaluation/Web-Karma-2.2/karma-offline

DATA_FILE = "/workspace/drepr/drepr/rdrepr/data/insurance.csv"
MODEL_FILE = "/workspace/drepr/drepr/rdrepr/data/insurance.level-0.model.ttl"
OUTPUT_FILE = "/tmp/kr2rml_output.ttl"

karma_exec_times = []

for i in tqdm(range(3)):
    !mvn exec:java -Dexec.mainClass="edu.isi.karma.rdf.OfflineRdfGenerator" -Dexec.args=" \
        --sourcetype CSV \
        --filepath \"{DATA_FILE}\" \
        --modelfilepath \"{MODEL_FILE}\" \
        --sourcename test \
        --outputfile {OUTPUT_FILE}" -Dexec.classpathScope=compile > /tmp/karma_speed_comparison.log
    
    karma_exec_times.append(read_exec_time("/tmp/karma_speed_comparison.log"))
    !rm /tmp/karma_speed_comparison.log
        
print(f"run 3 times, average: {np.mean(karma_exec_times)}ms")

/workspace/Web-Karma-2.2/karma-offline


HBox(children=(IntProgress(value=0, max=3), HTML(value='')))

init: Bootstrapping class not in Py.BOOTSTRAP_TYPES[class=class org.python.core.PyStringMap]
>>> [DREPR] Finish converting RDF after 5981ms
 -- extract exec_time: 5981
init: Bootstrapping class not in Py.BOOTSTRAP_TYPES[class=class org.python.core.PyStringMap]
>>> [DREPR] Finish converting RDF after 6486ms
 -- extract exec_time: 6486
init: Bootstrapping class not in Py.BOOTSTRAP_TYPES[class=class org.python.core.PyStringMap]
>>> [DREPR] Finish converting RDF after 5922ms
 -- extract exec_time: 5922

run 3 times, average: 6129.666666666667ms


<hr />

Report information about the output and input

In [29]:
with open(DATA_FILE, "r") as f:
    n_records = sum(1 for _ in f) - 1
    print("#records:", n_records, f"({round(n_records * 1000 / np.mean(karma_exec_times), 2)} records/s)")
with open(OUTPUT_FILE, "r") as f:
    n_triples = sum(1 for line in f if line.strip().endswith("."))
    print("#triples:", n_triples, f"({round(n_triples * 1000 / np.mean(karma_exec_times), 2)} triples/s)")

#records: 36634 (6147.68 records/s)
#triples: 256438 (43033.73 triples/s)


#### MorphRDB

Assuming that you have followed their installation guides at [this](https://github.com/oeg-upm/morph-rdb/wiki/Installation) and [usages](https://github.com/oeg-upm/morph-rdb/wiki/Usage#csv-files). We are going to create r2rml mappings and invoke their program to map data into RDF

In [1]:
%cd /workspace/tools-evaluation/morph-rdb/morph-examples

!java -cp .:morph-rdb-dist-3.9.17.jar:dependency/\* es.upm.fi.dia.oeg.morph.r2rml.rdb.engine.MorphCSVRunner /workspace/drepr/drepr/rdrepr/data insurance.level-0.morph.properties

/workspace/tools-evaluation/morph-rdb/morph-examples
[main] INFO es.upm.fi.dia.oeg.morph.r2rml.rdb.engine.MorphCSVProperties - reading configuration file : /workspace/drepr/drepr/rdrepr/data/insurance.level-0.morph.properties
[main] ERROR es.upm.fi.dia.oeg.morph.r2rml.rdb.engine.MorphCSVProperties - Configuration file not found: /workspace/drepr/drepr/rdrepr/data/insurance.level-0.morph.properties
java.io.FileNotFoundException: /workspace/drepr/drepr/rdrepr/data/insurance.level-0.morph.properties (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at es.upm.fi.dia.oeg.morph.base.MorphProperties.readConfigurationFile(MorphProperties.scala:91)
	at es.upm.fi.dia.oeg.morph.r2rml.rdb.engine.MorphRDBProperties.readConfigurationFile(MorphRDBProperties.scala:18)
	at es.upm.fi.dia.oeg.morph.r

#### DREPR

In [62]:
%cd /workspace/drepr/drepr/rdrepr

DREPR_EXEC_LOG = "/tmp/drepr_exec_log.log"

!cargo run --release > {DREPR_EXEC_LOG}
drepr_exec_times = read_exec_time(DREPR_EXEC_LOG)
!rm {DREPR_EXEC_LOG}

/workspace/drepr/drepr/rdrepr
[0m[0m[1m[32m    Finished[0m release [optimized] target(s) in 0.18s
[0m[0m[1m[32m     Running[0m `target/release/drepr`
>>> [DREPR] runtime: 146.171066ms -- extract exec_time: 146.171066


In [63]:
with open("/tmp/drepr_output.ttl", "r") as f:
    n_triples = sum(1 for line in f if line.strip().endswith("."))
    print("#triples:", n_triples, f"({round(n_triples * 1000 / np.mean(drepr_exec_times), 2)} triples/s)")

#triples: 256438 (1754369.09 triples/s)
