MetaLink is an identity meta-dataset that describes over 550M `owl:sameAs` statements together with their error degree, and their transitive closure. MetaLink can be used to look up the trustworthiness of a specific link, or to look up links for a specific node. The error degree is a value between 0.0 and 1.0. Identity links with high error degree (e.g. `>0.99`) have high probability of being erroneous, hence it is recommended to discard such links in your Linked Data application.
- Install SWI-Prolog.
- Install prolog_hdt and prolog_rocksdb.
- Clone this repository:
git clone https://github.com/wouterbeek/sameas_scripts
.
First, extract the owl:sameAs
pairs from a given HDT file, and store
the results in a given TSV output file. If the output file name ends
in .gz
the output is automatically compressed using GNU zip.
swipl -s hdt2tsv.pl -g run -t halt --input="FILE" --output="FILE"
Use GNU sort to sort the owl:sameAs
pairs that were extracted in the
previous step.
Each existing identity statement in the LOD-a-lot dataset (e.g. <ns1:x, owl:sameAs, ns2:y>
) is reified as the following:
PREFIX meta: <https://krr.triply.cc/krr/sameas-meta/> meta:id/link/<IDENTITY_STATEMENT_ID>, rdf:type, meta:def/IdentityStatement . meta:id/link/<IDENTITY_STATEMENT_ID>, rdfs:label, "Identity Statement <IDENTITY_STATEMENT_ID>" . meta:id/link/<IDENTITY_STATEMENT_ID>, rdf:subject, ns1:x . meta:id/link/<IDENTITY_STATEMENT_ID>, rdf:predicate, owl:sameAs . meta:id/link/<IDENTITY_STATEMENT_ID>, rdf:object, ns2:y . meta:id/link/<IDENTITY_STATEMENT_ID>, meta:def/error, "0.2"^^<http://www.w3.org/2001/XMLSchema#double> . meta:id/link/<IDENTITY_STATEMENT_ID>, meta:def/community, meta:id/comm/<EQUIVALENCE_SET_ID>/<COMMUNITY_ID> .
A. Download the MetaLink HDT file and its Index from:
https://doi.org/10.5281/zenodo.3227976
OPTIONAL. Download the LOD-a-lod dataset and its Index from:
http://lod-a-lot.lod.labs.vu.nl/
B. Install the HDT Python library
pip install hdt
C. Load the HDT file:
from hdt import HDTDocument metalink = HDTDocument("data.hdt")
D. Get the number of sameAs links with an error degree of 0.0 (i.e. have very high probability of correctness)
(triples, cardinality) = metalink.search_triples('', '', '"0.0"^^<http://www.w3.org/2001/XMLSchema#double>') print("There is","{:,}".format(cardinality), "sameAs links with an error degree of 0.0")
E. Retrieve the metadata of 2 random sameAs links with an error degree of 0.0
(all_triples_with_error_0, cardinality_triples_with_error_0) = metalink.search_triples('', '', '"0.0"^^<http://www.w3.org/2001/XMLSchema#double>') print("There is","{:,}".format(cardinality_triples_with_error_0), "sameAs links with an error degree of 0.0") counter = 0 for s,p,o in all_triples_with_error_0: counter+=1 if counter > 2: break else: (meta_triples, cardinality_meta_triples) = metalink.search_triples(s, '', '') for s1,p1,o1 in meta_triples: print(s1,p1,o1)