Skip to content

wouterbeek/sameas_script

Repository files navigation

sameas_script: Scripts for generating the owl:sameAs Cloud

MetaLink is an identity meta-dataset that describes over 550M `owl:sameAs` statements together with their error degree, and their transitive closure. MetaLink can be used to look up the trustworthiness of a specific link, or to look up links for a specific node. The error degree is a value between 0.0 and 1.0. Identity links with high error degree (e.g. `>0.99`) have high probability of being erroneous, hence it is recommended to discard such links in your Linked Data application.

Installation

  1. Install SWI-Prolog.
  2. Install prolog_hdt and prolog_rocksdb.
  3. Clone this repository: git clone https://github.com/wouterbeek/sameas_scripts.

Usage

hdt2tsv

First, extract the owl:sameAs pairs from a given HDT file, and store the results in a given TSV output file. If the output file name ends in .gz the output is automatically compressed using GNU zip.

swipl -s hdt2tsv.pl -g run -t halt --input="FILE" --output="FILE"

sort_tsv

Use GNU sort to sort the owl:sameAs pairs that were extracted in the previous step.

MetaLink Description

Each existing identity statement in the LOD-a-lot dataset (e.g. <ns1:x, owl:sameAs, ns2:y>) is reified as the following:

PREFIX meta: <https://krr.triply.cc/krr/sameas-meta/>

meta:id/link/<IDENTITY_STATEMENT_ID>, rdf:type, meta:def/IdentityStatement .
meta:id/link/<IDENTITY_STATEMENT_ID>, rdfs:label, "Identity Statement <IDENTITY_STATEMENT_ID>" .
meta:id/link/<IDENTITY_STATEMENT_ID>, rdf:subject, ns1:x .
meta:id/link/<IDENTITY_STATEMENT_ID>, rdf:predicate, owl:sameAs .
meta:id/link/<IDENTITY_STATEMENT_ID>, rdf:object, ns2:y .
meta:id/link/<IDENTITY_STATEMENT_ID>, meta:def/error, "0.2"^^<http://www.w3.org/2001/XMLSchema#double> .
meta:id/link/<IDENTITY_STATEMENT_ID>, meta:def/community, meta:id/comm/<EQUIVALENCE_SET_ID>/<COMMUNITY_ID> .

Read and Query MetaLink locally with Python

A. Download the MetaLink HDT file and its Index from:

https://doi.org/10.5281/zenodo.3227976

OPTIONAL. Download the LOD-a-lod dataset and its Index from:

http://lod-a-lot.lod.labs.vu.nl/

B. Install the HDT Python library

pip install hdt

C. Load the HDT file:

from hdt import HDTDocument
metalink = HDTDocument("data.hdt")

D. Get the number of sameAs links with an error degree of 0.0 (i.e. have very high probability of correctness)

(triples, cardinality) = metalink.search_triples('', '', '"0.0"^^<http://www.w3.org/2001/XMLSchema#double>')
print("There is","{:,}".format(cardinality), "sameAs links with an error degree of 0.0")

E. Retrieve the metadata of 2 random sameAs links with an error degree of 0.0

(all_triples_with_error_0, cardinality_triples_with_error_0) = metalink.search_triples('', '', '"0.0"^^<http://www.w3.org/2001/XMLSchema#double>')
print("There is","{:,}".format(cardinality_triples_with_error_0), "sameAs links with an error degree of 0.0")
counter = 0
for s,p,o in all_triples_with_error_0:
  counter+=1
    if counter > 2:
      break
    else:
      (meta_triples, cardinality_meta_triples) = metalink.search_triples(s, '', '')
      for s1,p1,o1 in meta_triples:
          print(s1,p1,o1)

About

Scripts for generating the owl:sameAs Cloud

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published