# Tracking a Term through the Collection & Thesaurus

Taking the term 'marron' (which refers to groups of people in the Americas; [Thesaurus link](https://hdl.handle.net/20.500.11840/termmaster3534). [Wikipedia link](https://nl.wikipedia.org/wiki/Marrons)) as an example, this notebook explores how a term can be tracked across both the collection and the thesaurus.

#### What information are we interested in? (*to be expanded*)

 - basic statistics on the term
 - statistics on related terms
 - shortest paths
 - placement in the hierarchies and facets (5 functional categories in the thesaurus)

---

#### Recipe:

 - read DB and thesaurus with rdflib
 - use queries to extract relevant triples and relevant parts of identifiers
 - construct table (perhaps pandas)
 - do statistics on table
 
 

In [1]:
import glob
from tqdm import tqdm

import numpy.random as rand

import rdflib
from rdflib import Graph
from rdflib import URIRef

def load_graph_from_dir(d, until=-1, file_ext="rdf", randomise=False):
    file_listing = glob.glob(f"{d}/*.{file_ext}")
    file_listing = rand.permutation(file_listing) if randomise else sorted(file_listing)
    file_listing = file_listing[:until] # there are 1570 files in /objects, loop below has 1.5 it/s so takes 15+min
        
    if len(file_listing) == 0:
        raise ValueError(f"taking {until} files from directory /{d}/ somehow not possible, listing empty!")
    
    graph = Graph()
    for path in tqdm(file_listing, 
                     desc=f"Parsing{' random' if randomise else ''} files from /{d}"): 
        graph.parse(path, format="xml")
    return graph

In [5]:
obj_graph = load_graph_from_dir("objects", until=10, randomise=True)
thesaurus = load_graph_from_dir("thesaurus", randomise=False)

Parsing random files from /objects: 100%|██████████| 10/10 [00:12<00:00,  1.21s/it]
Parsing files from /thesaurus: 100%|██████████| 43/43 [00:24<00:00,  1.72it/s]


In [6]:
marron = rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534')

marron_obj_graph = list(obj_graph.triples((marron, None, None))) + list(obj_graph.triples((None, None, marron)))
marron_thesaurus = list(thesaurus.triples((marron, None, None))) + list(thesaurus.triples((None, None, marron)))


In [7]:
print(f"'marron' occurs in\t{len(marron_obj_graph)}\trelations in the collection")
print(f"'marron' occurs in\t{len(marron_thesaurus)}\trelations in the thesaurus")

'marron' occurs in	11	relations in the collection
'marron' occurs in	15	relations in the thesaurus


In [8]:
marron_thesaurus

[(rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#broader'),
  rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster21108')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#narrower'),
  rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3535')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#altLabel'),
  rdflib.term.Literal('Bosnegers')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#notation'),
  rdflib.term.Literal('OVM.AAB.AAA.AAD.AAA.AAD.AAI.AAH.AAA')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#altLabel'),
  rdflib.term.Li

---
## Querying

In [40]:
q = """SELECT DISTINCT ?p
       WHERE {
          ?a ?p ?b .
       }"""

all_predicates = list(obj_graph.query(q))

type_preds = list(filter(lambda p: p[0].endswith("type"), all_predicates))

In [39]:
all_predicates

[(rdflib.term.URIRef('http://purl.org/dc/terms/medium')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/identifier')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/object')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/description')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/type')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/title')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/rights')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isRelatedTo')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/subject')),
 (rdflib.term.URIRef('http://purl.org/dc/terms/created')),
 (rdflib.term.URIRef('http://purl.org/dc/terms/spatial')),
 (rdflib.term.URIRef('file:///home/valentin/Desktop/SABIO/data/objects/exhibition')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isShownAt')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/e

In [35]:
all_predicates[0][0]

rdflib.term.URIRef('http://purl.org/dc/terms/medium')

In [17]:
list(obj_graph.predicates())

[rdflib.term.URIRef('http://purl.org/dc/terms/medium'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/identifier'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/object'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/description'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/type'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/type'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/title'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/rights'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isRelatedTo'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/subject'),
 rdflib.term.URIRef('http://purl.org/dc/terms/created'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isRelatedTo'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/description'),
 rdflib.term.URIRef('http://

In [20]:
{s for s, p, o in obj_graph if p == rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type')}

{rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/883418'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/883047'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/587852'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/587558'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/1130864'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/506547'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/1130517'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/51447'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/1176597'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/898192'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/898407'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/506420'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/883216'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/1130931'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/85

# Legacy Code

In [None]:
granman_photo = rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868')

granman_triples = list(obj_graph.triples((granman_photo, None, None))) + list(obj_graph.triples((None, None, granman_photo)))