# First Exploration of the Wereldculturen RDF Dumps

What do we want to do here? *to be expanded*
 - (distributional) properties of objects & terms
   - distribution over objects, object properties 

 - explore the properties of the collection graph & thesaurus graph:
   - network structure (connectivity, etc)
   
 - explore the connections between collection & thesaurus (collection links into thesaurus)
   - intersections of entities
   - density of indexed terms in collection


### TODO

 - move the basic functions (such as loading the graph) into Python module





### Preparation

The collection data RDF dumps are too large to be uploaded to GitHub. You can get the necessary data for this notebook [here](https://collectie.wereldculturen.nl/thesaurus/#/query/89a9b00f-5f4b-4fef-bf00-32299ba16c85). Download both the collection dumps *and* the thesaurus dumps, and put the results of unzipping in folders 'objects' and 'thesaurus' respectively

This notebook uses the following packages:

In [81]:
import glob
from tqdm import tqdm

import numpy.random as rand

import rdflib
from rdflib import Graph
from rdflib import URIRef

# the predicates in the object graph and the thesaurus are from these namespaces (plus others)
from rdflib.namespace import RDF, DCTERMS, SKOS

## 1. load objects and thesaurus

In [41]:
def load_graph_from_dir(d, until=-1, file_ext="rdf", randomise=False):
    file_listing = glob.glob(f"{d}/*.{file_ext}")
    file_listing = rand.permutation(file_listing) if randomise else sorted(file_listing)
    file_listing = file_listing[:until] # there are 1570 files in /objects, loop below has 1.5 it/s so takes 15+min
    
    if len(file_listing) == 0:
        raise ValueError(f"taking {until} files from directory /{d}/ somehow not possible, listing empty!")
    
    graph = Graph()
    for path in tqdm(file_listing, 
                     desc=f"Parsing{' random' if randomise else ''} files from /{d}"): 
        graph.parse(path, format="xml")
    return graph

In [45]:
obj_graph = load_graph_from_dir("objects", until=10, randomise=False)
thesaurus = load_graph_from_dir("thesaurus", randomise=False)

Parsing files from /objects: 100%|██████████| 10/10 [00:06<00:00,  1.54it/s]
Parsing files from /thesaurus: 100%|██████████| 43/43 [00:15<00:00,  2.79it/s]


### 1.1 basic properties

In [96]:
n = lambda gen: len(set(gen))
info = [["", "object graph", "thesaurus"],
        ["len", len(obj_graph), len(thesaurus)],
        ["n_subjects", n(obj_graph.subjects()), n(thesaurus.subjects())],
        ["n_objects", n(obj_graph.objects()), n(thesaurus.objects())],
        ["n_preds", n(obj_graph.predicates()), n(thesaurus.predicates())]]
info = list(zip(*info))
print(tabulate(info[1:], headers=info[0]))

                 len    n_subjects    n_objects    n_preds
------------  ------  ------------  -----------  ---------
object graph   86842          5000        32242         20
thesaurus     198128         21248       116479          9


## 2. descriptive stats

In [118]:
headers = ["object subjs", "object preds", "object objs", "thes subjs", "thes preds", "thes objs"]
sets = [obj_graph.subjects(), obj_graph.predicates(), obj_graph.objects(), thesaurus.subjects(), thesaurus.predicates(), thesaurus.objects()]
sets = list(map(set, sets))

import sklearn.metrics

f = lambda s1, s2: len(s1&s2)
jacc = lambda s1, s2: len(s1&s2)/len(s1|s2)

intersections = [[h]+[jacc(s1, s2) if not (s1 is s2) else -1  for s2 in sets] for s1, h in zip(sets, headers)]
print(tabulate(intersections, headers=[""]+headers))

                object subjs    object preds    object objs    thes subjs    thes preds    thes objs
------------  --------------  --------------  -------------  ------------  ------------  -----------
object subjs              -1       0             0              0             0           0
object preds               0      -1             0              0             0.0357143   0
object objs                0       0            -1              0.0189348     0           0.00683086
thes subjs                 0       0             0.0189348     -1             0           0.177335
thes preds                 0       0.0357143     0              0            -1           0
thes objs                  0       0             0.00683086     0.177335      0          -1


In [131]:
marron = rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868')

list(obj_graph.triples((marron, None, None)))

[(rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868'),
  rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef('http://purl.org/dc/terms/PhysicalResource')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868'),
  rdflib.term.URIRef('http://purl.org/dc/terms/medium'),
  rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster26533')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868'),
  rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isShownBy'),
  rdflib.term.Literal('http://collectie.wereldculturen.nl/lodimages/cc/imageproxy.ashx?filename=images/Images/TM//tm-60036342.jpg&cache=yes')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868'),
  rdflib.term.URIRef('http://purl.org/dc/elements/1.1/creator'),
  rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/pi7863')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868'),
  rdflib.term.URIRef('http://purl.o