# First Exploration of the Wereldculturen RDF Dumps

#### What do we want to do here? (*to be expanded*)
 - (distributional) properties of objects & terms
   - distribution over objects, object properties 

 - explore the properties of the collection graph & thesaurus graph:
   - network structure (connectivity, etc)
   
 - explore the connections between collection & thesaurus (collection links into thesaurus)
   - intersections of entities
   - density of indexed terms in collection


#### TODO

 - move the basic functions (such as loading the graph) into Python module





### Preparation

The collection data RDF dumps are too large to be uploaded to GitHub. You can get the necessary data for this notebook [here](https://collectie.wereldculturen.nl/thesaurus/#/query/89a9b00f-5f4b-4fef-bf00-32299ba16c85). Download both the collection dumps *and* the thesaurus dumps, and put the results of unzipping in folders 'objects' and 'thesaurus' respectively

This notebook uses the following packages:

In [1]:
import glob
from tqdm import tqdm

from collections import Counter
import numpy.random as rand

import rdflib
from rdflib import Graph
from rdflib import URIRef

from tabulate import tabulate

## Loading objects and thesaurus

In [2]:
def load_graph_from_dir(d, until=-1, file_ext="rdf", randomise=False):
    file_listing = glob.glob(f"{d}/*.{file_ext}")
    file_listing = rand.permutation(file_listing) if randomise else sorted(file_listing)
    file_listing = file_listing[:until] # there are 1570 files in /objects, loop below has 1.5 it/s so takes 15+min
    
    if len(file_listing) == 0:
        raise ValueError(f"taking {until} files from directory /{d}/ somehow not possible, listing empty!")
    
    graph = Graph()
    for path in tqdm(file_listing, 
                     desc=f"Parsing{' random' if randomise else ''} files from /{d}"): 
        graph.parse(path, format="xml")
    return graph

In [3]:
obj_graph = load_graph_from_dir("objects", until=10, randomise=True)
thesaurus = load_graph_from_dir("thesaurus")

Parsing random files from /objects: 100%|██████████| 10/10 [00:12<00:00,  1.28s/it]
Parsing files from /thesaurus: 100%|██████████| 43/43 [00:26<00:00,  1.62it/s]


---

## Dealing with Namespaces & Types

In [19]:
# the predicates in the object graph and the thesaurus are from these namespaces (plus others)
from rdflib.namespace import RDF, DC, DCTERMS, SKOS

# this lists all namespaces present in the graph
for ns in obj_graph.namespaces():
    print(ns)

('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace'))
('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#'))
('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#'))
('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#'))
('foaf', rdflib.term.URIRef('http://xmlns.com/foaf/0.1/'))
('edm', rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/'))
('dcterms', rdflib.term.URIRef('http://purl.org/dc/terms/'))
('dc', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/'))


In [13]:
found_ns, found_entity = rdflib.namespace.split_uri(rdflib.term.URIRef('http://purl.org/dc/terms/alternative'))

rdflib.term.URIRef(found_ns) 

rdflib.term.URIRef('http://purl.org/dc/terms/')

In [18]:
list(obj_graph.namespaces())

[('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace')),
 ('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#')),
 ('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#')),
 ('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#')),
 ('foaf', rdflib.term.URIRef('http://xmlns.com/foaf/0.1/')),
 ('edm', rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/')),
 ('dcterms', rdflib.term.URIRef('http://purl.org/dc/terms/')),
 ('dc', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/'))]

In [5]:
entities = list(type(e) for triple in obj_graph for e in triple)

# list({e for triple in obj_graph for e in triple})
# e = entities[2]

In [6]:
Counter(entities)

Counter({rdflib.term.URIRef: 228216, rdflib.term.Literal: 62214})

## Querying

In [35]:
# extract all predicates
q = """SELECT DISTINCT ?p
       WHERE {
          ?a ?p ?b .
       }"""

all_predicates = [row.get("p") for row in obj_graph.query(q)]


In [40]:
[rdflib.namespace.split_uri(p) for p in all_predicates]

[('http://www.europeana.eu/schemas/edm/', 'rights'),
 ('http://purl.org/dc/elements/1.1/', 'identifier'),
 ('http://www.europeana.eu/schemas/edm/', 'provider'),
 ('http://purl.org/dc/elements/1.1/', 'subject'),
 ('http://purl.org/dc/terms/', 'extent'),
 ('http://www.europeana.eu/schemas/edm/', 'isRelatedTo'),
 ('http://purl.org/dc/elements/1.1/', 'description'),
 ('http://purl.org/dc/elements/1.1/', 'title'),
 ('http://purl.org/dc/terms/', 'medium'),
 ('http://www.europeana.eu/schemas/edm/', 'type'),
 ('http://www.europeana.eu/schemas/edm/', 'isShownBy'),
 ('http://purl.org/dc/terms/', 'created'),
 ('http://www.europeana.eu/schemas/edm/', 'isShownAt'),
 ('http://purl.org/dc/elements/1.1/', 'type'),
 ('http://purl.org/dc/terms/', 'spatial'),
 ('http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'type'),
 ('http://www.europeana.eu/schemas/edm/', 'object'),
 ('http://purl.org/dc/elements/1.1/', 'creator'),
 ('file:///home/valentin/Desktop/SABIO/REPO/data/objects/', 'exhibition'),
 ('http://pur

In [42]:
[obj_graph.namespace_manager.qname(p) for p in all_predicates]

['edm:rights',
 'dc:identifier',
 'edm:provider',
 'dc:subject',
 'dcterms:extent',
 'edm:isRelatedTo',
 'dc:description',
 'dc:title',
 'dcterms:medium',
 'edm:type',
 'edm:isShownBy',
 'dcterms:created',
 'edm:isShownAt',
 'dc:type',
 'dcterms:spatial',
 'rdf:type',
 'edm:object',
 'dc:creator',
 'ns1:exhibition',
 'dc:isPartOf',
 'dcterms:alternative']

In [21]:
q = """SELECT ?a ?b
       WHERE {
          ?a dc:description ?b .
       }"""

descriptions = {obj: desc for obj, desc in obj_graph.query(q)}

In [43]:
descriptions

{rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/757753'): rdflib.term.Literal('Inv.kaart:N.O.Congo (Ma-NgBetu of Lendu), cf 3015-77 en no. 779-17'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/422921'): rdflib.term.Literal('Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/423005'): rdflib.term.Literal('Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/169331'): rdflib.term.Literal('Handtas die van boven dichtgetrokken kan worden met vleugelmotieven die refereren aan de Garuda, abstracte bloemmotieven, bergen en vogels. Garuda is het heilige rijdier van de indo-god Vish

In [45]:
obj_graph.namespace_manager.qname(rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/757753'))

'ns3:757753'

### .1 Extract descriptions 

(keep links)

In [None]:
q = """SELECT ?a ?b
       WHERE {
          ?a dc:description ?b .
       }"""

descriptions = {obj: obj_graph.query(q)

In [None]:
dir(descriptions[0][1])
descriptions[0][0], descriptions[0][1].normalize(), """
aant JH: film War Paint met Tim McCoy Arapaho Ind. achterop foto stempel met gegevens film: "Tim McCoy in War Paint with Pauline Starke & Karl Dane. directed by W.S.Vandyke. A metro-Goldwyn-Mayer Picture. Country of origin U.S.A."
"""

In [None]:
descriptions