# First Steps with the Wereldculturen RDF Dumps


this notebook is just a playground to learn how to interact with the MvW data

## Preparation

The collection data RDF dumps are too large to be uploaded to GitHub. You can get the necessary data for this notebook [here](https://collectie.wereldculturen.nl/thesaurus/#/query/89a9b00f-5f4b-4fef-bf00-32299ba16c85). Download both the collection dumps *and* the thesaurus dumps, and put the results of unzipping in folders 'objects' and 'thesaurus' respectively

This notebook uses the following packages:

In [1]:
import glob
from tqdm import tqdm

from collections import Counter
import numpy.random as rand
import pandas as pd

import rdflib
from rdflib import Graph
from rdflib import URIRef

from tabulate import tabulate

basci functions are defined in `utils.py`:

In [2]:
from utils import load_RDF_from_dir, graph_to_df

## Load objects and thesaurus

In [3]:
obj_graph = load_RDF_from_dir("objects", until=1, randomise=True)
thesaurus = load_RDF_from_dir("thesaurus", until=1, randomise=True)

Parsing random files from /objects: 100%|██████████| 1/1 [00:01<00:00,  1.29s/it]
Parsing random files from /thesaurus: 100%|██████████| 1/1 [00:00<00:00,  1.62it/s]


## Store graphs' triples in data frame

In [4]:
obj_df = graph_to_df(obj_graph, qname=obj_graph.namespace_manager.qname)
thes_df = graph_to_df(thesaurus, qname=thesaurus.namespace_manager.qname)

500it [00:00, 1289.02it/s]
500it [00:00, 4082.61it/s]


---

## Dealing with Namespaces & Types

In [5]:
# the predicates in the object graph and the thesaurus are from these namespaces (plus others)
from rdflib.namespace import RDF, DC, DCTERMS, SKOS

# this lists all namespaces present in the graph
for ns in obj_graph.namespaces():
    print(ns)

('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace'))
('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#'))
('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#'))
('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#'))
('foaf', rdflib.term.URIRef('http://xmlns.com/foaf/0.1/'))
('edm', rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/'))
('dcterms', rdflib.term.URIRef('http://purl.org/dc/terms/'))
('dc', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/'))
('ns1', rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/'))
('ns2', rdflib.term.URIRef('http://collectie.wereldculturen.nl/lodimages/cc/imageproxy.ashx?filename=images/Images/TM//tm-60022423.jpg&cache='))
('ns3', rdflib.term.URIRef('http://collectie.wereldculturen.nl/lodimages/cc/imageproxy.ashx?filename=images/Images/TM//tm-60022424.jpg&cache='))
('ns4', rdflib.term.URIRef('http://collectie.wereldculturen.nl/lodimages/cc/imageproxy.ashx?filename=images/Image

In [6]:
found_ns, found_entity = rdflib.namespace.split_uri(rdflib.term.URIRef('http://purl.org/dc/terms/alternative'))

rdflib.term.URIRef(found_ns) 

rdflib.term.URIRef('http://purl.org/dc/terms/')

In [7]:
list(obj_graph.namespaces())

[('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace')),
 ('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#')),
 ('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#')),
 ('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#')),
 ('foaf', rdflib.term.URIRef('http://xmlns.com/foaf/0.1/')),
 ('edm', rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/')),
 ('dcterms', rdflib.term.URIRef('http://purl.org/dc/terms/')),
 ('dc', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/')),
 ('ns1', rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/')),
 ('ns2',
  rdflib.term.URIRef('http://collectie.wereldculturen.nl/lodimages/cc/imageproxy.ashx?filename=images/Images/TM//tm-60022423.jpg&cache=')),
 ('ns3',
  rdflib.term.URIRef('http://collectie.wereldculturen.nl/lodimages/cc/imageproxy.ashx?filename=images/Images/TM//tm-60022424.jpg&cache=')),
 ('ns4',
  rdflib.term.URIRef('http://collectie.wereldculturen.nl/lodimages/cc/imagepro

In [8]:
entities = list(type(e) for triple in obj_graph for e in triple)

# list({e for triple in obj_graph for e in triple})
# e = entities[2]

In [9]:
Counter(entities)

Counter({rdflib.term.URIRef: 19550, rdflib.term.Literal: 6217})

## Querying

In [10]:
# extract all predicates
q = """SELECT DISTINCT ?p
       WHERE {
          ?a ?p ?b .
       }"""

all_predicates = [row.get("p") for row in obj_graph.query(q)]


In [11]:
[rdflib.namespace.split_uri(p) for p in all_predicates]

[('http://www.europeana.eu/schemas/edm/', 'rights'),
 ('http://purl.org/dc/elements/1.1/', 'description'),
 ('http://purl.org/dc/elements/1.1/', 'subject'),
 ('http://www.europeana.eu/schemas/edm/', 'type'),
 ('http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'type'),
 ('http://purl.org/dc/terms/', 'extent'),
 ('http://purl.org/dc/elements/1.1/', 'identifier'),
 ('http://purl.org/dc/terms/', 'created'),
 ('http://www.europeana.eu/schemas/edm/', 'isRelatedTo'),
 ('http://www.europeana.eu/schemas/edm/', 'provider'),
 ('http://purl.org/dc/elements/1.1/', 'type'),
 ('http://www.europeana.eu/schemas/edm/', 'isShownAt'),
 ('http://www.europeana.eu/schemas/edm/', 'isShownBy'),
 ('http://purl.org/dc/terms/', 'medium'),
 ('http://purl.org/dc/terms/', 'spatial'),
 ('http://purl.org/dc/elements/1.1/', 'title'),
 ('http://www.europeana.eu/schemas/edm/', 'object')]

In [12]:
[obj_graph.namespace_manager.qname(p) for p in all_predicates]

['edm:rights',
 'dc:description',
 'dc:subject',
 'edm:type',
 'rdf:type',
 'dcterms:extent',
 'dc:identifier',
 'dcterms:created',
 'edm:isRelatedTo',
 'edm:provider',
 'dc:type',
 'edm:isShownAt',
 'edm:isShownBy',
 'dcterms:medium',
 'dcterms:spatial',
 'dc:title',
 'edm:object']

In [13]:
q = """SELECT ?a ?b
       WHERE {
          ?a dc:description ?b .
       }"""

descriptions = {obj: desc for obj, desc in obj_graph.query(q)}

In [14]:
descriptions

{rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/400168'): rdflib.term.Literal('Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/400278'): rdflib.term.Literal('Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/400379'): rdflib.term.Literal('Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/400321'): rdflib.term.Literal('Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig

In [15]:
obj_graph.namespace_manager.qname(rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/757753')),\
obj_graph.namespace_manager.qname(rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster26980'))

('ns1:757753', 'ns1:termmaster26980')

## Construct DataFrame from graph

per object one record

obtain from triples, use groupby


note: several fields per object have multiple values -> put into a list

In [16]:
obj_df = graph_to_df(obj_graph, qname=obj_graph.namespace_manager.qname)

500it [00:00, 2750.18it/s]


In [17]:
obj_df.columns

# obj_df["dc:title"]

obj_df["dc:description"].isna().sum()/obj_df.shape[0]

# obj_df["http://purl.org/dc/elements/1.1/description"]

0.096

In [18]:
from itertools import groupby

grouped = groupby(sorted(obj_graph), lambda triple: triple[0])

qname = lambda term: obj_graph.namespace_manager.qname(term)

In [19]:
for k, group in grouped:
    print(k)
    
    for triple in group:
        s, p, o = triple
        
        if not s == k: raise ValueError("WHAT!?")
        p = qname(p)
        try:
            o = qname(o)
        
            print("\t", p, o)
        except ValueError:
            print("\t", p, o)
    break

https://hdl.handle.net/20.500.11840/29851
	 dc:identifier 29851
	 dc:identifier TM-60022423
	 dc:title Eenzelfde paar doch van jongere generatie
	 dc:type Foto
	 dcterms:extent 15,2 x 11cm (6 x 4 5/16in.)
	 dcterms:medium ns1:termmaster26983
	 edm:isRelatedTo ns1:termmaster1802
	 edm:isRelatedTo Audiovisuele collectie
	 edm:isShownAt ns1:29851
	 edm:isShownBy ns2:yes
	 edm:provider Stichting Nationaal Museum van Wereldculturen
	 edm:rights (not assigned)
	 edm:type IMAGE
	 rdf:type dcterms:PhysicalResource


In [20]:

records = [triples_to_record(k, group) for k, group in tqdm(grouped)]

0it [00:00, ?it/s]

NameError: name 'triples_to_record' is not defined

In [None]:
obj_df = pd.DataFrame.from_records(records)

In [None]:
obj_df.shape

In [None]:
cols = set(tuple(r.keys()) for r in records)
len(records), len(cols)

In [None]:
max_k = ('obj_ref',
  'ns11:exhibition',
  'dc:creator',
  'dc:description',
  'dc:identifier',
  'dc:subject',
  'dc:title',
  'dc:type',
  'dcterms:created',
  'dcterms:extent',
  'dcterms:medium',
  'dcterms:spatial',
  'edm:isRelatedTo',
  'edm:isShownAt',
  'edm:isShownBy',
  'edm:object',
  'edm:provider',
  'edm:rights',
  'edm:type',
  'rdf:type')

[tuple(r.keys()) == max_k for r in records if len(tuple(r.keys())) == 20]


{k for rec in records for k in rec.keys()} ^ set(max_k)

In [None]:
obj_df.mean()

### .2 Extract descriptions

 - keep links
 - construct rich data structure (not just list of text)

In [21]:
q = """SELECT ?a ?b
       WHERE {
          ?a dc:description ?b .
       }"""

descriptions = {obj: desc.toPython() for obj, desc in obj_graph.query(q)}

In [22]:
descs = list(descriptions.values())

set(map(type, descs))

{str}

In [29]:
descs[100]

'Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)'

In [23]:
import numpy as np

df = pd.DataFrame(descs, columns=["descs"])

(df.descs == np.nan).sum()

0

In [27]:
df.shape

(452, 1)

In [25]:
desc = next(iter(descriptions.values()))
desc, desc.normalize(), str(desc)
dir(desc)
desc.toPython() == str(desc)

AttributeError: 'str' object has no attribute 'normalize'

In [30]:
descriptions

{rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/400168'): 'Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)',
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/400278'): 'Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)',
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/400379'): 'Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the Indisch Wetenschappelijk Instituut (IWI)',
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/400321'): 'Credits: Collectie Nationaal Museum van Wereldculturen - afkomstig uit de collectie van het Indisch Wetenschappelijk Instituut (IWI) / donated by the

In [None]:
descs = [o.toPython() for s, p, o in obj_graph if p == rdflib.term.URIRef('http://purl.org/dc/elements/1.1/description')]

sorted(set(map(len, descs)))

set(d for d in descs if len(d) < 10)

In [None]:
set(obj_graph.predicates())