# First Exploration of the Wereldculturen RDF Dumps

#### What do we want to do here? (*to be expanded*)
 - (distributional) properties of objects & terms
   - distribution over objects, object properties 

 - explore the properties of the collection graph & thesaurus graph:
   - network structure (connectivity, etc)
   
 - explore the connections between collection & thesaurus (collection links into thesaurus)
   - intersections of entities
   - density of indexed terms in collection


#### TODO

 - move the basic functions (such as loading the graph) into Python module





### Preparation

The collection data RDF dumps are too large to be uploaded to GitHub. You can get the necessary data for this notebook [here](https://collectie.wereldculturen.nl/thesaurus/#/query/89a9b00f-5f4b-4fef-bf00-32299ba16c85). Download both the collection dumps *and* the thesaurus dumps, and put the results of unzipping in folders 'objects' and 'thesaurus' respectively

This notebook uses the following packages:

In [2]:
import glob
from tqdm import tqdm

from collections import Counter
import numpy.random as rand
import pandas as pd

import rdflib
from rdflib import Graph
from rdflib import URIRef

from tabulate import tabulate

In [3]:
from utils import load_RDF_from_dir

## Loading objects and thesaurus

In [4]:
obj_graph = load_RDF_from_dir("objects", until=1, randomise=True)
thesaurus = load_RDF_from_dir("thesaurus", until=1, randomise=True)

Parsing random files from /objects: 100%|██████████| 1/1 [00:01<00:00,  1.31s/it]
Parsing random files from /thesaurus: 100%|██████████| 1/1 [00:00<00:00,  1.96it/s]


---

## Dealing with Namespaces & Types

In [None]:
# the predicates in the object graph and the thesaurus are from these namespaces (plus others)
from rdflib.namespace import RDF, DC, DCTERMS, SKOS

# this lists all namespaces present in the graph
for ns in obj_graph.namespaces():
    print(ns)

In [None]:
found_ns, found_entity = rdflib.namespace.split_uri(rdflib.term.URIRef('http://purl.org/dc/terms/alternative'))

rdflib.term.URIRef(found_ns) 

In [None]:
list(obj_graph.namespaces())

In [None]:
entities = list(type(e) for triple in obj_graph for e in triple)

# list({e for triple in obj_graph for e in triple})
# e = entities[2]

In [None]:
Counter(entities)

## Querying

In [None]:
# extract all predicates
q = """SELECT DISTINCT ?p
       WHERE {
          ?a ?p ?b .
       }"""

all_predicates = [row.get("p") for row in obj_graph.query(q)]


In [None]:
[rdflib.namespace.split_uri(p) for p in all_predicates]

In [None]:
[obj_graph.namespace_manager.qname(p) for p in all_predicates]

In [None]:
q = """SELECT ?a ?b
       WHERE {
          ?a dc:description ?b .
       }"""

descriptions = {obj: desc for obj, desc in obj_graph.query(q)}

In [None]:
descriptions

In [None]:
obj_graph.namespace_manager.qname(rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/757753')),\
obj_graph.namespace_manager.qname(rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster26980'))

### .1 Construct DataFrame from graph

per object one record

obtain from triples, use groupby


note: several fields per object have multiple values -> put into a list

In [5]:
from utils import graph_to_df

In [6]:
obj_df = graph_to_df(obj_graph, qname=obj_graph.namespace_manager.qname)

500it [00:00, 1205.42it/s]


In [12]:
obj_df.columns

# obj_df["dc:title"]

obj_df["dc:description"].isna().sum()/obj_df.shape[0]

# obj_df["http://purl.org/dc/elements/1.1/description"]

0.428

In [None]:
from itertools import groupby

grouped = groupby(sorted(obj_graph), lambda triple: triple[0])

qname = lambda term: obj_graph.namespace_manager.qname(term)

In [None]:
for k, group in grouped:
    print(k)
    
    for triple in group:
        s, p, o = triple
        
        if not s == k: raise ValueError("WHAT!?")
        p = qname(p)
        try:
            o = qname(o)
        
            print("\t", p, o)
        except ValueError:
            print("\t", p, o)
    break

In [None]:

records = [triples_to_record(k, group) for k, group in tqdm(grouped)]

In [None]:
obj_df = pd.DataFrame.from_records(records)

In [None]:
obj_df.shape

In [None]:
cols = set(tuple(r.keys()) for r in records)
len(records), len(cols)

In [None]:
max_k = ('obj_ref',
  'ns11:exhibition',
  'dc:creator',
  'dc:description',
  'dc:identifier',
  'dc:subject',
  'dc:title',
  'dc:type',
  'dcterms:created',
  'dcterms:extent',
  'dcterms:medium',
  'dcterms:spatial',
  'edm:isRelatedTo',
  'edm:isShownAt',
  'edm:isShownBy',
  'edm:object',
  'edm:provider',
  'edm:rights',
  'edm:type',
  'rdf:type')

[tuple(r.keys()) == max_k for r in records if len(tuple(r.keys())) == 20]


{k for rec in records for k in rec.keys()} ^ set(max_k)

In [None]:
obj_df.mean()

### .2 Extract descriptions

 - keep links
 - construct rich data structure (not just list of text

In [None]:
q = """SELECT ?a ?b
       WHERE {
          ?a dc:description ?b .
       }"""

descriptions = {obj: desc.toPython() for obj, desc in obj_graph.query(q)}

In [None]:
descs = list(descriptions.values())

set(map(type, descs))

In [None]:
import numpy as np

df = pd.DataFrame(descs, columns=["descs"])

(df.descs == np.nan).sum()

In [None]:
desc = next(iter(descriptions.values()))
desc, desc.normalize(), str(desc)
dir(desc)
desc.toPywhere the thon() == str(desc)

In [None]:
descriptions

In [None]:
descs = [o.toPython() for s, p, o in obj_graph if p == rdflib.term.URIRef('http://purl.org/dc/elements/1.1/description')]

sorted(set(map(len, descs)))

set(d for d in descs if len(d) < 10)

In [None]:
set(obj_graph.predicates())