# Tracking a Term through the Collection & Thesaurus

Taking the term 'marron' (which refers to groups of people in the Americas; [Thesaurus link](https://hdl.handle.net/20.500.11840/termmaster3534). [Wikipedia link](https://nl.wikipedia.org/wiki/Marrons)) as an example, this notebook explores how a term can be tracked across both the collection and the thesaurus.

#### What information are we interested in? (*to be expanded*)

 - basic statistics on the term
 - statistics on related terms
 - shortest paths
 - placement in the hierarchies and facets (5 functional categories in the thesaurus)

---

#### Recipe:

 - read DB and thesaurus with rdflib
 - use queries to extract relevant triples and relevant parts of identifiers
 - construct table (perhaps pandas)
 - do statistics on table
 
 

In [1]:
import glob
from tqdm import tqdm

import numpy.random as rand

import rdflib
from rdflib import Graph
from rdflib import URIRef

def load_graph_from_dir(d, until=-1, file_ext="rdf", randomise=False):
    file_listing = glob.glob(f"{d}/*.{file_ext}")
    file_listing = rand.permutation(file_listing) if randomise else sorted(file_listing)
    file_listing = file_listing[:until] # there are 1570 files in /objects, loop below has 1.5 it/s so takes 15+min
        
    if len(file_listing) == 0:
        raise ValueError(f"taking {until} files from directory /{d}/ somehow not possible, listing empty!")
    
    graph = Graph()
    for path in tqdm(file_listing, 
                     desc=f"Parsing{' random' if randomise else ''} files from /{d}"): 
        graph.parse(path, format="xml")
    return graph

In [21]:
obj_graph = load_graph_from_dir("objects", until=30, randomise=True)
thesaurus = load_graph_from_dir("thesaurus", randomise=False)

Parsing random files from /objects: 100%|██████████| 30/30 [00:35<00:00,  1.18s/it]
Parsing files from /thesaurus: 100%|██████████| 43/43 [00:26<00:00,  1.60it/s]


In [None]:
marron = rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534')
thule = rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster2960')


marron_obj_graph = list(obj_graph.triples((marron, None, None))) + list(obj_graph.triples((None, None, marron)))
marron_thesaurus = list(thesaurus.triples((marron, None, None))) + list(thesaurus.triples((None, None, marron)))


In [None]:
print(f"'marron' occurs in\t{len(marron_obj_graph)}\trelations in the collection")
print(f"'marron' occurs in\t{len(marron_thesaurus)}\trelations in the thesaurus")

In [None]:
import networkx as nx

In [None]:
G = nx.MultiDiGraph()

G.add_edges_from(list(obj_graph))

In [None]:
len(G)

In [None]:
G.

In [None]:
list(obj_graph)

---
## Querying

In [None]:
q = """SELECT DISTINCT ?p
       WHERE {
          ?a ?p ?b .
       }"""

all_predicates = list(obj_graph.query(q))

type_preds = list(filter(lambda p: p[0].endswith("type"), all_predicates))

In [None]:
list(obj_graph.namespaces())

In [None]:
q = """PREFIX : <http://graphtheory/node/>
PREFIX ns1: <https://hdl.handle.net/20.500.11840/>

ASK {ns1:termmaster2960 :hasNeighbor*  ns1:termmaster3534}"""


list(thesaurus.query(q))

In [None]:
list(thesaurus)

In [None]:
qname = obj_graph.namespace_manager.qname

# list(map(qname, all_predicates))

for p in all_predicates:
    print(qname(p[0]))

In [None]:
all_predicates[0][0]

In [None]:
list(obj_graph.predicates())

In [None]:
{s for s, p, o in obj_graph if p == rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type')}

# Legacy Code

In [None]:
granman_photo = rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868')

granman_triples = list(obj_graph.triples((granman_photo, None, None))) + list(obj_graph.triples((None, None, granman_photo)))

# Terms from *Words Matter*

In [1]:
import pandas as pd

In [3]:
terms = pd.read_excel("../../Termen Words Matter (002).xlsx", engine='openpyxl')

In [8]:
terms.columns

Index(['nummer', 'woord NL', 'woord UK', 'paginanr NL', 'paginanr Engels',
       'toelichting NL', 'toelichting UK', 'voorkeurs alternatief NL',
       'voorkeursalternatief UK', 'relatie met', 'relatie met ',
       'relatie met .1', 'relatie met.1', 'relatie met.2', 'relatie met.3',
       'relatie met.4', 'relatie met.5', 'relatie met.6', 'relatie met.7'],
      dtype='object')

In [11]:
terms["woord UK"]

0            Aboriginal
1               Descent
2            Allochtoon
3             Barbarian
4               Servant
5                Berber
6                 Blank
7                Bombay
8            Bush Negro
9           Third World
10                Dwarf
11               Eskimo
12            Ethnicity
13               Exotic
14                  Gay
15              Colored
16           Half-blood
17             Disabled
18        Hermaphrodite
19           Homosexual
20            Hottentot
21           Inboorling
22               Indian
23              Indisch
24                 Indo
25           Indigenous
26                  NaN
27                  NaN
28         Jappenkampen
29               Kaffir
30            Caucasian
31               Coolie
32           Headhunter
33               Maroon
34         Medicine Man
35           Mohammedan
36            Mongoloid
37                 Moor
38              Mulatto
39               Native
40                Negro
41             D

In [9]:
terms.to_csv("../../words_matter.csv", index=False)

# Tracking Words in the Collections

In [34]:
from Levenshtein import distance as levenshtein

def nword(s):
    return "neger" in s.lower() or "nikker" in s.lower()

literals = [e for s, p, o in obj_graph for e in (s, o) if isinstance(e, rdflib.term.Literal)]

nwords = [s for s in map(str, literals) if nword(s) or "negr" in s.lower() or "nigger" in s.lower()]

In [31]:
sorted(set(nwords))

['De zeef is gevlochten door de ongeveer 80-jarige mandenmaker Guillaume Kodaman.<BR>    De wanden zijn gevlochten in de rechte éénslag met ijle, dubbele schering- en dubbele inslagrepen. De bodem is gevlochten in rechte tweeslag met ijle schering- en inslagrepen.<BR>    De stijl van de zeef is eerder indiaans en ongebruikelijk voor Bosnegers. Op de vraag waar hij de techniek had geleerd, werd geantwoord met een lachje.',
 'Ex-voto ofwel geloftegeschenk in de vorm van een polychroom beschilderde, vertind ijzeren plaat met zwart geverfde achterzijde. Links is de verschijning afgebeeld van de Virgen de la Candelaria in een wolkachtig ovaal met rechts een man, die uit een boom valt. Onderaan op een brede witte band staat de tekst: "Doy gracias a la Stma. Virgen de San Juan, por haver me salvado la vida, despues de haver caido de un arbol de 12 metros de altura. Miguel Reyes. Chiarillo, Jal. mayo, 26-1940" (Ik breng dank aan de allerheiligste Maagd van San Juan die mij het leven heeft gere

In [24]:
def to_qname(e):
    try:
        return thesaurus.namespace_manager.qname(e)
    except ValueError:
        return str(e)
    
thes_qnamed = [tuple(map(to_qname, tr)) for tr in tqdm(thesaurus)]
obj_qnamed = [tuple(map(to_qname, tr)) for tr in tqdm(obj_graph)]

100%|██████████| 198128/198128 [00:08<00:00, 23688.54it/s]
100%|██████████| 283919/283919 [02:55<00:00, 1616.90it/s]


In [33]:
ntriples = [tr for tr in obj_qnamed if any(map(nword, tr))]

TypeError: 'list' object is not callable

[('ns1:107416',
  'dc:description',
  'Zowel mannen als vrouwen roken pijp. De pijpen van vrouwen zijn over het algemeen minder gedecoreerd. De sociale status van de bezitter bepaalt de grootte en mate van versiering van de pijp.'),
 ('ns1:426744',
  'edm:provider',
  'Stichting Nationaal Museum van Wereldculturen'),
 ('ns1:77141', 'dc:identifier', '77141'),
 ('ns1:1000442', 'edm:object', 'ns1:termmaster1397'),
 ('ns1:1294702',
  'dcterms:extent',
  '2,4 × 3,6cm (Afbeelding)\n5 × 5cm (Drager)'),
 ('ns1:1295020', 'edm:isRelatedTo', 'Audiovisuele collectie'),
 ('ns1:338083',
  'edm:provider',
  'Stichting Nationaal Museum van Wereldculturen'),
 ('ns1:773220', 'edm:rights', 'CC-BY-SA 4.0'),
 ('ns1:77097', 'edm:rights', 'Copyright'),
 ('ns1:824050', 'dcterms:medium', 'ns1:termmaster26974')]