# Tracking a Term through the Collection & Thesaurus

Taking the term 'marron' (which refers to groups of people in the Americas; [Thesaurus link](https://hdl.handle.net/20.500.11840/termmaster3534). [Wikipedia link](https://nl.wikipedia.org/wiki/Marrons)) as an example, this notebook explores how a term can be tracked across both the collection and the thesaurus.

#### What information are we interested in? (*to be expanded*)

 - basic statistics on the term
 - statistics on related terms
 - shortest paths
 - placement in the hierarchies and facets (5 functional categories in the thesaurus)

---

#### Recipe:

 - read DB and thesaurus with rdflib
 - use queries to extract relevant triples and relevant parts of identifiers
 - construct table (perhaps pandas)
 - do statistics on table
 
 

In [1]:
import glob
from tqdm import tqdm

import numpy.random as rand

import rdflib
from rdflib import Graph
from rdflib import URIRef

def load_graph_from_dir(d, until=-1, file_ext="rdf", randomise=False):
    file_listing = glob.glob(f"{d}/*.{file_ext}")
    file_listing = rand.permutation(file_listing) if randomise else sorted(file_listing)
    file_listing = file_listing[:until] # there are 1570 files in /objects, loop below has 1.5 it/s so takes 15+min
        
    if len(file_listing) == 0:
        raise ValueError(f"taking {until} files from directory /{d}/ somehow not possible, listing empty!")
    
    graph = Graph()
    for path in tqdm(file_listing, 
                     desc=f"Parsing{' random' if randomise else ''} files from /{d}"): 
        graph.parse(path, format="xml")
    return graph

In [5]:
obj_graph = load_graph_from_dir("objects", until=10, randomise=True)
thesaurus = load_graph_from_dir("thesaurus", randomise=False)

Parsing random files from /objects: 100%|██████████| 10/10 [00:12<00:00,  1.21s/it]
Parsing files from /thesaurus: 100%|██████████| 43/43 [00:24<00:00,  1.72it/s]


In [6]:
marron = rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534')

marron_obj_graph = list(obj_graph.triples((marron, None, None))) + list(obj_graph.triples((None, None, marron)))
marron_thesaurus = list(thesaurus.triples((marron, None, None))) + list(thesaurus.triples((None, None, marron)))


In [7]:
print(f"'marron' occurs in\t{len(marron_obj_graph)}\trelations in the collection")
print(f"'marron' occurs in\t{len(marron_thesaurus)}\trelations in the thesaurus")

'marron' occurs in	11	relations in the collection
'marron' occurs in	15	relations in the thesaurus


In [8]:
marron_thesaurus

[(rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#broader'),
  rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster21108')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#narrower'),
  rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3535')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#altLabel'),
  rdflib.term.Literal('Bosnegers')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#notation'),
  rdflib.term.Literal('OVM.AAB.AAA.AAD.AAA.AAD.AAI.AAH.AAA')),
 (rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/termmaster3534'),
  rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#altLabel'),
  rdflib.term.Li

---
## Querying

In [40]:
q = """SELECT DISTINCT ?p
       WHERE {
          ?a ?p ?b .
       }"""

all_predicates = list(obj_graph.query(q))

type_preds = list(filter(lambda p: p[0].endswith("type"), all_predicates))

In [39]:
all_predicates

[(rdflib.term.URIRef('http://purl.org/dc/terms/medium')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/identifier')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/object')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/description')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/type')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/title')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/rights')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isRelatedTo')),
 (rdflib.term.URIRef('http://purl.org/dc/elements/1.1/subject')),
 (rdflib.term.URIRef('http://purl.org/dc/terms/created')),
 (rdflib.term.URIRef('http://purl.org/dc/terms/spatial')),
 (rdflib.term.URIRef('file:///home/valentin/Desktop/SABIO/data/objects/exhibition')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isShownAt')),
 (rdflib.term.URIRef('http://www.europeana.eu/schemas/e

In [35]:
all_predicates[0][0]

rdflib.term.URIRef('http://purl.org/dc/terms/medium')

In [17]:
list(obj_graph.predicates())

[rdflib.term.URIRef('http://purl.org/dc/terms/medium'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/identifier'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/object'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/description'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/type'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/type'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/title'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/rights'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isRelatedTo'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/subject'),
 rdflib.term.URIRef('http://purl.org/dc/terms/created'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type'),
 rdflib.term.URIRef('http://www.europeana.eu/schemas/edm/isRelatedTo'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/description'),
 rdflib.term.URIRef('http://

In [20]:
{s for s, p, o in obj_graph if p == rdflib.term.URIRef('http://purl.org/dc/elements/1.1/type')}

{rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/883418'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/883047'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/587852'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/587558'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/1130864'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/506547'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/1130517'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/51447'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/1176597'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/898192'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/898407'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/506420'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/883216'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/1130931'),
 rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/85

# Legacy Code

In [None]:
granman_photo = rdflib.term.URIRef('https://hdl.handle.net/20.500.11840/206868')

granman_triples = list(obj_graph.triples((granman_photo, None, None))) + list(obj_graph.triples((None, None, granman_photo)))

# Terms from *Words Matter*

In [2]:
import pandas as pd

In [5]:
terms = pd.read_excel("Termen Words Matter (002).xlsx", engine='openpyxl')

In [10]:
terms["woord NL"]

0            Aboriginal
1               Afkomst
2            Allochtoon
3               Barbaar
4              Bediende
5                Berber
6                 Blank
7                Bombay
8              Bosneger
9     (De) Derde Wereld
10                Dwerg
11               Eskimo
12           Etniciteit
13             Exotisch
14                  Gay
15             Gekleurd
16            Halfbloed
17             Handicap
18         Hermafrodiet
19                 Homo
20            Hottentot
21           Inboorling
22              Indiaan
23              Indisch
24                 Indo
25              Inheems
26             Inlander
27             Islamiet
28         Jappenkampen
29               Kaffer
30           Kaukasisch
31               Koelie
32        Koppensneller
33               Marron
34          Medicijnman
35          Mohammedaan
36              Mongool
37                 Moor
38                Mulat
39                  NaN
40                Neger
41            On

In [12]:
terms[38:340]

Unnamed: 0,nummer,woord NL,woord UK,paginanr NL,paginanr Engels,toelichting NL,toelichting UK,voorkeurs alternatief NL,voorkeursalternatief UK,relatie met,relatie met.8,relatie met .1,relatie met.1,relatie met.2,relatie met.3,relatie met.4,relatie met.5,relatie met.6,relatie met.7
38,39,Mulat,Mulatto,132.0,127.0,Sinds de 17de eeuw verwijst ‘mulat’ naar de ee...,"Since the 17th century, “mulatto” refers to fi...",•\tDe term kan gebruikt worden in een historis...,•\tThe term “mulatto” can be used in a descrip...,17.0,25.0,,,,,,,,
39,40,,Native,,128.0,,The term “native” derives from the Latin word ...,,Should be used with caution,1.0,2.0,3.0,12.0,23.0,26.0,27.0,,,
40,41,Neger,Negro,133.0,129.0,"‘Neger’ komt van het Latijnse woord ‘niger’, d...","This term derives from the Latin word “niger,”...",•\tHet gebruik van deze term wordt afgeraden i...,•\tBlack \n•\tThis term is not recommended for...,9.0,16.0,30.0,34.0,48.0,57.0,,,,
41,42,Ontdekken,Discover,134.0,103.0,De term ‘ontdekken’ kan op een neutrale manier...,"“Discover” can be used in a neutral manner, fo...",•\tEen zinsconstructie als “was de eerste Euro...,•\tPhrases like “was the first European to rea...,,,,,,,,,,
42,43,Oriëntaals,Oriental,135.0,130.0,De term ‘Oriëntaals’ komt van het Latijnse woo...,This term derives from the Latin word “Oriënt”...,•\tAziatisch\n•\tHet is echter beter om de spe...,•\t“Asian”\n•\tThe use of more specific terms ...,14.0,54.0,,,,,,,,
43,44,Politionele actie,Politionele actie,136.0,131.0,Met ‘politionele acties’ worden de grootschali...,This phrase refers to the large-scale military...,•\tEr is geen consensus over alternatieve term...,•\tThere is no consensus on alternative terms....,,,,,,,,,,
44,45,Primitief,Primitieve,137.0,132.0,‘Primitief’ komt van het Latijnse woord primit...,Primitive derives from the Latin word primitiv...,•\tDe term kan gebruikt worden in een historis...,•\tThe term is not recommended for use.\n•\tTh...,4.0,21.0,22.0,33.0,51.0,,,,,
45,46,Pygmee,Pygmy,138.0,133.0,‘Pygmee’ wordt in de antropologie gebruikt voo...,“Pygmy” is a term used in anthropology to desc...,"•\t‘Pygmee’ is beledigend, en kan beter vermed...",•\t“Pygmy” is derogatory and should therefore ...,11.0,,,,,,,,,
46,47,Queer,Queer,139.0,134.0,Vooral sinds de jaren 1980 heeft ‘queer’ gedie...,"Particularly since the 1980s, “queer” has serv...",•LGBT ...,•\tLGBT\n- Use terminology and pronouns that a...,15.0,20.0,53.0,,,,,,,
47,48,Ras,Race,140.0,135.0,‘Ras’ is een veelbesproken term die verwijst n...,“Race” is a debated term that refers to the ca...,•\tEr is geen alternatief voor deze term. Door...,•\tThere is no easy alternative for this term....,7.0,13.0,16.0,17.0,25.0,27.0,31.0,41.0,55.0,57.0
