# Use the Getty ULAN SPARQL endpoint to correct painter's data
- testrun @ 2023-06-16 and added code on 2023-07-03, 2024-05-20
- SPARQL (“SPARQL Protocol And RDF Query Language”) is a W3C standard for querying RDF and can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware
- SPARQLWrapper is a simple Python wrapper around a SPARQL service for remote query execution. Not only does it enable us to write more complex queries to extract information from RDF than those exposed through a library like rdflib, it can also convert query results into other formats like JSON and CSV!

## Literature
- https://rebeccabilbro.github.io/sparql-from-python/
- https://groups.google.com/g/gettyvocablod/c/mSnqx3rd8lM/m/LKPstWJyAwAJ
- https://sparqlwrapper.readthedocs.io/en/stable/main.html
- https://github.com/RDFLib/sparqlwrapper/blob/master/scripts/example.py


## CSV example
```sparql.setReturnFormat(CSV)
results = sparql.query().convert()
print(results)'''```

# Import

## Import libraries

In [1]:
import pickle
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON, CSV

# Tests

## Test | SPARQLWrapper on wikipedia data

In [4]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Below we SELECT both the hot sauce items & their labels
# in the WHERE clause we specify that we want labels as well as items
sparql.setQuery("""
SELECT ?item ?itemLabel

WHERE {
  ?item wdt:P279 wd:Q522171.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
# sparql.setReturnFormat(CSV)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

results_df = pd.json_normalize(results['results']['bindings'])

results_df.head()

Unnamed: 0,item.type,item.value,itemLabel.xml:lang,itemLabel.type,itemLabel.value
0,uri,http://www.wikidata.org/entity/Q249114,en,literal,salsa
1,uri,http://www.wikidata.org/entity/Q335016,en,literal,Tabasco sauce
2,uri,http://www.wikidata.org/entity/Q360459,en,literal,Adobo
3,uri,http://www.wikidata.org/entity/Q460439,en,literal,Blair's 16 Million Reserve
4,uri,http://www.wikidata.org/entity/Q736782,en,literal,Llajua


## Test | SPARQLWrapper on ULAN endpoint, works
- retrieves uri for all Person, Artist (ULAN facet).

In [5]:
%%time

sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# Below we SELECT both the hot sauce items & their labels
# in the WHERE clause we specify that we want labels as well as items
sparql.setQuery("""
SELECT * WHERE { ulan:500000002 skos:member ?p . }
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

CPU times: total: 1.17 s
Wall time: 15.1 s


In [6]:
results_df = pd.json_normalize(results['results']['bindings'])
results_df

Unnamed: 0,p.type,p.value
0,uri,http://vocab.getty.edu/ulan/500771934
1,uri,http://vocab.getty.edu/ulan/500557119
2,uri,http://vocab.getty.edu/ulan/500632715
3,uri,http://vocab.getty.edu/ulan/500546545
4,uri,http://vocab.getty.edu/ulan/500108834
...,...,...
286915,uri,http://vocab.getty.edu/ulan/500077242
286916,uri,http://vocab.getty.edu/ulan/500426555
286917,uri,http://vocab.getty.edu/ulan/500548610
286918,uri,http://vocab.getty.edu/ulan/500602361


# Getty ULAN SPARQL queries via SPARQLWrapper, all do function

## Getty, full query, returns all subjects within ULAN
- https://www.getty.edu/vow/ULANFullDisplay?find=&role=&nation=&page=1&subjectid=500000002

**Persons, Artists (ULAN facet)** Note: Records under this level represent information for individuals involved in the creation or production of works of fine art or architecture, for example painters, sculptors, printmakers, and architects. Included are individuals whose biographies are well known (e.g., Rembrandt van Rijn (Dutch painter and printmaker, 1606-1669)) as well as anonymous creators with identified oeuvres but whose names are unknown and whose biography is surmised (e.g., Master of Alkmaar (North Netherlandish painter, active ca. 1490-ca. 1510)). Craftsmen, artisans, engineers, and others who create visual works are included here, even if their works are not considered fine art per se. People whose primary life roles were other than "artist" or "architect," but who created or designed art or architecture in a professional or amateur capacity, are included here with a non-preferred relationship to this facet (e.g., Thomas Jefferson (American statesman, architect, and draftsman, 1743-1826)). Performance artists are included here. 

### Initial new SPARQL-query

In [None]:
%%time

# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
select ?x ?name ?bio ?nationality ?type ?ScopeNote 
{
  ?x gvp:broaderExtended ulan:500000002. # Persons, Artists
  optional {?x gvp:agentTypePreferred [gvp:prefLabelGVP [xl:literalForm ?type]]}
  optional {?x foaf:focus [gvp:nationalityPreferred [gvp:prefLabelGVP [xl:literalForm ?nationality]]]}
  optional {?x gvp:prefLabelGVP [xl:literalForm ?name]}
  optional {?x foaf:focus [gvp:biographyPreferred [schema:description ?bio]]
  optional {?x skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}}
  }
""")

# returns results as a json
sparql.setReturnFormat(JSON)

results = sparql.query().convert()

In [None]:
# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.head()

In [None]:
results_df.shape

In [None]:
%store results_df

In [21]:
# with open('result_json.pickle', 'wb') as handle:
#     pickle.dump(results, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('result_json.pickle', 'rb') as handle:
#     b = pickle.load(handle)

with open('data_dumps/results_df_all_ulan.pickle', 'wb') as handle:
    pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('data_dumps/results_df_all_ulan.pickle', 'rb') as handle:
#     df = pickle.load(handle)

### Initial query

In [19]:
%%time

# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

select ?p ?name ?birth ?death ?ScopeNote ?related ?rname ?rbirth ?rdeath ?relatedScopeNote
{ ulan:500000002 skos:member ?p .
optional {?p gvp:prefLabelGVP/xl:literalForm ?name;
     	foaf:focus/gvp:biographyPreferred [
       	schema:description ?bio;
       	gvp:estStart ?birth].}
optional { ?p gvp:prefLabelGVP/xl:literalForm ?name;
     	foaf:focus/gvp:biographyPreferred [
       	schema:description ?bio;
       	gvp:estEnd ?death]. }
optional {?p skos:related ?related . 
         	?related skos:scopeNote [dct:language gvp_lang:en; 
rdf:value ?relatedScopeNote]}  
optional {?related gvp:prefLabelGVP/xl:literalForm ?rname;
 	foaf:focus/gvp:biographyPreferred [
       	schema:description ?bio;
       	gvp:estStart ?rbirth].}
optional { ?related gvp:prefLabelGVP/xl:literalForm ?rname;
     	foaf:focus/gvp:biographyPreferred [
       	schema:description ?bio;
       	gvp:estEnd ?rdeath]. }
optional {?p skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}}
""")

# returns results as a json
sparql.setReturnFormat(JSON)

results = sparql.query().convert()

EndPointInternalError: EndPointInternalError: endpoint returned code 500 and response.

In [18]:
# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.head()

Unnamed: 0,p.type,p.value,bio.type,bio.value,name.xml:lang,name.type,name.value,birth.datatype,birth.type,birth.value,death.datatype,death.type,death.value
0,uri,http://vocab.getty.edu/ulan/500771934,literal,"North Netherlandish painter, 1743-1776",nl,literal,"Aa, Andries van der",http://www.w3.org/2001/XMLSchema#gYear,literal,1743,http://www.w3.org/2001/XMLSchema#gYear,literal,1776
1,uri,http://vocab.getty.edu/ulan/500557119,literal,"Moroccan graphic artist, 1958-",nl,literal,"Aabdaoui, Latif",http://www.w3.org/2001/XMLSchema#gYear,literal,1958,http://www.w3.org/2001/XMLSchema#gYear,literal,2035
2,uri,http://vocab.getty.edu/ulan/500632715,literal,"Dutch painter, born 1951",nl,literal,"Aa, Ben",http://www.w3.org/2001/XMLSchema#gYear,literal,1951,http://www.w3.org/2001/XMLSchema#gYear,literal,2080
3,uri,http://vocab.getty.edu/ulan/500546545,literal,"Danish ceramicist, 1939-",nl,literal,"Aaberg, Gunhild",http://www.w3.org/2001/XMLSchema#gYear,literal,1939,http://www.w3.org/2001/XMLSchema#gYear,literal,2080
4,uri,http://vocab.getty.edu/ulan/500108834,literal,"Norwegian architect, active mid-late 20th century",,literal,"Aabergh, Gösta",http://www.w3.org/2001/XMLSchema#gYear,literal,1920,http://www.w3.org/2001/XMLSchema#gYear,literal,2020


### Load and save file as a pickle

In [None]:
# with open('result_json.pickle', 'wb') as handle:
#     pickle.dump(results, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('result_json.pickle', 'rb') as handle:
#     b = pickle.load(handle)

# with open('results_df_all_ulan.pickle', 'wb') as handle:
#     pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('data_dumps/results_df_all_ulan.pickle', 'rb') as handle:
    df = pickle.load(handle)

In [None]:
df.shape

In [None]:
df.head()

## ULAN on roman active painters between 1400 and 1800
**Rome (inhabited place)** 
Note: City positioned on 7 hills over the swampy Tiber river area; one of the oldest continuously occupied sites in Europe. Archaeological evidence attests to human occupation of the area from ca. 14,000 years ago, but the dense layer of later debris obscures Palaeolithic and Neolithic sites. Was an Etruscan city by 8th cen. BCE, their kings expelled and republic established by 500 BCE; soon ruled vast area and was center of Empire from 31 BCE; declined when capital moved to Constantinople in 330 CE; revived under popes.

### Initial SPARQL-query

In [None]:
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>


SELECT ?id ?name ?bio ?birth ?death {
{SELECT DISTINCT ?id
         {?id foaf:focus/bio:event/(schema:location|(schema:location/gvp:broaderExtended)) tgn:7000874-place}}
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
          schema:description ?bio;
          gvp:estStart ?birth] . }
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
		  schema:description ?bio;
          gvp:estEnd ?death] . }
FILTER ("1000"^^xsd:gYear <= ?birth && ?birth <= "1800"^^xsd:gYear)
}

""")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df

In [None]:
results_df

### Load and save file as a pickle

In [None]:
# dumps df as a pickle file
with open('results_df_roman.pickle', 'wb') as handle:
    pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# reads pickle file
with open('results_df_roman.pickle', 'rb') as handle:
    df = pickle.load(handle)

In [None]:
df.head()

### Subselects data 

In [None]:
# subset cols
df = df[['related.value', 'rname.value', 'rbirth.value', 'rdeath.value']]

# change datatype
df['rbirth.value'] = df['rbirth.value'].astype(float)
df['rdeath.value'] = df['rdeath.value'].astype(float)

# get rid of floats, checked can be deleted
df = df[(df['rdeath.value'].notnull()) &
        (df['rbirth.value'].notnull())]

# subselection on active painters
df = df[(df['rdeath.value'].astype(int) < 1775) &
        (df['rbirth.value'].astype(int) > 1400)]

# split data and names
df[['last_name','first_name','addition','comment']] = df['rname.value'].str.split(', ', expand=True)

In [None]:
df

## ULAN on European active painters between 1400 and 1800
**Europe (continent)** Note: Europe was an early home to Homo erectus, Neanderthal and modern Homo sapiens. Agricultural settlements arose in the 7th millennium BCE. Advanced civilizations developed here in the Mediterranean area with Asian and African influences. Modern European nations began to form after the fall of the Western Roman Empire in the fifth century CE. The late 20th and early 21st centuries have witnessed the formation and expansion of the European Union, a group of European nations allied for trade and some administrative purposes, such as a common currency. 

In [None]:
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>


SELECT ?id ?name ?bio ?birth ?death {
{SELECT DISTINCT ?id
         {?id foaf:focus/bio:event/(schema:location|(schema:location/gvp:broaderExtended)) tgn:7000874-place}}
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
          schema:description ?bio;
          gvp:estStart ?birth] . }
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
		  schema:description ?bio;
          gvp:estEnd ?death] . }
FILTER ("1400"^^xsd:gYear <= ?birth && ?birth <= "1800"^^xsd:gYear)}

""")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df

## ULAN on active between 1400 and 1750

In [None]:
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>


SELECT ?id ?name ?bio ?place ?birth ?death {
{SELECT DISTINCT ?id
         {?id foaf:focus/bio:event/(schema:location|(schema:location/gvp:broaderExtended)) ?place}}
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
          schema:description ?bio;
          gvp:estStart ?birth] . }
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
		  schema:description ?bio;
          gvp:estEnd ?death] . }
FILTER ("1400"^^xsd:gYear <= ?birth && ?birth <= "1750"^^xsd:gYear)}
""")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df

## AAT on Associated Concepts Facet

In [2]:
%%time
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

select distinct ?Subject ?ScopeNote ?Term
{
  ?Subject gvp:broaderExtended aat:300264086. # Associated Concepts Facet
   optional {?Subject skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}
   optional {?Subject gvp:prefLabelGVP [xl:literalForm ?Term]}
   }
""")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

CPU times: total: 172 ms
Wall time: 3.28 s


In [4]:
# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.head()

Unnamed: 0,Subject.type,Subject.value,ScopeNote.xml:lang,ScopeNote.type,ScopeNote.value,Term.xml:lang,Term.type,Term.value
0,uri,http://vocab.getty.edu/aat/300189559,en,literal,Referring to the sex that in reproduction norm...,en,literal,male
1,uri,http://vocab.getty.edu/aat/300189557,en,literal,Referring to the sex that normally produces eg...,en,literal,female
2,uri,http://vocab.getty.edu/aat/300451703,en,literal,Motility or other conditions that limit a pers...,en,literal,physical disabilities
3,uri,http://vocab.getty.edu/aat/300266528,en,literal,"Persian wool carpets made in Herat, characteri...",en,literal,Herat carpets
4,uri,http://vocab.getty.edu/aat/300056330,en,literal,Pictorial narrative device featuring two or mo...,en,literal,continuous narration


In [5]:
results_df.shape

(12983, 8)

In [11]:
%store results_df

Stored 'results_df' (DataFrame)


In [9]:
with open('data_dumps/results_df_aat_300264086.pickle', 'wb') as handle:
    pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)