# Use the Getty SPARQL endpoint to correct data
- testrun @ 2023-06-16 and added code on 2023-07-03, 2024-05-20 through 2024-06-03
- SPARQL (“SPARQL Protocol And RDF Query Language”) is a W3C standard for querying RDF and can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware
- SPARQLWrapper is a simple Python wrapper around a SPARQL service for remote query execution. Not only does it enable us to write more complex queries to extract information from RDF than those exposed through a library like rdflib, it can also convert query results into other formats like JSON and CSV!

## Literature
- https://rebeccabilbro.github.io/sparql-from-python/
- https://groups.google.com/g/gettyvocablod/c/mSnqx3rd8lM/m/LKPstWJyAwAJ
- https://sparqlwrapper.readthedocs.io/en/stable/main.html
- https://github.com/RDFLib/sparqlwrapper/blob/master/scripts/example.py


## CSV example
```sparql.setReturnFormat(CSV)
results = sparql.query().convert()
print(results)'''```

# Import

## Import libraries

In [1]:
# progress bar
import tqdm

# save and dump
import pickle

# data wranling
import pandas as pd

# use sparqlwrapper to access getty endpoints
from SPARQLWrapper import SPARQLWrapper, JSON, CSV

# surpress warnings, due to my code block of exploding and comparing lists
import warnings
warnings.filterwarnings('ignore')

# Tests

## Test | SPARQLWrapper on wikipedia data

In [4]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Below we SELECT both the hot sauce items & their labels
# in the WHERE clause we specify that we want labels as well as items
sparql.setQuery("""
SELECT ?item ?itemLabel

WHERE {
  ?item wdt:P279 wd:Q522171.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
# sparql.setReturnFormat(CSV)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

results_df = pd.json_normalize(results['results']['bindings'])

results_df.head()

Unnamed: 0,item.type,item.value,itemLabel.xml:lang,itemLabel.type,itemLabel.value
0,uri,http://www.wikidata.org/entity/Q249114,en,literal,salsa
1,uri,http://www.wikidata.org/entity/Q335016,en,literal,Tabasco sauce
2,uri,http://www.wikidata.org/entity/Q360459,en,literal,Adobo
3,uri,http://www.wikidata.org/entity/Q460439,en,literal,Blair's 16 Million Reserve
4,uri,http://www.wikidata.org/entity/Q736782,en,literal,Llajua


## Test | SPARQLWrapper on ULAN endpoint, works
- retrieves uri for all Person, Artist (ULAN facet).

In [5]:
%%time

sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# Below we SELECT both the hot sauce items & their labels
# in the WHERE clause we specify that we want labels as well as items
sparql.setQuery("""
SELECT * WHERE { ulan:500000002 skos:member ?p . }
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

CPU times: total: 1.17 s
Wall time: 15.1 s


In [6]:
results_df = pd.json_normalize(results['results']['bindings'])
results_df

Unnamed: 0,p.type,p.value
0,uri,http://vocab.getty.edu/ulan/500771934
1,uri,http://vocab.getty.edu/ulan/500557119
2,uri,http://vocab.getty.edu/ulan/500632715
3,uri,http://vocab.getty.edu/ulan/500546545
4,uri,http://vocab.getty.edu/ulan/500108834
...,...,...
286915,uri,http://vocab.getty.edu/ulan/500077242
286916,uri,http://vocab.getty.edu/ulan/500426555
286917,uri,http://vocab.getty.edu/ulan/500548610
286918,uri,http://vocab.getty.edu/ulan/500602361


# Getty SPARQL queries via SPARQLWrapper, all queries are successful

## ULAN | Full query, returns all subjects within ULAN
- https://www.getty.edu/vow/ULANFullDisplay?find=&role=&nation=&page=1&subjectid=500000002

**Persons, Artists (ULAN facet)** Note: Records under this level represent information for individuals involved in the creation or production of works of fine art or architecture, for example painters, sculptors, printmakers, and architects. Included are individuals whose biographies are well known (e.g., Rembrandt van Rijn (Dutch painter and printmaker, 1606-1669)) as well as anonymous creators with identified oeuvres but whose names are unknown and whose biography is surmised (e.g., Master of Alkmaar (North Netherlandish painter, active ca. 1490-ca. 1510)). Craftsmen, artisans, engineers, and others who create visual works are included here, even if their works are not considered fine art per se. People whose primary life roles were other than "artist" or "architect," but who created or designed art or architecture in a professional or amateur capacity, are included here with a non-preferred relationship to this facet (e.g., Thomas Jefferson (American statesman, architect, and draftsman, 1743-1826)). Performance artists are included here. 

### New SPARQL-query

In [12]:
%%time

# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
select ?x ?name ?bio ?nationality ?type ?ScopeNote 
{
  ?x gvp:broaderExtended ulan:500000002. # Persons, Artists
  optional {?x gvp:agentTypePreferred [gvp:prefLabelGVP [xl:literalForm ?type]]}
  optional {?x foaf:focus [gvp:nationalityPreferred [gvp:prefLabelGVP [xl:literalForm ?nationality]]]}
  optional {?x gvp:prefLabelGVP [xl:literalForm ?name]}
  optional {?x foaf:focus [gvp:biographyPreferred [schema:description ?bio]]
  optional {?x skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}}
  }
""")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.head()

CPU times: total: 9.17 s
Wall time: 1min 34s


Unnamed: 0,x.type,x.value,name.xml:lang,name.type,name.value,bio.type,bio.value,nationality.xml:lang,nationality.type,nationality.value,type.xml:lang,type.type,type.value,ScopeNote.xml:lang,ScopeNote.type,ScopeNote.value
0,uri,http://vocab.getty.edu/ulan/500771934,nl,literal,"Aa, Andries van der",literal,"North Netherlandish painter, 1743-1776",en,literal,North Netherlandish,en,literal,painters (artists),,,
1,uri,http://vocab.getty.edu/ulan/500557119,nl,literal,"Aabdaoui, Latif",literal,"Moroccan graphic artist, 1958-",en,literal,Moroccan,en,literal,graphic artists,,,
2,uri,http://vocab.getty.edu/ulan/500632715,nl,literal,"Aa, Ben",literal,"Dutch painter, born 1951",en,literal,Dutch (culture or style),en,literal,painters (artists),,,
3,uri,http://vocab.getty.edu/ulan/500546545,nl,literal,"Aaberg, Gunhild",literal,"Danish ceramicist, 1939-",en,literal,Danish (culture or style),en,literal,ceramicists,,,
4,uri,http://vocab.getty.edu/ulan/500108834,,literal,"Aabergh, Gösta",literal,"Norwegian architect, active mid-late 20th century",en,literal,Norwegian (culture),en,literal,architects,en,literal,Norwegian architect.


In [13]:
with open('data_dumps/results_df_ulan_all.pickle', 'wb') as handle:
    pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# with open('data_dumps/results_df_all_ulan.pickle', 'rb') as handle:
#     df = pickle.load(handle)

In [None]:
results_df.shape

In [None]:
%store results_df

### Initial query

In [19]:
%%time

# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

select ?p ?name ?birth ?death ?ScopeNote ?related ?rname ?rbirth ?rdeath ?relatedScopeNote
{ ulan:500000002 skos:member ?p .
optional {?p gvp:prefLabelGVP/xl:literalForm ?name;
     	foaf:focus/gvp:biographyPreferred [
       	schema:description ?bio;
       	gvp:estStart ?birth].}
optional { ?p gvp:prefLabelGVP/xl:literalForm ?name;
     	foaf:focus/gvp:biographyPreferred [
       	schema:description ?bio;
       	gvp:estEnd ?death]. }
optional {?p skos:related ?related . 
         	?related skos:scopeNote [dct:language gvp_lang:en; 
rdf:value ?relatedScopeNote]}  
optional {?related gvp:prefLabelGVP/xl:literalForm ?rname;
 	foaf:focus/gvp:biographyPreferred [
       	schema:description ?bio;
       	gvp:estStart ?rbirth].}
optional { ?related gvp:prefLabelGVP/xl:literalForm ?rname;
     	foaf:focus/gvp:biographyPreferred [
       	schema:description ?bio;
       	gvp:estEnd ?rdeath]. }
optional {?p skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}}
""")

# returns results as a json
sparql.setReturnFormat(JSON)

results = sparql.query().convert()

EndPointInternalError: EndPointInternalError: endpoint returned code 500 and response.

In [18]:
# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.head()

Unnamed: 0,p.type,p.value,bio.type,bio.value,name.xml:lang,name.type,name.value,birth.datatype,birth.type,birth.value,death.datatype,death.type,death.value
0,uri,http://vocab.getty.edu/ulan/500771934,literal,"North Netherlandish painter, 1743-1776",nl,literal,"Aa, Andries van der",http://www.w3.org/2001/XMLSchema#gYear,literal,1743,http://www.w3.org/2001/XMLSchema#gYear,literal,1776
1,uri,http://vocab.getty.edu/ulan/500557119,literal,"Moroccan graphic artist, 1958-",nl,literal,"Aabdaoui, Latif",http://www.w3.org/2001/XMLSchema#gYear,literal,1958,http://www.w3.org/2001/XMLSchema#gYear,literal,2035
2,uri,http://vocab.getty.edu/ulan/500632715,literal,"Dutch painter, born 1951",nl,literal,"Aa, Ben",http://www.w3.org/2001/XMLSchema#gYear,literal,1951,http://www.w3.org/2001/XMLSchema#gYear,literal,2080
3,uri,http://vocab.getty.edu/ulan/500546545,literal,"Danish ceramicist, 1939-",nl,literal,"Aaberg, Gunhild",http://www.w3.org/2001/XMLSchema#gYear,literal,1939,http://www.w3.org/2001/XMLSchema#gYear,literal,2080
4,uri,http://vocab.getty.edu/ulan/500108834,literal,"Norwegian architect, active mid-late 20th century",,literal,"Aabergh, Gösta",http://www.w3.org/2001/XMLSchema#gYear,literal,1920,http://www.w3.org/2001/XMLSchema#gYear,literal,2020


### Load and save file as a pickle

In [None]:
# with open('result_json.pickle', 'wb') as handle:
#     pickle.dump(results, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('result_json.pickle', 'rb') as handle:
#     b = pickle.load(handle)

# with open('results_df_all_ulan.pickle', 'wb') as handle:
#     pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('data_dumps/results_df_all_ulan.pickle', 'rb') as handle:
    df = pickle.load(handle)

In [None]:
df.shape

In [None]:
df.head()

## ULAN | Roman active painters between 1400 and 1800
**Rome (inhabited place)** 
Note: City positioned on 7 hills over the swampy Tiber river area; one of the oldest continuously occupied sites in Europe. Archaeological evidence attests to human occupation of the area from ca. 14,000 years ago, but the dense layer of later debris obscures Palaeolithic and Neolithic sites. Was an Etruscan city by 8th cen. BCE, their kings expelled and republic established by 500 BCE; soon ruled vast area and was center of Empire from 31 BCE; declined when capital moved to Constantinople in 330 CE; revived under popes.

In [16]:
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>


SELECT ?id ?name ?bio ?birth ?death {
{SELECT DISTINCT ?id
         {?id foaf:focus/bio:event/(schema:location|(schema:location/gvp:broaderExtended)) tgn:7000874-place}}
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
          schema:description ?bio;
          gvp:estStart ?birth] . }
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
		  schema:description ?bio;
          gvp:estEnd ?death] . }
FILTER ("1000"^^xsd:gYear <= ?birth && ?birth <= "1800"^^xsd:gYear)
}

""")



# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.shape

(1348, 13)

In [None]:
# dumps df as a pickle file
with open('results_df_roman.pickle', 'wb') as handle:
    pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# # reads pickle file
# with open('results_df_roman.pickle', 'rb') as handle:
#     df = pickle.load(handle)

### Subselects data 

In [None]:
# subset cols
df = df[['related.value', 'rname.value', 'rbirth.value', 'rdeath.value']]

# change datatype
df['rbirth.value'] = df['rbirth.value'].astype(float)
df['rdeath.value'] = df['rdeath.value'].astype(float)

# get rid of floats, checked can be deleted
df = df[(df['rdeath.value'].notnull()) &
        (df['rbirth.value'].notnull())]

# subselection on active painters
df = df[(df['rdeath.value'].astype(int) < 1775) &
        (df['rbirth.value'].astype(int) > 1400)]

# split data and names
df[['last_name','first_name','addition','comment']] = df['rname.value'].str.split(', ', expand=True)

## ULAN | European active painters between 1400 and 1800
**Europe (continent)** Note: Europe was an early home to Homo erectus, Neanderthal and modern Homo sapiens. Agricultural settlements arose in the 7th millennium BCE. Advanced civilizations developed here in the Mediterranean area with Asian and African influences. Modern European nations began to form after the fall of the Western Roman Empire in the fifth century CE. The late 20th and early 21st centuries have witnessed the formation and expansion of the European Union, a group of European nations allied for trade and some administrative purposes, such as a common currency. 

In [17]:
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>


SELECT ?id ?name ?bio ?birth ?death {
{SELECT DISTINCT ?id
         {?id foaf:focus/bio:event/(schema:location|(schema:location/gvp:broaderExtended)) tgn:7000874-place}}
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
          schema:description ?bio;
          gvp:estStart ?birth] . }
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
		  schema:description ?bio;
          gvp:estEnd ?death] . }
FILTER ("1400"^^xsd:gYear <= ?birth && ?birth <= "1800"^^xsd:gYear)}

""")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.shape

(1284, 13)

In [None]:
with open('data_dumps/results_df_roman_painters_1400_1800.pickle', 'wb') as handle:
    pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

## ULAN | Active between 1400 and 1750

In [11]:
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>


SELECT ?id ?name ?bio ?place ?birth ?death {
{SELECT DISTINCT ?id
         {?id foaf:focus/bio:event/(schema:location|(schema:location/gvp:broaderExtended)) ?place}}
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
          schema:description ?bio;
          gvp:estStart ?birth] . }
OPTIONAL { ?id gvp:prefLabelGVP/xl:literalForm ?name;
          foaf:focus/gvp:biographyPreferred [
		  schema:description ?bio;
          gvp:estEnd ?death] . }
FILTER ("1400"^^xsd:gYear <= ?birth && ?birth <= "1750"^^xsd:gYear)}
""")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df

Unnamed: 0,id.type,id.value,name.type,name.value,bio.type,bio.value,birth.datatype,birth.type,birth.value,death.datatype,death.type,death.value,name.xml:lang
0,uri,http://vocab.getty.edu/ulan/500003194,literal,"Dunthorne, James, II",literal,"English portraitist, miniaturist, and caricatu...",http://www.w3.org/2001/XMLSchema#gYear,literal,1748,http://www.w3.org/2001/XMLSchema#gYear,literal,1793,
1,uri,http://vocab.getty.edu/ulan/500000926,literal,"Maynard, Thomas",literal,"British painter, active 1777-1812",http://www.w3.org/2001/XMLSchema#gYear,literal,1747,http://www.w3.org/2001/XMLSchema#gYear,literal,1842,nl
2,uri,http://vocab.getty.edu/ulan/500003422,literal,"Summers, S. N.",literal,"British painter, active 1764-1806",http://www.w3.org/2001/XMLSchema#gYear,literal,1734,http://www.w3.org/2001/XMLSchema#gYear,literal,1836,
3,uri,http://vocab.getty.edu/ulan/500000545,literal,"Pye, John, I",literal,"British engraver and printmaker, 1745-after 1773",http://www.w3.org/2001/XMLSchema#gYear,literal,1745,http://www.w3.org/2001/XMLSchema#gYear,literal,1773,
4,uri,http://vocab.getty.edu/ulan/500000597,literal,"Chair, R. B. de",literal,"French miniaturist, active ca. 1785",http://www.w3.org/2001/XMLSchema#gYear,literal,1745,http://www.w3.org/2001/XMLSchema#gYear,literal,1845,en
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9631,uri,http://vocab.getty.edu/ulan/500497944,literal,"Pfeifer, Rochus",literal,Austrian master builder,http://www.w3.org/2001/XMLSchema#gYear,literal,1500,http://www.w3.org/2001/XMLSchema#gYear,literal,1785,
9632,uri,http://vocab.getty.edu/ulan/500023060,literal,"Elmer, Stephen",literal,"British painter, ca.1714-1796",http://www.w3.org/2001/XMLSchema#gYear,literal,1704,http://www.w3.org/2001/XMLSchema#gYear,literal,1796,en
9633,uri,http://vocab.getty.edu/ulan/500778608,literal,"Patterson, James",literal,"American architect, died 1799",http://www.w3.org/2001/XMLSchema#gYear,literal,1700,http://www.w3.org/2001/XMLSchema#gYear,literal,1799,
9634,uri,http://vocab.getty.edu/ulan/500580454,literal,"Ernesti, Jordan",literal,"German painter, active ca. 1720s",http://www.w3.org/2001/XMLSchema#gYear,literal,1680,http://www.w3.org/2001/XMLSchema#gYear,literal,1780,nl


## AAT | Associated Concepts Facet
The AAT includes **generic terms, and associated dates, relationships, and other information about concepts related to or required to catalog, discover, and retrieve information about art, architecture, and other visual cultural heritage**, including related disciplines dealing with visual works, such as archaeology and conservation, where the works are of the type collected by art museums and repositories for visual cultural heritage, or that are architecture. It is our goal to be ever more inclusive of various cultures and their visual works. Also, in recognition of diverse collections found in art museums, the AAT contains terminology to describe objects and associated activities that are ceremonial or utilitarian in nature, but are not necessarily labeled as art according to traditional Western aesthetics. 

In [8]:
%%time
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iso: <http://purl.org/iso25964/skos-thes#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

select distinct ?Subject ?ScopeNote ?Term
{
  ?Subject gvp:broaderExtended aat:300264086. # Associated Concepts Facet
   optional {?Subject skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}
   optional {?Subject gvp:prefLabelGVP [xl:literalForm ?Term]}
   }
""")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.head()

CPU times: total: 422 ms
Wall time: 3.62 s


In [5]:
results_df.shape

(12983, 8)

In [11]:
%store results_df

Stored 'results_df' (DataFrame)


In [10]:
with open('data_dumps/results_df_aat_300264086.pickle', 'wb') as handle:
    pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

## TGN | All places, their names, long- and latitude, regions, continents, etc.
 A minimum **TGN record** contains a numeric ID, a name, a place in the hierarchy, and a place type. However, the data model includes dates, relationships, and other rich data. Most records include coordinates. TGN focuses on the historical world and places necessary for cataloging and discovery of visual works. It is not intended to be comprehensive. Current areas of TGN development include 1) adding archaeological sites, lost sites, and other historical sites, particularly Pre-Columbian places and places in Asia, Middle East, Africa, and others, and 2) building hierarchies for historical nations and empires. 

In [125]:
%%time
# set sparql endpoint
sparql = SPARQLWrapper("http://vocab.getty.edu/sparql")

# query
sparql.setQuery("""
select ?tgn_id ?city_name ?lat ?long ?parentstring ?broader_parentstring {{
  ?tgn_id skos:inScheme tgn: ; }
OPTIONAL { ?tgn_id foaf:focus [wgs:lat ?lat; wgs:long ?long];
    gvp:prefLabelGVP [xl:literalForm ?city_name].}
OPTIONAL { ?tgn_id gvp:parentString ?parentstring. }
OPTIONAL { ?tgn_id gvp:broaderPartitive ?x .
           ?x gvp:parentString ?broader_parentstring}
} 
""")

print(f"query done.")

# returns results as a json
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(f"results converted.")

# traverses the json
results_df = pd.json_normalize(results['results']['bindings'])
results_df.shape

query done.
results converted.
CPU times: total: 25 s
Wall time: 3min 2s


(557429, 15)

In [126]:
# delete empty results
results_df = results_df[(results_df['city_name.value'].notnull())]

# subset results and rename cols
results_df = results_df[[
#     'tgn_id.type', 
    'tgn_id.value', 
    'city_name.value',
#     'parentstring.type', 
    'parentstring.value',
#     'broader_parentstring.type',
    'broader_parentstring.value',
#     'city_name.xml:lang', 
#     'city_name.type', 
#     'lat.datatype', 
#     'lat.type',       
    'lat.value', 
#     'long.datatype', 
#     'long.type', 
    'long.value'
           ]].rename(columns={'tgn_id.value' : 'tgn_id', 
                              'city_name.value' : 'city_name', 
                              'parentstring.value' : 'parentstring',
                              'broader_parentstring.value' : 'broader_parentstring',
                              'lat.value' : 'lat',
                              'long.value' : 'lon',
                             })

results_df.shape

(545714, 6)

In [137]:
%%time
# compare two cols with strings, converted to list

# initiate col
results_df['inferred_city_name'] = ''

# iterate over df
for i in range(len(tqdm.notebook.tqdm(results_df, total=len(results_df)))):
    
    # check for empty values
    if type(results_df['parentstring'].iloc[i]) != float and type(results_df['broader_parentstring'].iloc[i]) != float:
        
        # split and create two lists to compare
        test = results_df['parentstring'].iloc[i].split(', ')
        test2 = results_df['broader_parentstring'].iloc[i].split(', ')
        
    # iterate over list and return diff
        for item in test:
            if item not in test2:
                results_df['inferred_city_name'].iloc[i] = item
    else:
        results_df['inferred_city_name'].iloc[i] = ''

  0%|          | 0/545714 [00:00<?, ?it/s]

CPU times: total: 2min 32s
Wall time: 2min 31s


In [144]:
# reorder cols
results_df = results_df[['tgn_id', 'city_name', 'inferred_city_name', 'parentstring', 'broader_parentstring', 'lat',
       'lon']]

In [146]:
results_df.head()

Unnamed: 0,tgn_id,city_name,inferred_city_name,parentstring,broader_parentstring,lat,lon
3,http://vocab.getty.edu/tgn/7004367,Abyan,Yemen,"Yemen, Asia, World","Asia, World",13.786,46.141
4,http://vocab.getty.edu/tgn/7002913,Abū Ẓaby,Abu Dhabi,"Abu Dhabi, United Arab Emirates, Asia, World","United Arab Emirates, Asia, World",24.4667,54.3667
5,http://vocab.getty.edu/tgn/7002913,Abū Ẓaby,,"Abu Dhabi, United Arab Emirates, Asia, World","Abu Dhabi, United Arab Emirates, Asia, World",24.4667,54.3667
6,http://vocab.getty.edu/tgn/7002905,Ra's al-Khaymah,Ra's al Khaymah,"Ra's al Khaymah, United Arab Emirates, Asia, W...","United Arab Emirates, Asia, World",25.7833,55.95
7,http://vocab.getty.edu/tgn/7002909,Ash-Shāriqah,Sharjah,"Sharjah, United Arab Emirates, Asia, World","United Arab Emirates, Asia, World",25.35,55.3667


In [147]:
with open('data_dumps/results_df_tgn_country.pickle', 'wb') as handle:
    pickle.dump(results_df, handle, protocol=pickle.HIGHEST_PROTOCOL)