## Preliminaries

We have to get some packages. Then we can move on.

In [None]:
# in a python powered notebook, the python kernel knows what to do with !curl. 
# but in this R powered notebook, it does not. So instead, we define the 
# url we want to download, and the name of the file we'll give it once it has downloaded
# and THEN we can download the file and pass the target and destination as parameters:

url <- "https://cran.r-project.org/src/contrib/Archive/SPARQL/SPARQL_1.16.tar.gz"
destination_file <- "SPARQL.tar.gaz"
download.file(url, destination_file) 

In [None]:
options(warn = -1)  # Suppress all warnings
install.packages("SPARQL.tar.gaz", repos = NULL, type="source")
install.packages("ggplot2")

# SPARQL and LOD

Part of the early dream of the worldwide web was that data could be described semantically, and so when website A discussed eg Roman coins, and website B also discussed Roman coins, it would be possible for the very links themselves to contain the metadata that would alert us to when we were talking about the same thing, or same kind of thing. What if you could write some code to retrieve all of the information, everywhere, about Roman coins minted under Augustus from a particular mint, no matter what academic or museum or commercial database held relevant information? This was/is the promise of linked open data, and the query language for exploring those linkages, SPARQL.

When data is described using a 'Resource Description Framework', RDF, the resource - the 'thing'- is described via a series of relationships, rather than as rows in a table or keys having values.

Information is in the relationships. It's a network. It's a graph. Thus, every 'thing' in this graph can have its own uniform resource identifier (URI) that lives as a location on the internet. Information can then be created by making statements that use these URIs, similarly to how English grammar creates meaning: subject verb object. Or, in RDF-speak, 'subject predicate object', also known as a triple. In this way, data in different places can be linked together by referencing the elements they have in common. This is Linked Open Data (LOD). The access point for interrogating LOD is called an 'endpoint'.

Finally, SPARQL is an acronymn for SPARQL Protocol and RDF Query Language (yes, it's one of those kinds of acronyms).

In this notebook, we're using R to interact with and query an endpoint providing access to numismatic evidence from the Roman world, and manipulate linked open data, but for the sake of learning a bit of what one can do with SPARQL, this notebook keeps all of that ancillary code tucked away. The [followup notebook](Using R to Retrieve and Visualize Data from SPARQL.ipynb) to this one shows you how to use R to do some basic manipulations of the query results.

[Matthew Lincoln once wrote an excellent tutorial at the Programming Historian about SPARQL and LOD](https://programminghistorian.org/en/lessons/retired/graph-databases-and-SPARQL) but that tutorial depended on an endpoint maintained by the British Museum. The museum no longer maintains that endpoint, rendering Lincoln's tutorial obsolete in terms of its examples. But it _is_ an excellent intro to the key ideas, parts of which I am reproducing below.

Let's look at his example, which concerns the painting, 'The Nightwatch'.

```rdf
<The Nightwatch> <was created by> <Rembrandt van Rijn> .
```

This statement has three elements:

```
    the subject: <The Nightwatch>
    the predicate: <was created by>
    the object: <Rembrandt van Rijn>
```

Lincoln combines these, and other such statements, into a (pseudo-)RDF database like so:

```
<The Nightwatch> <was created by> <Rembrandt van Rijn> .
<The Nightwatch> <was created in> <1642> .
<The Nightwatch> <has medium> <oil on canvas> .
<Rembrandt van Rijn> <was born in> <1606> .
<Rembrandt van Rijn> <has nationality> <Dutch> .
<Johannes Vermeer> <has nationality> <Dutch> .
<Woman with a Balance> <was created by> <Johannes Vermeer> .
<Woman with a Balance> <has medium> <oil on canvas> .
```

Such RDF databases are describing nodes and links, and so we can visualize as a graph like so:

![](https://camo.githubusercontent.com/6c12a0f4f4c91fd87c787e790037151c2c3024849472e4740673d321370e073c/68747470733a2f2f70726f6772616d6d696e67686973746f7269616e2e6f72672f696d616765732f67726170682d6461746162617365732d616e642d53504152514c2f73706172716c30312e737667)

This is rather different than the tables and their intersections that we've discussed and read about so far this week! You can see that _adding_ information to this database is as simple (relatively) as adding a new relationship to the graph. This is one of the reasons why graph databases have become more popular in recent years, especially with the evolution of the web and social media.

Lincoln suggests that we think of SPARQL queries as a kind of 'mad lib', where there is a structure and we just fill in the blanks. We specify what kind of data we want in those blanks, and the query traverses the graph looking for data that correctly fills it in. Sticking with art history, he shows us this:

```
SELECT ?painting
WHERE {
  ?painting <has medium> <oil on canvas> .
}
```

The query goes off and looks for every painting that was made with oil on canvas. Visually, it looks a bit like this:

![](https://programminghistorian.org/images/graph-databases-and-SPARQL/sparql02.svg)

and we'd end up with a table with a column called 'painting' and the titles of each painting listed.

But there is a difference between the pseudo-RDF that Lincoln shows us, and what actual RDF might look like:

```
<http://data.rijksmuseum.nl/item/8909812347> <http://purl.org/dc/terms/creator>  <http://dbpedia.org/resource/Rembrandt>
```

The human-readable version requires more statements:
```
<http://data.rijksmuseum.nl/item/8909812347> <http://purl.org/dc/terms/title> "The Nightwatch" .

<http://purl.org/dc/terms/creator> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> "was created by" .

<http://dbpedia.org/resource/Rembrandt> <http://xmlns.com/foaf/0.1/name> "Rembrandt van Rijn" .
```

In the examples below, we're going to draw on the [Nomisma.org](https://nomisma.org) project which created and maintains an **ontology** or formalized descriptive framework for working with archaeological numismatic materials.

Lincoln suggests that when we first encountered a new RDF graph, that we explore the network of relationships from an example object to understand what is going on in the database, to see what is available for querying. In the query below, p and o stand for 'predicate' and 'object'. This could give you the necessary information to construct more complicated queries.


In [None]:
library(SPARQL)
library(ggplot2)

endpoint <- "http://nomisma.org/query"

# since this cell just defines the packages we're going to use and the variable 'endpoint' 
# that tells SPARQL where to send our query, when you run this block nothing much
# will seem to happen - the [ ] will change to [*] and then [3] (if this is were the third block you ran).

In [None]:
# All those 'prefix' statements? Those tell sparql where to find the authoritative source
# for things we might be after. You can copy those URLS into a browser to find out more 
# about what they describe. In a subsequent code block we touch lightly on an example.

query <- "PREFIX crm:  <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX dcterms:  <http://purl.org/dc/terms/>
PREFIX dcmitype:  <http://purl.org/dc/dcmitype/>
PREFIX foaf:  <http://xmlns.com/foaf/0.1/>
PREFIX geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX nm:  <http://nomisma.org/id/>
PREFIX nmo:  <http://nomisma.org/ontology#>
PREFIX org:  <http://www.w3.org/ns/org#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos:  <http://www.w3.org/2004/02/skos/core#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>

SELECT * WHERE {
  ?s ?p ?o
} LIMIT 100
"
qd <- SPARQL(endpoint,query)
qd

## Simple Queries

That's a lot of information to throw at a person, but it just goes to show that there's a lot of work that goes into representing data this way, and a lot of potential research that could be built on top of it! The work of data infrastructure is scholarly work and just as important as other forms of scholarly output.

But let's start with a simple exploration: can we get information about a single coin?

In [None]:
# Simple query to get one coin type
query <- "
PREFIX nmo: <http://nomisma.org/ontology#>
SELECT ?coin WHERE {
  ?coin a nmo:TypeSeriesItem .
} ORDER BY LIMIT 1
"
qd <- SPARQL(endpoint,query)
qd

Note if you rerun the block above, you'll always get the same result, because of the way the index at Nomisma (or any similar database)  is built. So to get something random each time, you'd run this modified code where you specify to return materials ORDERed BY RANDom:
```
query <- "
PREFIX nmo: <http://nomisma.org/ontology#>
SELECT ?coin WHERE {
  ?coin a nmo:TypeSeriesItem .
} 
ORDER BY RAND()
LIMIT 1
"
```

Do that in the empty code block below. Remember you have to run the query with SPARQL. Notice that each time you do it, the query is passing over all of the linked data sources that it knows about. Sometimes you'll find a coin at the Fitzwilliam Museum in Cambridge. Sometimes you'll find one from somewhere else. 

This illustrates one of the potential attractions of organizing and publishing data in known linked open datastores, especially for archaeology. So much material is squirreled away and it's quite possible if we actually _knew_ what we had and where it all was, it'd transform our understanding.


## Graphing Some Data

This is only a very short introduction to SPARQL and what might be accomplished with it (but if you're interested, here's a good [tutorial for more depth](https://jena.apache.org/tutorials/sparql.html)). We'll finish off by grabbing some data and then trying to graph it, to see if there is anything interesting in the patterns we might spot.

In [None]:
# these examples are coming from Nomisma itself 
# https://www.nomisma.org/documentation/sparql/
# All those 'prefix' statements? Those tell sparql where to find the authoritative source
# for things we might be after. For instance, 'dcterms', if you followed that url, goes to
# the DublinCore metadata scheme, and you could look up what 'dcterms:source' precisely defines: eg,
# https://www.dublincore.org/specifications/dublin-core/dcmi-terms/elements11/source/ 
# Notice 'nm:ric'? That is saying, 'the authoritative source for this information is RIC, Roman Imperial Coinage,
# https://nomisma.org/id/ric 

# so... just looking at this query, what do you think it's going to show you?

# create query statement
query <-
"PREFIX rdf:		<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms:		<http://purl.org/dc/terms/>
PREFIX nm:		<http://nomisma.org/id/>
PREFIX nmo:		<http://nomisma.org/ontology#>
SELECT ?type ?weight WHERE {
  VALUES ?authority { nm:augustus nm:vespasian }
  ?type nmo:hasAuthority ?authority ;
    nmo:hasDenomination nm:denarius ;
    dcterms:source nm:ric.
  ?coin nmo:hasTypeSeriesItem ?type .
  ?coin nmo:hasWeight ?weight
}
"

# Our GOAL: we're going to query nomisma using SPARQL and retrieve data from several locations
# that we will then visualize.

In [None]:
# We use SPARQL to submit the query, and then we write those results into a dataframe.
# There's a lot of data, this might take a bit of time.
qd <- SPARQL(endpoint,query)
df <- qd$results


In [None]:
df <- qd$results

# now lets look at the result.
df

Every coin has its own unique identifier! But let's create a new column for 'authority' where we can parse that URI for the name of the Emperor under whom the coin was minted.

In [None]:
# Extract authority FIRST while type is still character
df$authority <- case_when(
  grepl("aug", df$type, ignore.case = TRUE) ~ "Augustus",
  grepl("ves", df$type, ignore.case = TRUE) ~ "Vespasian",
  TRUE ~ "Other"
)

# THEN convert only the weight column to numeric
df$weight <- as.numeric(df$weight)

# Check the results
table(df$authority)
str(df)

Great! Let's compare these coins by Emperor; such a comparison might tell us something about the Roman economy.

In [None]:
library(ggplot2)

# Create histogram with facets
ggplot(df, aes(x = weight, fill = authority)) +
  geom_histogram(bins = 20, alpha = 0.7, position = "identity") + # bins are the number of groups, by weight, we want
  facet_wrap(~authority) +
  labs(title = "Distribution of Denarius Weights by Authority",
       subtitle = "Coins from RIC",
       x = "Weight (grams)",
       y = "Count") +
  scale_fill_manual(values = c("Augustus" = "gold", "Vespasian" = "darkred")) +
  theme_minimal() +

theme(legend.position = "none")  # Remove legend since facets show the grouping

It might not be apparent at first glance, but you've got two subgraphs or 'facets' in that result.
Try changing the number of 'bins' or weight categories. Sure looks like there's something different going on between the reign of Augustus and the reign of Vespasian, eh?

Can you modify the query and the code to retrieve more data from other Emperors? Hint: you'll want to go back and start by modifying this line: ```VALUES ?authority { nm:augustus nm:vespasian }```

In [None]:
# There are some other sample queries at nomisma
# https://www.nomisma.org/documentation/sparql/
# Average Diameter of RIC Augustus 1A
# This query finds all coins connected to the OCRE URI for Augustus 1A and averages their diameters.

# I include it to show that you can do some math as part of the query language too.


query2 ="
PREFIX rdf:		<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms:		<http://purl.org/dc/terms/>
PREFIX nm:		<http://nomisma.org/id/>
PREFIX nmo:		<http://nomisma.org/ontology#>
PREFIX xsd:		<http://www.w3.org/2001/XMLSchema#>

SELECT (AVG(xsd:decimal(?diameter)) AS ?average) WHERE {
?object nmo:hasTypeSeriesItem <http://numismatics.org/ocre/id/ric.1(2).aug.1A> ;
  nmo:hasDiameter ?diameter
}

"

qd2 <- SPARQL(endpoint,query2)
df2 <- qd2$results


In [None]:
# Now let's see the results
df2