# Querying using SPARQL

SPARQL is a query language for searching RDF graphs. Python's `rdflib` package supports querying using SPARQL. In this Notebook you will learn how to write SPARQL queries using `rdflib`.

## The SPARQL SELECT query

Begin by importing the package `rdflib` and load a graph from a specified file using the `parse` method.

In [None]:
import rdflib

# Set up a function to print out the first few triples of a graph 
def printtriples(agraph, limit): 
    n = 0
    for subj, pred, obj in agraph:
        print(subj)
        print(pred)
        print(obj)
        print('')
        if (limit > 0):
            n = n+1
            if n == limit:
                break

# Create a new empty graph in memory
geog = rdflib.Graph()

# Read the contents of a graph held in a file
geog.parse("data/European Geography.ttl", format="turtle")

# How many triples are there?
print("No of triples in graph:", len(geog))
print("")

# View a few triples
printtriples(geog, 10)

So, there is a large number of triples similar to those you met in Notebook 25.1.

One of the predicates in this graph is `hasCapital`, which relates a country to its capital city as in the following example:

    (
    http://www.example.org/geography/Germany
    http://www.example.org/hasCapital
    Berlin
    )

Suppose you wanted to find the capital of Moldova (assuming Moldova is a country in this graph). If such a relationship (triple) exists, it will be of the form:

    (
    http://www.example.org/geography/Moldova
    http://www.example.org/hasCapital
    ?capital
    )

where we've used `?capital` as the object to stand for the unknown name of the capital city.

A straightforward querying mechanism would be to search the graph for triples that match this pattern. 

In SPARQL, a **pattern** is simply a triple in which one or more of its elements (subject, predicate, object) is a name starting with '`?`'. A name that starts with a '`?`' is known as a **variable**.  

Therefore, if we search the graph and find a triple that matches this pattern, the variable `?capital` can be assigned the name of the capital city appearing in the found triple. When this happens we say that the value (the name of the capital city of Moldova) is **bound** to the variable `?capital`. Once a match for the variable has been found we can ask for the value bound to the variable to be returned as the result of the query.

In SPARQL you construct a query in the form of one or more triple **patterns** specifying which triples you want to be retrieved from the graph and specify the variables whose values you want to be returned from the query. To do this you should use a SPARQL SELECT query (there are several other types of query in SPARQL) as follows.


In [None]:
# A SPARQL SELECT query
q1 = '''SELECT ?capital
        WHERE { 
            <http://www.example.org/geography/Moldova>
            <http://www.example.org/hasCapital>
            ?capital
        }'''

print(q1)

The query is wrapped up in a Python string and assigned to a variable (`q1`).

Immediately following the SELECT keyword is a SPARQL variable (it starts with `?`) the value of which will be returned when the query has been run.

In this example, the WHERE clause contains a single triple pattern. It is a triple because it contains three elements, the first two of which specify a subject (`Moldova`) and a predicate (`hasCapital`). The third element, the object, is a variable (`?capital`). 

There are three things to notice about the syntax of this query statement:

1. The query string is enclosed in three quotes; this allows you to include newlines in the string to help layout the query in a more readable fashion (the newlines are ignored by the query engine).

2. Each URI is enclosed within angle brackets; this is essential SPARQL syntax. 

3. Following the WHERE keyword is a pair of braces (curly brackets) in which the triple pattern has been placed.


To run a query, use the rdflib function `query()` with a string argument containing the query.

In [None]:
# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print out the results.
# Results are returned as a set of tuples since it is expected that in most
# queries there will be several results. Each tuple contains a value for 
# each variable mentioned after the SELECT keyword.
# So, in this example, print first element of each tuple in the results:

for row in r1:
    print (row[0])
    print()


### Activity 1

Devise and run a query that requests the capital of Georgia. Explain the result.

In [None]:
# Insert your solution here.

The solution is in the [`25.2solutions`](25.2solutions.ipynb) Notebook.

## What's in the graph?

At this stage, you don't really know what data is contained in the graph. This is usually the case whenever you meet a graph for the first time. Therefore, it would be useful to know what predicates occur in the graph.

At first you might be tempted to write the following query which finds all predicates associated with a country such as Switzerland. (Since we're only interested in the names of the predicates, that's the only variable after the SELECT keyword, and we've used `?object` as a variable that will match with any object.)

This query illustrates that there can be multiple matches for a given pattern.


In [None]:
# Find all predicates associated with 'Switzerland'
q1 = '''SELECT ?predicate
        WHERE { 
            <http://www.example.org/geography/Switzerland>
            ?predicate
            ?object
        }'''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0])
    print()

Whenever the query engine (the code that performs the search) finds a triple that matches with the pattern, it continues to search for futher matches and returns all the matches as its result.

If you look closely at the output, you will see that several predicates occur multiple times. For example, `hasOfficialLanguage` occurs four times because Switzerland has four different official languages. We can make the output of this query more useful if we ask that each predicate is returned only once by using the keyword DISTINCT (placed after the SELECT keyword) as follows.

In [None]:
# Find all predicates associated with 'Switzerland'
q1 = '''SELECT DISTINCT ?predicate
        WHERE { 
            <http://www.example.org/geography/Switzerland>
            ?predicate
            ?object
        }'''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0])
    print()

Of course, this list of predicates associated with Switzerland may not represent all the predicates used in the graph. So let's amend the query to match with every triple in the graph and return only the distinct predicate names.

In [None]:
# Find all predicates used in the graph
q1 = '''SELECT DISTINCT ?predicate
        WHERE { 
            ?subject
            ?predicate
            ?object
        }'''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0])
    print()

It turns out that there are no other predicates in this graph.

### Activity 2

Find the names of all the different subjects contained in the graph.

In [None]:
# Insert your solution here.

The solution is in the [`25.2solutions`](25.2solutions.ipynb) Notebook.

There are a lot of results. When exploring a new graph you can often get into a situation in which the number of answers to your query is very large and the results are time consuming to produce (a waste of your time, too). Therefore, when experimenting like this it is a good idea to restrict the number of results returned using the LIMIT keyword placed after the closing brace, `}`, of the WHERE clause.

In [None]:
# Find the first 10 different subjects in the graph
q1 = '''SELECT DISTINCT ?subject
        WHERE { 
            ?subject
            ?predicate
            ?object
        }LIMIT 10
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0])
    print()

If you wanted the next 10 results you can use the OFFSET keyword. The following example returns 10 results starting with the 11th result.


In [None]:
# Find the the second set of 10 different subjects in the graph
q1 = '''SELECT DISTINCT ?subject
        WHERE { 
            ?subject
            ?predicate
            ?object
        }LIMIT 10 OFFSET 10
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0])
    print()

So, when querying a graph for the first time, it is always a good idea to limit the results both in number (use LIMIT) and uniqueness (use DISTINCT).

### Activity 3

Find the names of all those countries that border Switzerland.
Hint: use the predicate `hasBorder`.

In [None]:
# Insert your solution here.

The solution is in the [`25.2solutions`](25.2solutions.ipynb) Notebook.

There is another useful mechanism for limiting the number of results called **filtering**. The keyword FILTER is used to pick out from a set of results those that satisfy a particular condition. For example, to find all those countries with a population greater than 50 million, first find the population of a country and then check whether its population is more than 50 million.

In [None]:
# Find the countires that have a population greater than 50 million
q1 = '''SELECT DISTINCT ?country ?population
        WHERE { 
            ?country <http://www.example.org/hasPopulation> ?population .
            FILTER (?population > 50000000) 
        }LIMIT 20 
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0], "has population", row[1])
    print()

In this example, the values of two variables have been returned: `?country` and `?population`.

The processing of the query takes place in the order of the statements within the WHERE clause.

First, a match is found for the pattern 

    ?country <http://www.example.org/hasPopulation> ?population .

This might be, for example, `?country = Latvia` with `?population = 2165165`. 

We say that the value '`Latvia`' has been **bound** to the variable `?country` and the value `2165165` has been **bound** to the variable `?population`.

The values of the bound variables are carried through to the second statement of the WHERE clause where the current  (i.e. bound) value of `?population` is compared with the value 50000000. Since the current (i.e. bound) value of `?population` is less than 50000000 this potential result is discarded and the processing returns to the first statement in the WHERE clause and a new match is found with the triple pattern (technically we say that the variables become unbound and new values are obtained and bound to them). The next match might result in the bindings `?country = Germany` and `?population = 80996685`. This time the value bound to `?population` is greater than 50 million and so this binding will be accepted as a result to be returned. 

Whenever the query engine reaches the end of the sequence of statements in a WHERE clause, it returns to the first statement and repeats the process to see whether further results can be found. If no results are found, processing stops.

Note about the syntax: there is a full stop at the end of the first statement inside the WHERE clause to separate it from the second statement.

If you wanted to put the results of the last query in alphabetical order of country name you can use the ORDER BY keyword (which must be placed before the LIMIT keyword, if present):

In [None]:
# Find the countires that have a population greater than 50 million; place in alphabetical order
q1 = '''SELECT DISTINCT ?country ?population
        WHERE { 
            ?country <http://www.example.org/hasPopulation> ?population .
            FILTER (?population > 50000000) 
        }ORDER BY ?country LIMIT 20 
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0], "has population", row[1])
    print()

### Activity 4

Ouput the results of the last query in order of increasing size of population. 

In [None]:
# Insert your solution here.

The solution is in the [`25.2solutions`](25.2solutions.ipynb) Notebook.

To order items in reverse alphabetical order or decreasing numerical value, use the construct:

    ORDER BY DESC()

## Multiple patterns in a query

So far we have used only one triple pattern in our queries. More complex queries require multiple patterns.

For example, suppose you want to find the names of the countries that border those countries with a population greater than 5 million. This can be achieved by adding a further triple pattern to the last query.


In [None]:
# Find the countires that border those countries with a population greater than 50 million
q1 = '''SELECT DISTINCT ?country ?borderCountry
        WHERE { 
            ?country <http://www.example.org/hasPopulation> ?population .
            FILTER (?population > 50000000) .
            ?country <http://www.example.org/hasBorder> ?borderCountry 
        }ORDER BY ?borderCountry
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[1], "borders", row[0])
    print()

In this example, the bindings of `?country` and `?population` found in the first statement are carried through to the second and third statements. If any subsequent statement fails to find a match or returns `false` (as in the case of the FILTER statement) the processing reverts to the immediately preceeding statement.

For example, if `?country` refers to a country with no borders (Iceland has no bordering countries), processing reverts or **backtracks** to the previous statement, in this case the FILTER statement. Since a FILTER statement doesn't make bindings, backtracking continues to the previous statement (i.e. the first statement in this example). The query engine will attempt to find new matches for the variables in the first pattern and, if it succeeds, the processing moves forward to the second statement as before. 

In this way, the query engine continually moves forwards and backwards through the sequence of statements in the WHERE clause trying to find new bindings (when moving forward) and discarding existing bindings (when moving backwards).

Note the use of the full stop to separate the statements inside the WHERE clause.

### Activity 5

Find all information held about Germany.

In [None]:
# Insert your solution here.

The solution is in the [`25.2solutions`](25.2solutions.ipynb) Notebook.

### Activity 6

Find all countries that have German as an official language. (The term German is a literal value, so use `"German"` as the object in the pattern.)

In [None]:
# Insert your solution here.

The solution is in the [`25.2solutions`](25.2solutions.ipynb) Notebook.

### Activity 7

Find all countries that have German as an official language and which border Italy.

In [None]:
# Insert your solution here.

The solution is in the [`25.2solutions`](25.2solutions.ipynb) Notebook.

## Some shortcuts

Queries can become quite unreadable when they contain several patterns; there are several ways to reduce the size of queries and make them easier to read. 

As you have already seen in Part 24, triples in a graph can be made more readable by use of prefixes for URIs. The same is true in SPARQL although the syntax is slightly different. Here is the previous example rewritten using a PREFIX statement.

In [None]:
# Find the countires that border those countries with a population greater than 50 million

q1 = '''
        PREFIX eg: <http://www.example.org/>
        SELECT DISTINCT ?country ?borderCountry
        WHERE { 
            ?country eg:hasPopulation ?population .
            FILTER (?population > 50000000) .
            ?country eg:hasBorder ?borderCountry .
        } LIMIT 10
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[1], "borders", row[0])
    print()

When there are patterns with the same subject, they can be combined into a single pattern using a semicolon to separate the predicate-object pairs. For example, suppose we wanted to find all countries that border Germany and have German as an official language. Here is one way to do this.

In [None]:
#Find countries that border Germany that have German as an official language
q1 = '''
        PREFIX eg: <http://www.example.org/>
        PREFIX geo: <http://www.example.org/geography/>
        SELECT DISTINCT ?country
        WHERE { 
            ?country eg:hasBorder geo:Germany .
            ?country eg:hasOfficialLanguage "German" .
        } LIMIT 10
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0])
    print()

As the two patterns have the same subject, `?country`, they can be combined into a single pattern using a semicolon as follows:

In [None]:
q1 = '''
        PREFIX eg: <http://www.example.org/>
        PREFIX geo: <http://www.example.org/geography/>
        SELECT DISTINCT ?country
        WHERE { 
            ?country eg:hasBorder geo:Germany ;
                     eg:hasOfficialLanguage "German" .
        } LIMIT 10
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0])
    print()

If two successive patterns have the same subject and predicate they can be combined using a comma. For example, to find countries that border both Germany and France you could write:

In [None]:
q1 = '''
        PREFIX eg: <http://www.example.org/>
        PREFIX geo: <http://www.example.org/geography/>
        SELECT DISTINCT ?country
        WHERE { 
            ?country eg:hasBorder geo:Germany ,
                                  geo:France .
        } LIMIT 10
        '''

# Run the query q1, and save the results in variable r1
r1 = geog.query(q1)

# Print the results
for row in r1:
    print (row[0])
    print()

## Summary

In this Notebook you have seen that a triple pattern is a triple in which one or more of its subject, predicate and object are variables. A variable is a name prefixed with `?`.

A query engine searches the graph for triples that match the pattern(s) in the WHERE clause of a SELECT query. 

The query engine returns the values bound to the variables specified after the SELECT keyword.

When searching, the query engine works through multiple patterns in the order in which they have been written down matching the patterns to triples in the graph. Whenever a match is found, the variables in the pattern are bound to the values in the triple. The bound values are carried on to the next pattern and a further match is sought for any variables that have not yet been bound. Once all patterns have been tried and values found for the variables, the search engine records the result and returns to the first pattern to see whether further matches (results) can be found. 

If, at any stage, the search engine fails to match a pattern it backtracks to the previous pattern, unbinds any variables that were bound in that previous pattern, and looks for new matches for the variables in that previous pattern. If it finds a new match it continues moving forward through the patterns.

The FILTER statement enables you to carry forward the value of a variable if it meets a given criterion. Otherwise, the search engine unbinds that variable and backtracks.

SPARQL contains a number of features for limiting the number of results returned (which can be extensive, especially when you are experimenting):

1. LIMIT constrains the number of results returned
2. DISTINCT ensures that a result is returned only once

To print out a set of results in order, the ORDER BY keyword can be used.

A variety of mechanisms are available to reduce the complexity of queries:

1. the PREFIX statement can be used to replace parts of a long URI by a short name
2. if successive patterns have the same subject but different predicates, the predicates (and associated objects) can be listed after the subject separated by semicolons
3. if successive patterns have the same subject and predicate, the objects can be listed after the predicate separated by commas.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `25.3 Endpoints accessing real data`.