# Using multiple endpoints

Running queries against a dataset accessible from a single endpoint is a useful facility. Even more useful is the ability to extract and merge data from several datasets located at different endpoints. In this Notebook you will learn how to create a single query that extracts data from multiple endpoints and melds them together - a **federated query**.

There are several ideas that you need to become familiar with before examining a federated query in detail, so we shall proceed through a series of incremental examples.

Note that some queries may take several seconds to complete, depending on network speeds and the load on the processor at the endpoint.

To start with, import SPARQLWrapper and create the usual helper functions for creating queries and printing results.

In [None]:
# Import the necessary packages
from SPARQLWrapper import SPARQLWrapper, JSON

# Add some helper functions

# A function that will return the results of running a SPARQL query with
# a defined set of prefixes over a specified endpoint.
# It follows the usual five-step process apart from creating the query, 
# which is provided as an argument to the function.
def runQuery(endpoint, prefix, q):
    ''' Run a SPARQL query with a declared prefix over a specified endpoint'''
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(prefix+q) # concatenate the strings representing the prefixes and the query
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

# Import pandas to provide facilities for creating a DataFrame to hold results
import pandas as pd

# Function to convert query results into a DataFrame
def dict2df(results):
    ''' A function to flatten the SPARQL query results and return the column values '''
    data = []
    for result in results["results"]["bindings"]:
        tmp = {}
        for el in result:
            tmp[el] = result[el]['value']
            data.append(tmp)
            
    df = pd.DataFrame(data)
    return df

# Function to run a query and return results in a DataFrame
def dfResults(endpoint, prefix, q):
    ''' Generate a DataFrame containing the results of running
        a SPARQL query with a declared prefix over a specified endpoint '''
    return dict2df(runQuery(endpoint, prefix, q))

# Print a limited number of results of a query
def printQuery(results, limit=''):
    ''' Print the results from the SPARQL query '''
    resdata = results["results"]["bindings"]
    if limit != '':
        resdata = results["results"]["bindings"][:limit]
    for result in resdata:
        for ans in result:
            print('{0}: {1}'.format(ans, result[ans]['value']))
        print()
        
# Run a query and print out a limited number of results
def printRunQuery(endpoint, prefix, q, limit=''):
    ''' Print the results from the SPARQL query '''
    results = runQuery(endpoint, prefix, q)
    printQuery(results, limit)
    
print("Helper functions set up")    

## Example 1: The Environment Agency's Bathing Water Linked Data

Recall that the Environment Agency collects water quality data each year from May to September, to ensure that designated bathing water sites on the coast and inland are safe and clean for swimming and other activities.

The Environment Agency splits up the country into counties, and counties are divided into districts. The following query prints out some of the attributes of bathing water places, specifically their district number and district name.

There are three triple patterns in the query that do the following:

   1. find a bathing water place and bind the resuls to variable `?x`
   2. find the name of the bathing water place
   3. find the district of the bathing water place.

In [None]:
# Define the endpoint
endpoint_envAgency = 'http://environment.data.gov.uk/sparql/bwq/query'

# Set up prefixes
prefix = '''
    PREFIX bw: <http://environment.data.gov.uk/def/bathing-water/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX stats: <http://statistics.data.gov.uk/def/administrative-geography/>
'''

# Define the query
query = '''
    SELECT ?name ?district
    WHERE {
        ?x a bw:BathingWater . #find subjects that are of type (predicate a) bw:BathingWater
        ?x rdfs:label ?name .  #obtain name of the subject
        ?x stats:district ?district . #obtain district (number) of the subject
    }
'''

# Run query and print out a limited number of results
printRunQuery(endpoint_envAgency, prefix, query, limit=6)


For each named district there are two results: one gives the location reference previously used by the Office for National Statistics (ONS) but which is no longer active, and the other gives an Ordnance Survey (OS) identification number.

### Activity 1
Choose one of the districts and click on the URI for the OS location. This should show some data about the chosen district. What do you see?


### Discussion 1
You should see a map of the district in which the bathing area is located. For example, Ringstead Bay is located in the district of West Dorset.


### Activity 2

The Ordnance Survey endpoint provides a wealth of information. It uses the rdf:label predicate to associate a label with a given object. Use this predicate to find the OS district identifier for East Sussex by completing the following query.

In [None]:
endpoint_os = 'http://data.ordnancesurvey.co.uk/datasets/os-linked-data/apis/sparql'

prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
'''

q = '''
    SELECT ?districtID
        WHERE {
            # Fill in the appropriate (single) triple pattern here
            
        }
'''

printRunQuery(endpoint_os, prefix, q)


The solution is in the [`26.1solutions`](26.1solutions.ipynb) Notebook.

## Example 2: Find information about bathing places in East Sussex

In this example we are going to send a query to the Environment Agency's endpoint. However, this query will include a SERVICE request from another endpoint (OS) for some data that it will then use to complete the original query.

Specifically, the SERVICE request will ask the OS endpoint for the name and identifier of a district in East Sussex. Then, using this data, the Environment Agency's endpoint will be asked to supply various data about the district.


In [None]:
prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX admingeo: <http://statistics.data.gov.uk/def/administrative-geography/>
    PREFIX ossr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
'''

q = '''
    SELECT ?districtname ?sedimentname ?location
    WHERE {
        # Ask OS endpoint to find a district, and its name, 
        # within the area East Sussex
        SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql> {
            ?area rdfs:label "East Sussex".
            ?district ossr:within ?area.
            ?district rdfs:label ?districtname.
        }

        # Ask Environment Agency to find location and sediment type
        # of district.
        
        # Find location of district
        ?location <http://statistics.data.gov.uk/def/administrative-geography/district> ?district .
        #Find whether location is that of a bathing water place
        ?location a <http://environment.data.gov.uk/def/bathing-water/BathingWater> .

        #Find type of sediment at this bathing water place
        ?location <http://environment.data.gov.uk/def/bathing-water/sedimentTypesPresent> ?sediment .
        ?sediment rdfs:label ?sedimentname.

    }
    ORDER BY ?districtname
'''

printRunQuery(endpoint_envAgency, prefix, q, limit=30)

The query, `q`, is sent to the Environment Agency's endpoint:

    endpoint_envAgency = 'http://environment.data.gov.uk/sparql/bwq/query'

Part of that SELECT query is a request for a SERVICE from the OS endpoint:

    SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql> { 
      ...
    }

The values returned from the OS endpoint, `?district` and `?districtname`, are available to the remainder of the query.


### Activity 3

Click on one of the location URLs returned by the previous query to see what kind of information is available about a specific bathing place. Then answer the following:

1. What kind(s) of sediment are to be found at this place?
2. What are the different categories of sediment used by the Environment Agency?


### Discussion 3

1. We visited Eastbourne at http://environment.data.gov.uk/id/bathing-water/ukj2201-14500 and found that both sand and shingle are present.

2. Clicking on one of the sediment types, such as sand, reveals a definition of that sediment type including the link 'sediment type'. This downloads a Turtle file from which it is possible to discover the types of sediment used by the Environment Agency:

    sand, shingle, rock, marsh, mud and other


### Activity 4

Amend the query in Example 2 to find information about bathing places in another area (e.g. West Sussex, Dorset, Devon or another area (county) of your own choice).

The solution is in the [`26.1solutions`](26.1solutions.ipynb) Notebook.

## Example 3: Using a general-purpose endpoint and the FROM keyword

This example differs from the earlier examples in that it uses a general-purpose endpoint (SPARQLer).

Since SPARQLer is general purpose, it expects to receive queries accompanied by the URL of the dataset (graph) to be searched. In SPARQL, it is possible to include this URL within a query using the FROM keyword (which is placed immediately after the SELECT construct - as shown in the code below).

The following query returns the names of individuals in Tim Berners-Lee's FOAF dataset.


In [None]:
# Set the endpoint - here is a general purpose endpoint
endpoint = 'http://sparql.org/sparql'

# Set prefix\n",
prefix = '''
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
'''

# Create query
q = '''
    SELECT ?name
        FROM <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf>
        WHERE {
           ?person a foaf:Person;
                   foaf:name ?name.
        } ORDER BY ?name
'''

printRunQuery(endpoint, prefix, q, limit=15)

## Example 4: Linked Movie Database (LinkedMDB)

This example sends a query to an endpoint via a SERVICE request.

LinkedMDB publishes linked open data. It publishes an open semantic web database for movies, which includes a large number of interlinks to several datasets on the open data cloud and provides references to related webpages.

LinkedMDB uses numeric identifiers for movies: movie 675 is 'Startrek: The Motion Picture'. The following query asks for the names of the actors in this movie.

Since the service provided by the LinkedMDB endpoint is specialised for movie data, there is no need to pass the URL of the dataset. However, the SPARQLer endpoint expects to receive a URL. Therefore, we have provided a genuine URL, but one that will be ignored by the LinkedMDB endpoint. The FROM statement has been commented as having a 'placeholder graph'.


In [None]:
endpoint = 'http://sparql.org/sparql'

prefix = '''
    PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
    PREFIX dbpedia: <http://dbpedia.org/ontology/>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX film: <http://data.linkedmdb.org/resource/film/>
'''

q = '''
    SELECT ?actor_name
    FROM <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf> # placeholder graph
    WHERE {
        SERVICE <http://data.linkedmdb.org/sparql> {
              film:675 movie:actor ?actor .
              ?actor movie:actor_name ?actor_name
        }
    }
'''

printRunQuery(endpoint, prefix, q, limit=15)

The query sent to the LinkedMDB endpoint has two triple patterns. The first finds an actor in the film with ID 675, the second finds the actor's name. The actor's name is returned to the SPARQLer endpoint that is dealing with the overall query. The results of the query are returned to the program to be printed out.

## Example 5: Multiple SERVICE requests

Suppose that we now wish to find the birth dates of the actors in *Startrek: The Motion Picture*. This information is not available from the LinkedMDB dataset but is available from the DBpedia data. Therefore, we could try to send an actor's name to DBpedia and use its `birthdate` property.

Here is the amended query which has a second SERVICE request. This request has three patterns. The first finds a subject (`?pers`) who is a member of the set `Person`. The second ensures that this person has the required name (`?actor_name_en`) and the third finds the birth date of this person.

Note that in the first SERVICE request the result is found in the variable `?actor_name`. However, names in DBpedia have a language tag. That is, names with the tag 'en' are in English. However, the LinkedMDB names do not have a language tag. Therefore, the BIND function has been used to add a language tag to the value in `?actor_name`, using `STRLANG(?actor_name, "en")` and binds that new value to a new variable `?actor_name_en`. The latter is then used in the second SERVICE request.


In [None]:
q = '''
    SELECT ?actor_name ?birth_date
    FROM <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf> # placeholder graph
    WHERE {
        {
            SERVICE <http://data.linkedmdb.org/sparql> {
                film:675 movie:actor ?actor .
                ?actor movie:actor_name ?actor_name
            }
            BIND(STRLANG(?actor_name, "en") AS ?actor_name_en)
        }
        SERVICE <http://dbpedia.org/sparql> {
            ?pers a foaf:Person ;
                foaf:name ?actor_name_en ;
                dbpedia:birthDate ?birth_date .
        }
    }
'''

printRunQuery(endpoint, prefix, q, limit=15)

You should observe a problem with the results of this amended query. Several people have the same name but different birth dates. This is simply because DBpedia knows a lot about different people with the same names. In these results there is no way to distinguish the actors in Startrek from other people of the same name.

You might like to investigate the DBpedia ontology further to see whether there is a property that might pick out the actors.

This last query has illustrated that a query can be sent to one endpoint which then requests services from several other endpoints before returning results. Such a query is known as a **federated query**.

It also illustrates that different datasets can hold similar data but in different formats and some processing may be required to change the format of data obtained from one dataset before it can be used to query another dataset.


## A case study of a federated query

The next query involves extracting data from Open Data Communities, the Department for Communities and Local Government's (DCLG) linked data platfrom.

Since the 1970s, the Department for Communities and Local Government has calculated local measures of deprivation in England. The increasing availability of administrative data at local levels has driven developments in the definition and measurement of deprivation. For the 2010 investigation, seven distinct domains have been identified: Income Deprivation, Employment Deprivation, Health Deprivation and Disability, Education Skills and Training Deprivation, Barriers to Housing and Services, Living Environment Deprivation, and Crime. They have been combined, using appropriate weights, into a single overall Index of Multiple Deprivation which can be used to rank every small area in England according to the deprivation experienced by the people living there.

Further information about the 2010 data can be found at http://www.communities.gov.uk/documents/statistics/pdf/1871208.pdf.

Suppose that we want to discover data about deprivation on the Isle of Wight.

The UK is divided into administrative regions called unitary authorities. The first task is to find out the name of the unitary authority for the Isle of Wight. This can be done by searching the Open Data Communities dataset as follows.

In [None]:
endpoint_odc = 'http://opendatacommunities.org/sparql'

prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX osadmingeo: <http://data.ordnancesurvey.co.uk/ontology/admingeo/>
'''

q = '''
    SELECT ?uaname
        WHERE {

            # Find unitary authorities
            ?ua rdf:type osadmingeo:UnitaryAuthority .
            ?ua rdfs:label ?uaname .
        }
        LIMIT 60
'''

df = dfResults(endpoint_odc, prefix, q)
s = df.uaname
print(s)


A search of the results shows that the required unitary authority is named 'Isle of Wight'.

The second step is to use the name 'Isle of Wight' to find the object whose label is 'Isle of Wight' and is a unitary authority. This can be done by querying the Ordnance Survey's administrative regions. Then, the districts within this unitary authority can be found.

The next query finds the URL for the website of the local authority for a specficied region (Isle of Wight) and the council's IMD (Index of Multiple Deprivation) rank.

Note: The ranking data is for 2010. A revised index for 2015 was published in September 2015 (see  http://www.gov.uk/government/uploads/system/uploads/attachment_data/file/465791/English_Indices_of_Deprivation_2015_-_Statistical_Release.pdf).

In [None]:
#Open Data Communities endpoint
endpoint_odc = 'http://opendatacommunities.org/sparql'

#Define prefixes
prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX osadmingeo: <http://data.ordnancesurvey.co.uk/ontology/admingeo/>
    PREFIX ossr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
'''

# Define query
q = '''
    SELECT ?councilwebsite ?imdrank ?authority ?authorityname
    WHERE {
    
        # Find IMD rank for IoW
        # 1. Find the unitary authority "Isle of Wight"
        ?iow rdfs:label "Isle of Wight" ;
            rdf:type osadmingeo:UnitaryAuthority .   
            
        # 2. Find the reference area for the unitary authority
        ?s <http://purl.org/linked-data/sdmx/2009/dimension#refArea> ?iow .
        
        # 3. Find the overall rank of the reference area
        ?s <http://opendatacommunities.org/def/IMD#IMD-rank> ?imdrank . 

        # Find council website and authority name for IoW
        ?authority <http://opendatacommunities.org/def/local-government/governs> ?iow .
        ?authority <http://xmlns.com/foaf/0.1/page> ?councilwebsite .
        ?authority rdfs:label ?authorityname.
    }
'''

printRunQuery(endpoint_odc, prefix, q, limit=15)

You might like to re-run this query with other unitary authorities (say 'Cornwall',  'Hartlepool', 'Poole' and 'Windsor and Maidenhead') and find their ranks. The lower the rank, the more deprived the area.

We can combine the previous two queries into a single query, whose execution starts at one of the endpoints, in this case the Ordnance Survey endpoint. The SERVICE command then executes another query fragment on a remote endpoint, in this case the Open Data Communities endpoint.

In [None]:
q = '''
    SELECT ?districtname ?councilwebsite ?imdrank ?authority ?authorityname
    WHERE {

        # Find the object whose label is 'Isle of Wight' and is a
        # unitary authority
        ?iow rdfs:label "Isle of Wight" ;
            rdf:type osadmingeo:UnitaryAuthority .

        # Find name of districts within IoW unitary authority
        ?district ossr:within ?iow .
        ?district rdfs:label ?districtname.

        # Run a query at the opendatacommunities endpoint that
        # finds the imd rank for IoW and 
        SERVICE <http://opendatacommunities.org/sparql> {

            ?s <http://purl.org/linked-data/sdmx/2009/dimension#refArea> ?iow .
            ?s <http://opendatacommunities.org/def/IMD#IMD-rank> ?imdrank . 


            ?authority <http://opendatacommunities.org/def/local-government/governs> ?iow .
            ?authority <http://xmlns.com/foaf/0.1/page> ?councilwebsite .
            ?authority rdfs:label ?authorityname.

        }
    }
'''

# Print out a few of the results
printRunQuery(endpoint_os, prefix, q, limit=5)

The aim of this case study was to illustrate another federated query in which a sub-query is sent by one endpoint to another endpoint.

The specific details of the triple patterns used in the queries are not important (although trying to find the appropriate properties to use in the queries takes some time as you have to become familiar with the ontologies adopted for the different datasets).


## Summary

The primary purpose of this Notebook has been to give you practical experience of federated queries in which a single query gives rise to multiple sub-queries that are sent to different endpoints to be actioned. The main query can then combine the several sets of results from the sub-queries to answer the primary question.

To send a query to a remote endpoint, you use SPARQL's SERVICE mechanism in which you specify the URL of the endpoint to be used to answer a specific sub-query. A remote endpoint will find suitable bindings for the variables mentioned in the sub-query which are then available to the main query to be  used in subsequent patterns (including further SERVICE requests).

You saw that when the results of one SERVICE request are required by another SERVICE request (or even the main query itself), some manipulation is required to the format of the data: graphs dealing with the same subject matter may not use the same format for similar data. You saw an example where names in one graph had a language tag but the same names in another graph did not. In this case the BIND function was used to bind a new value to a variable. 

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `26.2 The SPARQL CONSTRUCT query and inferencing`.