# Using multiple endpoints

Running queries against a dataset accessible from a single endpoint is a useful facility. Even more useful is the ability to extract and merge data from several datasets located at different endpoints. In this Notebook you will learn how to create a single query that extracts data from multiple endpoints and melds them together - a **federated query**.

There are several ideas that you need to become familiar with before examining a federated query in detail, so we shall proceed through a series of incremental examples.

Note that some queries may take several seconds to complete, depending on network speeds and the load on the processor at the endpoint.

To start with, import SPARQLWrapper and create the usual helper functions for creating queries and printing results.

In [None]:
# Import the necessary packages
from SPARQLWrapper import SPARQLWrapper, JSON

# Add some helper functions

# A function that will return the results of running a SPARQL query with
# a defined set of prefixes over a specified endpoint.
# It follows the usual five-step process apart from creating the query, 
# which is provided as an argument to the function.
def runQuery(endpoint, prefix, q):
    ''' Run a SPARQL query with a declared prefix over a specified endpoint'''
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(prefix+q) # concatenate the strings representing the prefixes and the query
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

# Import pandas to provide facilities for creating a DataFrame to hold results
import pandas as pd

# Function to convert query results into a DataFrame
def dict2df(results):
    ''' A function to flatten the SPARQL query results and return the column values '''
    data = []
    for result in results["results"]["bindings"]:
        tmp = {}
        for el in result:
            tmp[el] = result[el]['value']
            data.append(tmp)
            
    df = pd.DataFrame(data)
    return df

# Function to run a query and return results in a DataFrame
def dfResults(endpoint, prefix, q):
    ''' Generate a DataFrame containing the results of running
        a SPARQL query with a declared prefix over a specified endpoint '''
    return dict2df(runQuery(endpoint, prefix, q))

# Print a limited number of results of a query
def printQuery(results, limit=''):
    ''' Print the results from the SPARQL query '''
    resdata = results["results"]["bindings"]
    if limit != '':
        resdata = results["results"]["bindings"][:limit]
    for result in resdata:
        for ans in result:
            print('{0}: {1}'.format(ans, result[ans]['value']))
        print()
        
# Run a query and print out a limited number of results
def printRunQuery(endpoint, prefix, q, limit=''):
    ''' Print the results from the SPARQL query '''
    results = runQuery(endpoint, prefix, q)
    printQuery(results, limit)
    
print("Helper functions set up")    

## Example 1: The Environment Agency's Bathing Water Linked Data

Recall that the Environment Agency collects water quality data each year from May to September, to ensure that designated bathing water sites on the coast and inland are safe and clean for swimming and other activities.

The Environment Agency splits up the country into counties, and counties are divided into districts. The following query prints out some of the attributes of bathing water places, specifically their district number and district name.

There are three triple patterns in the query that do the following:

   1. find a bathing water place and bind the resuls to variable `?x`
   2. find the name of the bathing water place
   3. find the district of the bathing water place.

In [None]:
# Define the endpoint
endpoint_envAgency = 'http://environment.data.gov.uk/sparql/bwq/query'

# Set up prefixes
prefix = '''
    PREFIX bw: <http://environment.data.gov.uk/def/bathing-water/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX stats: <http://statistics.data.gov.uk/def/administrative-geography/>
'''

# Define the query
query = '''
    SELECT ?name ?district
    WHERE {
        ?x a bw:BathingWater . #find subjects that are of type (predicate a) bw:BathingWater
        ?x rdfs:label ?name .  #obtain name of the subject
        ?x stats:district ?district . #obtain district (number) of the subject
    }
'''

# Run query and print out a limited number of results
printRunQuery(endpoint_envAgency, prefix, query, limit=6)


For each named district there are two results: one gives the location reference previously used by the Office for National Statistics (ONS) but which is no longer active, and the other gives an Ordnance Survey (OS) identification number.

### Activity 1
Choose one of the districts and click on the URI for the OS location. This should show some data about the chosen district. What do you see?


### Discussion 1
You should see a map of the district in which the bathing area is located. For example, Ringstead Bay is located in the district of West Dorset.


### Activity 2

The Ordnance Survey endpoint provides a wealth of information. It uses the rdf:label predicate to associate a label with a given object. Use this predicate to find the OS district identifier for East Sussex by completing the following query.

In [None]:
endpoint_os = 'http://data.ordnancesurvey.co.uk/datasets/os-linked-data/apis/sparql'

prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
'''

q = '''
    SELECT ?districtID
        WHERE {
            # Fill in the appropriate (single) triple pattern here
            
        }
'''

printRunQuery(endpoint_os, prefix, q)


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


The required triple is:

    ?districtID rdfs:label "East Sussex".
    
The resulting query is:

In [None]:
endpoint_os = 'http://data.ordnancesurvey.co.uk/datasets/os-linked-data/apis/sparql'
prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
'''


q = '''
    SELECT ?districtID
        WHERE {
            # Fill in the appropriate (single) triple pattern here
            ?districtID rdfs:label "East Sussex".
        }
'''

printRunQuery(endpoint_os, prefix, q)

## Example 2: Find information about bathing places in East Sussex

In this example we are going to send a query to the Environment Agency's endpoint. However, this query will include a SERVICE request from another endpoint (OS) for some data that it will then use to complete the original query.

Specifically, the SERVICE request will ask the OS endpoint for the name and identifier of a district in East Sussex. Then, using this data, the Environment Agency's endpoint will be asked to supply various data about the district.


In [None]:
prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX admingeo: <http://statistics.data.gov.uk/def/administrative-geography/>
    PREFIX ossr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
'''

q = '''
    SELECT ?districtname ?sedimentname ?location
    WHERE {
        # Ask OS endpoint to find a district, and its name, 
        # within the area East Sussex
        SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql> {
            ?area rdfs:label "East Sussex".
            ?district ossr:within ?area.
            ?district rdfs:label ?districtname.
        }

        # Ask Environment Agency to find location and sediment type
        # of district.
        
        # Find location of district
        ?location <http://statistics.data.gov.uk/def/administrative-geography/district> ?district .
        #Find whether location is that of a bathing water place
        ?location a <http://environment.data.gov.uk/def/bathing-water/BathingWater> .

        #Find type of sediment at this bathing water place
        ?location <http://environment.data.gov.uk/def/bathing-water/sedimentTypesPresent> ?sediment .
        ?sediment rdfs:label ?sedimentname.

    }
    ORDER BY ?districtname
'''

printRunQuery(endpoint_envAgency, prefix, q, limit=30)

The query, `q`, is sent to the Environment Agency's endpoint:

    endpoint_envAgency = 'http://environment.data.gov.uk/sparql/bwq/query'

Part of that SELECT query is a request for a SERVICE from the OS endpoint:

    SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql> { 
      ...
    }

The values returned from the OS endpoint, `?district` and `?districtname`, are available to the remainder of the query.


### Activity 3

Click on one of the location URLs returned by the previous query to see what kind of information is available about a specific bathing place. Then answer the following:

1. What kind(s) of sediment are to be found at this place?
2. What are the different categories of sediment used by the Environment Agency?


### Discussion 3

1. We visited Eastbourne at http://environment.data.gov.uk/id/bathing-water/ukj2201-14500 and found that both sand and shingle are present.

2. Clicking on one of the sediment types, such as sand, reveals a definition of that sediment type including the link 'sediment type'. This downloads a Turtle file from which it is possible to discover the types of sediment used by the Environment Agency:

    sand, shingle, rock, marsh, mud and other


### Activity 4

Amend the query in Example 2 to find information about bathing places in another area (e.g. West Sussex, Dorset, Devon or another area (county) of your own choice).


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


In [None]:
endpoint_envAgency = 'http://environment.data.gov.uk/sparql/bwq/query'

prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX admingeo: <http://statistics.data.gov.uk/def/administrative-geography/>
    PREFIX ossr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
'''

q = '''
    SELECT ?districtname ?sedimentname ?location
    WHERE {
        # Ask OS endpoint to find a district, and its name, 
        # within the area East Sussex
        SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql> {
            ?area rdfs:label "Devon". # *** CHANGE THIS TRIPLE ***
            ?district ossr:within ?area.
            ?district rdfs:label ?districtname.
        }

        # Ask Environment Agency to find location and sediment type 
        # of district.
        # Find location of district
        ?location <http://statistics.data.gov.uk/def/administrative-geography/district> ?district .
        
        # Find whether location is that of a bathing water place
        ?location a <http://environment.data.gov.uk/def/bathing-water/BathingWater> .

        # Find type of sediment at this bathing water place
        ?location <http://environment.data.gov.uk/def/bathing-water/sedimentTypesPresent> ?sediment .
        ?sediment rdfs:label ?sedimentname.

    }
    ORDER BY ?districtname
'''

printRunQuery(endpoint_envAgency, prefix, q, limit=30)


## Summary

The primary purpose of this Notebook has been to give you practical experience of federated queries in which a single query gives rise to multiple sub-queries that are sent to different endpoints to be actioned. The main query can then combine the several sets of results from the sub-queries to answer the primary question.

To send a query to a remote endpoint, you use SPARQL's SERVICE mechanism in which you specify the URL of the endpoint to be used to answer a specific sub-query. A remote endpoint will find suitable bindings for the variables mentioned in the sub-query which are then available to the main query to be  used in subsequent patterns (including further SERVICE requests).

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `26.2 The SPARQL CONSTRUCT query and inferencing`.