# Endpoints: accessing real data

In this Notebook you will learn how to use SPARQLWrapper, which enables you to access endpoints via a Python application. In so doing you will visit a variety of endpoints and examine a wide variety of datasets. You will also see how to capture data from an RDF graph in a Python data structure ready for analysis. Further SPARQL filtering mechanisms are also introduced.

*As the datasets and endpoints are not under the control of the Open University, it may be the case that some have changed or are unavailable. Should you encounter such a situation, please read the discussion as there should be enough information there for you to understand the points being made in the activities.*

*We want to encourage you to experiment for yourself wherever possible. We suggest that you try the activities as given and then see what happens when you slightly alter the queries. When you amend a query it is very likely that you will make mistakes. Most query engines are good at providing error messages should you make mistakes in the syntax of a query. However, they vary in the helpfulness of error reporting should a website be unavailable. You may come across error messages such as 'Error 400: Failed to load URL http://dbpedia.org/' which probably means that one of the websites mentioned in a query is unavailable (perhaps because it is being updated). In such circumstances, move on to the next activity and return to the one that failed at a later time.*

The process for accessing data from an endpoint using SPARQLWrapper is quite straightforward, consisting of the following steps:

1. Import the features of SPARQLWrapper using a `from ... import` statement.

2. Create a reference to the required endpoint.

3. Create a query (in the same way that you saw in Notebook 25.2).

4. Choose the format of the results to be returned. Most endpoints offer a variety of formats: JSON is a common format to choose.

5. Send the query to the endpoint using the `query()` method (this time there are several arguments that must be provided).

6. Convert the results from the chosen format (such as JSON) to a Python dictionary using the `convert()` method.

It is usual to combine steps 5 and 6 as you will see.

## Example: accessing the DBpedia dataset

This example sends a query to the DBpedia endpoint asking for all the names by which Germany is known in different languages. It follows the five steps outlined above.

Read the code and then run the cell.

In [None]:
# Step 1: Obtain a SPARQLWrapper and JSON classes
from SPARQLWrapper import SPARQLWrapper, JSON

# Step 2: Create a SPARQL wrapper object that references an endpoint (dbpedia)
endpoint = "http://dbpedia.org/sparql"
sparql = SPARQLWrapper(endpoint)

# Step 3: Create the query and associate it with the wrapper object 
# using the setQuery() method
query = '''
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        PREFIX dbpedia: <http://dbpedia.org/resource/>
        
        SELECT ?label
        WHERE { 
            dbpedia:Germany rdfs:label ?label 
        }
     '''

sparql.setQuery(query)

# Step 4: Choose the format of the results to be returned by the query
sparql.setReturnFormat(JSON)

# Step 5: Obtain results and convert to Python dictionary based on
# JSON format
results = sparql.query().convert()

# Print out results in JSON format

print(results)
print()

# When the return format is set to JSON, the Python dictionary will have 
# the results indexed by 'results' and then 'bindings'. 
# In this example, the bindings are pairs indexed by 'label' and 'value'.
# The index 'label' is the name of the variable used in the SELECT query 
# above.

# A more readable output is as follows, which just lists the values
 
for result in results["results"]["bindings"]:
    print(result["label"]["value"]) 

## A better approach to outputting the results

Printing the results in this way is not very helpful and you need to know the index terms in order to print the results in a more meaningful way. Here are some helper functions that make the process of running a query and printing the results much easier.

You will find that PREFIXes have been separated out from the SELECT part of the query (easy to do as both are strings that can be concatenated to form the complete query). This can be a useful time saver when you want to experiment with many queries that use the same prefixes.

Run the code in the following cell.

The details of the code need not concern you; simply be aware of the purpose of each function.

In [None]:
# Import the necessary packages
from SPARQLWrapper import SPARQLWrapper, JSON

# Add some helper functions

# A function that will return the results of running a SPARQL query with 
# a defined set of prefixes over a specified endpoint.
# It follows the same five-step process apart from creating the query, which 
# is provided as an argument to the function.
def runQuery(endpoint, prefix, q):
    ''' Run a SPARQL query with a declared prefix over a specified endpoint '''
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(prefix+q) # concatenate the strings representing the prefixes and the query
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()
    
# Import pandas to provide facilities for creating a DataFrame to hold results
import pandas as pd

# Function to convert query results into a DataFrame
# The results are assumed to be in JSON format and therefore the Python dictionary will have  
# the results indexed by 'results' and then 'bindings'. 
def dict2df(results):
    ''' A function to flatten the SPARQL query results and return the column values '''
    data = []
    for result in results["results"]["bindings"]:
        tmp = {}
        for el in result:
            tmp[el] = result[el]['value']
        data.append(tmp)

    df = pd.DataFrame(data)
    return df

# Function to run a query and return results in a DataFrame
def dfResults(endpoint, prefix, q):
    ''' Generate a data frame containing the results of running
        a SPARQL query with a declared prefix over a specified endpoint '''
    return dict2df(runQuery(endpoint, prefix, q))
        
# Print a limited number of results of a query
def printQuery(results, limit=''):
    ''' Print the results from the SPARQL query '''
    resdata = results["results"]["bindings"]
    if limit != '':
        resdata = results["results"]["bindings"][:limit]
    for result in resdata:
        for ans in result:
            print('{0}: {1}'.format(ans, result[ans]['value']))
        print()

# Run a query and print out a limited number of results
def printRunQuery(endpoint, prefix, q, limit=''):
    ''' Print the results from the SPARQL query '''
    results = runQuery(endpoint, prefix, q)
    printQuery(results, limit)
    
print("Helper functions set up")

The previous query can now be written in a much briefer way as follows.

In [None]:
# Define the endpoint
endpoint ="http://dbpedia.org/sparql"

# Define any prefixes
prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX dbpedia: <http://dbpedia.org/resource/>
'''

# Define the query
query = '''
        SELECT ?label
            WHERE { 
                dbpedia:Germany rdfs:label ?label 
            }
        '''

# Carry out query against the defined endpoint and put results into a DataFrame
df = dfResults(endpoint, prefix, query)

# Output the first 5 results
df[:5]


## Example: accessing the British National Bibliography (BNB) endpoint

The British National Bibliography (BNB) lists the books and new journal titles published or distributed in the United Kingdom and Ireland since 1950.

The following query finds details of the book whose ISBN number is '9780729408745' (this is a book that you may find useful in this module).

Note that there is only one pattern with one subject (`?book`) but there are three predicates.

In [None]:
# Get SPARQL wrapper
from SPARQLWrapper import SPARQLWrapper, JSON

# Declare the BNB endpoint
endpoint = "http://bnb.data.bl.uk/sparql"

# Define any prefixes
# The prefixes bibo and dct are related to the Dublin Core vocabulary (see commentary below)
# The blt prefix relates to British Library terms
prefix = '''
    PREFIX bibo: <http://purl.org/ontology/bibo/>
    PREFIX blt: <http://www.bl.uk/schemas/bibliographic/blterms#>
    PREFIX dct: <http://purl.org/dc/terms/>
'''

# Define the query
query = '''
        SELECT ?book ?author ?title 
        WHERE {
            #Match the book using the 13 character ISBN (International Standard Book Number)
            ?book bibo:isbn13 "9781449371432" ; 
    
            #bind the book's other attributes to variables
                dct:creator ?author;
                dct:itlet ?title. 
}'''

# Run query and print out a limited number of results
printRunQuery(endpoint, prefix, query, limit='')


Knowing what predicates and vocabulary are used in the BNB dataset requires some investigation. 

As a first step you might think of finding some of the predicates used, as follows:

In [None]:
prefix = '' # No prefixes required

# Set up query to find all predicates
query = '''
        SELECT DISTINCT ?p 
        WHERE {
            ?s ?p ?o
    
      } '''

# Run query and print out a limited number of results
printRunQuery(endpoint, prefix, query, limit=30)


Several terms are defined in the FOAF ontology and some are W3C terms. However, there are a number of terms which come from purl.org/cd and purl.org.ontology (these occur as prefixes in the earlier query). These come from The Dublin Core Metadata Initiative (DCMI) which supports shared innovation in metadata design and best practices across a broad range of purposes and business models. If you wish, you can find out more about the Dublin Core by visiting http://dublincore.org/about-us/.

You should click on some of the URLs given in the results of the last query to find out more about the terms being used. For example, click on one of the results involving the Dublin Core bibliographic ontology such as http://purl.org/ontology/bibo/isbn13.

Note that in the Dubin Core terms vocabulary there is no 'author' predicate: instead there is a generic dct:creator predicate.

### Activity 1

Create and run a query against the BNB endpoint which finds the titles of all novels written by Ian Rankin. 

In the BNB dataset an author (creator) is specified by writing their name in the form surname followed by given name as in:

<http://bnb.data.bl.uk/id/person/RankinIan>

In [None]:
# Insert your solution here.

The solution is in the [`25.3solutions`](25.3solutions.ipynb) Notebook.

If you wanted to restrict the results of the last query to titles containing the word 'Rebus' (the name of a character in many of Ian Rankin's novels) you can use the FILTER keyword together with a regular expression as illustrated in the following example. 

The function `regex()`, short for regular expression, is a SPARQL function that returns `true` if the value of `?title`, a string, contains the substring 'Rebus'.

In [None]:
# Set up query to find all predicates
query = '''
        SELECT DISTINCT ?title
        WHERE {
            
            ?book dct:creator pers:RankinIan;
                  dct:title ?title
                  
            FILTER(regex(?title, 'Rebus'))
      } '''           
    
# Run query and print out a limited number of results
printRunQuery(endpoint, prefix, query, limit=30)

## Example: The Environment Agency's Bathing Water Linked Data store

The Environment Agency collects water-quality data each year from May to September, to ensure that designated bathing water sites on the coast and inland are safe and clean for swimming and other activities. See http://environment.data.gov.uk/bwq/index.html.

The Environment Agency splits up the country into counties, and counties are divided into districts. The following query prints out some of the attributes of bathing water places, specifically their district number and district name. 

In [None]:
# Define the endpoint
endpoint = 'http://environment.data.gov.uk/sparql/bwq/query'

# Set up prefixes
prefix = '''
    PREFIX bw: <http://environment.data.gov.uk/def/bathing-water/> 
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX stats: <http://statistics.data.gov.uk/def/administrative-geography/>
    '''

# Define the query
query = '''
SELECT ?name ?district
WHERE {
    ?x a bw:BathingWater . #find subjects that are of type (predicate a) bw:BathingWater
    ?x rdfs:label ?name .  #obtain name of the subject
    ?x stats:district ?district . #obtain district (number) of the subject
    }
'''

# Run query and print out a limited number of results
printRunQuery(endpoint, prefix, query, limit=6)


The output provides two results for each district: one provides the location reference of the Office for National Statistics (ONS) and the other an Ordnance Survey (OS) location. The ONS references no longer exist (try one!) - the result of data owners changing the structure or contents of their datasets and other providers not keeping up with the changes.

### Activity 2

Choose one of the districts and click on the URI for the OS location. You should see a map of where the district is located.

### Discussion

You should see a map of the district in which the bathing area is located. For example, Ringstead Bay is located in the district of West Dorset.

## Example: Ordnance Survey (OS)

Ordnance Survey is Great Britain's national mapping agency, providing accurate and up-to-date geographic data. Details of their open Linked Data store can be found at http://data.ordnancesurvey.co.uk/datasets/os-linked-data.

1. Click on the above link and visit the OS website. Read the introduction.
2. Scroll down the page and click on the link labelled 'SPARQL' (in a box containing the label 'powered by SPARQL'). This will take you to a page entitled 'OS Linked Data SPARQL API' where you can experiment by running SPARQL queries against OS datasets.

You can either continue to experiment with the examples at the endpoint or run the following query (to find Roman antiquities - one of the examples at the endpoint).

In [None]:
endpoint = 'http://data.ordnancesurvey.co.uk/datasets/os-linked-data/apis/sparql'

prefix = '''
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX spatial: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
PREFIX gaz: <http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/>
'''

query = '''
SELECT ?uri ?label ?easting ?northing ?one ?twenty ?map 
WHERE {
  ?uri 
    #filter on type
    gaz:featureType gaz:RomanAntiquity;

    #bind everything we want to return
    rdfs:label ?label;
    spatial:easting ?easting;
    spatial:northing ?northing;
    spatial:oneKMGridReference ?one;
    spatial:twentyKMGridReference ?twenty;
    gaz:mapReference ?map.    
}
'''

printRunQuery(endpoint, prefix, query, limit=5)


The results show the OS grid references (easting and northing) of various Roman antiquities. Also given are the areas relative to various maps (OS LandRanger, 1km grid square and 20km grid square).

## Example: UK's Land Registry Linked Open Data

The Land Registry publishes the following public datasets:

* House Price Index background data – available from January 1995 and updated in full 
    each month.
* Price paid data – available from January 1995 and updated in full each month.
* Transaction data – available from December 2011.
    
At the top of the Land Registry's home page at http://landregistry.data.gov.uk, immediately below the title, are four links labelled 'House Price Index', 'Price Paid Data', 'Standard Reports' and 'SPARQL query'. Visit the Land Registry's home page and click on the 'House Price Index' link where you will find a form for obtaining the house price indices for regions within England and Wales. Enter an area, choose a date range, and choose the data items you are interested in. The result should be displayed on your screen immediately below the form.

You may like to view a tutorial prior to perfoming the search; if so, press the button labelled 'tutorial' at the top right-hand corner of the form.

If you do not obtain any results, that is, nothing appears below the form, it is probably because you have entered an invalid value for the area. We suggest that you click on the button displaying a map of England and Wales (it can be found to the right of the area selection box), select a region and then an area.

The following query returns the price paid data from the default graph for each transaction record having an address with the given postcode.
The postcode to query is set in the line `?addr lrcommon:postcode "PL6 8RU"^^xsd:string .`


In [None]:
endpoint = 'http://landregistry.data.gov.uk/landregistry/query'

prefix = '''
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX lrppi: <http://landregistry.data.gov.uk/def/ppi/>
PREFIX lrcommon: <http://landregistry.data.gov.uk/def/common/>
'''

query = '''
SELECT ?paon ?saon ?street ?town ?county ?postcode ?amount ?date
WHERE
{
  ?transx lrppi:pricePaid ?amount ;
          lrppi:transactionDate ?date ;
          lrppi:propertyAddress ?addr.
  
  ?addr lrcommon:postcode "PL6 8RU"^^xsd:string.
  ?addr lrcommon:postcode ?postcode.
  
  OPTIONAL {?addr lrcommon:county ?county}
  OPTIONAL {?addr lrcommon:paon ?paon}
  OPTIONAL {?addr lrcommon:saon ?saon}
  OPTIONAL {?addr lrcommon:street ?street}
  OPTIONAL {?addr lrcommon:town ?town}

}
ORDER BY ?amount

'''

printRunQuery(endpoint, prefix, query, limit=5)

The majority of index terms are self explanatory, but there are two acronyms related to addresses. The first, `paon`, stands for 'primary addressable object name' and is either a number, name or description of the property (technically it is a 'basic land and property unit'). The second, `saon`, is a 'secondary addressable object name' and is similar to a `paon`. A `soan` usually only exists where a building has been divided into sub-buildings such as flats. 

In this example, there are no saons in the results even though they are requested in the list of variables following the SELECT keyword. This is a result of the use of the OPTIONAL patterns at the end of the WHERE clause. The patterns (there can be more than one) inside an OPTIONAL construct are either matched, in which case new bindings are made, or are not matched, in which case no further bindings are made and the pattern is effectively ignored. 

The following query returns the house price indices for Plymouth for the month of January 2015.

In [None]:
prefix = '''
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX lrhpi: <http://landregistry.data.gov.uk/def/hpi/>
'''

# Returns the house price indices for a particular region (Plymouth) for a particular month (Jan 2015)
query = '''
SELECT DISTINCT ?regionName ?yearmonth ?indexr ?region ?avgPriceAll ?avgDetached ?avgSemi \
?avgFlats ?avgTerraced ?annual ?volume
WHERE
{
  VALUES ?localAuthorityMonth {<http://landregistry.data.gov.uk/data/hpi/region/city-of-plymouth/month/2015-01>}

  ?localAuthorityMonth
    lrhpi:refRegion ?regionURI ;
    lrhpi:indicesSASM ?indexr ;
    lrhpi:refPeriod ?yearmonth ;
    lrhpi:averagePricesSASM ?avgPriceAll ;
    lrhpi:monthlyChange ?monthly ;
    lrhpi:averagePricesDetachedSASM ?avgDetached ;
    lrhpi:averagePricesSemiDetachedSASM ?avgSemi ;
    lrhpi:averagePricesFlatMaisonetteSASM ?avgFlats ;
    lrhpi:averagePricesTerracedSASM ?avgTerraced ;
    lrhpi:annualChange ?annual .

  OPTIONAL { ?localAuthorityMonth lrhpi:salesVolume ?volume }

  ?regionURI rdfs:label ?regionName .
  FILTER (langMatches( lang(?regionName), "EN") )
}
'''

printRunQuery(endpoint, prefix, query, limit=5)

This example uses another construct to restrict the number of results returned: the VALUES keyword. In general, the VALUES keyword is followed by one or more variable names which are followed by a set of values placed inside curly braces. There should be one value for each of the variables. The idea is that the query will only return results for which the variables named in the VALUES clause have the specific values specified in the braces. In this case, only results for the local authority Plymouth for the month of January 2015 have been requested.

## Example: Open University

The Open University (OU) has a collection of datasets (graphs) and provides a specialised endpoint for querying them: visit  http://data.open.ac.uk/ to read about them.

More specific information about the individual datasets can be found by visiting http://data.open.ac.uk/site/datasets.html.

The following query finds third-level modues in Computing and IT and returns their codes and titles. Note that `subj:subject` is a predicate in which 'subject' refers to a subject area offered by the OU and should not be confused with the use of the word 'subject' to refer to the first element of a triple.

In [None]:
endpoint = 'http://data.open.ac.uk/query'

prefix = '''
PREFIX subj: <http://purl.org/dc/terms/>
PREFIX top: <http://data.open.ac.uk/topic/>
PREFIX cl: <http://data.open.ac.uk/saou/ontology#>
PREFIX cw: <http://courseware.rkbexplorer.com/ontologies/courseware#>
PREFIX xmls: <http://www.w3.org/2001/XMLSchema#>
'''

query = '''
SELECT DISTINCT ?course ?courseTitle
    WHERE {
        ?course subj:subject top:computing_and_it;
            cl:OUCourseLevel "3"^^xmls:string;
            cw:has-title ?courseTitle.
}
'''

printRunQuery(endpoint, prefix, query, limit=10)

By default, the OU's endpoint searches all OU datasets. If you want it to search a specific dataset, you can specify it using the FROM keyword (after the SELECT keyword). For example, the following query should produce exactly the same results as the previous example (the URI of the graph <http://data.open.ac.uk/context/course> was discovered by visiting http://data.open.ac.uk/site/datasets.html).

In [None]:
endpoint = 'http://data.open.ac.uk/query'

prefix = '''
PREFIX subj: <http://purl.org/dc/terms/>
PREFIX top: <http://data.open.ac.uk/topic/>
PREFIX cl: <http://data.open.ac.uk/saou/ontology#>
PREFIX cw: <http://courseware.rkbexplorer.com/ontologies/courseware#>
PREFIX xmls: <http://www.w3.org/2001/XMLSchema#>
'''

query = '''
SELECT DISTINCT ?course ?courseTitle
    FROM <http://data.open.ac.uk/context/course>
    WHERE {
        ?course subj:subject top:computing_and_it;
            cl:OUCourseLevel "3"^^xmls:string;
            cw:has-title ?courseTitle.
}
'''

printRunQuery(endpoint, prefix, query, limit=10)

### Activity 3

Modify the previous query to find all topics offered by the OU. In the previous query `computing_and_it` is a topic.

In [None]:
# Insert your solution here.

The solution is in the [`25.3solutions`](25.3solutions.ipynb) Notebook.

### Activity 4

Find all the predicates in the OU dataset.

In [None]:
# Insert your solution here.

The solution is in the [`25.3solutions`](25.3solutions.ipynb) Notebook.

## Summary

In this Notebook you have been introduced to a variety of real linked datasets including DBpedia, the British National Bibliography, the Environment Agency's Bathing Water Linked Data store, Ordnance Survey, the UK's Land Registry Linked Open Data and the Open University's datasets.

You have seen how to use the SPARQLWrapper library. To do so, you create a SPARQLWrapper object and associate it with: the URL of an endpoint, a query (as a Python string) and the output format (we used JSON). The results of a query are then converted to a Python dictionary data structure.

The queries in this Notebook are more complex than in previous Notebooks. To make the process of creating a query and outputting the results in a more useful way, we provided a number of helper functions.

Several new filtering mechanisms were introduced, including:

* `regex(string, substring)`, a regular expression function used to determine whether a string contains a specific substring
* `OPTIONAL {pattern}`, to make bindings only when the pattern is matched
* `VALUES variable list {values list}`, binds the values to the variables.

You will have also observed that whenever you meet a new dataset for the first time, you need to devote some time to finding out and understanding the predicates used. We have shown you how a simple query can reveal all the predicates, but you then have to discover their meaning. A useful first step is to determine the domain and range of the predicates. Also, it is to be hoped that the names used for the predicates provide a clue as to their meaning.

You should also have noticed that there are several ontologies and vocabularies that have occurred frequently in our examples such as the Dublin Core, FOAF, xmls, rdf and rdfs. It is well worth getting familiar with these (or at least know how to access them to discover their contents). Most developers of datasets will use well-accepted vocabularies and then add on new vocabularies suited to their particular domains.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, you've completed the Part 25 Notebooks. It's time to move on to Part 26.