# Creating Collections as Data Using Federated Queries / BNE Query Building (1)
## About the Notebook
This notebook demonstrates how to use [SPARQL](https://www.w3.org/TR/sparql11-query/) to query Linked Open Data repositories. Specifically, it showcases how to perform federated queries by combining data from [Wikidata](https://www.wikidata.org) and the [Biblioteca Nacional de España (BNE)](https://datos.bne.es), which has published its catalog as Linked Open Data.

<!--🔖 **How to Cite**: [![DOI](https://zenodo.org/badge/DOI_NUMBER.svg)](https://doi.org/DOI_NUMBER) <!--fix once we get the CITATION.cff set up in the GitHub repo-->

---

## Getting Started
First of all, we start by importing the `SPARQLWrapper` library, which is used to interact with SPARQL endpoints.
Then, the SPARQL endpoint for Wikidata is set up, and JSON is the return format specified for our query.

In [14]:
# Import necessary libraries
from SPARQLWrapper import SPARQLWrapper, JSON

# Set up the SPARQL endpoint for Wikidata
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setReturnFormat(JSON)

## Writing Our First SPARQL Query

In this section, we construct our first SPARQL query; we begin by retrieving the BNE identifier (`P950`) for a specific Wikidata entity, in this case, the author Jorge Juan y Santacilia (`wd:Q2085725`). Using this identifier, we query the BNE SPARQL endpoint to retrieve works (`?work`) associated with this author.

The query fetches metadata for each work of the author, including the title (`?workLabel`), edition (`?edition`), place of production (`?placeOfProduction`), year of publication (`?yearOfPublication`), and language code (`?langCode`). Some of these fields are optional and will only be included if available.

What makes this query federated is the use of the `SERVICE` keyword, which allows us to access the BNE SPARQL endpoint and combine its data with Wikidata.

In [18]:
sparql.setQuery(
    """
PREFIX bne-def: <https://datos.bne.es/def/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?author ?work ?workLabel ?edition ?placeOfProduction ?yearOfPublication ?langCode
WHERE {
 wd:Q2085725 wdt:P950 ?id .
 wd:Q2085725 rdfs:label ?author.  FILTER(LANG(?author) = "en").            
 BIND(uri(concat("https://datos.bne.es/resource/", ?id)) as ?bneID)
 SERVICE <http://datos.bne.es/sparql> {
  ?bneID bne-def:OP5001 ?work .
  ?work rdfs:label ?workLabel .   
  OPTIONAL {?work bne-def:OP1002 ?m . ?m bne-def:OP2001 ?edition . ?edition bne-def:P3003 ?placeOfProduction}
  OPTIONAL {?work bne-def:OP1002 ?m . ?m bne-def:OP2001 ?edition . ?edition bne-def:P3006 ?yearOfPublication}
  OPTIONAL {?work bne-def:OP1002 ?m . ?m bne-def:OP2001 ?edition . ?edition dcterms:language ?langCode}
 }
}
LIMIT 1000
"""
)

Now it is time to test our query by running the code below to preview the first 7 results. To ensure robustness, we use a `try-except` block to handle any potential errors that may occur during the execution of the query. If the query executes successfully, the results will be printed; otherwise, an error message will be displayed to help diagnose the issue.

The code also checks for the presence of the `bindings` key in the query results. This allows us to quickly determine if the query returned any data. If the key is not present, it indicates that no data was returned, and a message will be printed to inform the user. Else, the code will print the first 7 results from the query.

In [19]:
# Preview a limited number of results from the SPARQL query
try:
    ret = sparql.queryAndConvert()

    # Limit the number of results to preview
    preview_limit = 7
    # Check if the results contain any bindings
    if "bindings" not in ret["results"]:
        print("No results found.")
        exit(1)
    count = 0

    # Iterate through the results and print relevant information
    for r in ret["results"]["bindings"]:
        if count >= preview_limit:
            break

        # Print only the most relevant fields
        print(f"Title: {r['workLabel']['value']}")
        if "yearOfPublication" in r:
            print(f"Year: {r['yearOfPublication']['value']}")
        if "placeOfProduction" in r:
            print(f"Place: {r['placeOfProduction']['value']}")
        if "work" in r:
            print(f"Work: {r['work']['value']}")
        if "edition" in r:
            print(f"Edition: {r['edition']['value']}")
        if "langCode" in r:
            print(f"Language: {r['langCode']['value']}")

        print("---")
        count += 1

except Exception as e:
    print("Exception:")
    print(e)

Title: Relación histórica del viaje a la América Meridional
Year: 1807
Place: London
Work: https://datos.bne.es/resource/XX3280526
Edition: https://datos.bne.es/resource/bima0000014850
Language: http://id.loc.gov/vocabulary/languages/eng
---
Title: Relación histórica del viaje a la América Meridional
Year: 1760
Place: London
Work: https://datos.bne.es/resource/XX3280526
Edition: https://datos.bne.es/resource/bima0000015581
Language: http://id.loc.gov/vocabulary/languages/eng
---
Title: Relación histórica del viaje a la América Meridional
Year: 1806]
Place: London
Work: https://datos.bne.es/resource/XX3280526
Edition: https://datos.bne.es/resource/bima0000016101
Language: http://id.loc.gov/vocabulary/languages/eng
---
Title: Noticias secretas de América
Year: D.L. 1982
Place: Madrid
Work: https://datos.bne.es/resource/XX2035522
Edition: https://datos.bne.es/resource/bimo0000147143
Language: http://id.loc.gov/vocabulary/languages/spa
---
Title: Compendio de navegacion para el uso de los 

## Storing the Query Results and Creating a DataFrame
In this section, we will create a dataframe from the results of our SPARQL query. We will use the `pandas` library to create a dataframe and populate it with the data retrieved from the BNE SPARQL endpoint.

If there are missing values in the data, the code handles them by assigning empty strings (`""`) to the corresponding fields. This ensures that the `data` list contains consistent dictionaries, even when some fields are not available in the SPARQL query results.

In [None]:
# Initialize an empty list to store the processed data
data = []

# Iterate through the results from the SPARQL query
for r in ret["results"]["bindings"]:
    placeOfProduction = ""
    if "placeOfProduction" in r:
        placeOfProduction = r["placeOfProduction"]["value"]
    yearOfPublication = ""
    if "yearOfPublication" in r:
        yearOfPublication = r["yearOfPublication"]["value"]
    if "edition" in r:
        edition = r["edition"]["value"]
    langCode = ""
    if "langCode" in r:
        langCode = r["langCode"]["value"]

    # Append a dictionary containing the extracted data to the list
    data.append(
        {
            "author": r["author"]["value"],  # Author's name
            "work": r["work"]["value"],  # Work URI
            "title": r["workLabel"]["value"],  # Title of the work
            "langCode": langCode,  # Language code of the work
            "edition": edition,  # Edition information
            "yearOfPublication": yearOfPublication,  # Year of publication
            "placeOfProduction": placeOfProduction,  # Place of production
        }
    )

# Print the first 15 lines of the dictionary to check the results
data[0:15]

[{'author': 'Jorge Juan y Santacilia',
  'work': 'https://datos.bne.es/resource/XX3280526',
  'title': 'Relación histórica del viaje a la América Meridional',
  'langCode': 'http://id.loc.gov/vocabulary/languages/eng',
  'edition': 'https://datos.bne.es/resource/bima0000014850',
  'yearOfPublication': '1807',
  'placeOfProduction': 'London'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://datos.bne.es/resource/XX3280526',
  'title': 'Relación histórica del viaje a la América Meridional',
  'langCode': 'http://id.loc.gov/vocabulary/languages/eng',
  'edition': 'https://datos.bne.es/resource/bima0000015581',
  'yearOfPublication': '1760',
  'placeOfProduction': 'London'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://datos.bne.es/resource/XX3280526',
  'title': 'Relación histórica del viaje a la América Meridional',
  'langCode': 'http://id.loc.gov/vocabulary/languages/eng',
  'edition': 'https://datos.bne.es/resource/bima0000016101',
  'yearOfPublication': '1806]'

To make this code more robust and handling missing data more efficiently, we can use `.get()` and providing default values for optional fields, e.g. "Unknown xyz". This would prevent errors and ensure that the data processing step works even when some fields are not available in the query results.

In [8]:
# Initialize an empty list to store the processed data
data = []

# Iterate through the results from the SPARQL query
for r in ret["results"]["bindings"]:
    # Use .get() to provide default values for optional fields
    author = r.get("author", {}).get("value", "Unknown Author")
    work = r.get("work", {}).get("value", "Unknown Work")
    title = r.get("workLabel", {}).get("value", "Unknown Title")
    placeOfProduction = r.get("placeOfProduction", {}).get("value", "Unknown Place")
    yearOfPublication = r.get("yearOfPublication", {}).get("value", "Unknown Year")
    langCode = r.get("langCode", {}).get("value", "Unknown Language")
    edition = r.get("edition", {}).get("value", "Unknown Edition")

    # Append a dictionary containing the extracted data to the list
    data.append(
        {
            "author": author,
            "work": work,
            "title": title,
            "placeOfProduction": placeOfProduction,
            "yearOfPublication": yearOfPublication,
            "langCode": langCode,
            "edition": edition,
        }
    )

# Print the first 15 lines of the dictionary to check the results
data[:15]

[{'author': 'Jorge Juan y Santacilia',
  'work': 'https://datos.bne.es/resource/XX3280526',
  'title': 'Relación histórica del viaje a la América Meridional',
  'placeOfProduction': 'London',
  'yearOfPublication': '1807',
  'langCode': 'http://id.loc.gov/vocabulary/languages/eng',
  'edition': 'https://datos.bne.es/resource/bima0000014850'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://datos.bne.es/resource/XX3280526',
  'title': 'Relación histórica del viaje a la América Meridional',
  'placeOfProduction': 'London',
  'yearOfPublication': '1760',
  'langCode': 'http://id.loc.gov/vocabulary/languages/eng',
  'edition': 'https://datos.bne.es/resource/bima0000015581'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://datos.bne.es/resource/XX3280526',
  'title': 'Relación histórica del viaje a la América Meridional',
  'placeOfProduction': 'London',
  'yearOfPublication': '1806]',
  'langCode': 'http://id.loc.gov/vocabulary/languages/eng',
  'edition': 'https://dato

Now  let's convert our dictionary into a pandas DataFrame. We will use the `pd.DataFrame()` function to create a DataFrame from the list of dictionaries. Each dictionary in the list will represent a row in the DataFrame, and the keys of the dictionaries will become the column names.

In [9]:
# Load required libraries
import pandas as pd

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(data)

# Preview the first 10 rows
df.head(10)

Unnamed: 0,author,work,title,placeOfProduction,yearOfPublication,langCode,edition
0,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,London,1807,http://id.loc.gov/vocabulary/languages/eng,https://datos.bne.es/resource/bima0000014850
1,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,London,1760,http://id.loc.gov/vocabulary/languages/eng,https://datos.bne.es/resource/bima0000015581
2,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,London,1806],http://id.loc.gov/vocabulary/languages/eng,https://datos.bne.es/resource/bima0000016101
3,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX2035522,Noticias secretas de América,Madrid,D.L. 1982,http://id.loc.gov/vocabulary/languages/spa,https://datos.bne.es/resource/bimo0000147143
4,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280515,Compendio de navegacion para el uso de los cab...,[Boadilla del Monte],D.L. 2012,http://id.loc.gov/vocabulary/languages/spa,https://datos.bne.es/resource/a5299371
5,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,Valladolid,2013,http://id.loc.gov/vocabulary/languages/spa,https://datos.bne.es/resource/a5535968
6,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280515,Compendio de navegacion para el uso de los cab...,[Valencia],D.L. 1996,http://id.loc.gov/vocabulary/languages/spa,https://datos.bne.es/resource/bimo0000695964
7,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,Te Goes,1771-1772,http://id.loc.gov/vocabulary/languages/dut,https://datos.bne.es/resource/a4554798
8,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,Te Goes,1771-1772,http://id.loc.gov/vocabulary/languages/dut,https://datos.bne.es/resource/a4554798
9,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280527,Dissertation historique et géographique sur le...,Paris,1776,http://id.loc.gov/vocabulary/languages/fre,https://datos.bne.es/resource/bima0000058106


In [21]:
# Some basic statistics about the DataFrame
df.describe()

Unnamed: 0,author,work,title,placeOfProduction,yearOfPublication,langCode,edition
count,39,39,39,39,39,39,39
unique,1,21,21,17,26,5,25
top,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,Unknown Place,Unknown Year,http://id.loc.gov/vocabulary/languages/spa,Unknown Edition
freq,39,11,11,13,13,17,13


In [11]:
# Checking extant data types
df.dtypes

author               object
work                 object
title                object
placeOfProduction    object
yearOfPublication    object
langCode             object
edition              object
dtype: object

In [12]:
# Sort the DataFrame by yearOfPublication
sorted_df = df.sort_values(by="yearOfPublication")

# Preview the first 10 rows
sorted_df.head(10)

Unnamed: 0,author,work,title,placeOfProduction,yearOfPublication,langCode,edition
12,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280515,Compendio de navegacion para el uso de los cab...,En Cadiz,1757,http://id.loc.gov/vocabulary/languages/spa,https://datos.bne.es/resource/bima0000058092
1,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,London,1760,http://id.loc.gov/vocabulary/languages/eng,https://datos.bne.es/resource/bima0000015581
7,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,Te Goes,1771-1772,http://id.loc.gov/vocabulary/languages/dut,https://datos.bne.es/resource/a4554798
8,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,Te Goes,1771-1772,http://id.loc.gov/vocabulary/languages/dut,https://datos.bne.es/resource/a4554798
13,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280518,Estado de la Astronomia en Europa,En Madrid,1774,http://id.loc.gov/vocabulary/languages/spa,https://datos.bne.es/resource/bima0000058094
9,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280527,Dissertation historique et géographique sur le...,Paris,1776,http://id.loc.gov/vocabulary/languages/fre,https://datos.bne.es/resource/bima0000058106
10,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280521,Examen marítimo teórico-práctico,A Nantes,1783,http://id.loc.gov/vocabulary/languages/fre,https://datos.bne.es/resource/bima0000058107
2,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,London,1806],http://id.loc.gov/vocabulary/languages/eng,https://datos.bne.es/resource/bima0000016101
0,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX3280526,Relación histórica del viaje a la América Meri...,London,1807,http://id.loc.gov/vocabulary/languages/eng,https://datos.bne.es/resource/bima0000014850
16,Jorge Juan y Santacilia,https://datos.bne.es/resource/XX2035522,Noticias secretas de América,Madrid,1918,http://id.loc.gov/vocabulary/languages/spa,https://datos.bne.es/resource/bimo0001052452
