# Creating Collections as Data Using Federated Queries / BVMC Query Building (3)
## About the Notebook
This notebook demonstrates how to use [SPARQL](https://www.w3.org/TR/sparql11-query/) to query Linked Open Data repositories. Specifically, it showcases how to perform federated queries by combining data from [Wikidata](https://www.wikidata.org) and the [Biblioteca Virtual Miguel de Cervantes (BVMC)](https://data.cervantesvirtual.com), which has published its catalog as Linked Open Data and makes use of [Resource, Description and Access (RDA)](http://www.rdaregistry.info/") as main vocabulary.

<!--🔖 **How to Cite**: [![DOI](https://zenodo.org/badge/DOI_NUMBER.svg)](https://doi.org/DOI_NUMBER) fix once we get the CITATION.cff set up in the GitHub repo-->

---

## Getting Started
First of all, we start by importing the `SPARQLWrapper` library, which is used to interact with SPARQL endpoints.
Then, the SPARQL endpoint for Wikidata is set up, and JSON is the return format specified for our query.

The <a href="https://data.cervantesvirtual.com"></a> published its catalogue as Linked Open data 

This example shows how to use <a href="https://www.w3.org/TR/sparql11-query/">SPARQL</a> as a query language in Linked Open Data repositories.

In [2]:
# Import necessary libraries
from SPARQLWrapper import SPARQLWrapper, JSON

# Set up the SPARQL endpoint for Wikidata
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setReturnFormat(JSON)

## Writing Our First SPARQL Query

In this section, we construct a federated SPARQL query that begins by retrieving the BVMC identifier (`P2799`) for a specific Wikidata entity, in this case, the author Jorge Juan y Santacilia (`wd:Q2085725`). What makes this query federated is the use of the `SERVICE` keyword, which allows us to access the BVMC SPARQL endpoint and combine its data with Wikidata.

Along with the RDA ontologies on use (`rdaw`, `rdam`, `rdae`), the  MADS/RDF ontology (`madsrdf`) is used to retrieve controlled vocabulary terms, such as language codes; last, the RDFS ontology (`rdfs`) is used to retrieve labels for resources, making the data more human-readable.

In [None]:
sparql.setQuery(
    """
PREFIX rdaw: <http://rdaregistry.info/Elements/w/>
PREFIX rdam: <http://rdaregistry.info/Elements/m/>
PREFIX rdae: <http://rdaregistry.info/Elements/e/>
PREFIX madsrdf: <http://www.loc.gov/mads/rdf/v1#>

SELECT ?author ?work ?workLabel ?placeOfProduction ?yearOfPublication ?langCode 
WHERE {
  wd:Q2085725 wdt:P2799 ?id .
  wd:Q2085725 rdfs:label ?author.  FILTER(LANG(?author) = "en").
  BIND(uri(concat("https://data.cervantesvirtual.com/person/", ?id)) as ?bvmcID)
  SERVICE <http://data.cervantesvirtual.com/openrdf-sesame/repositories/data> {
    ?work rdaw:author ?bvmcID .
    ?work rdfs:label ?workLabel .
    ?work rdaw:manifestationOfWork ?manifestation .  
    ?work rdaw:expressionOfWork ?expression .
    OPTIONAL {?expression rdae:languageOfExpression ?language . ?language madsrdf:code ?langCode .}
    OPTIONAL {?manifestation rdam:placeOfProduction ?placeOfProduction .}
    OPTIONAL {?manifestation rdam:dateOfPublication ?dateOfPublication . BIND(REPLACE(str(?dateOfPublication), "https://data.cervantesvirtual.com/date/", "", "i") AS ?yearOfPublication) .}
  }
}
LIMIT 1000
"""
)

## Storing the Query Results and Creating a DataFrame
In this section, we will create a dataframe from the results of our SPARQL query. We will use the `pandas` library to create a dataframe and populate it with the data retrieved from the BNF SPARQL endpoint.

If there are missing values in the data, the code handles them by assigning empty strings (`""`) to the corresponding fields. This ensures that the `data` list contains consistent dictionaries, even when some fields are not available in the SPARQL query results.

In [None]:
try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
        print('Author: ', r['author']['value'], ' Work: ', r['work']['value'], ' Title: ', r['workLabel']['value'], ' placeOfProduction: ', r['placeOfProduction']['value'],
             ' Year publication: ', r['yearOfPublication']['value'])
except Exception as e:
    print(e)

In [6]:
# Preview a limited number of results from the SPARQL query
try:
    ret = sparql.queryAndConvert()

    # Limit the number of results to preview
    preview_limit = 7
    # Check if the results contain any bindings
    if "bindings" not in ret["results"]:
        print("No results found.")
        exit(1)
    count = 0

    # Iterate through the results and print the desired fields
    for r in ret["results"]["bindings"]:
        if count >= preview_limit:
            break

        # Print only the most relevant fields
        print(f"Title: {r['workLabel']['value']}")
        print(f"Author: {r['author']['value']}")
        if "yearOfPublication" in r:
            print(f"Year: {r['yearOfPublication']['value']}")
        if "placeOfProduction" in r:
            print(f"Place: {r['placeOfProduction']['value']}")
        if "work" in r:
            print(f"Work: {r['work']['value']}")
        if "edition" in r:
            print(f"Edition: {r['edition']['value']}")
        if "langCode" in r:
            print(f"Language: {r['langCode']['value']}")

        print("---")
        count += 1

except Exception as e:
    print("Exception:")
    print(e)

Title: Examen maritimo theorico practico, o tratado de mechanica aplicado a la construccion, conocimiento y manejo de los navios y demas embarcaciones
Author: Jorge Juan y Santacilia
Year: 1771
Place: en la Imprenta de D. Francisco Manuel de Mena, Calle de las Carretas, 1771
Work: https://data.cervantesvirtual.com/work/662222
Language: es
---
Title: Elogio de D. Jorge Juan, Comendador de Aliaga en la Orden de S. Juan... [Texto impreso]
Author: Jorge Juan y Santacilia
Year: 1750
Place: 1750
Work: https://data.cervantesvirtual.com/work/666509
Language: es
---
Title: Estado de la astronomia en Europa [Texto impreso] : Y juicio de los fundamentos sobre que se erigieron los Systemas del Mundo, para que sirva de guia al mètodo en que debe recibirlos la Nación ...
Author: Jorge Juan y Santacilia
Year: 1774
Place: en la Imprenta Real de la Gazeta, 1774
Work: https://data.cervantesvirtual.com/work/666530
Language: es
---
Title: Observaciones astronomicas y physicas ... en los reynos del Peru [T

## Storing the Query Results and Creating a DataFrame
In this section, we will create a dataframe from the results of our SPARQL query. We will use the `pandas` library to create a dataframe and populate it with the data retrieved from the BVMC SPARQL endpoint.

If there are missing values in the data, the code handles them by assigning empty strings (`""`) to the corresponding fields. This ensures that the `data` list contains consistent dictionaries, even when some fields are not available in the SPARQL query results.

In [7]:
# Initialize an empty list to store the processed data
data = []

# Iterate through the results from the SPARQL query
for r in ret["results"]["bindings"]:
    data.append(
        {
            "author": r.get("author", {}).get("value", ""),
            "work": r.get("work", {}).get("value", ""),
            "title": r.get("workLabel", {}).get("value", ""),
            "langCode": r.get("langCode", {}).get("value", ""),
            "edition": r.get("edition", {}).get("value", ""),
            "yearOfPublication": r.get("yearOfPublication", {}).get("value", ""),
            "placeOfProduction": r.get("placeOfProduction", {}).get("value", ""),
        }
    )

# Print the first 15 items
data[0:15]

[{'author': 'Jorge Juan y Santacilia',
  'work': 'https://data.cervantesvirtual.com/work/662222',
  'title': 'Examen maritimo theorico practico, o tratado de mechanica aplicado a la construccion, conocimiento y manejo de los navios y demas embarcaciones',
  'langCode': 'es',
  'edition': '',
  'yearOfPublication': '1771',
  'placeOfProduction': 'en la Imprenta de D. Francisco Manuel de Mena, Calle de las Carretas, 1771'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://data.cervantesvirtual.com/work/666509',
  'title': 'Elogio de D. Jorge Juan, Comendador de Aliaga en la Orden de S. Juan... [Texto impreso]',
  'langCode': 'es',
  'edition': '',
  'yearOfPublication': '1750',
  'placeOfProduction': '1750'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://data.cervantesvirtual.com/work/666530',
  'title': 'Estado de la astronomia en Europa [Texto impreso] : Y juicio de los fundamentos sobre que se erigieron los Systemas del Mundo, para que sirva de guia al mètodo en q

To make this code more robust and handling missing data more efficiently, we can use `.get()` and providing default values for optional fields, e.g. "Unknown xyz". This would prevent errors and ensure that the data processing step works even when some fields are not available in the query results.

In [8]:
# Initialize an empty list to store the processed data
data = []

# Iterate through the results from the SPARQL query
for r in ret["results"]["bindings"]:
    # Use .get() to provide default values for optional fields
    author = r.get("author", {}).get("value", "")
    work = r.get("work", {}).get("value", "Unknown Work")
    title = r.get("workLabel", {}).get("value", "Unknown Title")
    placeOfProduction = r.get("placeOfProduction", {}).get("value", "Unknown Place")
    yearOfPublication = r.get("yearOfPublication", {}).get("value", "Unknown Year")
    langCode = r.get("langCode", {}).get("value", "Unknown Language")
    edition = r.get("edition", {}).get("value", "Unknown Edition")

    # Append a dictionary containing the extracted data to the list
    data.append(
        {
            "author": author,
            "work": work,
            "title": work,
            "placeOfProduction": placeOfProduction,
            "yearOfPublication": yearOfPublication,
            "langCode": langCode,
            "edition": edition,
        }
    )

# Print the first 15 lines of the dictionary to check the results
data[:15]

[{'author': 'Jorge Juan y Santacilia',
  'work': 'https://data.cervantesvirtual.com/work/662222',
  'title': 'https://data.cervantesvirtual.com/work/662222',
  'placeOfProduction': 'en la Imprenta de D. Francisco Manuel de Mena, Calle de las Carretas, 1771',
  'yearOfPublication': '1771',
  'langCode': 'es',
  'edition': 'Unknown Edition'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://data.cervantesvirtual.com/work/666509',
  'title': 'https://data.cervantesvirtual.com/work/666509',
  'placeOfProduction': '1750',
  'yearOfPublication': '1750',
  'langCode': 'es',
  'edition': 'Unknown Edition'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://data.cervantesvirtual.com/work/666530',
  'title': 'https://data.cervantesvirtual.com/work/666530',
  'placeOfProduction': 'en la Imprenta Real de la Gazeta, 1774',
  'yearOfPublication': '1774',
  'langCode': 'es',
  'edition': 'Unknown Edition'},
 {'author': 'Jorge Juan y Santacilia',
  'work': 'https://data.cervantesvirtu

Now  let's convert our dictionary into a pandas DataFrame. We will use the `pd.DataFrame()` function to create a DataFrame from the list of dictionaries. Each dictionary in the list will represent a row in the DataFrame, and the keys of the dictionaries will become the column names.

In [9]:
# Load required libraries
import pandas as pd

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(data)

# Preview the first 10 rows
df.head(10)

Unnamed: 0,author,work,title,placeOfProduction,yearOfPublication,langCode,edition
0,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/662222,https://data.cervantesvirtual.com/work/662222,"en la Imprenta de D. Francisco Manuel de Mena,...",1771,es,Unknown Edition
1,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/666509,https://data.cervantesvirtual.com/work/666509,1750,1750,es,Unknown Edition
2,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/666530,https://data.cervantesvirtual.com/work/666530,"en la Imprenta Real de la Gazeta, 1774",1774,es,Unknown Edition
3,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/666560,https://data.cervantesvirtual.com/work/666560,"por Juan de Zuñiga, 1748",1748,es,Unknown Edition
4,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/109096,https://data.cervantesvirtual.com/work/109096,"Madrid, en la Imprenta Real, 1809, pp. 143-151",1809,es,Unknown Edition
5,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/109097,https://data.cervantesvirtual.com/work/109097,"Madrid, en la Imprenta Real, 1809, pp. 152-155",1809,es,Unknown Edition
6,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/109098,https://data.cervantesvirtual.com/work/109098,"Madrid, en la Imprenta Real, 1809, pp. 160-163",1809,es,Unknown Edition
7,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/109099,https://data.cervantesvirtual.com/work/109099,"Madrid, en la Imprenta Real, 1809, pp. 176-184",1809,es,Unknown Edition
8,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/109100,https://data.cervantesvirtual.com/work/109100,"Madrid, en la Imprenta Real, 1809, pp. 253-320",1809,es,Unknown Edition
9,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/97199,https://data.cervantesvirtual.com/work/97199,España. Cádiz,1757,es,Unknown Edition


In [10]:
# Some basic statistics about the DataFrame
df.describe()

Unnamed: 0,author,work,title,placeOfProduction,yearOfPublication,langCode,edition
count,38,38,38,38,38,38,38
unique,1,38,38,30,12,2,1
top,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/662222,https://data.cervantesvirtual.com/work/662222,"En Madrid : por Antonio Marin, 1748.",1748,es,Unknown Edition
freq,38,1,1,5,7,35,38


In [11]:
# Checking extant data types
df.dtypes

author               object
work                 object
title                object
placeOfProduction    object
yearOfPublication    object
langCode             object
edition              object
dtype: object

In [12]:
# Sort the DataFrame by yearOfPublication
sorted_df = df.sort_values(by="yearOfPublication")

# Preview the first 10 rows
sorted_df.head(10)

Unnamed: 0,author,work,title,placeOfProduction,yearOfPublication,langCode,edition
19,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/136736,https://data.cervantesvirtual.com/work/136736,"En Madrid : por Antonio Marin, 1748.",1748,es,Unknown Edition
3,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/666560,https://data.cervantesvirtual.com/work/666560,"por Juan de Zuñiga, 1748",1748,es,Unknown Edition
15,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/64371,https://data.cervantesvirtual.com/work/64371,"En Madrid : por Antonio Marin, 1748.",1748,es,Unknown Edition
14,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/64370,https://data.cervantesvirtual.com/work/64370,"En Madrid : por Antonio Marin, 1748.",1748,es,Unknown Edition
13,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/64369,https://data.cervantesvirtual.com/work/64369,"En Madrid : por Antonio Marin, 1748.",1748,es,Unknown Edition
12,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/64368,https://data.cervantesvirtual.com/work/64368,"En Madrid : por Antonio Marin, 1748.",1748,es,Unknown Edition
27,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/42393,https://data.cervantesvirtual.com/work/42393,"En Madrid : por Juan de Zuñiga, 1748",1748,es,Unknown Edition
18,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/108123,https://data.cervantesvirtual.com/work/108123,"En Madrid : en la Imprenta de Antonio Marin, 1749",1749,es,Unknown Edition
16,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/64383,https://data.cervantesvirtual.com/work/64383,"En Madrid : en la imprenta de Antonio Marin, 1...",1749,es,Unknown Edition
26,Jorge Juan y Santacilia,https://data.cervantesvirtual.com/work/1004811,https://data.cervantesvirtual.com/work/1004811,"En Madrid : en la Imprenta de Antonio Marin, 1749",1749,es,Unknown Edition
