# Creating Collections as Data Using Federated Queries / Multiple Authors' Works

## About the Notebook
This notebook demonstrates **a streamlined pipeline for executing federated queries across multiple library Linked Open Data (LOD) endpoints**. It simplifies the process of retrieving and consolidating data about literary works and authors into a single dataset. The query outputs are merged into a final dataset that is exported as a CSV file for analyses or reuse.

The Spanish Golden Age ([Q530936](https://www.wikidata.org/wiki/Q530936), 1492-1659) was period of flourishing in arts and literature in Spain, coinciding with the political rise and decline of the Spanish Habsburg dynasty.

🔖 **How to Cite**: [![DOI](https://zenodo.org/badge/DOI_NUMBER.svg)](https://doi.org/DOI_NUMBER) <!--fix once we get the CITATION.cff in the GitHub repo-->

---

### Core Modules on Use

- `SPARQLWrapper` is used to interact with SPARQL endpoints and execute queries to retrieve data.
- `pandas` essential to perform data handling and manipulation tasks.
- `aiohttp` and `nest_asyncio` enable asynchronous programming which is crucial for efficient data retrieval by allowing multiple tasks to run concurrently. In particular, `nest_asyncio` is important in a Jupyter notebook for running async tasks.
- `tqdm` is used to display progress bars for long-running tasks.


In [None]:
# Core modules
import asyncio
import nest_asyncio
from SPARQLWrapper import SPARQLWrapper, JSON

# Data handling and manipulation
import pandas as pd

# Network and HTTP requests
import aiohttp

# Utility and support
from tqdm import tqdm
from tqdm.asyncio import tqdm_asyncio
import time
import random

## Queries General Parameters Setup

We start by storing into `endpoint_url` the address of the Wikidata endpoint.

Then, a function called `execute_query` is defined: it takes a SPARQL query and the endpoint address as input. Next, it sends the query to the Wikidata endpoint, waits for a response (specifying a timeout in case of complex queries), and formats the received data into a list of dictionaries, which it then returns as the final output.

In [None]:
# Define the SPARQL endpoint on use (outside the function)
endpoint_url = "https://query.wikidata.org/sparql"

# Function to execute a query and return results as a list of dictionaries
def execute_query(query, endpoint_url):
    sparql = SPARQLWrapper(endpoint_url)
    sparql.setTimeout(180)
    sparql.setReturnFormat(JSON)
    sparql.setQuery(query)
    sparql.addCustomHttpHeader(
        "User-Agent", "LODCaD/1.0 (osti.giuli@gmail.com)"
    )  # Replace with your bot's details
    results = sparql.query().convert()
    return results["results"]["bindings"]

Before laying out our queries we need to identify the Wikidata ids fo the authors from the Spanish Golden Age (Wikidata ID: Q530936) who are represented to some extent across our selected repositories. To perform an environmental scan, we retrieve all authors related to the Spanish Golden Age using a SPARQL query stored in the `fetch_all_authors` variable to be used for the following LOD repositories:

- `BNE`: Biblioteca Nacional de España (The Spanish National Library).
- `BNF`: Bibliothèque nationale de France (The National Library of France).
- `BVMC`: Biblioteca Virtual Miguel de Cervantes (The Miguel de Cervantes Virtual Library).

In [None]:
fetch_all_authors = """
SELECT DISTINCT ?author ?authorLabel ?bneID ?idBnF ?bvmcID WHERE {
  ?author wdt:P135 wd:Q530936;  # Spanish Golden Age
          rdfs:label ?authorLabel.
  OPTIONAL { ?author wdt:P950 ?bneID. }
  OPTIONAL { ?author wdt:P268 ?idBnF. }
  OPTIONAL { ?author wdt:P2799 ?bvmcID. }
  FILTER(LANG(?authorLabel) = "en")
}
LIMIT 200
"""

Once the query has been defined, the function `query_to_dataframe` is created to run it, process the results into a structured format, and store them in a pandas dataframe for further analysis.

In [None]:
def query_to_dataframe(query):
    # Fetch results from Wikidata
    results = execute_query(query, endpoint_url)

    # Map results to a list of dictionaries
    data = []
    for result in results:
        row = {
            "author": result.get("author", {}).get("value", ""),
            "authorLabel": result.get("authorLabel", {}).get("value", ""),
            "bneID": result.get("bneID", {}).get("value", ""),
            "idBnF": result.get("idBnF", {}).get("value", ""),
            "bvmcID": result.get("bvmcID", {}).get("value", ""),
        }
        data.append(row)

    # Convert to a dataframe
    return pd.DataFrame(data)

In [None]:
# Get a dataframe with all the authors
df_all_authors = query_to_dataframe(fetch_all_authors)

# Preview
df_all_authors

Unnamed: 0,author,authorLabel,bneID,idBnF,bvmcID
0,http://www.wikidata.org/entity/Q5682,Miguel de Cervantes,XX1718747,118957747,40.0
1,http://www.wikidata.org/entity/Q94802,The Burial of the Count of Orgaz,,,
2,http://www.wikidata.org/entity/Q201315,Francisco Quevedo,XX1066651,118873287,6.0
3,http://www.wikidata.org/entity/Q165257,Lope de Vega,XX1719671,11927819k,72.0
4,http://www.wikidata.org/entity/Q2450139,Boys Eating Grapes and Melon,,,
5,http://www.wikidata.org/entity/Q2905689,María de Zayas,XX913942,119878263,105.0
6,http://www.wikidata.org/entity/Q5551713,Jerónimo de Pasamonte,XX1183253,,6748.0
7,http://www.wikidata.org/entity/Q3893270,Alonso Jerónimo de Salas Barbadillo,XX1054399,120545809,895.0
8,http://www.wikidata.org/entity/Q5723718,Bautista Remiro de Navarra,XX1140785,144495928,
9,http://www.wikidata.org/entity/Q10304696,Jacinto Abad de Ayala,XX1171039,,


Subsequently, we decided to narrow down the list to include **only those authors who have identifiers in all three chosen LOD repositories** (BNE, BNF, and BVMC). To do so we created a second SPARQL query, `fetch_authors` and ran again `query_to_dataframe`.

In [None]:
fetch_authors = """
SELECT DISTINCT ?author ?authorLabel ?bneID ?idBnF ?bvmcID WHERE {
  ?author wdt:P950 ?bneID.
  ?author wdt:P268 ?idBnF.
  ?author wdt:P2799 ?bvmcID.
  ?author wdt:P135 wd:Q530936;
    rdfs:label ?authorLabel.
  FILTER(LANG(?authorLabel) = "en")
}
LIMIT 200
"""

# Get a dataframe with all the authors
df_authors = query_to_dataframe(fetch_authors)

# Preview
df_authors

Unnamed: 0,author,authorLabel,bneID,idBnF,bvmcID
0,http://www.wikidata.org/entity/Q5682,Miguel de Cervantes,XX1718747,118957747,40
1,http://www.wikidata.org/entity/Q165257,Lope de Vega,XX1719671,11927819k,72
2,http://www.wikidata.org/entity/Q201315,Francisco Quevedo,XX1066651,118873287,6
3,http://www.wikidata.org/entity/Q2905689,María de Zayas,XX913942,119878263,105
4,http://www.wikidata.org/entity/Q3893270,Alonso Jerónimo de Salas Barbadillo,XX1054399,120545809,895


 As last step, we clean the Wikidata ids for our authors (keeping just the tail of the URLs), and store them in a new list called `wikidata_ids` for further processing.

 Now we have all the required information to craft the queries.

In [None]:
# Get a list of raw Wikidata ids (URLs) from the author column
wikidata_raw_ids = df_authors["author"].tolist()

# Extract Wikidata ids from the URLs by splitting them at '/' and taking the last element
wikidata_ids = [url.split("/")[-1] for url in wikidata_raw_ids]

# Preview
wikidata_ids

['Q5682', 'Q165257', 'Q201315', 'Q2905689', 'Q3893270']

## Functions Design

This section focuses on creating reusable functions to efficiently query and retrieve data from LOD repositories, leveraging the use of query templates as outlined in the Queries run.

In short `fetch_paginated_results_by_author` orchestrates the process, calling `fetch_results_for_author for each author`, which in turn calls `fetch_query` to retrieve the actual data. This function chaining ensures that data is fetched efficiently and facilitates handling potential issues during the process.

Additionally, we defined `track_progress`, **a function to provide a visual progress bar using the `tqdm` library**, and therefore support tracking the execution of the asynchronous tasks.

In [None]:
# Enable progression bar display with tqdm
async def track_progress(tasks, desc="Processing"):
    results = []
    for coro in tqdm_asyncio.as_completed(tasks, desc=desc, total=len(tasks)):
        result = await coro
        results.append(result)
    return results

Then, the function `fect_query` is designed to send a SPARQL query to a specified endpoint and handle potential errors or rate limiting.

In [None]:
## Define a function to ship the query to a specified endpoint
async def fetch_query(session, query, endpoint_url, timeout, retries):
    headers = {"Accept": "application/sparql-results+json"}

    for attempt in range(retries):
        try:
            print(f"[Attempt {attempt + 1}/{retries}] Querying {endpoint_url}")
            print(f"→ Query Preview: {query[:200].strip()}...\n")

            async with session.post(
                endpoint_url, data={"query": query}, headers=headers, timeout=timeout
            ) as response:
                if response.status == 200:
                    print(
                        "✅ Query successful.\n"
                    )  # added some emoji to enhance interpretability
                    return await response.json()
                elif response.status == 429:
                    wait_time = 2**attempt + random.uniform(0, 1)
                    print(
                        f"⚠️ 429 Too Many Requests — Retrying in {wait_time:.2f} seconds...\n"
                    )
                    await asyncio.sleep(wait_time)
                else:
                    print(f"❌ Unexpected status {response.status}. Raising...\n")
                    response.raise_for_status()

        except aiohttp.ClientError as e:
            print(f"🚨 Client error on attempt {attempt + 1}: {e}\n")
        except asyncio.TimeoutError:
            print(f"⏳ Timeout on attempt {attempt + 1}\n")
        except Exception as e:
            print(f"❗ General error on attempt {attempt + 1}: {e}\n")

        if attempt == retries - 1:
            print(f"❌ Failed after {retries} attempts. Giving up.\n")
            return None

 To complement and refine `fetch_query`, **the function `fetch_results_for_author` is designed to retrieve data for a specific author from a given endpoint**. It is built to handle scenarios where the data is split across multiple pages (pagination) and potential network issues.

In [None]:
# Define a function to get results for a specified author
async def fetch_results_for_author(
    session, wikidata_id, query_template, endpoint_url, timeout, limit, max_retries
):
    print(f"\n🔍 Fetching results for author: {wikidata_id}")
    author_query = query_template.replace("<WIKIDATA_ID>", wikidata_id)
    offset = 0
    author_results = []

    while True:
        paginated_query = author_query.replace(
            "LIMIT <LIMIT>", f"LIMIT {limit} OFFSET {offset}"
        )
        print(f"  → OFFSET {offset}, LIMIT {limit}")

        result = await fetch_query(
            session, paginated_query, endpoint_url, timeout, max_retries
        )

        if result:
            bindings = result.get("results", {}).get("bindings", [])
            if not bindings:
                print(f"  ✅ No more results for {wikidata_id}. Done.\n")
                break
            author_results.extend(bindings)
            offset += limit
        else:
            print(f"  ❌ Failed to fetch data for author {wikidata_id}. Skipping.\n")
            break

    return author_results

Last, the function **`fetch_paginated_results_by_author` streamlines the process of retrieving data about multiple authors**, acting as a sort of wrapper for the previously specified functions.

In [None]:
# Define the wrapper function to handle data from multiple authors
async def fetch_paginated_results_by_author(
    wikidata_ids, query_template, endpoint_url, timeout=10, limit=100, max_retries=5
):
    print(f"🚀 Starting queries for {len(wikidata_ids)} authors...\n")

    async with aiohttp.ClientSession() as session:
        tasks = [
            fetch_results_for_author(
                session,
                wikidata_id,
                query_template,
                endpoint_url,
                timeout,
                limit,
                max_retries,
            )
            for wikidata_id in tqdm(wikidata_ids)
        ]

        results = await asyncio.gather(*tasks)

        results_dict = {
            wikidata_id: result for wikidata_id, result in zip(wikidata_ids, results)
        }
        print("\n📊 Summary of records fetched per author:")
        for author_id, records in results_dict.items():
            print(f"  {author_id}: {len(records)} records")
        print()

        return results_dict

## Queries Design

Here the actual SPARQL queries are defined and executed to retrieve data from our chosen LOD repositories. The tasks are run using the previously defined functions to handle the query execution and data retrieval process.

In [None]:
nest_asyncio.apply() # required to run multiple queries simultaneously (parallelization) within the Jupyter Notebook.

**The queries are designed with placeholders to make them reusable and adaptable for different authors and better handle different data collection limits.** If you look closely at each of the SPARQL queries structure, you will notice the use of two placeholders:

- `<WIKIDATA_ID>`. This placeholder is intended to be replaced with the actual Wikidata id of a specific author, specifically the ones we stored in the `wikidata_ids` list.  
- `<LIMIT>`. This placeholder controls the maximum number of results returned by the query. `<LIMIT>` is intended to be replaced with an actual numerical value, specifying the desired number of results to obtain during each iteration.

### BNE

In [None]:
# Define BNE query
bne_query = """
PREFIX bne-def: <https://datos.bne.es/def/>
        PREFIX dcterms: <http://purl.org/dc/terms/>

        SELECT ?author ?work ?workLabel ?edition ?placeOfProduction ?yearOfPublication ?langCode
        WHERE {
            wd:<WIKIDATA_ID> wdt:P950 ?id .
            wd:<WIKIDATA_ID> rdfs:label ?author.  FILTER(LANG(?author) = "en").
            BIND(uri(concat("https://datos.bne.es/resource/", ?id)) as ?bneID)
            SERVICE <http://datos.bne.es/sparql> {
                ?bneID bne-def:OP5001 ?work .
                ?work rdfs:label ?workLabel .
                OPTIONAL {?work bne-def:OP1002 ?m . ?m bne-def:OP2001 ?edition . ?edition bne-def:P3003 ?placeOfProduction}
                OPTIONAL {?work bne-def:OP1002 ?m . ?m bne-def:OP2001 ?edition . ?edition bne-def:P3006 ?yearOfPublication}
                OPTIONAL {?work bne-def:OP1002 ?m . ?m bne-def:OP2001 ?edition . ?edition dcterms:language ?langCode}
            }
        }
        LIMIT <LIMIT>
"""

In [None]:
# Run the asynchronous function
bne_combined_results = asyncio.run(
    fetch_paginated_results_by_author(wikidata_ids, bne_query, endpoint_url)
)

# Flatten the results and convert to dataframe
bne_data = []
for author_id, results in bne_combined_results.items():
    for result in results:
        # Extract relevant fields from the SPARQL query result
        author = result.get("author", {}).get("value", None)
        # author_label = result.get('authorLabel', {}).get('value', None)
        work = result.get("work", {}).get("value", None)
        work_label = result.get("workLabel", {}).get("value", None)
        edition = result.get("edition", {}).get("value", None)
        place_of_production = result.get("placeOfProduction", {}).get("value", None)
        year_of_publication = result.get("yearOfPublication", {}).get("value", None)
        lang_code = result.get("langCode", {}).get("value", None)

        bne_data.append(
            {
                "Source": "BNE",  # specifies BNE as source
                "Author": author,
                #'Author Label': author_label,
                "Work": work,
                "Work Label": work_label,
                "Edition": edition,
                "Place of Production": place_of_production,
                "Year of Publication": year_of_publication,
                "Language Code": lang_code,
            }
        )

🚀 Starting queries for 5 authors...



100%|██████████| 5/5 [00:00<00:00, 27666.91it/s]


🔍 Fetching results for author: Q5682
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX bne-def: <https://datos.bne.es/def/>
        PREFIX dcterms: <http://purl.org/dc/terms/>

        SELECT ?author ?work ?workLabel ?edition ?placeOfProduction ?yearOfPublication ?langCode...


🔍 Fetching results for author: Q165257
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX bne-def: <https://datos.bne.es/def/>
        PREFIX dcterms: <http://purl.org/dc/terms/>

        SELECT ?author ?work ?workLabel ?edition ?placeOfProduction ?yearOfPublication ?langCode...


🔍 Fetching results for author: Q201315
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX bne-def: <https://datos.bne.es/def/>
        PREFIX dcterms: <http://purl.org/dc/terms/>

        SELECT ?author ?work ?workLabel ?edition ?placeOfProduction ?yearOfPublication ?langCod




✅ Query successful.

  → OFFSET 100, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX bne-def: <https://datos.bne.es/def/>
        PREFIX dcterms: <http://purl.org/dc/terms/>

        SELECT ?author ?work ?workLabel ?edition ?placeOfProduction ?yearOfPublication ?langCode...

✅ Query successful.

  → OFFSET 100, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX bne-def: <https://datos.bne.es/def/>
        PREFIX dcterms: <http://purl.org/dc/terms/>

        SELECT ?author ?work ?workLabel ?edition ?placeOfProduction ?yearOfPublication ?langCode...

✅ Query successful.

  ✅ No more results for Q3893270. Done.

✅ Query successful.

  ✅ No more results for Q2905689. Done.

✅ Query successful.

  → OFFSET 100, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX bne-def: <https://datos.bne.es/def/>
        PREFIX dcterms: <http://purl.org/dc/terms/>

        SELECT ?auth

In [None]:
# Render bne_data into a dataframe
df_bne = pd.DataFrame(bne_data)

# Preview
df_bne.head()

Unnamed: 0,Source,Author,Work,Work Label,Edition,Place of Production,Year of Publication,Language Code
0,BNE,Miguel de Cervantes,https://datos.bne.es/resource/XX3383764,Novelas ejemplares,https://datos.bne.es/resource/bima0000013178,Paris,1809,http://id.loc.gov/vocabulary/languages/fre
1,BNE,Miguel de Cervantes,https://datos.bne.es/resource/XX3383563,Don Quijote de la Mancha,https://datos.bne.es/resource/bimo0000398030,[Murcia,D.L. 1993,http://id.loc.gov/vocabulary/languages/spa
2,BNE,Miguel de Cervantes,https://datos.bne.es/resource/XX1924290,La cueva de Salamanca,https://datos.bne.es/resource/bimo0002046013,[Granada],2005,http://id.loc.gov/vocabulary/languages/spa
3,BNE,Miguel de Cervantes,https://datos.bne.es/resource/XX3383764,Novelas ejemplares,https://datos.bne.es/resource/Mimo0001661915,[Tokio,1993],http://id.loc.gov/vocabulary/languages/jpn
4,BNE,Miguel de Cervantes,https://datos.bne.es/resource/XX4894754,El celoso extremeño,https://datos.bne.es/resource/a5599310,[S.l],[1917?],http://id.loc.gov/vocabulary/languages/rus


In [None]:
df_bne.describe()

Unnamed: 0,Source,Author,Work,Work Label,Edition,Place of Production,Year of Publication,Language Code
count,3559,3559,3559,3559,3285,3164,3277,3261
unique,1,5,734,726,2985,454,810,25
top,BNE,Lope de Vega,https://datos.bne.es/resource/XX3383563,Don Quijote de la Mancha,https://datos.bne.es/resource/Mimo0000178061,Madrid,[2005],http://id.loc.gov/vocabulary/languages/spa
freq,3559,1862,769,769,6,893,51,3004


### BNF

In [None]:
bnf_query = """
  PREFIX dcterms: <http://purl.org/dc/terms/>
  PREFIX rdarelationships: <http://rdvocab.info/RDARelationshipsWEMI/>
  PREFIX rdagroup1elements: <http://rdvocab.info/Elements/>

        SELECT ?author ?authorLabel ?expression ?title ?edition ?placeOfPublication ?yearOfPublication ?langCode WHERE {
            wd:<WIKIDATA_ID> wdt:P268 ?id
            BIND(uri(concat(concat("http://data.bnf.fr/ark:/12148/cb", ?id),"#about")) as ?author)
            SERVICE <http://data.bnf.fr/sparql> {
                ?expression <http://id.loc.gov/vocabulary/relators/aut> ?author .
                OPTIONAL {?expression dcterms:language ?langCode .}
                OPTIONAL {?expression dcterms:publisher ?edition .}
                ?manifestation rdarelationships:expressionManifested ?expression .
                ?manifestation dcterms:title ?title .
                ?manifestation dcterms:date ?yearOfPublication .
                OPTIONAL{ ?manifestation rdagroup1elements:placeOfPublication ?placeOfPublication .}
            }
        }
        LIMIT <LIMIT>
"""

In [None]:
# Run the asynchronous function
bnf_combined_results = asyncio.run(
    fetch_paginated_results_by_author(wikidata_ids, bnf_query, endpoint_url, timeout=30)
)

# Flatten the results and convert to DataFrame
bnf_data = []
for author_id, results in bnf_combined_results.items():
    for result in results:
        # Extract relevant fields from the SPARQL query result
        author = result.get("author", {}).get("value", "")
        author_label = result.get("authorLabel", {}).get("value", "")
        work = result.get("expression", {}).get("value", "")
        work_label = result.get("title", {}).get("value", "")
        edition = result.get("edition", {}).get("value", "")
        place_of_production = result.get("placeOfProduction", {}).get("value", "")
        year_of_publication = result.get("yearOfPublication", {}).get("value", "")
        lang_code = result.get("langCode", {}).get("value", "")

        bnf_data.append(
            {
                "Source": "BNF",  # specifies BNF as source
                "Author": author,
                "Work": work,
                "Work Label": work_label,
                "Edition": edition,
                "Place of Production": place_of_production,
                "Year of Publication": year_of_publication,
                "Language Code": lang_code,
            }
        )

🚀 Starting queries for 5 authors...



100%|██████████| 5/5 [00:00<00:00, 19654.66it/s]


🔍 Fetching results for author: Q5682
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX dcterms: <http://purl.org/dc/terms/>
  PREFIX rdarelationships: <http://rdvocab.info/RDARelationshipsWEMI/>
  PREFIX rdagroup1elements: <http://rdvocab.info/Elements/>

        SELECT ?autho...


🔍 Fetching results for author: Q165257
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX dcterms: <http://purl.org/dc/terms/>
  PREFIX rdarelationships: <http://rdvocab.info/RDARelationshipsWEMI/>
  PREFIX rdagroup1elements: <http://rdvocab.info/Elements/>

        SELECT ?autho...


🔍 Fetching results for author: Q201315
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX dcterms: <http://purl.org/dc/terms/>
  PREFIX rdarelationships: <http://rdvocab.info/RDARelationshipsWEMI/>
  PREFIX rdagroup1elements: <http://rdvocab.info/Elements/>

        




✅ Query successful.

  → OFFSET 100, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX dcterms: <http://purl.org/dc/terms/>
  PREFIX rdarelationships: <http://rdvocab.info/RDARelationshipsWEMI/>
  PREFIX rdagroup1elements: <http://rdvocab.info/Elements/>

        SELECT ?autho...

✅ Query successful.

  → OFFSET 100, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX dcterms: <http://purl.org/dc/terms/>
  PREFIX rdarelationships: <http://rdvocab.info/RDARelationshipsWEMI/>
  PREFIX rdagroup1elements: <http://rdvocab.info/Elements/>

        SELECT ?autho...

✅ Query successful.

  ✅ No more results for Q3893270. Done.

❌ Unexpected status 500. Raising...

🚨 Client error on attempt 1: 500, message='Internal Server Error', url='https://query.wikidata.org/sparql'

[Attempt 2/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX dcterms: <http://purl.org/dc/terms/>
  PREFIX rdarelationships: <h

In [None]:
# Create DataFrame
df_bnf = pd.DataFrame(bnf_data)

# Flatten the results and get to dataframe
df_bnf.head()

Unnamed: 0,Source,Author,Work,Work Label,Edition,Place of Production,Year of Publication,Language Code
0,BNF,http://data.bnf.fr/ark:/12148/cb118957747#about,http://data.bnf.fr/ark:/12148/cb36188456p#Expr...,L'ingénieux hidalgo Don Quichotte de la Manche,,,1997,http://id.loc.gov/vocabulary/iso639-2/fre
1,BNF,http://data.bnf.fr/ark:/12148/cb118957747#about,http://data.bnf.fr/ark:/12148/cb35986258k#Expr...,L'ingénieux hidalgo Don Quichotte de la Manche,,,1892,http://id.loc.gov/vocabulary/iso639-2/fre
2,BNF,http://data.bnf.fr/ark:/12148/cb118957747#about,http://data.bnf.fr/ark:/12148/cb38927398h#Expr...,Exemplary stories,,,1972,http://id.loc.gov/vocabulary/iso639-2/eng
3,BNF,http://data.bnf.fr/ark:/12148/cb118957747#about,http://data.bnf.fr/ark:/12148/cb30213279g#Expr...,Relatione di quanto è successo nella città di ...,,,1608,http://id.loc.gov/vocabulary/iso639-2/ita
4,BNF,http://data.bnf.fr/ark:/12148/cb118957747#about,http://data.bnf.fr/ark:/12148/cb321129229#Expr...,"Galatée, pastorale imitée de Cervantes par M. ...",,,1788,http://id.loc.gov/vocabulary/iso639-2/fre


In [None]:
# Preview
df_bnf.describe()

Unnamed: 0,Source,Author,Work,Work Label,Edition,Place of Production,Year of Publication,Language Code
count,2082,2082,2082,2082,2082.0,2082.0,2082,2082
unique,1,5,2055,1523,1.0,1.0,464,25
top,BNF,http://data.bnf.fr/ark:/12148/cb118957747#about,http://data.bnf.fr/ark:/12148/cb427142507#Expr...,Don Quichotte,,,(s. d.),http://id.loc.gov/vocabulary/iso639-2/spa
freq,2082,1095,4,40,2082.0,2082.0,30,1062


### BVMC

In [None]:
bvmc_query = """
PREFIX rdaw: <http://rdaregistry.info/Elements/w/>
        PREFIX rdam: <http://rdaregistry.info/Elements/m/>
        PREFIX rdae: <http://rdaregistry.info/Elements/e/>
        PREFIX madsrdf: <http://www.loc.gov/mads/rdf/v1#>

        SELECT ?author ?work ?workLabel ?placeOfProduction ?yearOfPublication ?langCode
        WHERE {
            wd:<WIKIDATA_ID> wdt:P2799 ?id .
            wd:<WIKIDATA_ID> rdfs:label ?author.  FILTER(LANG(?author) = "en").
            BIND(uri(concat("https://data.cervantesvirtual.com/person/", ?id)) as ?bvmcID)
            SERVICE <http://data.cervantesvirtual.com/openrdf-sesame/repositories/data> {
                ?work rdaw:author ?bvmcID .
                ?work rdfs:label ?workLabel .
                ?work rdaw:manifestationOfWork ?manifestation .
                ?work rdaw:expressionOfWork ?expression .
                OPTIONAL {?expression rdae:languageOfExpression ?language . ?language madsrdf:code ?langCode .}
                OPTIONAL {?manifestation rdam:placeOfProduction ?placeOfProduction .}
                OPTIONAL {?manifestation rdam:dateOfPublication ?dateOfPublication . BIND(REPLACE(str(?dateOfPublication), "https://data.cervantesvirtual.com/date/", "", "i") AS ?yearOfPublication) .}
            }
        }
        LIMIT <LIMIT>
"""

In [None]:
# Run the asynchronous function
bvmc_combined_results = asyncio.run(
    fetch_paginated_results_by_author(wikidata_ids, bvmc_query, endpoint_url)
)

# Flatten the results and convert to DataFrame
bvmc_data = []
for author_id, results in bvmc_combined_results.items():
    for result in results:
        # Extract relevant fields from the SPARQL query result
        author = result.get("author", {}).get("value", "")
        work = result.get("work", {}).get("value", "")
        work_label = result.get("workLabel", {}).get("value", "")
        place_of_production = result.get("placeOfProduction", {}).get("value", "")
        year_of_publication = result.get("yearOfPublication", {}).get("value", "")
        lang_code = result.get("langCode", {}).get("value", "")

        bvmc_data.append(
            {
                "Source": "BVMC",
                "Author": author,
                "Work": work,
                "Work Label": work_label,
                "Place of Production": place_of_production,
                "Year of Publication": year_of_publication,
                "Language Code": lang_code,
            }
        )

🚀 Starting queries for 5 authors...



100%|██████████| 5/5 [00:00<00:00, 21312.52it/s]


🔍 Fetching results for author: Q5682
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX rdaw: <http://rdaregistry.info/Elements/w/>
        PREFIX rdam: <http://rdaregistry.info/Elements/m/>
        PREFIX rdae: <http://rdaregistry.info/Elements/e/>
        PREFIX madsrdf: <http:...


🔍 Fetching results for author: Q165257
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX rdaw: <http://rdaregistry.info/Elements/w/>
        PREFIX rdam: <http://rdaregistry.info/Elements/m/>
        PREFIX rdae: <http://rdaregistry.info/Elements/e/>
        PREFIX madsrdf: <http:...


🔍 Fetching results for author: Q201315
  → OFFSET 0, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX rdaw: <http://rdaregistry.info/Elements/w/>
        PREFIX rdam: <http://rdaregistry.info/Elements/m/>
        PREFIX rdae: <http://rdaregistry.info/Elements/e/>
        PRE




✅ Query successful.

  → OFFSET 100, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX rdaw: <http://rdaregistry.info/Elements/w/>
        PREFIX rdam: <http://rdaregistry.info/Elements/m/>
        PREFIX rdae: <http://rdaregistry.info/Elements/e/>
        PREFIX madsrdf: <http:...

✅ Query successful.

  → OFFSET 100, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX rdaw: <http://rdaregistry.info/Elements/w/>
        PREFIX rdam: <http://rdaregistry.info/Elements/m/>
        PREFIX rdae: <http://rdaregistry.info/Elements/e/>
        PREFIX madsrdf: <http:...

✅ Query successful.

  → OFFSET 100, LIMIT 100
[Attempt 1/5] Querying https://query.wikidata.org/sparql
→ Query Preview: PREFIX rdaw: <http://rdaregistry.info/Elements/w/>
        PREFIX rdam: <http://rdaregistry.info/Elements/m/>
        PREFIX rdae: <http://rdaregistry.info/Elements/e/>
        PREFIX madsrdf: <http:...

✅ Query successful.

  → 

In [None]:
# Create DataFrame
df_bvmc = pd.DataFrame(bvmc_data)

# Flatten the results and get to dataframe
df_bvmc.head()

Unnamed: 0,Source,Author,Work,Work Label,Place of Production,Year of Publication,Language Code
0,BVMC,Miguel de Cervantes,https://data.cervantesvirtual.com/work/1005253,Rinconete y Cortadillo,"Madrid : Revista de Archivos, Bibliotecas y Mu...",1920,es
1,BVMC,Miguel de Cervantes,https://data.cervantesvirtual.com/work/766798,Novelas ejemplares,"Barcelona : Salvador Ribas, 1881 (Establecimie...",1881,es
2,BVMC,Miguel de Cervantes,https://data.cervantesvirtual.com/work/770698,El ingenioso hidalgo Don Quijote de la Mancha,Barcelona : Centro editorial artístico de Migu...,1897,es
3,BVMC,Miguel de Cervantes,https://data.cervantesvirtual.com/work/21424,L'enginyós hidalgo Don Quixote de la Mancha,"Felanitx, Imprempta d'en Bartoméu Rèus, 1905-1906",1906,ca
4,BVMC,Miguel de Cervantes,https://data.cervantesvirtual.com/work/721923,El ingenioso hidalgo Don Quijote de la Mancha....,"Madrid, Ediciones de La Lectura, 1911-1913. Ca...",1911-1913,es


In [None]:
# Preview
df_bvmc.describe()

Unnamed: 0,Source,Author,Work,Work Label,Place of Production,Year of Publication,Language Code
count,1839,1839,1839,1839,1839.0,1839.0,1839
unique,1,5,1506,1383,570.0,234.0,11
top,BVMC,Lope de Vega,https://data.cervantesvirtual.com/work/2904,El ingenioso hidalgo Don Quijote de la Mancha,,,es
freq,1839,1124,66,68,527.0,433.0,1802


## Cumulative Dataset Creation

The Cumulative Dataset Creation section aims to combine the data retrieved from the BNE, BNF, and BVMC queries into a single, unified dataset, to be exported in .CSV format.

In [None]:
# Concatenate the obtained dataframes
df_combined = pd.concat([df_bne, df_bnf, df_bvmc], ignore_index=True)

In [None]:
# Fields summary stats
df_combined.describe()

Unnamed: 0,Source,Author,Work,Work Label,Edition,Place of Production,Year of Publication,Language Code
count,7480,7480,7480,7480,5367.0,7085.0,7198.0,7182
unique,3,10,4295,3316,2986.0,1022.0,1035.0,60
top,BNE,Lope de Vega,https://datos.bne.es/resource/XX3383563,Don Quijote de la Mancha,,,,http://id.loc.gov/vocabulary/languages/spa
freq,3559,2986,769,786,2082.0,2609.0,433.0,3004


In [None]:
# Author fields check
df_combined['Author'].unique()

array(['Miguel de Cervantes', 'Lope de Vega', 'Francisco Quevedo',
       'María de Zayas', 'Alonso Jerónimo de Salas Barbadillo',
       'http://data.bnf.fr/ark:/12148/cb118957747#about',
       'http://data.bnf.fr/ark:/12148/cb11927819k#about',
       'http://data.bnf.fr/ark:/12148/cb118873287#about',
       'http://data.bnf.fr/ark:/12148/cb119878263#about',
       'http://data.bnf.fr/ark:/12148/cb120545809#about'], dtype=object)

Different LOD repositories can store the information uding different modalities. This may create inconsistencies as we observe in the case of the BNF, using URIs instead of human-readable authors' names. The following two codeblocks address the normalisation of the auhors' names.

In [None]:
# Vocabulary to map author names vs. URIs
author_uri_to_name = {
    "http://data.bnf.fr/ark:/12148/cb118957747#about": "Miguel de Cervantes",
    "http://data.bnf.fr/ark:/12148/cb11927819k#about": "Lope de Vega",
    "http://data.bnf.fr/ark:/12148/cb118873287#about": "Francisco Quevedo",
    "http://data.bnf.fr/ark:/12148/cb119878263#about": "María de Zayas",
    "http://data.bnf.fr/ark:/12148/cb120545809#about": "Alonso Jerónimo de Salas Barbadillo",
}

In [None]:
# Define a reusable function to normalise author names
def normalise_authors(df, column, uri_name_map):
    return df.assign(**{column: df[column].replace(uri_name_map)})

In [None]:
# Normalise authors names
df_combined = normalise_authors(df_combined, "Author", author_uri_to_name)

# Preview
df_combined.describe()

Unnamed: 0,Source,Author,Work,Work Label,Edition,Place of Production,Year of Publication,Language Code
count,7480,7480,7480,7480,5367.0,7085.0,7198.0,7182
unique,3,5,4295,3316,2986.0,1022.0,1035.0,60
top,BNE,Lope de Vega,https://datos.bne.es/resource/XX3383563,Don Quijote de la Mancha,,,,http://id.loc.gov/vocabulary/languages/spa
freq,3559,3631,769,786,2082.0,2609.0,433.0,3004


As we have noremalised the authors' names, we now can export our dataset in .CSV format.

In [None]:
# Export the resulting dataframe in .CSV file format
df_combined.to_csv('df-golden-age.csv', index=False)