# OpenAIRE Community Data Dump Handling: Extraction, Tranformation and Enrichment

In this notebook we will start with one of the [OpenAIRE Community Subgraphs](https://graph.openaire.eu/docs/downloads/subgraphs) to enrich that informatino for further analysis.

This data process will extract an [OpenAIRE community data dump from Zenodo](https://doi.org/10.5281/zenodo.3974604), transforms it in to a portable file format .parquet (and updatable with changes for time seires analysis), that can be used to query with DuckDB, to enrich this with additional data  (also in .parquet, for join queries).

This additional data can be societal impact data from [Altmetric.com](https://details-page-api-docs.altmetric.com/) or [Overton.io](https://app.overton.io/swagger.php), Gender data using [genderize.io](https://genderize.io/documentation), sdg classification using [aurora-sdg](https://aurora-universities.eu/sdg-research/sdg-api/)

This script needs to be written in a way so that it can run every month using  the latest data.

## Processing steps

* the folder ./data/ is put in .gitignore to prevent that bulk datais sent to a code repository. So make sure that folder exists, and mkdir if not exists. 
* The script downloads the lastest Data Dump Tar file from one selected community. See https://doi.org/10.5281/zenodo.3974604 for the latest list. In our case the Aurora tar file. https://zenodo.org/records/14887484/files/aurora.tar?download=1
  * Use the json record of zenodo to get to the latest record, and fetch the download link of the aurora.tar file. for example : https://zenodo.org/records/14887484/export/json or https://zenodo.org/api/records/14887484/versions/latest 
  Make the tar filename a variable, so it can be used for multiple community dumps.
  Download the tar file in a target folder ./data/{publication_date}/01-downloaded/{filename} where a subfolder is created using the filename and the timestamp. Make this also as a  variable to use later on.
* Extract the tar file, to the compressed .json.gz files and put these in target folder ./data/{publication_date}/02-extracted/{filename}
* Transform the compressed .json.gz files into a single .parquet file in target folder ./data/{publication_date}/03-transformed/{filename}
Use instructions in sections "Processing JSON files with DuckDB" and "Full dataset, bit by bit" and "Splitting and Processing JSON Files in Batches" https://github.com/mosart/OpenAIRE-tools/blob/main/duckdb-querying.ipynb to start with. (be aware of error messages, and fix the issues to get all the data in)
* Extract the SQL schema (schema-datadump.sql) from the .parquet file and put it in target folder ./data/{filename+timestamp}/02-transformed/ This is needed for further processing of the records with DuckDB later on.
Use instructions in section "Extracting Schema from Parquet File" https://github.com/mosart/OpenAIRE-tools/blob/main/duckdb-querying.ipynb to start with.
* Query to get all identifiers: openaire id, doi, isbn, hdl, etc.
* **Get Altmetric data:**
* Extract the Altmetric data using the Identifiers. Transform keeping the record id in .parquet and put that in target folder ./data/{publication_date}/04-processed/{filename}/
* Extract the SQL schema (schema-altmetric.sql) from the .parquet file and put it in target folder ./data/{publication_date}/04-processed/{filename}/
* **Get Overton data:** Repeat the altmetric steps, bun than for Overton.
* **Get Gender data** query for the Author names and country codes, and run them over the gerderize api
* **Get SDG data** query for the abstracts, and run abstracs larger than 100 tokens over the aurora-SDG api.

## Testing Mode
Testign mode will reduce the number of records to process. Set to False if you want to go for the long haul.

In [14]:
testing_mode = True # Set to False for production
####

## Step 1 : Get the latest Community Dump File

In [15]:
import requests
import json

# Fetch the JSON data from the URL
url = "https://zenodo.org/api/records/14887484/versions/latest"
response = requests.get(url)
data = response.json()

# Extract the files information
files = data.get("files", [])

# Create a list of dictionaries for the .tar files
tar_files = []
for file in files:
    if file["key"].endswith(".tar"):
        tar_files.append({
            "filename": file["key"],
            "size": f"{file['size'] / (1024**3):.2f} GB",  # Convert bytes to GB
            "downloadlink": file["links"]["self"],
            "checksum": file["checksum"]
        })

# print the tar files
# If no tar files found, print a message
if not tar_files:
    print("No .tar files found in the dataset.")
else:
    print(f"Found {len(tar_files)} .tar files in the dataset.")
    print("Details of .tar files:")
    print(tar_files)

# get and print the publication date
publication_date = data.get("metadata", {}).get("publication_date", "Unknown")
print(f"Publication date: {publication_date}")
# get and print the DOI
doi = data.get("doi", "Unknown")
print(f"DOI: {doi}")
# get and print the title
title = data.get("title", "Unknown")
print(f"Title: {title}")

Found 37 .tar files in the dataset.
Details of .tar files:
[{'filename': 'energy-planning_1.tar', 'size': '6.99 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/energy-planning_1.tar/content', 'checksum': 'md5:0a2f551db46a9e629bb1d0a0098ae5cd'}, {'filename': 'edih-adria_1.tar', 'size': '5.86 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/edih-adria_1.tar/content', 'checksum': 'md5:23559bed5a9023398b431777bdc8a126'}, {'filename': 'uarctic_1.tar', 'size': '9.75 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/uarctic_1.tar/content', 'checksum': 'md5:302e3844ebd041c5f4ed94505eb9a285'}, {'filename': 'netherlands_1.tar', 'size': '3.91 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/netherlands_1.tar/content', 'checksum': 'md5:d1416c058b3961483aac340750ea8726'}, {'filename': 'knowmad_1.tar', 'size': '10.08 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/knowmad_1.tar/content', 'checksum': 'md5:

In [16]:
# Create a DataFrame to hold the tar files information for later use.

import pandas as pd

# Convert the list of dictionaries to a DataFrame
df_tar_files = pd.DataFrame(tar_files)

# Sort the DataFrame by filename alphabetically
df_tar_files = df_tar_files.sort_values(by='filename')

# Print the DataFrame
print(df_tar_files)

                      filename      size  \
5              argo-france.tar   0.00 GB   
8                   aurora.tar   1.73 GB   
22                  beopen.tar   0.20 GB   
6                   civica.tar   0.23 GB   
7                 covid-19.tar   2.03 GB   
23                  dariah.tar   0.02 GB   
9                    dh-ch.tar   1.16 GB   
11                     dth.tar   0.01 GB   
1             edih-adria_1.tar   5.86 GB   
12                  egrise.tar   0.02 GB   
25               elixir-gr.tar   0.01 GB   
0        energy-planning_1.tar   6.99 GB   
27                enermaps.tar   1.59 GB   
24              eu-conexus.tar   0.18 GB   
26                     eut.tar   0.21 GB   
15                 eutopia.tar   1.60 GB   
28                 forthem.tar   0.91 GB   
10        heritage-science.tar   0.03 GB   
29                   inria.tar   0.27 GB   
14               iperionhs.tar   0.00 GB   
4                knowmad_1.tar  10.08 GB   
19               knowmad_2.tar  

In [17]:
# Print a reindexed list of available tar files
print("Available tar files:")
print(df_tar_files[['filename', 'size']].reset_index())

import signal

# Function to handle timeout
def timeout_handler(signum, frame):
    raise TimeoutError

# Set the timeout handler for the input
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(10)  # Set the timeout to 10 seconds

try:
    # Ask the user to select a tar file by its index
    selected_index = int(input("Enter the index of the tar file you want to download: "))
except TimeoutError:
    print("No response received. Defaulting to index 1.")
    selected_index = 1
finally:
    signal.alarm(0)  # Disable the alarm

# Get the selected tar file's download link and checksum
selected_file = df_tar_files.iloc[selected_index]
downloadlink = selected_file['downloadlink']
checksum = selected_file['checksum']

print(f"Selected file: {selected_file['filename']}")
print(f"Download link: {downloadlink}")
print(f"Checksum: {checksum}")

Available tar files:
    index                    filename      size
0       5             argo-france.tar   0.00 GB
1       8                  aurora.tar   1.73 GB
2      22                  beopen.tar   0.20 GB
3       6                  civica.tar   0.23 GB
4       7                covid-19.tar   2.03 GB
5      23                  dariah.tar   0.02 GB
6       9                   dh-ch.tar   1.16 GB
7      11                     dth.tar   0.01 GB
8       1            edih-adria_1.tar   5.86 GB
9      12                  egrise.tar   0.02 GB
10     25               elixir-gr.tar   0.01 GB
11      0       energy-planning_1.tar   6.99 GB
12     27                enermaps.tar   1.59 GB
13     24              eu-conexus.tar   0.18 GB
14     26                     eut.tar   0.21 GB
15     15                 eutopia.tar   1.60 GB
16     28                 forthem.tar   0.91 GB
17     10        heritage-science.tar   0.03 GB
18     29                   inria.tar   0.27 GB
19     14          

In [18]:
# Path Variables

# Extract the file name from the selected file
file_name = selected_file['filename']    

# Path to save the downloaded tar file using file_name variable
download_path = f"./data/{publication_date}/01_input/{file_name}"

# Create the folder name by removing the .tar extension
folder_name = selected_file['filename'].replace('.tar', '')

# Path to save the extracted files using the file_name variable without the .tar extension
extraction_path = f"./data/{publication_date}/02_extracted/{folder_name}"


print(f"File Name: {file_name}")
print(f"Download Path File: {download_path}")
print(f"Folder Name: {folder_name}")
print(f"Extraction Path Folder: {extraction_path}")

File Name: aurora.tar
Download Path File: ./data/2025-02-19/01_input/aurora.tar
Folder Name: aurora
Extraction Path Folder: ./data/2025-02-19/02_extracted/aurora


### Download the tar file

In [19]:
import os

# Ensure the directory for the download path exists
os.makedirs(os.path.dirname(download_path), exist_ok=True)

# Check if the file already exists
if not os.path.exists(download_path):
    # Get the file size in bytes
    file_size_bytes = float(selected_file['size'].split()[0]) * (1024**3)  # Convert GB to bytes
    print(f"Downloading file: {selected_file['filename']} ({selected_file['size']})")
    print(f"Download URL: {downloadlink}")
    
    # Estimate download duration assuming an average speed of 10 MB/s
    avg_speed = 10 * (1024**2)  # 10 MB/s in bytes
    estimated_duration = file_size_bytes / avg_speed
    print(f"Estimated download time: {estimated_duration:.2f} seconds")
    
    # Download the selected tar file
    response = requests.get(downloadlink, stream=True)
    with open(download_path, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)
else:
    print(f"File already exists: {download_path}")
    print(f"Download URL: {downloadlink}")

print(f"Download complete: {download_path}")


Downloading file: aurora.tar (1.73 GB)
Download URL: https://zenodo.org/api/records/14887484/files/aurora.tar/content
Estimated download time: 177.15 seconds
Download complete: ./data/2025-02-19/01_input/aurora.tar


In [20]:
import hashlib

# Function to calculate the checksum of a file
def calculate_checksum(file_path, algorithm):
    hash_func = hashlib.new(algorithm)
    with open(file_path, 'rb') as f:
        while chunk := f.read(8192):
            hash_func.update(chunk)
    return hash_func.hexdigest()

# Extract the checksum algorithm and value
checksum_parts = checksum.split(':', 1)
checksum_algorithm = checksum_parts[0]
expected_checksum = checksum_parts[1]

# Calculate the checksum of the downloaded file
calculated_checksum = calculate_checksum(download_path, algorithm=checksum_algorithm)

# Compare the calculated checksum with the provided checksum
if calculated_checksum == expected_checksum:
    print("Checksum verification passed.")
else:
    print("Checksum verification failed.")
    print(f"Expected: {expected_checksum}")
    print(f"Calculated: {calculated_checksum}")

Checksum verification passed.


## Step 2: Extract the tar file

In [21]:
import os
import tarfile

In [22]:

# Check if the extraction directory already exists and contains files
if os.path.exists(extraction_path) and os.listdir(extraction_path):
    print("The tar file has already been extracted.")
else:
    # Create the directory if it doesn't exist
    os.makedirs(extraction_path, exist_ok=True)

    # Extract the tar file in the parent directory of the extraction_path - because the tar file contains a folder structure repeating the name of the tar file
    print(f"Extracting {download_path} to {extraction_path}...")
    parent_extraction_path = os.path.dirname(extraction_path)
    with tarfile.open(download_path, 'r') as tar:
        if testing_mode:
            # Extract only the first 10 files for testing
            members = tar.getmembers()[:10]
            tar.extractall(path=parent_extraction_path, members=members)
            print("Extracted only the first 10 files for testing mode.")
        else:
            tar.extractall(path=parent_extraction_path)

    print("Extraction complete.")
    print(f"Files extracted to: {extraction_path}")
    # print the number of files extracted
    extracted_files = os.listdir(extraction_path)
    print(f"Number of files extracted: {len(extracted_files)}")
    

Extracting ./data/2025-02-19/01_input/aurora.tar to ./data/2025-02-19/02_extracted/aurora...
Extracted only the first 10 files for testing mode.
Extraction complete.
Files extracted to: ./data/2025-02-19/02_extracted/aurora
Number of files extracted: 10


In [23]:
# List the extracted files
extracted_files = os.listdir(extraction_path)

# if testing_mode is True, limit the number of files to 10 for testing purposes
if testing_mode:
    extracted_files = extracted_files[:10]

# add the path to the extracted files
extracted_files_with_path = [os.path.join(extraction_path, file) for file in extracted_files]

# count the number of files in the extracted folder
num_files = len(extracted_files)
print(f"Number of files: {num_files}")

# print the first 5 files
print("First 5 files:")
for file in extracted_files[:5]:
    print(file) 

# make a DataFrame for the extracted files
df_extracted_files = pd.DataFrame(extracted_files, columns=['filename'])
# Sort the DataFrame by filename alphabetically
df_extracted_files = df_extracted_files.sort_values(by='filename')
# Print the DataFrame
print(df_extracted_files)

# print the dimensions of the DataFrame
print(f"DataFrame dimensions: {df_extracted_files.shape}")

# print a random 5 files, to be used for testing, and use in a variable for later use
import random
random_files = random.sample(extracted_files, 5)
random_files_with_path = [os.path.join(extraction_path, file) for file in random_files]
print("Randomly selected files with full paths for testing:")
for file in random_files_with_path:
    print(file)

# one random file for later use
random_file = random.choice(extracted_files)
print(f"Random file selected for later use: {random_file}")
# Define the path to the random file
random_file_path = os.path.join(extraction_path, random_file)
print(f"Path to the random file: {random_file_path}")
# Check if the random file exists
if os.path.exists(random_file_path):
    print(f"The random file exists: {random_file_path}")
else:
    print(f"The random file does not exist: {random_file_path}")




Number of files: 10
First 5 files:
part-00007-7a70885f-56f2-4cc2-b836-a5bd99ab23c3-c000.json.gz
part-00005-7a70885f-56f2-4cc2-b836-a5bd99ab23c3-c000.json.gz
part-00002-7a70885f-56f2-4cc2-b836-a5bd99ab23c3-c000.json.gz
part-00009-7a70885f-56f2-4cc2-b836-a5bd99ab23c3-c000.json.gz
part-00004-7a70885f-56f2-4cc2-b836-a5bd99ab23c3-c000.json.gz
                                            filename
8  part-00000-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
5  part-00001-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
2  part-00002-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
6  part-00003-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
4  part-00004-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
1  part-00005-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
7  part-00006-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
0  part-00007-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
9  part-00008-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
3  part-00009-7a70885f-56f2-4cc2-b836-a5bd99ab23c...
DataFrame dimensions: (10, 1)
Randomly selected files with full paths for tes

## Step 3: Get a data sample to generate parquetfile and the SQL schema

In [24]:
import duckdb

transformation_folder_path = f"./data/{publication_date}/03_transformed/{folder_name}"

# Ensure the target directory exists
os.makedirs(transformation_folder_path, exist_ok=True)

# define and print the target output master file for all extracted files
master_file = f"{transformation_folder_path}/{folder_name}-master.parquet"
print(f"Master file path: {master_file}")


Master file path: ./data/2025-02-19/03_transformed/aurora/aurora-master.parquet


#### Parsing extracted files into one master parquet file

In [25]:
import os
import signal

def timeout_handler(signum, frame):
    raise TimeoutError

# Check if the master file already exists
if os.path.exists(master_file):
    print(f"Master file already exists: {master_file}")
    print("Do you want to overwrite it? (y/n) [Default: n, timeout 10s]:")
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(10)
    try:
        user_input = input()
        overwrite = user_input.strip().lower() == 'y'
    except TimeoutError:
        print("No response received. Continuing with the existing master file.")
        overwrite = False
    finally:
        signal.alarm(0)
    if not overwrite:
        print("Using the existing master file.")
    else:
        # Overwrite: regenerate the master file
        con = duckdb.connect()
        file_paths = ','.join(f"'{file}'" for file in extracted_files_with_path)
        con.sql(f'''
            COPY (
                SELECT *
                FROM read_json([{file_paths}], sample_size=-1, union_by_name=true)
            )
            TO '{master_file}' (FORMAT parquet, COMPRESSION gzip)
        ''')
        print(f"Transformed data saved to: {master_file}")
        print(f"File size: {os.path.getsize(master_file) / (1024**2):.2f} MB")
        con.close()
else:
    # Master file does not exist, create it
    con = duckdb.connect()
    file_paths = ','.join(f"'{file}'" for file in extracted_files_with_path)
    con.sql(f'''
        COPY (
            SELECT *
            FROM read_json([{file_paths}], sample_size=-1, union_by_name=true)
        )
        TO '{master_file}' (FORMAT parquet, COMPRESSION gzip)
    ''')
    print(f"Transformed data saved to: {master_file}")
    print(f"File size: {os.path.getsize(master_file) / (1024**2):.2f} MB")
    con.close()

Transformed data saved to: ./data/2025-02-19/03_transformed/aurora/aurora-master.parquet
File size: 59.46 MB


In [28]:
# schema path
schema_file_path = f"{transformation_folder_path}/{folder_name}-schema.sql"

#print the schema file path
print(f"Schema file path: {schema_file_path}")

duckdb.sql(f'''
    COPY (
        SELECT *
        FROM (DESCRIBE '{master_file}')
    )
    TO '{schema_file_path}'
''')
# check if the schema file exists
if os.path.exists(schema_file_path):
    print(f"Schema file exists: {schema_file_path}")
else:
    print(f"Schema file does not exist: {schema_file_path}")
# Print the schema file content
with open(schema_file_path, 'r') as schema_file:
    schema_content = schema_file.read()
    print("Schema file content:")
    print(schema_content)


Schema file path: ./data/2025-02-19/03_transformed/aurora/aurora-schema.sql
Schema file exists: ./data/2025-02-19/03_transformed/aurora/aurora-schema.sql
Schema file content:
column_name,column_type,null,key,default,extra
authors,"STRUCT(fullName VARCHAR, ""name"" VARCHAR, pid STRUCT(id STRUCT(scheme VARCHAR, ""value"" VARCHAR), provenance STRUCT(provenance VARCHAR, trust VARCHAR)), rank BIGINT, surname VARCHAR)[]",YES,,,
bestAccessRight,"STRUCT(code VARCHAR, ""label"" VARCHAR, scheme VARCHAR)",YES,,,
collectedFrom,"STRUCT(""key"" VARCHAR, ""value"" VARCHAR)[]",YES,,,
communities,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR)[])[]",YES,,,
contributors,VARCHAR[],YES,,,
countries,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR))[]",YES,,,
coverages,VARCHAR[],YES,,,
dateOfCollection,VARCHAR,YES,,,
descriptions,VARCHAR[],YES,,,
formats,VARCHAR[],YES,,,
id,VARCHAR,YES,,,
indicators,"STRUCT(citationIm

In [29]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Count the number of records in the Parquet file
record_count = con.sql(f'''
    SELECT COUNT(*)
    FROM read_parquet('{master_file}')
''').fetchone()[0]

# Print the record count
print(f"Number of records in the Parquet file: {record_count}")

# Close the DuckDB connection
con.close()

Number of records in the Parquet file: 47483


In [30]:
import random

# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query all titles from the Parquet file
titles = con.sql(f'''
    SELECT mainTitle
    FROM read_parquet('{master_file}')
''').fetchall()

# Select 10 random titles
random_titles = random.sample([title[0] for title in titles if title[0]], min(10, len(titles)))

print("10 Random Titles in the Parquet file:")
for title in random_titles:
    print(title)

# print the number of unique titles
unique_titles = set(title[0] for title in titles if title[0])
print(f"Number of unique titles in the Parquet file: {len(unique_titles)}")

# Close the DuckDB connection
con.close()

10 Random Titles in the Parquet file:
The effect of individualized NUTritional counseling on muscle mass and treatment outcome in patients with metastatic COLOrectal cancer undergoing chemotherapy: a randomized controlled trial protocol
Genade zonder afzender
Italian external quality assessment program for cystic fibrosis sweat chloride test: a 2015 and 2016 results comparison.
Rethinking Subthreshold Effects in Regulatory Chemical Risk Assessments
TMEM16A (ANO1) as a therapeutic target in cystic fibrosis
Socially Responsible Resistance towards Consumption: Theoretical Legitimation and Implications on Marketing Practices
Conducteur geslagen: wegkijken of helpen?
Significados da utilização de plantas medicinais nas práticas de autoatenção à saúde
Achievements and Challenges in Sedimentary Basin Dynamics: A Review
VARIABILITY IN SOIL FOOD WEB STRUCTURE ACROSS TIME AND SPACE
Number of unique titles in the Parquet file: 47312


In [31]:
import random

# Connect to an in-memory DuckDB database
con = duckdb.connect()
# Query to extract DOIs from the pids column
dois = con.sql(f'''
    SELECT unnest.value AS doi
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(pids) AS unnest
    WHERE unnest.scheme = 'doi'
''').fetchall()

# Select 10 random DOIs
random_dois = random.sample([doi[0] for doi in dois if doi[0]], min(10, len(dois)))

# Print the 10 random DOIs
print("10 Random DOIs:")
for doi in random_dois:
    print(doi)

# print total number of DOIs
print(f"Total number of DOIs: {len(dois)}")

# print the number of unique DOIs
unique_dois = set(doi[0] for doi in dois if doi[0])
print(f"Number of unique DOIs: {len(unique_dois)}")

# Close the DuckDB connection
con.close()

10 Random DOIs:
10.1007/bf00263291
10.1155/2014/809741
10.1016/j.pce.2011.06.007
10.1088/0004-637x/769/2/151
10.1016/s1002-0160(18)60022-0
10.17863/cam.81832
10.1111/imm.12335
10.1016/j.mex.2023.102239
10.1111/jocd.12540
10.1016/s1040-8428(99)00052-9
Total number of DOIs: 34925
Number of unique DOIs: 34925


In [32]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query to extract distinct PID schemes from the master file
pid_schemes = con.sql(f'''
    SELECT DISTINCT unnest.scheme AS scheme
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(pids) AS unnest
''').fetchall()

# Print the distinct PID schemes
print("Distinct PID schemes in the master table:")
for scheme in pid_schemes:
    print(scheme[0])

# Close the DuckDB connection
con.close()

Distinct PID schemes in the master table:
mag_id
doi
arXiv
handle
pmc
pmid


In [33]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query to extract PIDs grouped by their schemes
pids_by_scheme = con.sql(f'''
    SELECT unnest.scheme AS scheme, LIST(unnest.value) AS pids
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(pids) AS unnest
    GROUP BY unnest.scheme
''').fetchall()

# dataframe to hold the PIDs grouped by their schemes
df_pids_by_scheme = pd.DataFrame(pids_by_scheme, columns=['scheme', 'pids'])
# Print the DataFrame of PIDs grouped by their schemes
print("PIDs grouped by schemes:")  
print(df_pids_by_scheme)
 
# Close the DuckDB connection
con.close()

PIDs grouped by schemes:
   scheme                                               pids
0  handle  [11588/164365, 11591/208934, 11588/633072, 115...
1     pmc  [PMC10478064, PMC8750824, PMC8970603, PMC10442...
2    pmid  [35349665, 38001043, 36322395, 37452799, 36174...
3     doi  [10.4337/9781848445987.00016, 10.6093/unina/fe...
4  mag_id  [2907305326, 2014694079, 1974512246, 232785792...
5   arXiv  [http://arxiv.org/abs/2112.14427, http://arxiv...


## Geting ready for further processing the master data

In [34]:
import os

# Create a folder for processed data
processing_folder_path = f"./data/{publication_date}/04_processed/{folder_name}"

# Ensure the target directory exists
os.makedirs(processing_folder_path, exist_ok=True)

### Step 4: Get the DOI's and other identifiers

In [35]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query to combine the id with each pid
combined_data = con.sql(f'''
    SELECT 
        id AS record_id,
        unnest.scheme AS pid_scheme,
        unnest.value AS pid_value,
        CONCAT(id, '_', unnest.value) AS combined_id_pid
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(pids) AS unnest
''').fetchdf()

# Print the resulting DataFrame
print("Combined id and pid:")
print(combined_data)


# Save the combined data to a new Parquet file for later use
combined_file_path = f"{processing_folder_path}/{folder_name}-combined-id-pid.parquet"
combined_data.to_parquet(combined_file_path, index=False)
print(f"Combined data saved to: {combined_file_path}")
print(f"File size: {os.path.getsize(combined_file_path) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Combined id and pid:
                                             record_id pid_scheme  \
0       dedup_wf_002::00a855101519be433c2a40000190d007     handle   
1       dedup_wf_002::00ec3f4e5f2f9b99df9f87875ab9a9e2     handle   
2       dedup_wf_002::00fab457e86aa3257212d739a7f83995     handle   
3       dedup_wf_002::014e7a41f66c80fbb2908c9477856715     handle   
4       dedup_wf_002::018d3a6f3d7674ba51cd90707cc74652     handle   
...                                                ...        ...   
123152  pmid_dedup__::281ac6f71b7a14a3b743d46e2cf9cb61       pmid   
123153  pmid_dedup__::281ac6f71b7a14a3b743d46e2cf9cb61        pmc   
123154  pmid_dedup__::281ac6f71b7a14a3b743d46e2cf9cb61       pmid   
123155  pmid_dedup__::281ac6f71b7a14a3b743d46e2cf9cb61        pmc   
123156  pmid_dedup__::281ac6f71b7a14a3b743d46e2cf9cb61       pmid   

           pid_value                                    combined_id_pid  
0       11588/164365  dedup_wf_002::00a855101519be433c2a40000190d007...  
1 

### Step 5: Get Altmetric data

a. use the PIDS (df_pids_by_scheme) along with the record id (to be used as primary keys, connecting the tables later on), 

b. get mention data by parsing the pids over the altmetric API,

c. save the outcomes in a separate parquet file.

In [36]:
import os
import time
import json
import shutil
import tempfile
import requests
import pandas as pd
import duckdb
from tqdm import tqdm

# Supported PID schemes (from script 1)
supported_schemes = [
    'dimensions_publication_id',
    'doi', 'pmid', 'handle', 'arxiv', 'ads',
    'ssrn', 'repec', 'isbn', 'id',
    'nct_id', 'urn'
]

# Map PID schemes to Altmetric endpoints (from script 2)
endpoint_map = {
    'doi': 'doi',
    'handle': 'handle',
    'pmid': 'pmid',
    'arxiv': 'arxiv',
    'ads': 'ads',
    'ssrn': 'ssrn',
    'repec': 'repec',
    'isbn': 'isbn',
    'id': 'id',
    'nct_id': 'nct_id',
    'urn': 'urn',
    'uri': 'uri'
}

def estimate_enrichment_time(n_items, rate_per_minute):
    secs_per = 60 / rate_per_minute
    total_secs = secs_per * n_items
    print(f"Estimated time for {n_items} items at {rate_per_minute}/min: {total_secs / 60:.1f} minutes")

def atomic_write_json(data, path):
    """Write JSON atomically to avoid corruption if interrupted."""
    dirpath = os.path.dirname(path)
    with tempfile.NamedTemporaryFile('w', delete=False, dir=dirpath) as tf:
        json.dump(data, tf)
        tempname = tf.name
    shutil.move(tempname, path)

def fetch_altmetric_data(combined_file_path,
                         processing_folder_path,
                         folder_name,
                         batch_size=100,
                         sleep_sec=1.0):
    """
    Hybrid function:
    - Loads combined parquet (with record_id, pid_scheme, pid_value).
    - Filters/sanitizes inputs (script 1).
    - Uses checkpointing & periodic saving (script 2).
    - Produces parquet + SQL schema.
    """

    # Folders
    extracted_folder = os.path.join(processing_folder_path, folder_name, "03-altmetric-extracted")
    transformed_folder = os.path.join(processing_folder_path, folder_name, "04-altmetric-transformed")
    os.makedirs(extracted_folder, exist_ok=True)
    os.makedirs(transformed_folder, exist_ok=True)

    json_path = os.path.join(extracted_folder, "altmetric_results.json")
    parquet_path = os.path.join(transformed_folder, "altmetric_results.parquet")
    schema_path = os.path.join(transformed_folder, "schema-altmetric.sql")

    # Load combined data
    df = pd.read_parquet(combined_file_path)
    df = df[df['pid_value'].notna() & (df['pid_value'] != '')]
    df['pid_scheme'] = df['pid_scheme'].str.lower()
    df['pid_value'] = df['pid_value'].str.lower()
    df = df[df['pid_scheme'].isin(supported_schemes)]
    print(f"Number of records to be processed: {len(df)}")

    # Resume checkpoint
    if os.path.exists(json_path):
        print(f"Resuming from saved JSON: {json_path}")
        with open(json_path, 'r') as f:
            results = json.load(f)
    else:
        results = []

    processed_keys = {(r.get('scheme'), r.get('value')) for r in results}
    df_to_process = df[~df.apply(lambda row: (row['pid_scheme'], row['pid_value']) in processed_keys, axis=1)]
    df_to_process = df_to_process.reset_index(drop=True)

    total = len(df_to_process)
    print(f"Remaining to process: {total}")
    if total > 0:
        estimate_enrichment_time(total, rate_per_minute=(60 / sleep_sec))

    start_time = time.time()

    for i, row in enumerate(tqdm(df_to_process.itertuples(index=False), total=total, desc="Fetching Altmetric data")):
        record_id = row.record_id
        scheme = row.pid_scheme
        value = row.pid_value

        if scheme in endpoint_map:
            endpoint = endpoint_map[scheme]
            url = f"https://api.altmetric.com/v1/{endpoint}/{value}"
            print(f"\nRequesting Altmetric data for {scheme}:{value} → {url}")

            try:
                response = requests.get(url)

                if response.status_code == 200:
                    data = response.json()
                    print(f"  ✔ 200 OK: Data received.")
                    if isinstance(data, dict):
                        altmetric_score = data.get('score')
                        if altmetric_score is not None:
                            print(f"    Altmetric score: {altmetric_score}")
                        else:
                            print("    Altmetric score not found.")
                    data['record_id'] = record_id
                    data['scheme'] = scheme
                    data['value'] = value
                    results.append(data)

                elif response.status_code == 403:
                    print("403 Forbidden: Not authorized (API key may be required).")
                elif response.status_code == 404:
                    print("404 Not Found: No Altmetric details available.")
                elif response.status_code == 429:
                    print("429 Too Many Requests: You are being rate limited.")
                elif response.status_code == 502:
                    print("502 Bad Gateway: Altmetric API maintenance.")
                else:
                    print(f"Error {response.status_code} for {scheme}:{value}")

            except Exception as e:
                print(f"Exception for {scheme}:{value}: {e}")

            time.sleep(sleep_sec)

        # Save every batch_size or at end
        if (i + 1) % batch_size == 0 or (i + 1) == total:
            atomic_write_json(results, json_path)

            altmetric_df = pd.json_normalize(results)
            altmetric_df = altmetric_df.astype(str)
            altmetric_df.to_parquet(parquet_path, index=False)

            elapsed = time.time() - start_time
            completed = i + 1
            remaining = total - completed
            avg_time_per = elapsed / completed if completed else 0
            eta_sec = avg_time_per * remaining
            eta_str = time.strftime("%H:%M:%S", time.gmtime(eta_sec))
            print(f"\n💾 Saved at {completed} items | ETA remaining: {eta_str}")

    print(f"\nAltmetric enrichment completed. Total records: {len(results)}")

    # Extract SQL schema
    con = duckdb.connect()
    con.execute(f"DESCRIBE SELECT * FROM parquet_scan('{parquet_path}')")
    schema_df = con.fetchdf()
    with open(schema_path, "w") as f:
        for _, row in schema_df.iterrows():
            f.write(f"{row['column_name']} {row['column_type']},\n")
    con.close()
    print(f"SQL schema written to: {schema_path}")

    return pd.json_normalize(results)


In [None]:
from datetime import datetime
import pandas as pd
import os

# Path to the combined PID parquet
combined_file_path = (
    f"./data/{publication_date}/04_processed/{folder_name}/{folder_name}-combined-id-pid.parquet"
)

# Altmetric output folder (05_altmetric)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
altmetric_folder_name = f"{folder_name}_altmetric_{timestamp}"
processing_folder_path = f"./data/{publication_date}/05_altmetric"
os.makedirs(processing_folder_path, exist_ok=True)

if testing_mode:
    df = pd.read_parquet(combined_file_path).tail(100)
    test_path = os.path.join(processing_folder_path, "test_combined.parquet")
    df.to_parquet(test_path, index=False)
    results_df = fetch_altmetric_data(
        test_path,
        processing_folder_path,
        altmetric_folder_name
    )
else:
    results_df = fetch_altmetric_data(
        combined_file_path,
        processing_folder_path,
        altmetric_folder_name
    )

print(f"Altmetric enrichment completed. Results stored under: {processing_folder_path}/{altmetric_folder_name}")


Number of records to be processed: 79
Remaining to process: 79
Estimated time for 79 items at 60.0/min: 1.3 minutes


Fetching Altmetric data:   0%|          | 0/79 [00:00<?, ?it/s]


Requesting Altmetric data for doi:10.1002/9780470725207.ch6 → https://api.altmetric.com/v1/doi/10.1002/9780470725207.ch6
404 Not Found: No Altmetric details available.


Fetching Altmetric data:   1%|▏         | 1/79 [00:01<01:25,  1.10s/it]


Requesting Altmetric data for handle:20.500.11755/50623997-3854-4efe-879c-62a8a13fef72 → https://api.altmetric.com/v1/handle/20.500.11755/50623997-3854-4efe-879c-62a8a13fef72
404 Not Found: No Altmetric details available.


Fetching Altmetric data:   3%|▎         | 2/79 [00:02<01:27,  1.14s/it]


Requesting Altmetric data for doi:10.1023/a:1014403119794 → https://api.altmetric.com/v1/doi/10.1023/a:1014403119794
404 Not Found: No Altmetric details available.


Fetching Altmetric data:   4%|▍         | 3/79 [00:03<01:25,  1.13s/it]


Requesting Altmetric data for handle:11588/133016 → https://api.altmetric.com/v1/handle/11588/133016
404 Not Found: No Altmetric details available.


Fetching Altmetric data:   5%|▌         | 4/79 [00:04<01:23,  1.12s/it]


Requesting Altmetric data for handle:11588/489596 → https://api.altmetric.com/v1/handle/11588/489596
404 Not Found: No Altmetric details available.


Fetching Altmetric data:   6%|▋         | 5/79 [00:05<01:22,  1.11s/it]


Requesting Altmetric data for handle:11588/483779 → https://api.altmetric.com/v1/handle/11588/483779
404 Not Found: No Altmetric details available.


Fetching Altmetric data:   8%|▊         | 6/79 [00:06<01:20,  1.11s/it]


Requesting Altmetric data for doi:10.1023/a:1006090807678 → https://api.altmetric.com/v1/doi/10.1023/a:1006090807678
404 Not Found: No Altmetric details available.


Fetching Altmetric data:   9%|▉         | 7/79 [00:07<01:19,  1.11s/it]


Requesting Altmetric data for handle:11588/338523 → https://api.altmetric.com/v1/handle/11588/338523
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  10%|█         | 8/79 [00:08<01:18,  1.10s/it]


Requesting Altmetric data for doi:10.22037/uj.v0i0.5633 → https://api.altmetric.com/v1/doi/10.22037/uj.v0i0.5633
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  11%|█▏        | 9/79 [00:09<01:17,  1.10s/it]


Requesting Altmetric data for handle:11571/135571 → https://api.altmetric.com/v1/handle/11571/135571
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  13%|█▎        | 10/79 [00:11<01:16,  1.10s/it]


Requesting Altmetric data for handle:20.500.11769/23610 → https://api.altmetric.com/v1/handle/20.500.11769/23610
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  14%|█▍        | 11/79 [00:12<01:14,  1.10s/it]


Requesting Altmetric data for handle:11588/165668 → https://api.altmetric.com/v1/handle/11588/165668
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  15%|█▌        | 12/79 [00:13<01:13,  1.10s/it]


Requesting Altmetric data for doi:10.23812/21-86-l → https://api.altmetric.com/v1/doi/10.23812/21-86-l
  ✔ 200 OK: Data received.
    Altmetric score: 0.25


Fetching Altmetric data:  16%|█▋        | 13/79 [00:14<01:12,  1.10s/it]


Requesting Altmetric data for handle:11379/33267 → https://api.altmetric.com/v1/handle/11379/33267
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  18%|█▊        | 14/79 [00:15<01:11,  1.10s/it]


Requesting Altmetric data for pmid:21172342 → https://api.altmetric.com/v1/pmid/21172342
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  19%|█▉        | 15/79 [00:16<01:10,  1.10s/it]


Requesting Altmetric data for handle:11588/364415 → https://api.altmetric.com/v1/handle/11588/364415
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  20%|██        | 16/79 [00:17<01:09,  1.10s/it]


Requesting Altmetric data for handle:11591/235527 → https://api.altmetric.com/v1/handle/11591/235527
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  22%|██▏       | 17/79 [00:18<01:08,  1.10s/it]


Requesting Altmetric data for handle:11591/365500 → https://api.altmetric.com/v1/handle/11591/365500
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  23%|██▎       | 18/79 [00:19<01:07,  1.10s/it]


Requesting Altmetric data for pmid:31461927 → https://api.altmetric.com/v1/pmid/31461927
  ✔ 200 OK: Data received.
    Altmetric score: 1


Fetching Altmetric data:  24%|██▍       | 19/79 [00:20<01:06,  1.10s/it]


Requesting Altmetric data for pmid:6603090 → https://api.altmetric.com/v1/pmid/6603090
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  25%|██▌       | 20/79 [00:22<01:05,  1.10s/it]


Requesting Altmetric data for pmid:25277619 → https://api.altmetric.com/v1/pmid/25277619
  ✔ 200 OK: Data received.
    Altmetric score: 3


Fetching Altmetric data:  27%|██▋       | 21/79 [00:23<01:03,  1.10s/it]


Requesting Altmetric data for pmid:2291015 → https://api.altmetric.com/v1/pmid/2291015
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  28%|██▊       | 22/79 [00:24<01:05,  1.14s/it]


Requesting Altmetric data for handle:2108/293046 → https://api.altmetric.com/v1/handle/2108/293046
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  29%|██▉       | 23/79 [00:25<01:03,  1.13s/it]


Requesting Altmetric data for handle:11588/135169 → https://api.altmetric.com/v1/handle/11588/135169
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  30%|███       | 24/79 [00:26<01:02,  1.13s/it]


Requesting Altmetric data for handle:1871.1/b4bee7e0-d8f1-4d53-8bdd-e1a28b1e0b6d → https://api.altmetric.com/v1/handle/1871.1/b4bee7e0-d8f1-4d53-8bdd-e1a28b1e0b6d
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  32%|███▏      | 25/79 [00:27<01:01,  1.14s/it]


Requesting Altmetric data for handle:11591/227814 → https://api.altmetric.com/v1/handle/11591/227814
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  33%|███▎      | 26/79 [00:28<00:59,  1.13s/it]


Requesting Altmetric data for handle:11381/2833704 → https://api.altmetric.com/v1/handle/11381/2833704
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  34%|███▍      | 27/79 [00:30<00:58,  1.12s/it]


Requesting Altmetric data for handle:11588/356484 → https://api.altmetric.com/v1/handle/11588/356484
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  35%|███▌      | 28/79 [00:31<00:56,  1.12s/it]


Requesting Altmetric data for pmid:6103698 → https://api.altmetric.com/v1/pmid/6103698
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  37%|███▋      | 29/79 [00:32<00:55,  1.11s/it]


Requesting Altmetric data for handle:1871.1/ec3d7c98-0e74-4819-9609-4d923c6dd5a9 → https://api.altmetric.com/v1/handle/1871.1/ec3d7c98-0e74-4819-9609-4d923c6dd5a9
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  38%|███▊      | 30/79 [00:33<00:54,  1.11s/it]


Requesting Altmetric data for pmid:7938749 → https://api.altmetric.com/v1/pmid/7938749
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  39%|███▉      | 31/79 [00:34<00:53,  1.11s/it]


Requesting Altmetric data for pmid:6939959 → https://api.altmetric.com/v1/pmid/6939959
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  41%|████      | 32/79 [00:35<00:53,  1.13s/it]


Requesting Altmetric data for pmid:160180 → https://api.altmetric.com/v1/pmid/160180
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  42%|████▏     | 33/79 [00:36<00:51,  1.13s/it]


Requesting Altmetric data for pmid:7857601 → https://api.altmetric.com/v1/pmid/7857601
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  43%|████▎     | 34/79 [00:37<00:50,  1.12s/it]


Requesting Altmetric data for handle:11591/231669 → https://api.altmetric.com/v1/handle/11591/231669
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  44%|████▍     | 35/79 [00:38<00:48,  1.11s/it]


Requesting Altmetric data for handle:11591/217650 → https://api.altmetric.com/v1/handle/11591/217650
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  46%|████▌     | 36/79 [00:40<00:47,  1.11s/it]


Requesting Altmetric data for pmid:10726233 → https://api.altmetric.com/v1/pmid/10726233
  ✔ 200 OK: Data received.
    Altmetric score: 3


Fetching Altmetric data:  47%|████▋     | 37/79 [00:41<00:46,  1.11s/it]


Requesting Altmetric data for handle:11588/421561 → https://api.altmetric.com/v1/handle/11588/421561
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  48%|████▊     | 38/79 [00:42<00:45,  1.11s/it]


Requesting Altmetric data for pmid:19935540 → https://api.altmetric.com/v1/pmid/19935540
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  49%|████▉     | 39/79 [00:43<00:44,  1.11s/it]


Requesting Altmetric data for handle:11588/460489 → https://api.altmetric.com/v1/handle/11588/460489
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  51%|█████     | 40/79 [00:44<00:43,  1.11s/it]


Requesting Altmetric data for handle:11386/4582067 → https://api.altmetric.com/v1/handle/11386/4582067
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  52%|█████▏    | 41/79 [00:45<00:41,  1.10s/it]


Requesting Altmetric data for pmid:24334780 → https://api.altmetric.com/v1/pmid/24334780
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  53%|█████▎    | 42/79 [00:46<00:42,  1.15s/it]


Requesting Altmetric data for handle:20.500.11768/149682 → https://api.altmetric.com/v1/handle/20.500.11768/149682
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  54%|█████▍    | 43/79 [00:47<00:40,  1.13s/it]


Requesting Altmetric data for pmid:12063997 → https://api.altmetric.com/v1/pmid/12063997
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  56%|█████▌    | 44/79 [00:49<00:39,  1.13s/it]


Requesting Altmetric data for pmid:23515036 → https://api.altmetric.com/v1/pmid/23515036
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  57%|█████▋    | 45/79 [00:50<00:38,  1.13s/it]


Requesting Altmetric data for handle:11588/863623 → https://api.altmetric.com/v1/handle/11588/863623
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  58%|█████▊    | 46/79 [00:51<00:36,  1.12s/it]


Requesting Altmetric data for handle:11584/219628 → https://api.altmetric.com/v1/handle/11584/219628
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  59%|█████▉    | 47/79 [00:52<00:35,  1.11s/it]


Requesting Altmetric data for pmid:19845110 → https://api.altmetric.com/v1/pmid/19845110
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  61%|██████    | 48/79 [00:53<00:34,  1.11s/it]


Requesting Altmetric data for pmid:18074632 → https://api.altmetric.com/v1/pmid/18074632
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  62%|██████▏   | 49/79 [00:54<00:33,  1.11s/it]


Requesting Altmetric data for pmid:10541474 → https://api.altmetric.com/v1/pmid/10541474
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  63%|██████▎   | 50/79 [00:55<00:32,  1.11s/it]


Requesting Altmetric data for pmid:11998888 → https://api.altmetric.com/v1/pmid/11998888
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  65%|██████▍   | 51/79 [00:56<00:30,  1.10s/it]


Requesting Altmetric data for pmid:10066098 → https://api.altmetric.com/v1/pmid/10066098
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  66%|██████▌   | 52/79 [00:57<00:29,  1.10s/it]


Requesting Altmetric data for pmid:31587252 → https://api.altmetric.com/v1/pmid/31587252
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  67%|██████▋   | 53/79 [00:58<00:28,  1.10s/it]


Requesting Altmetric data for handle:11562/346885 → https://api.altmetric.com/v1/handle/11562/346885
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  68%|██████▊   | 54/79 [01:00<00:27,  1.10s/it]


Requesting Altmetric data for handle:11588/476041 → https://api.altmetric.com/v1/handle/11588/476041
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  70%|██████▉   | 55/79 [01:01<00:26,  1.10s/it]


Requesting Altmetric data for handle:11588/867678 → https://api.altmetric.com/v1/handle/11588/867678
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  71%|███████   | 56/79 [01:02<00:25,  1.10s/it]


Requesting Altmetric data for pmid:19771745 → https://api.altmetric.com/v1/pmid/19771745
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  72%|███████▏  | 57/79 [01:03<00:24,  1.10s/it]


Requesting Altmetric data for pmid:8532376 → https://api.altmetric.com/v1/pmid/8532376
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  73%|███████▎  | 58/79 [01:04<00:23,  1.10s/it]


Requesting Altmetric data for handle:11588/586221 → https://api.altmetric.com/v1/handle/11588/586221
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  75%|███████▍  | 59/79 [01:05<00:22,  1.10s/it]


Requesting Altmetric data for doi:10.26355/eurrev_202011_23627 → https://api.altmetric.com/v1/doi/10.26355/eurrev_202011_23627
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  76%|███████▌  | 60/79 [01:06<00:20,  1.10s/it]


Requesting Altmetric data for pmid:22610121 → https://api.altmetric.com/v1/pmid/22610121
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  77%|███████▋  | 61/79 [01:07<00:19,  1.10s/it]


Requesting Altmetric data for handle:11380/1142834 → https://api.altmetric.com/v1/handle/11380/1142834
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  78%|███████▊  | 62/79 [01:08<00:18,  1.10s/it]


Requesting Altmetric data for handle:11588/167247 → https://api.altmetric.com/v1/handle/11588/167247
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  80%|███████▉  | 63/79 [01:09<00:17,  1.10s/it]


Requesting Altmetric data for pmid:9177614 → https://api.altmetric.com/v1/pmid/9177614
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  81%|████████  | 64/79 [01:11<00:16,  1.10s/it]


Requesting Altmetric data for pmid:34121372 → https://api.altmetric.com/v1/pmid/34121372
  ✔ 200 OK: Data received.
    Altmetric score: 0.25


Fetching Altmetric data:  82%|████████▏ | 65/79 [01:12<00:15,  1.10s/it]


Requesting Altmetric data for pmid:24817301 → https://api.altmetric.com/v1/pmid/24817301
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  84%|████████▎ | 66/79 [01:13<00:14,  1.10s/it]


Requesting Altmetric data for handle:11588/169412 → https://api.altmetric.com/v1/handle/11588/169412
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  85%|████████▍ | 67/79 [01:14<00:13,  1.10s/it]


Requesting Altmetric data for handle:11588/949490 → https://api.altmetric.com/v1/handle/11588/949490
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  86%|████████▌ | 68/79 [01:15<00:12,  1.10s/it]


Requesting Altmetric data for pmid:12796363 → https://api.altmetric.com/v1/pmid/12796363
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  87%|████████▋ | 69/79 [01:16<00:10,  1.10s/it]


Requesting Altmetric data for pmid:33215456 → https://api.altmetric.com/v1/pmid/33215456
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  89%|████████▊ | 70/79 [01:17<00:09,  1.10s/it]


Requesting Altmetric data for doi:10.23750/abm.v88i3 → https://api.altmetric.com/v1/doi/10.23750/abm.v88i3
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  90%|████████▉ | 71/79 [01:18<00:08,  1.10s/it]


Requesting Altmetric data for handle:11588/686114 → https://api.altmetric.com/v1/handle/11588/686114
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  91%|█████████ | 72/79 [01:19<00:07,  1.10s/it]


Requesting Altmetric data for pmid:28752832 → https://api.altmetric.com/v1/pmid/28752832
  ✔ 200 OK: Data received.
    Altmetric score: 1


Fetching Altmetric data:  92%|█████████▏| 73/79 [01:20<00:06,  1.10s/it]


Requesting Altmetric data for pmid:28752831 → https://api.altmetric.com/v1/pmid/28752831
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  94%|█████████▎| 74/79 [01:22<00:05,  1.10s/it]


Requesting Altmetric data for pmid:28752828 → https://api.altmetric.com/v1/pmid/28752828
  ✔ 200 OK: Data received.
    Altmetric score: 0.25


Fetching Altmetric data:  95%|█████████▍| 75/79 [01:23<00:04,  1.10s/it]


Requesting Altmetric data for pmid:28752827 → https://api.altmetric.com/v1/pmid/28752827
  ✔ 200 OK: Data received.
    Altmetric score: 0.25


Fetching Altmetric data:  96%|█████████▌| 76/79 [01:24<00:03,  1.10s/it]


Requesting Altmetric data for pmid:28752830 → https://api.altmetric.com/v1/pmid/28752830
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  97%|█████████▋| 77/79 [01:25<00:02,  1.10s/it]


Requesting Altmetric data for pmid:28752835 → https://api.altmetric.com/v1/pmid/28752835
404 Not Found: No Altmetric details available.


Fetching Altmetric data:  99%|█████████▊| 78/79 [01:26<00:01,  1.10s/it]


Requesting Altmetric data for pmid:28752829 → https://api.altmetric.com/v1/pmid/28752829
404 Not Found: No Altmetric details available.


Fetching Altmetric data: 100%|██████████| 79/79 [01:27<00:00,  1.11s/it]


💾 Saved at 79 items | ETA remaining: 00:00:00

Altmetric enrichment completed. Total records: 8
SQL schema written to: ./data/2025-02-19/05_altmetric/aurora_altmetric_20250919_102721/04-altmetric-transformed/schema-altmetric.sql
Altmetric enrichment completed. Results stored under: ./data/2025-02-19/05_altmetric/aurora_altmetric_20250919_102721





### Step 6: Get Overton data

In [None]:
# Waiting for the Overton dump to be available

### Step 7: Get SDG classification labels

a. Query the abstracts first along with the id (to be used as primary keys, connecting the tables later on), 

b. get sdg data by parsing the abstracts with more than 100 tokens over an LLM API with system prompt, 

c. save the outcomes in a separate parquet file.

##### step 7a: Get the abstracts, including the record id and the number of tokens i nthe abstract

Number of tokens are important later on, less then 100 tokens in the abstract deliver low quality SDG classifications.

In [52]:

# Connect to an in-memory DuckDB database
con = duckdb.connect()
# Query to extract the ID, description, remove XML tags, and calculate the number of tokens in the description
description_data = con.sql(f'''
    SELECT 
        id AS record_id,
        regexp_replace(descriptions[1], '<[^>]+>', '') AS description,  -- Remove XML tags
        array_length(split(regexp_replace(descriptions[1], '<[^>]+>', ''), ' ')) AS token_count
    FROM read_parquet('{master_file}')
    WHERE descriptions IS NOT NULL AND array_length(descriptions) > 0
''').fetchdf()

# Print the resulting DataFrame
print("Descriptions with token counts:")
print(description_data)

# Save the data to a new Parquet file for later use
description_file_path = f"{processing_folder_path}/{folder_name}-descriptions-with-tokens.parquet"
description_data.to_parquet(description_file_path, index=False)
print(f"Description data saved to: {description_file_path}")
print(f"File size: {os.path.getsize(description_file_path) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Descriptions with token counts:
                                            record_id  \
0      dedup_wf_002::00fa9a6806de20c970d77677b43220be   
1      dedup_wf_002::022b5a9d85b0c38cffef0fb145b3645a   
2      dedup_wf_002::04463b5f4ad9cc6e1863d76407a3e424   
3      dedup_wf_002::053440b6265dfce1da1a31deb1a6436b   
4      dedup_wf_002::062128904997f80c9ca88cec1e30d482   
...                                               ...   
31403  unidue___bib::f055fc26e50fb169b575160b700d473a   
31404  unidue___bib::f10f6cd926f9969659108b66fecd04f4   
31405  unidue___bib::f5be71256cb2f521cf37c442a38ea64e   
31406  unidue___bib::f688bcf16fc9154255868f8143a02275   
31407  unidue___bib::fb7a759f2e12b8c6cb2f7cfc98cb4ff2   

                                             description  token_count  
0      Dans quelle mesure l'accompagnement par un men...          153  
1      Etant membre de la communauté internationale, ...          197  
2      This paper examines the small but growing lite...           

##### Step 7b-1:  Aurora SDG Classifier
In this step we use the Aurora SDG classifier to classify all the abstracts.

First we set a test_mode parameter, so that the first 3 abstracts with more than 100 tokens are used. If testing mode is False, then use all abstracts with more than 100 tokens.


In [53]:
import requests
import json
import pandas as pd
import time

# Set the testing mode to True for limited processing
testing_mode = True

# define the models
model = "aurora-sdg"  # Use the multi-label model for SDG classification (faster, Aurora definition of SDG's, 104 languages)

# other available models:
# model = "aurora-sdg"  # Use the single-label model for classification of each SDG in the Aurora definition (slower, Aurora definition of SDG's, 104 languages)
# model = "elsevier-multi"  # Elsevier SDG multi-label mBERT model (fast, Elsevier definition of SDG's, 104 languages)
# model = "osdg"  # OSDG model (alternative, OSDG definition of SDG's, 15 languages)

# Set the base URL for the Aurora SDG classifier
base_url = "https://aurora-sdg.labs.vu.nl/classifier/classify/" + model

# Load the descriptions with token counts
description_df = pd.read_parquet(f"{processing_folder_path}/{folder_name}-descriptions-with-tokens.parquet")

# Filter abstracts with at least 100 tokens
description_df = description_df[description_df['token_count'] >= 100]

# Set testing mode to limit the number of abstracts
if testing_mode:
    description_df = description_df.head(10)  # Limit to 10 records for testing

# Prepare results list
sdg_results = []

# Rate limit settings
rate_limit = 5  # 5 requests per second
delay_between_requests = 1 / rate_limit

# Loop through each abstract
for idx, row in description_df.iterrows():
    record_id = row['record_id']
    abstract = row['description']

    # Prepare the payload for the API
    payload = json.dumps({"text": abstract})
    headers = {'Content-Type': 'application/json'}

    try:
        # Make the API call
        response = requests.post(base_url, headers=headers, data=payload)
        response.raise_for_status()

        # Parse the response
        result = response.json()
        predictions = result.get("predictions", [])

        # Extract SDG predictions
        sdgs = [
            {
                "goal_code": pred["sdg"]["code"],
                "goal_name": pred["sdg"]["name"],
                "prediction_score": pred["prediction"]
            }
            for pred in predictions
        ]

        # Append the result to the list
        sdg_results.append({
            "record_id": record_id,
            "abstract": abstract,
            "sdgs": sdgs
        })

        # Calculate and print the time taken to process the record
        start_time = time.time()
        print(f"Processed record_id: {record_id}, SDGs: {sdgs}")
        end_time = time.time()
        print(f"Time taken to process record_id {record_id}: {end_time - start_time:.2f} seconds")

    except requests.exceptions.RequestException as e:
        print(f"Error processing record_id {record_id}: {e}")

    # Add a delay to respect the rate limit
    time.sleep(delay_between_requests)

# Convert the results to a DataFrame
sdg_results_df = pd.DataFrame(sdg_results)

# calculate the 90th percentile of the prediction scores for each SDG
sdg_scores = []
for sdg in sdg_results_df['sdgs']:
    for prediction in sdg:
        sdg_scores.append(prediction['prediction_score'])  
# Calculate the 90th percentile
percentile_90 = pd.Series(sdg_scores).quantile(0.9)
# Filter the results and append a column top_predicted_sdgs, to include only SDGs (as list of goal_codes) with a prediction score above the 90th percentile
sdg_results_df['top_predicted_sdgs'] = sdg_results_df['sdgs'].apply(
    lambda x: [sdg['goal_code'] for sdg in x if sdg['prediction_score'] >= percentile_90 and sdg['prediction_score'] > 0.1]
)

# Print the DataFrame with SDG results
print("SDG classification results:")
print(sdg_results_df[['record_id', 'top_predicted_sdgs']])

# Save the results to a Parquet file including the top predicted SDGs
sdg_results_path = f"{processing_folder_path}/{folder_name}-sdg-results-{model}.parquet"
sdg_results_df.to_parquet(sdg_results_path, index=False)
print(f"SDG classification results saved to: {sdg_results_path}")


Processed record_id: dedup_wf_002::00fa9a6806de20c970d77677b43220be, SDGs: [{'goal_code': '1', 'goal_name': 'No poverty', 'prediction_score': 0.179525405}, {'goal_code': '2', 'goal_name': 'Zero hunger', 'prediction_score': 0.0224823952}, {'goal_code': '3', 'goal_name': 'Good health and well-being', 'prediction_score': 0.0757595897}, {'goal_code': '4', 'goal_name': 'Quality Education', 'prediction_score': 0.693393469}, {'goal_code': '5', 'goal_name': 'Gender equality', 'prediction_score': 0.740161419}, {'goal_code': '6', 'goal_name': 'Clean water and sanitation', 'prediction_score': 0.0137088597}, {'goal_code': '7', 'goal_name': 'Affordable and clean energy', 'prediction_score': 0.0231188238}, {'goal_code': '8', 'goal_name': 'Decent work and economic growth', 'prediction_score': 0.934973717}, {'goal_code': '9', 'goal_name': 'Industry, innovation and infrastructure', 'prediction_score': 0.75080055}, {'goal_code': '10', 'goal_name': 'Reduced inequalities', 'prediction_score': 0.670219898}

KeyboardInterrupt: 

##### step 7b Get the official definitions of the SDG's from https://metadata.un.org/sdg/ using the Accept header application/rdf+xml

First we get the links to the top level goals.

In [24]:
import requests

# URL for the SDG metadata
sdg_metadata_url = "https://metadata.un.org/sdg/"

# Set the headers to request RDF/XML format
headers = {
    "Accept": "application/rdf+xml"
}

# Send the GET request
response = requests.get(sdg_metadata_url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Save the RDF/XML content to a file
    rdf_file_path = f"{processing_folder_path}/sdg_definitions.rdf"
    with open(rdf_file_path, "wb") as rdf_file:
        rdf_file.write(response.content)
    print(f"SDG definitions saved to: {rdf_file_path}")
else:
    print(f"Failed to fetch SDG definitions. Status code: {response.status_code}")
    print(f"Response: {response.text}")

SDG definitions saved to: ./data/2025-02-19/04_processed/argo-france/sdg_definitions.rdf


In [25]:
import pandas as pd

import xml.etree.ElementTree as ET

# Parse the RDF/XML file
tree = ET.parse(rdf_file_path)
root = tree.getroot()

# Find all skos:hasTopConcept elements and extract their rdf:resource attribute
top_concept_urls = []
for elem in root.findall('.//{http://www.w3.org/2004/02/skos/core#}hasTopConcept'):
    url = elem.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
    if url:
        top_concept_urls.append(url)

# sort the URLs based on the integer in the last part of the URL
top_concept_urls.sort(key=lambda x: int(x.split('/')[-1]))

print("Top concept URLs found in the RDF/XML:")
for url in top_concept_urls:
    print(url)



Top concept URLs found in the RDF/XML:
http://metadata.un.org/sdg/1
http://metadata.un.org/sdg/2
http://metadata.un.org/sdg/3
http://metadata.un.org/sdg/4
http://metadata.un.org/sdg/5
http://metadata.un.org/sdg/6
http://metadata.un.org/sdg/7
http://metadata.un.org/sdg/8
http://metadata.un.org/sdg/9
http://metadata.un.org/sdg/10
http://metadata.un.org/sdg/11
http://metadata.un.org/sdg/12
http://metadata.un.org/sdg/13
http://metadata.un.org/sdg/14
http://metadata.un.org/sdg/15
http://metadata.un.org/sdg/16
http://metadata.un.org/sdg/17


Next we get the goal number, goal name and goal description for each top level goal.

In [26]:
import requests
import pandas as pd

import xml.etree.ElementTree as ET

# Prepare lists to store the results
goal_codes = []
goal_names = []
goal_descriptions = []
goal_urls = []

# Loop through each top concept URL
for url in top_concept_urls:
    try:
        # Fetch the RDF/XML content
        resp = requests.get(url, headers={"Accept": "application/rdf+xml"})
        resp.raise_for_status()
        root = ET.fromstring(resp.content)
        # Find the main Description element
        desc = root.find('.//{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description')
        if desc is None:
            continue
        # Extract <skos:note xml:lang="en">Goal N</skos:note>
        goal_code = None
        for note in desc.findall('{http://www.w3.org/2004/02/skos/core#}note'):
            if note.attrib.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en' and note.text and note.text.startswith('Goal'):
                goal_code = note.text.replace('Goal ', '').strip()
                break
        # Extract <skos:altLabel xml:lang="en">...</skos:altLabel>
        goal_name = None
        for alt in desc.findall('{http://www.w3.org/2004/02/skos/core#}altLabel'):
            if alt.attrib.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
                goal_name = alt.text.strip()
                break
        # Extract <skos:prefLabel xml:lang="en">...</skos:prefLabel>
        goal_description = None
        for pref in desc.findall('{http://www.w3.org/2004/02/skos/core#}prefLabel'):
            if pref.attrib.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
                goal_description = pref.text.strip()
                break
        # Store results
        goal_codes.append(goal_code)
        goal_names.append(goal_name)
        goal_descriptions.append(goal_description)
        goal_urls.append(url)
    except Exception as e:
        print(f"Error processing {url}: {e}")

# Create DataFrame
df_sdg_goals = pd.DataFrame({
    "goal_code": goal_codes,
    "goal_name": goal_names,
    "goal_description": goal_descriptions,
    "goal_url": goal_urls
})

print(df_sdg_goals)

# Save the DataFrame to a CSV file
sdg_goals_csv_path = f"{processing_folder_path}/sdg_goals.csv"
df_sdg_goals.to_csv(sdg_goals_csv_path, index=False)
print(f"SDG goals saved to: {sdg_goals_csv_path}")

   goal_code                                goal_name  \
0          1                               No poverty   
1          2                              Zero hunger   
2          3               Good health and well-being   
3          4                        Quality education   
4          5                          Gender equality   
5          6               Clean water and sanitation   
6          7              Affordable and clean energy   
7          8          Decent work and economic growth   
8          9  Industry, innovation and infrastructure   
9         10                     Reduced inequalities   
10        11       Sustainable cities and communities   
11        12   Responsible consumption and production   
12        13                           Climate action   
13        14                         Life below water   
14        15                             Life on land   
15        16   Peace, justice and strong institutions   
16        17               Part

##### Step 7c Here we prepare the System and User prompts to be used by an LLM.

In [27]:
# Define the text to classify
text = """
The United Nations Sustainable Development Goals (SDGs) are a universal call to action to end poverty, protect the planet, and ensure prosperity for all by 2030. They address global challenges such as inequality, climate change, environmental degradation, peace, and justice. The SDGs consist of 17 goals and 169 targets that aim to achieve a better and more sustainable future for all.
"""
# Print the text to classify
print("Text to classify:")
print(text)

# Define the expected output format, now including an explanation field
example_output_format = """
{
    "sdgs": [2, 6, 17],
    "explanation": "This text is related to SDG 2 (Zero hunger) because it discusses food security, SDG 6 (Clean water and sanitation) due to references to environmental protection, and SDG 17 (Partnerships for the goals) as it mentions global cooperation."
}
"""

# Print the example output format
print("Example Output Format:")
print(example_output_format)

# system_prompt
# Build SDG goal info string from df_sdg_goals
sdg_goal_info = "\n".join(
    f"{row.goal_code}: {row.goal_name} - {row.goal_description}"
    for _, row in df_sdg_goals.iterrows()
)

sdg_system_prompt = f"""
You are an intelligent multi-label classification system designed to map texts to their relevant Sustainable Development Goals.
Take the text delimited by triple quotation marks and return a JSON list of relevant SDGs. 
Example output format: {example_output_format}

Here are the SDG goals and their descriptions:
{sdg_goal_info}

"""
# Print the system prompt
print("System Prompt:")
print(sdg_system_prompt)
# user_prompt
sdg_user_prompt = f"""
"Classify the following text in terms of its relevance to the Sustainable Development Goals:",
Text: '''{text}'''
"""
# Print the user prompt
print("User Prompt:")
print(sdg_user_prompt)


Text to classify:

The United Nations Sustainable Development Goals (SDGs) are a universal call to action to end poverty, protect the planet, and ensure prosperity for all by 2030. They address global challenges such as inequality, climate change, environmental degradation, peace, and justice. The SDGs consist of 17 goals and 169 targets that aim to achieve a better and more sustainable future for all.

Example Output Format:

{
    "sdgs": [2, 6, 17],
    "explanation": "This text is related to SDG 2 (Zero hunger) because it discusses food security, SDG 6 (Clean water and sanitation) due to references to environmental protection, and SDG 17 (Partnerships for the goals) as it mentions global cooperation."
}

System Prompt:

You are an intelligent multi-label classification system designed to map texts to their relevant Sustainable Development Goals.
Take the text delimited by triple quotation marks and return a JSON list of relevant SDGs. 
Example output format: 
{
    "sdgs": [2, 6, 1

##### Step 7d: Get the LLM API prepared

In [28]:
# OpenWebUI API configuration
openwebui_base_url = "https://nebula.cs.vu.nl"  # Replace with your actual OpenWebUI API base URL
openwebui_api_key = "sk-5b5a024888c14a019c0e9b4857df9329"  # Replace with your actual API key

first get the models

In [29]:
# This script fetches the list of available models from the OpenWebUI API
# and prints their IDs, names, and parameter sizes.

import requests

# Use the existing variables openwebui_base_url and openwebui_api_key

headers = {
    "Authorization": f"Bearer {openwebui_api_key}"
}

# Ensure the base URL does not end with a slash
api_url = openwebui_base_url.rstrip('/') + "/api/models"

# print the request in curl
print(f"curl -X GET '{api_url}' -H 'Authorization: Bearer {openwebui_api_key}'")

response = requests.get(api_url, headers=headers)

if response.status_code == 200:
    models_json = response.json()
    models = models_json.get("data", [])
    print("Available models:")
    for model in models:
        print(f"- id: {model.get('id')}, name: {model.get('name')}, parameter_size: {model.get('ollama', {}).get('details', {}).get('parameter_size')}")
else:
    print(f"Failed to fetch models. Status code: {response.status_code}")
    print(f"Response: {response.text}")




curl -X GET 'https://nebula.cs.vu.nl/api/models' -H 'Authorization: Bearer sk-5b5a024888c14a019c0e9b4857df9329'


KeyboardInterrupt: 

Select the model to use, when no model is chosen, deepseek-r1:1.5b will be the default (faser & cheaper)

In [None]:
import signal

# Select the model to use, when no model is chosen, llama3.1:8b will be the default
model = "llama3.1:8b"  # Replace with your actual model name

def timeout_handler(signum, frame):
    raise TimeoutError

print("Available models:")
for i, m in enumerate(models):
    print(f"{i}: {m['id']}")

print("Select the model index to use (default: 2, llama3.1:8b) [timeout 10s]:")
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(10)
try:
    user_input = input()
    if user_input.strip().isdigit():
        selected_model_index = int(user_input.strip())
        if 0 <= selected_model_index < len(models):
            model = models[selected_model_index]['id']
        else:
            print("Invalid index, using default model.")
            model = "llama3.1:8b"
    else:
        print("No valid input, using default model.")
        model = "llama3.1:8b"
except TimeoutError:
    print("No response received. Using default model.")
    model = "llama3.1:8b"
finally:
    signal.alarm(0)

print(f"Model selected: {model}")


Available models:


NameError: name 'models' is not defined

Finally, for each abstract, run the system and user prompt

In [None]:
import requests
import pandas as pd
import time

# Add a testing method to limit the number of abstracts
if testing_mode:
    # Load only the first 3 abstracts for testing
    description_df = pd.read_parquet(f"{processing_folder_path}/{folder_name}-descriptions-with-tokens.parquet").head(3)
else:
    # Load all abstracts for production
    description_df = pd.read_parquet(f"{processing_folder_path}/{folder_name}-descriptions-with-tokens.parquet")                                                             

# Filter abstracts with at least 100 tokens
description_df = description_df[description_df['token_count'] >= 100]

# Prepare results list
sdg_results = []

# Loop through each abstract
for idx, row in description_df.iterrows():
    record_id = row['record_id']
    abstract = row['description']

    # Prepare the messages for the API
    messages = [
        {"role": "system", "content": sdg_system_prompt},
        {"role": "user", "content": f"Classify the following text in terms of its relevance to the Sustainable Development Goals:\nText: '''{abstract}'''"}
    ]

    data = {
        "model": model,
        "messages": messages
    }

    # Print the data variable for debugging
    print(f"Data for record_id {record_id}: {data}")

    # Make the API call
    response = requests.post(
        openwebui_base_url.rstrip('/') + "/api/chat/completions",
        headers={"Authorization": f"Bearer {openwebui_api_key}", "Content-Type": "application/json"},
        json=data
    )

    # Check if the request was successful
    if response.status_code != 200:
        print(f"Error processing record_id {record_id}: {response.status_code} - {response.text}")
        continue

    # Print the response for debugging
    print(f"Response for record_id {record_id}: {response.json()}")
    
    # Parse the response
    try:
        result = response.json()
        # Try to extract the SDG list from the response
        content = result['choices'][0]['message']['content']
        # Try to parse the JSON from the model output
        try:
            sdg_json = eval(content) if isinstance(content, str) else content
            sdgs = sdg_json.get("sdgs", [])
            explanation = sdg_json.get("explanation", "")
        except Exception:
            sdgs = []
            explanation = ""
    except Exception:
        sdgs = []
        explanation = ""

    # Append to results, including the explanation if available
    sdg_results.append({
        "record_id": record_id,
        "abstract": abstract,
        "sdgs": sdgs,
        "explanation": explanation
    })

    # Optional: print progress
    print(f"Processed record_id: {record_id}, SDGs: {sdgs}")

    # Optional: delay to avoid rate limits
    time.sleep(1)

# Print the number of results
print(f"Number of SDG results collected: {len(sdg_results)}")

# Make the value of the model variable suitable for using in the file names
model_filename = model.replace(":", "-").replace(" ", "_")

# Save results to parquet
sdg_results_df = pd.DataFrame(sdg_results)
sdg_results_path = f"{processing_folder_path}/{folder_name}-sdg-results-{model_filename}.parquet"
sdg_results_df.to_parquet(sdg_results_path, index=False)
print(f"SDG LLM results saved to: {sdg_results_path}")
print(f"File size: {os.path.getsize(sdg_results_path) / (1024**2):.2f} MB")

Data for record_id doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef: {'model': 'llama3.1:8b', 'messages': [{'role': 'system', 'content': '\nYou are an intelligent multi-label classification system designed to map texts to their relevant Sustainable Development Goals.\nTake the text delimited by triple quotation marks and return a JSON list of relevant SDGs. \nExample output format: \n{\n    "sdgs": [2, 6, 17],\n    "explanation": "This text is related to SDG 2 (Zero hunger) because it discusses food security, SDG 6 (Clean water and sanitation) due to references to environmental protection, and SDG 17 (Partnerships for the goals) as it mentions global cooperation."\n}\n\n\nHere are the SDG goals and their descriptions:\n1: No poverty - End poverty in all its forms everywhere\n2: Zero hunger - End hunger, achieve food security and improved nutrition and promote sustainable agriculture\n3: Good health and well-being - Ensure healthy lives and promote well-being for all at all ages\n4: Quali

KeyboardInterrupt: 

### Step 8: Get Genderize data
a. First Query the authors with country of the affiliation along with the record id (to be used as primary keys, connecting the tables later on), 

b. get gender data by parsing the author names with country label over an API, 

c. save the outcomes in a separate parquet file.

In [51]:
import os
import pandas as pd
import regex as re
import unicodedata
import requests
import time
from datetime import datetime
import duckdb

# --- Settings ---
paid_subscription = False 
genderize_api_key = "da1a264b9bab63b46f27ac635dd7d2df"

# --- Paths ---
processing_folder_path = f"./data/{publication_date}/04_processed/{folder_name}"
combined_file_path = f"{processing_folder_path}/{folder_name}-combined-id-pid.parquet"

# Output folder
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
gender_output_folder = f"./data/{publication_date}/05_gender/{folder_name}_gender_{timestamp}"
os.makedirs(gender_output_folder, exist_ok=True)

# --- DuckDB query to extract authors ---
master_file = f"./data/{publication_date}/03_transformed/{folder_name}/{folder_name}-master.parquet"

con = duckdb.connect()
authors = con.sql(f"""
    SELECT 
        id AS record_id,
        unnest.fullName AS full_name,
        unnest.name AS first_name,
        unnest.surname AS last_name,
        unnest.pid.id.value AS orcid,
        countries[1].label AS country_name,
        countries[1].code AS country_code
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(authors) AS unnest
    WHERE countries IS NOT NULL AND array_length(countries) > 0
""").fetchdf()
con.close()

# Save raw authors data
authors_file_path = os.path.join(processing_folder_path, f"{folder_name}-authors.parquet")
authors.to_parquet(authors_file_path, index=False)

# --- Prepare unique authors ---
unique_authors = authors[['first_name', 'country_code']].copy()
unique_authors['first_name'] = unique_authors['first_name'].str.split().str[0]
unique_authors = unique_authors.dropna(subset=['first_name'])
unique_authors = unique_authors[unique_authors['first_name'].str.lower() != 'none']

# --- Advanced cleaning ---
JUNK_TOKENS = {'-', 'prof', 'prof.', 'professore', 'professor', 'dr', 'dr.', 'none'}
def is_garbage(name):
    if not isinstance(name, str):
        return True
    name_lower = name.strip().lower()
    return '.' in name or name_lower in JUNK_TOKENS or len(name.strip()) <= 1

unique_authors = unique_authors[~unique_authors['first_name'].apply(is_garbage)]

def clean_symbols(name: str):
    if not isinstance(name, str):
        return None
    name = unicodedata.normalize("NFKC", name)
    name = re.sub(r'\p{C}+', '', name)
    name = re.sub(r"^[\"'()\[\]{}<>]+|[\"'()\[\]{}<>]+$", "", name)
    name = re.sub(r"[^ \p{L}-]", "", name)
    name = re.sub(r"\s+", " ", name).strip()
    if name.startswith("-"):
        name = name.lstrip("-").strip()
    if re.fullmatch(r"(?:[A-Z]-)+[A-Z]", name, flags=re.I):
        return None
    return name.capitalize() if name else None

unique_authors['first_name'] = unique_authors['first_name'].apply(clean_symbols)
unique_authors = unique_authors.dropna(subset=['first_name'])
unique_authors = unique_authors[unique_authors['first_name'].str.len() > 1]

# --- Keep only unique first_name + country_code combinations ---
unique_combinations = unique_authors.drop_duplicates(subset=['first_name', 'country_code'])
print(f"Number of unique first_name+country combinations: {len(unique_combinations)}")

# --- Genderize API settings ---
rate_limit = 1000 if paid_subscription else 100
delay_between_requests = 0.5
base_url = "https://api.genderize.io"
request_count = 0
gender_results = []

# --- Iterate over unique name-country combinations ---
for _, row in unique_combinations.iterrows():
    if request_count >= rate_limit:
        print("Rate limit reached. Stopping for the day.")
        break

    first_name = row['first_name']
    country_code = row['country_code']

    params = {"name": first_name, "country_id": country_code}
    if paid_subscription:
        params["apikey"] = genderize_api_key

    try:
        response = requests.get(base_url, params=params)
        response.raise_for_status()
        data = response.json()
        gender_results.append({
            "first_name": first_name,
            "country_code": country_code,
            "gender": data.get("gender"),
            "probability": data.get("probability"),
            "count": data.get("count")
        })
        request_count += 1
        print(f"Processed: {first_name} ({country_code}) - Gender: {data.get('gender')}")
        time.sleep(delay_between_requests)
    except requests.exceptions.RequestException as e:
        print(f"Error processing {first_name} ({country_code}): {e}")
        request_count += 1
        time.sleep(1)

# --- Save results ---
gender_df = pd.DataFrame(gender_results)
gender_file_path = os.path.join(gender_output_folder, f"{folder_name}-gender-data.parquet")
gender_df.to_parquet(gender_file_path, index=False)
print(f"Gender data saved to: {gender_file_path}")


Number of unique first_name+country combinations: 49466
Processed: Cécile (FR) - Gender: female
Processed: Gabriele (IT) - Gender: male
Processed: Renata (IT) - Gender: female
Processed: Antonio (IT) - Gender: male
Processed: Aiman (FR) - Gender: male
Processed: Fabio (IT) - Gender: male
Processed: Emanuela (IT) - Gender: female


KeyboardInterrupt: 

### Step 9: Get Citizen Science classification labels

a. Query the abstracts first along with the id (to be used as primary keys, connecting the tables later on), 

b. get citizen science labels by parsing the abstract over an LLM API with system prompt, 

c. save the outcomes in a separate parquet file.

#### Step 10: Generate SQL schemas for all the parquet files in the processed folder.

In [None]:
import os
import duckdb

# List all .parquet files in the processing folder
parquet_dir = os.path.join(processing_folder_path)
parquet_files = [
    f for f in os.listdir(parquet_dir)
    if f.endswith('.parquet')
]

for parquet_file in parquet_files:
    parquet_path = os.path.join(processing_folder_path, parquet_file)
    schema_file_name = os.path.splitext(parquet_file)[0] + '.sql'
    schema_file_path = os.path.join(processing_folder_path, schema_file_name)
    
    print(f"Generating schema for: {parquet_path}")
    print(f"Schema file path: {schema_file_path}")

    duckdb.sql(f'''
        COPY (
            SELECT *
            FROM (DESCRIBE '{parquet_path}')
        )
        TO '{schema_file_path}'
    ''')

    if os.path.exists(schema_file_path):
        print(f"Schema file exists: {schema_file_path}")
        with open(schema_file_path, 'r') as schema_file:
            schema_content = schema_file.read()
            print("Schema file content:")
            print(schema_content)
    else:
        print(f"Schema file does not exist: {schema_file_path}")

Generating schema for: ./data/2025-02-19/04_processed/argo-france/argo-france-combined-id-pid.parquet
Schema file path: ./data/2025-02-19/04_processed/argo-france/argo-france-combined-id-pid.sql
Schema file exists: ./data/2025-02-19/04_processed/argo-france/argo-france-combined-id-pid.sql
Schema file content:
column_name,column_type,null,key,default,extra
record_id,VARCHAR,YES,,,
pid_scheme,VARCHAR,YES,,,
pid_value,VARCHAR,YES,,,
combined_id_pid,VARCHAR,YES,,,

Generating schema for: ./data/2025-02-19/04_processed/argo-france/argo-france-descriptions-with-tokens.parquet
Schema file path: ./data/2025-02-19/04_processed/argo-france/argo-france-descriptions-with-tokens.sql
Schema file exists: ./data/2025-02-19/04_processed/argo-france/argo-france-descriptions-with-tokens.sql
Schema file content:
column_name,column_type,null,key,default,extra
record_id,VARCHAR,YES,,,
description,VARCHAR,YES,,,
token_count,BIGINT,YES,,,

Generating schema for: ./data/2025-02-19/04_processed/argo-france/argo

InvalidInputException: Invalid Input Error: Failed to read Parquet file './data/2025-02-19/04_processed/argo-france/argo-france-sdg-llm-results.parquet': Need at least one non-root column in the file