# OpenAIRE Community Data Dump Handling: Extraction, Tranformation and Enrichment

In this notebook we will start with one of the [OpenAIRE Community Subgraphs](https://graph.openaire.eu/docs/downloads/subgraphs) to enrich that informatino for further analysis.

This data process will extract an [OpenAIRE community data dump from Zenodo](https://doi.org/10.5281/zenodo.3974604), transforms it in to a portable file format .parquet (and updatable with changes for time seires analysis), that can be used to query with DuckDB, to enrich this with additional data  (also in .parquet, for join queries).

This additional data can be societal impact data from [Altmetric.com](https://details-page-api-docs.altmetric.com/) or [Overton.io](https://app.overton.io/swagger.php), Gender data using [genderize.io](https://genderize.io/documentation), sdg classification using [aurora-sdg](https://aurora-universities.eu/sdg-research/sdg-api/)

This script needs to be written in a way so that it can run every month using  the latest data.

## Processing steps

* the folder ./data/ is put in .gitignore to prevent that bulk datais sent to a code repository. So make sure that folder exists, and mkdir if not exists. 
* The script downloads the lastest Data Dump Tar file from one selected community. See https://doi.org/10.5281/zenodo.3974604 for the latest list. In our case the Aurora tar file. https://zenodo.org/records/14887484/files/aurora.tar?download=1
  * Use the json record of zenodo to get to the latest record, and fetch the download link of the aurora.tar file. for example : https://zenodo.org/records/14887484/export/json or https://zenodo.org/api/records/14887484/versions/latest 
  Make the tar filename a variable, so it can be used for multiple community dumps.
  Download the tar file in a target folder ./data/{filename+timestamp}/ where a subfolder is created using the filename and the timestamp. Make this also as a  variable to use later on.
* Extract the tar file, to the compressed .json.gz files and put these in target folder ./data/{filename+timestamp}/01-extracted/
* Transform the compressed .json.gz files into a single .parquet file in target folder ./data/{filename+timestamp}/02-transformed/
Use instructions in sections "Processing JSON files with DuckDB" and "Full dataset, bit by bit" and "Splitting and Processing JSON Files in Batches" https://github.com/mosart/OpenAIRE-tools/blob/main/duckdb-querying.ipynb to start with. (be aware of error messages, and fix the issues to get all the data in)
* Extract the SQL schema (schema-datadump.sql) from the .parquet file and put it in target folder ./data/{filename+timestamp}/02-transformed/ This is needed for further processing of the records with DuckDB later on.
Use instructions in section "Extracting Schema from Parquet File" https://github.com/mosart/OpenAIRE-tools/blob/main/duckdb-querying.ipynb to start with.
* Query to get all identifiers: openaire id, doi, isbn, hdl, etc.
* **Get Altmetric data:**
* Extract the Altmetric data using the Identifiers. put that in target folder ./data/{filename+timestamp}/03-altmetric-extracted/
* Transform the Altmetric data to a single .parquet file, with the identifiers. put that in target folder ./data/{filename+timestamp}/04-altmetric-transformed/ This way duckDB can make a join when querying over multiple parquet files.
* Extract the SQL schema (schema-altmetric.sql) from the .parquet file and put it in target folder ./data/{filename+timestamp}/04-altmetric-transformed/
* **Get Overton data:** Repeat the altmetric steps, bun than for Overton.
* **Get Gender data** query for the Author names and country codes, and run them over the gerderize api
* **Get SDG data** query for the abstracts, and run abstracs larger than 100 tokens over the aurora-SDG api.

ss

## Step 1 : Get the latest Community Dump File

In [1]:
import requests
import json

# Fetch the JSON data from the URL
url = "https://zenodo.org/api/records/14887484/versions/latest"
response = requests.get(url)
data = response.json()

# Extract the files information
files = data.get("files", [])

# Create a list of dictionaries for the .tar files
tar_files = []
for file in files:
    if file["key"].endswith(".tar"):
        tar_files.append({
            "filename": file["key"],
            "size": f"{file['size'] / (1024**3):.2f} GB",  # Convert bytes to GB
            "downloadlink": file["links"]["self"],
            "checksum": file["checksum"]
        })

# print the tar files
# If no tar files found, print a message
if not tar_files:
    print("No .tar files found in the dataset.")
else:
    print(f"Found {len(tar_files)} .tar files in the dataset.")
    print("Details of .tar files:")
    print(tar_files)

# get and print the publication date
publication_date = data.get("metadata", {}).get("publication_date", "Unknown")
print(f"Publication date: {publication_date}")
# get and print the DOI
doi = data.get("doi", "Unknown")
print(f"DOI: {doi}")
# get and print the title
title = data.get("title", "Unknown")
print(f"Title: {title}")

Found 37 .tar files in the dataset.
Details of .tar files:
[{'filename': 'energy-planning_1.tar', 'size': '6.99 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/energy-planning_1.tar/content', 'checksum': 'md5:0a2f551db46a9e629bb1d0a0098ae5cd'}, {'filename': 'edih-adria_1.tar', 'size': '5.86 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/edih-adria_1.tar/content', 'checksum': 'md5:23559bed5a9023398b431777bdc8a126'}, {'filename': 'uarctic_1.tar', 'size': '9.75 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/uarctic_1.tar/content', 'checksum': 'md5:302e3844ebd041c5f4ed94505eb9a285'}, {'filename': 'netherlands_1.tar', 'size': '3.91 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/netherlands_1.tar/content', 'checksum': 'md5:d1416c058b3961483aac340750ea8726'}, {'filename': 'knowmad_1.tar', 'size': '10.08 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/knowmad_1.tar/content', 'checksum': 'md5:

In [2]:
# Create a DataFrame to hold the tar files information for later use.

import pandas as pd

# Convert the list of dictionaries to a DataFrame
df_tar_files = pd.DataFrame(tar_files)

# Sort the DataFrame by filename alphabetically
df_tar_files = df_tar_files.sort_values(by='filename')

# Print the DataFrame
print(df_tar_files)

                      filename      size  \
5              argo-france.tar   0.00 GB   
8                   aurora.tar   1.73 GB   
22                  beopen.tar   0.20 GB   
6                   civica.tar   0.23 GB   
7                 covid-19.tar   2.03 GB   
23                  dariah.tar   0.02 GB   
9                    dh-ch.tar   1.16 GB   
11                     dth.tar   0.01 GB   
1             edih-adria_1.tar   5.86 GB   
12                  egrise.tar   0.02 GB   
25               elixir-gr.tar   0.01 GB   
0        energy-planning_1.tar   6.99 GB   
27                enermaps.tar   1.59 GB   
24              eu-conexus.tar   0.18 GB   
26                     eut.tar   0.21 GB   
15                 eutopia.tar   1.60 GB   
28                 forthem.tar   0.91 GB   
10        heritage-science.tar   0.03 GB   
29                   inria.tar   0.27 GB   
14               iperionhs.tar   0.00 GB   
4                knowmad_1.tar  10.08 GB   
19               knowmad_2.tar  

In [3]:
# Print a reindexed list of available tar files
print("Available tar files:")
print(df_tar_files[['filename', 'size']].reset_index())

import signal

# Function to handle timeout
def timeout_handler(signum, frame):
    raise TimeoutError

# Set the timeout handler for the input
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(10)  # Set the timeout to 10 seconds

try:
    # Ask the user to select a tar file by its index
    selected_index = int(input("Enter the index of the tar file you want to download: "))
except TimeoutError:
    print("No response received. Defaulting to index 1.")
    selected_index = 1
finally:
    signal.alarm(0)  # Disable the alarm

# Get the selected tar file's download link and checksum
selected_file = df_tar_files.iloc[selected_index]
downloadlink = selected_file['downloadlink']
checksum = selected_file['checksum']

print(f"Selected file: {selected_file['filename']}")
print(f"Download link: {downloadlink}")
print(f"Checksum: {checksum}")

Available tar files:
    index                    filename      size
0       5             argo-france.tar   0.00 GB
1       8                  aurora.tar   1.73 GB
2      22                  beopen.tar   0.20 GB
3       6                  civica.tar   0.23 GB
4       7                covid-19.tar   2.03 GB
5      23                  dariah.tar   0.02 GB
6       9                   dh-ch.tar   1.16 GB
7      11                     dth.tar   0.01 GB
8       1            edih-adria_1.tar   5.86 GB
9      12                  egrise.tar   0.02 GB
10     25               elixir-gr.tar   0.01 GB
11      0       energy-planning_1.tar   6.99 GB
12     27                enermaps.tar   1.59 GB
13     24              eu-conexus.tar   0.18 GB
14     26                     eut.tar   0.21 GB
15     15                 eutopia.tar   1.60 GB
16     28                 forthem.tar   0.91 GB
17     10        heritage-science.tar   0.03 GB
18     29                   inria.tar   0.27 GB
19     14          

In [4]:
# Path Variables

# Extract the file name from the selected file
file_name = selected_file['filename']    

# Path to save the downloaded tar file using file_name variable
download_path = f"./data/{publication_date}/01_input/{file_name}"

# Create the folder name by removing the .tar extension
folder_name = selected_file['filename'].replace('.tar', '')

# Path to save the extracted files using the file_name variable without the .tar extension
extraction_path = f"./data/{publication_date}/02_extracted/{folder_name}"


print(f"File Name: {file_name}")
print(f"Download Path File: {download_path}")
print(f"Folder Name: {folder_name}")
print(f"Extraction Path Folder: {extraction_path}")

File Name: argo-france.tar
Download Path File: ./data/2025-02-19/01_input/argo-france.tar
Folder Name: argo-france
Extraction Path Folder: ./data/2025-02-19/02_extracted/argo-france


### Download the tar file

In [5]:
import os

# Ensure the directory for the download path exists
os.makedirs(os.path.dirname(download_path), exist_ok=True)

# Check if the file already exists
if not os.path.exists(download_path):
    # Get the file size in bytes
    file_size_bytes = float(selected_file['size'].split()[0]) * (1024**3)  # Convert GB to bytes
    print(f"Downloading file: {selected_file['filename']} ({selected_file['size']})")
    print(f"Download URL: {downloadlink}")
    
    # Estimate download duration assuming an average speed of 10 MB/s
    avg_speed = 10 * (1024**2)  # 10 MB/s in bytes
    estimated_duration = file_size_bytes / avg_speed
    print(f"Estimated download time: {estimated_duration:.2f} seconds")
    
    # Download the selected tar file
    response = requests.get(downloadlink, stream=True)
    with open(download_path, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)
else:
    print(f"File already exists: {download_path}")
    print(f"Download URL: {downloadlink}")

print(f"Download complete: {download_path}")


File already exists: ./data/2025-02-19/01_input/argo-france.tar
Download URL: https://zenodo.org/api/records/14887484/files/argo-france.tar/content
Download complete: ./data/2025-02-19/01_input/argo-france.tar


In [6]:
import hashlib

# Function to calculate the checksum of a file
def calculate_checksum(file_path, algorithm):
    hash_func = hashlib.new(algorithm)
    with open(file_path, 'rb') as f:
        while chunk := f.read(8192):
            hash_func.update(chunk)
    return hash_func.hexdigest()

# Extract the checksum algorithm and value
checksum_parts = checksum.split(':', 1)
checksum_algorithm = checksum_parts[0]
expected_checksum = checksum_parts[1]

# Calculate the checksum of the downloaded file
calculated_checksum = calculate_checksum(download_path, algorithm=checksum_algorithm)

# Compare the calculated checksum with the provided checksum
if calculated_checksum == expected_checksum:
    print("Checksum verification passed.")
else:
    print("Checksum verification failed.")
    print(f"Expected: {expected_checksum}")
    print(f"Calculated: {calculated_checksum}")

Checksum verification passed.


## Step 2: Extract the tar file

In [6]:
import os
import tarfile

In [7]:

# Check if the extraction directory already exists and contains files
if os.path.exists(extraction_path) and os.listdir(extraction_path):
    print("The tar file has already been extracted.")
else:
    # Create the directory if it doesn't exist
    os.makedirs(extraction_path, exist_ok=True)

    # Extract the tar file in the parent directory of the extraction_path - because the tar file contains a folder structure repeating the name of the tar file
    print(f"Extracting {download_path} to {extraction_path}...")
    parent_extraction_path = os.path.dirname(extraction_path)
    with tarfile.open(download_path, 'r') as tar:
        tar.extractall(path=parent_extraction_path)

    print("Extraction complete.")
    print(f"Files extracted to: {extraction_path}")
    

The tar file has already been extracted.


In [8]:
# List the extracted files
extracted_files = os.listdir(extraction_path)

# add the path to the extracted files
extracted_files_with_path = [os.path.join(extraction_path, file) for file in extracted_files]

# count the number of files in the extracted folder
num_files = len(extracted_files)
print(f"Number of files: {num_files}")

# print the first 5 files
print("First 5 files:")
for file in extracted_files[:5]:
    print(file) 

# make a DataFrame for the extracted files
df_extracted_files = pd.DataFrame(extracted_files, columns=['filename'])
# Sort the DataFrame by filename alphabetically
df_extracted_files = df_extracted_files.sort_values(by='filename')
# Print the DataFrame
print(df_extracted_files)

# print the dimensions of the DataFrame
print(f"DataFrame dimensions: {df_extracted_files.shape}")

# print a random 5 files, to be used for testing, and use in a variable for later use
import random
random_files = random.sample(extracted_files, 5)
random_files_with_path = [os.path.join(extraction_path, file) for file in random_files]
print("Randomly selected files with full paths for testing:")
for file in random_files_with_path:
    print(file)

# one random file for later use
random_file = random.choice(extracted_files)
print(f"Random file selected for later use: {random_file}")
# Define the path to the random file
random_file_path = os.path.join(extraction_path, random_file)
print(f"Path to the random file: {random_file_path}")
# Check if the random file exists
if os.path.exists(random_file_path):
    print(f"The random file exists: {random_file_path}")
else:
    print(f"The random file does not exist: {random_file_path}")




Number of files: 285
First 5 files:
part-00000-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
part-00001-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
part-00002-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
part-00003-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
part-00004-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
                                              filename
0    part-00000-2c0de614-bb18-4931-bd6a-64f101a27ba...
1    part-00001-2c0de614-bb18-4931-bd6a-64f101a27ba...
2    part-00002-2c0de614-bb18-4931-bd6a-64f101a27ba...
3    part-00003-2c0de614-bb18-4931-bd6a-64f101a27ba...
4    part-00004-2c0de614-bb18-4931-bd6a-64f101a27ba...
..                                                 ...
280  part-00580-2c0de614-bb18-4931-bd6a-64f101a27ba...
281  part-00583-2c0de614-bb18-4931-bd6a-64f101a27ba...
282  part-00592-2c0de614-bb18-4931-bd6a-64f101a27ba...
283  part-00618-2c0de614-bb18-4931-bd6a-64f101a27ba...
284  part-00736-2c0de614-bb18-4931-bd6a-64f101a27ba...

## Step 3: Get a data sample to generate parquetfile and the SQL schema
We do this before we process the bulk of the data.

In [9]:
import duckdb

transformation_folder_path = f"./data/{publication_date}/03_transformed/{folder_name}"

# Ensure the target directory exists
os.makedirs(transformation_folder_path, exist_ok=True)

# for testing: Define and print the target output sample file path
sample_file = f"{transformation_folder_path}/{folder_name}-sample.parquet"
print(f"Output file path: {sample_file}")

# for testing: define and print the target output sample file for the multiple selected random sample files
multiple_sample_file = f"{transformation_folder_path}/{folder_name}-multiple-sample.parquet"
print(f"Multiple sample file path: {multiple_sample_file}")

# for production: define and print the target output master file for all extracted files
master_file = f"{transformation_folder_path}/{folder_name}-master.parquet"
print(f"Master file path: {master_file}")


Output file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-sample.parquet
Multiple sample file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-multiple-sample.parquet
Master file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-master.parquet


#### for testing: this part is for running on a single sample

In [11]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Use DuckDB to process the extracted JSON files and save a sample of all rows as a Parquet file
con.sql(f'''
    COPY (
        
        FROM (
            SELECT *
            FROM read_json('{random_file_path}', sample_size=-1, union_by_name=true)
        )
    )
    TO '{sample_file}' (FORMAT parquet, COMPRESSION gzip)
''')

print(f"Transformed data saved to: {sample_file}")
print(f"File size: {os.path.getsize(sample_file) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Transformed data saved to: ./data/2025-02-19/03_transformed/argo-france/argo-france-sample.parquet
File size: 0.06 MB


In [12]:
# schema path
schema_file_path = f"{transformation_folder_path}/{folder_name}-schema.sql"

#print the schema file path
print(f"Schema file path: {schema_file_path}")

duckdb.sql(f'''
    COPY (
        SELECT *
        FROM (DESCRIBE '{sample_file}')
    )
    TO '{schema_file_path}'
''')
# check if the schema file exists
if os.path.exists(schema_file_path):
    print(f"Schema file exists: {schema_file_path}")
else:
    print(f"Schema file does not exist: {schema_file_path}")
# Print the schema file content
with open(schema_file_path, 'r') as schema_file:
    schema_content = schema_file.read()
    print("Schema file content:")
    print(schema_content)


Schema file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file exists: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file content:
column_name,column_type,null,key,default,extra
authors,"STRUCT(fullName VARCHAR, ""name"" VARCHAR, rank BIGINT, surname VARCHAR, pid STRUCT(id STRUCT(scheme VARCHAR, ""value"" VARCHAR), provenance STRUCT(provenance VARCHAR, trust VARCHAR)))[]",YES,,,
bestAccessRight,"STRUCT(code VARCHAR, ""label"" VARCHAR, scheme VARCHAR)",YES,,,
collectedFrom,"STRUCT(""key"" VARCHAR, ""value"" VARCHAR)[]",YES,,,
communities,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR)[])[]",YES,,,
container,"STRUCT(ep VARCHAR, issnOnline VARCHAR, issnPrinted VARCHAR, ""name"" VARCHAR, sp VARCHAR, vol VARCHAR)",YES,,,
contributors,VARCHAR[],YES,,,
countries,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR))[]",YES,,,
coverages,JSON[],YE

In [13]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Count the number of records in the Parquet file
record_count = con.sql(f'''
    SELECT COUNT(*)
    FROM read_parquet('{sample_file}')
''').fetchone()[0]

# Print the record count
print(f"Number of records in the Parquet file: {record_count}")

# Close the DuckDB connection
con.close()

Number of records in the Parquet file: 10


In [14]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query the titles from the Parquet file
titles = con.sql(f'''
    SELECT mainTitle
    FROM read_parquet('{sample_file}')
''').fetchall()

# Print the titles
print("Titles in the Parquet file:")
for title in titles:
    print(title[0])

# Close the DuckDB connection
con.close()

Titles in the Parquet file:
Correction of profiles of in‐situ chlorophyll fluorometry for the contribution of fluorescence originating from non‐algal matter
Plankton Assemblage Estimated with BGC‐Argo Floats in the Southern Ocean: Implications for Seasonal Successions and Particle Export
Main processes of the Atlantic cold tongue interannual variability
A Simplified Model for the Baroclinic and Barotropic Ocean Response to Moving Tropical Cyclones: 2. Model and Simulations
Atmospherically Forced and Chaotic Interannual Variability of Regional Sea Level and Its Components Over 1993–2015
Advances in operational oceanography : expanding Europe's ocean observing and forecasting capacity
CDOM Spatiotemporal Variability in the Mediterranean Sea: A Modelling Study
Report on new products
A European strategy plan with regard to the Argo extension in WBC and other boundary regions
Recommendations to operate shallow coastal float in European Marginal Seas


#### for testing: this part is for running on a random sample of multiple files


In [15]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Use DuckDB to process the extracted JSON files and save a sample of all rows as a Parquet file
# Join the list of file paths into a comma-separated string
file_paths = ','.join(f"'{file}'" for file in random_files_with_path)

# Use DuckDB to process the extracted JSON files and save a sample of all rows as a Parquet file
con.sql(f'''
    COPY (
        SELECT *
        FROM read_json([{file_paths}], sample_size=-1, union_by_name=true)
    )
    TO '{multiple_sample_file}' (FORMAT parquet, COMPRESSION gzip)
''')

print(f"Transformed data saved to: {multiple_sample_file}")
print(f"File size: {os.path.getsize(multiple_sample_file) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Transformed data saved to: ./data/2025-02-19/03_transformed/argo-france/argo-france-multiple-sample.parquet
File size: 0.05 MB


In [16]:
# schema path
schema_file_path = f"{transformation_folder_path}/{folder_name}-schema.sql"

#print the schema file path
print(f"Schema file path: {schema_file_path}")

duckdb.sql(f'''
    COPY (
        SELECT *
        FROM (DESCRIBE '{multiple_sample_file}')
    )
    TO '{schema_file_path}'
''')
# check if the schema file exists
if os.path.exists(schema_file_path):
    print(f"Schema file exists: {schema_file_path}")
else:
    print(f"Schema file does not exist: {schema_file_path}")
# Print the schema file content
with open(schema_file_path, 'r') as schema_file:
    schema_content = schema_file.read()
    print("Schema file content:")
    print(schema_content)


Schema file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file exists: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file content:
column_name,column_type,null,key,default,extra
authors,"STRUCT(fullName VARCHAR, ""name"" VARCHAR, pid STRUCT(id STRUCT(scheme VARCHAR, ""value"" VARCHAR), provenance STRUCT(provenance VARCHAR, trust VARCHAR)), rank BIGINT, surname VARCHAR)[]",YES,,,
bestAccessRight,"STRUCT(code VARCHAR, ""label"" VARCHAR, scheme VARCHAR)",YES,,,
collectedFrom,"STRUCT(""key"" VARCHAR, ""value"" VARCHAR)[]",YES,,,
communities,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR)[])[]",YES,,,
container,"STRUCT(issnOnline VARCHAR, ""name"" VARCHAR, vol VARCHAR, ep VARCHAR, issnPrinted VARCHAR, sp VARCHAR)",YES,,,
contributors,VARCHAR[],YES,,,
countries,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR))[]",YES,,,
coverages,JSON[],YE

In [17]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Count the number of records in the Parquet file
record_count = con.sql(f'''
    SELECT COUNT(*)
    FROM read_parquet('{multiple_sample_file}')
''').fetchone()[0]

# Print the record count
print(f"Number of records in the Parquet file: {record_count}")

# Close the DuckDB connection
con.close()

Number of records in the Parquet file: 8


In [18]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query the titles from the Parquet file
titles = con.sql(f'''
    SELECT mainTitle
    FROM read_parquet('{multiple_sample_file}')
''').fetchall()

# Print the titles
print("Titles in the Parquet file:")
for title in titles:
    print(title[0])

# Close the DuckDB connection
con.close()

Titles in the Parquet file:
Biogeochemical Argo: The Test Case of the NAOS Mediterranean Array
Recommendations to increase the overall life expectancy of Argo floats, based on at-sea monitoring fleet behavior monitoring, assessment and report (including a review of metadata that impact life expectancy: specific float configurations, batteries)
Monitoring the Oceans and Climate Change with Argo. MOCCA project 5 – year achievements
Dissolved Organic Nitrogen Production and Export by Meridional Overturning in the Eastern Subpolar North Atlantic
A new record of Atlantic sea surface salinity from 1896 to 2013 reveals the signatures of climate variability and long‐term trends
How Deep Argo Will Improve the Deep Ocean in an Ocean Reanalysis
CORA-IBI, Coriolis Ocean Dataset for Reanalysis for the Ireland-Biscay-Iberia region
Spreading and Vertical Structure of the Persian Gulf and Red Sea Outflows in the Northwestern Indian Ocean


#### for production: parsing all extracted files into one master parquet file

In [11]:
import os
import signal

def timeout_handler(signum, frame):
    raise TimeoutError

# Check if the master file already exists
if os.path.exists(master_file):
    print(f"Master file already exists: {master_file}")
    print("Do you want to overwrite it? (y/n) [Default: n, timeout 10s]:")
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(10)
    try:
        user_input = input()
        overwrite = user_input.strip().lower() == 'y'
    except TimeoutError:
        print("No response received. Continuing with the existing master file.")
        overwrite = False
    finally:
        signal.alarm(0)
    if not overwrite:
        print("Using the existing master file.")
    else:
        # Overwrite: regenerate the master file
        con = duckdb.connect()
        file_paths = ','.join(f"'{file}'" for file in extracted_files_with_path)
        con.sql(f'''
            COPY (
                SELECT *
                FROM read_json([{file_paths}], sample_size=-1, union_by_name=true)
            )
            TO '{master_file}' (FORMAT parquet, COMPRESSION gzip)
        ''')
        print(f"Transformed data saved to: {master_file}")
        print(f"File size: {os.path.getsize(master_file) / (1024**2):.2f} MB")
        con.close()
else:
    # Master file does not exist, create it
    con = duckdb.connect()
    file_paths = ','.join(f"'{file}'" for file in extracted_files_with_path)
    con.sql(f'''
        COPY (
            SELECT *
            FROM read_json([{file_paths}], sample_size=-1, union_by_name=true)
        )
        TO '{master_file}' (FORMAT parquet, COMPRESSION gzip)
    ''')
    print(f"Transformed data saved to: {master_file}")
    print(f"File size: {os.path.getsize(master_file) / (1024**2):.2f} MB")
    con.close()

Master file already exists: ./data/2025-02-19/03_transformed/argo-france/argo-france-master.parquet
Do you want to overwrite it? (y/n) [Default: n, timeout 10s]:
Using the existing master file.


In [12]:
# schema path
schema_file_path = f"{transformation_folder_path}/{folder_name}-schema.sql"

#print the schema file path
print(f"Schema file path: {schema_file_path}")

duckdb.sql(f'''
    COPY (
        SELECT *
        FROM (DESCRIBE '{master_file}')
    )
    TO '{schema_file_path}'
''')
# check if the schema file exists
if os.path.exists(schema_file_path):
    print(f"Schema file exists: {schema_file_path}")
else:
    print(f"Schema file does not exist: {schema_file_path}")
# Print the schema file content
with open(schema_file_path, 'r') as schema_file:
    schema_content = schema_file.read()
    print("Schema file content:")
    print(schema_content)


Schema file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file exists: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file content:
column_name,column_type,null,key,default,extra
authors,"STRUCT(fullName VARCHAR, ""name"" VARCHAR, pid STRUCT(id STRUCT(scheme VARCHAR, ""value"" VARCHAR), provenance STRUCT(provenance VARCHAR, trust VARCHAR)), rank BIGINT, surname VARCHAR)[]",YES,,,
bestAccessRight,"STRUCT(code VARCHAR, ""label"" VARCHAR, scheme VARCHAR)",YES,,,
collectedFrom,"STRUCT(""key"" VARCHAR, ""value"" VARCHAR)[]",YES,,,
communities,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR)[])[]",YES,,,
container,"STRUCT(ep VARCHAR, issnOnline VARCHAR, issnPrinted VARCHAR, ""name"" VARCHAR, sp VARCHAR, vol VARCHAR, edition VARCHAR, iss VARCHAR, issnLinking VARCHAR)",YES,,,
contributors,VARCHAR[],YES,,,
countries,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VAR

In [13]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Count the number of records in the Parquet file
record_count = con.sql(f'''
    SELECT COUNT(*)
    FROM read_parquet('{master_file}')
''').fetchone()[0]

# Print the record count
print(f"Number of records in the Parquet file: {record_count}")

# Close the DuckDB connection
con.close()

Number of records in the Parquet file: 674


In [18]:
import random

# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query all titles from the Parquet file
titles = con.sql(f'''
    SELECT mainTitle
    FROM read_parquet('{master_file}')
''').fetchall()

# Select 10 random titles
random_titles = random.sample([title[0] for title in titles if title[0]], min(10, len(titles)))

print("10 Random Titles in the Parquet file:")
for title in random_titles:
    print(title)

# print the number of unique titles
unique_titles = set(title[0] for title in titles if title[0])
print(f"Number of unique titles in the Parquet file: {len(unique_titles)}")

# Close the DuckDB connection
con.close()

10 Random Titles in the Parquet file:
Intercomparison and validation of the mixed layer depth fields of global ocean syntheses
Baltic Sea workshop report
3D Structure of the Ras Al Hadd Oceanic Dipole
QC Report. AtlantOS project
Deep mixed ocean volume in the Labrador Sea in HighResMIP models
CTD DATA - EUREC4A_OA Atalante Cruise
Budget of organic carbon in the <scp>N</scp>orth‐<scp>W</scp>estern <scp>M</scp>editerranean open sea over the period 2004–2008 using 3‐D coupled physical‐biogeochemical modeling
Applications and Challenges of GRACE and GRACE Follow-On Satellite Gravimetry
Recommendations to increase the overall life expectancy of Argo floats, based on at-sea monitoring fleet behavior monitoring, assessment and report (including a review of metadata that impact life expectancy: specific float configurations, batteries)
CERA‐20C: A Coupled Reanalysis of the Twentieth Century
Number of unique titles in the Parquet file: 666


In [19]:
import random

# Connect to an in-memory DuckDB database
con = duckdb.connect()
# Query to extract DOIs from the pids column
dois = con.sql(f'''
    SELECT unnest.value AS doi
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(pids) AS unnest
    WHERE unnest.scheme = 'doi'
''').fetchall()

# Select 10 random DOIs
random_dois = random.sample([doi[0] for doi in dois if doi[0]], min(10, len(dois)))

# Print the 10 random DOIs
print("10 Random DOIs:")
for doi in random_dois:
    print(doi)

# print total number of DOIs
print(f"Total number of DOIs: {len(dois)}")

# print the number of unique DOIs
unique_dois = set(doi[0] for doi in dois if doi[0])
print(f"Number of unique DOIs: {len(unique_dois)}")

# Close the DuckDB connection
con.close()

10 Random DOIs:
10.1038/s41467-020-14474-y
10.1016/j.ocemod.2018.11.005
10.5194/osd-10-1127-2013
10.1038/s41598-018-27407-z
10.3389/fmars.2023.1287867
10.5281/zenodo.7369190
10.5067/ghgoy-4fe01
10.3389/fmars.2019.00519
10.5194/os-18-129-2022
10.1029/2021jc017999
Total number of DOIs: 709
Number of unique DOIs: 709


In [20]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query to extract distinct PID schemes from the master file
pid_schemes = con.sql(f'''
    SELECT DISTINCT unnest.scheme AS scheme
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(pids) AS unnest
''').fetchall()

# Print the distinct PID schemes
print("Distinct PID schemes in the master table:")
for scheme in pid_schemes:
    print(scheme[0])

# Close the DuckDB connection
con.close()

Distinct PID schemes in the master table:
doi
mag_id
arXiv
handle
pmid
pmc


In [21]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query to extract PIDs grouped by their schemes
pids_by_scheme = con.sql(f'''
    SELECT unnest.scheme AS scheme, LIST(unnest.value) AS pids
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(pids) AS unnest
    GROUP BY unnest.scheme
''').fetchall()

# dataframe to hold the PIDs grouped by their schemes
df_pids_by_scheme = pd.DataFrame(pids_by_scheme, columns=['scheme', 'pids'])
# Print the DataFrame of PIDs grouped by their schemes
print("PIDs grouped by schemes:")  
print(df_pids_by_scheme)
 
# Close the DuckDB connection
con.close()

PIDs grouped by schemes:
   scheme                                               pids
0  handle  [20.500.14243/381862, 1871/48380, 1912/27589, ...
1    pmid  [31996687, 32978152, 35865129, 31875863, 30659...
2     pmc  [PMC6989661, PMC7518875, PMC9287098, PMC691659...
3     doi  [10.1175/jpo-d-16-0107.1, 10.1002/2016jc012629...
4  mag_id  [2497128190, 2601592524, 2332474861, 205755597...
5   arXiv                  [http://arxiv.org/abs/1607.08469]


## Geting ready for further processing the master data

In [22]:
import os

# Create a folder for processed data
processing_folder_path = f"./data/{publication_date}/04_processed/{folder_name}"

# Ensure the target directory exists
os.makedirs(processing_folder_path, exist_ok=True)

### Step 4: Get the DOI's and other identifiers

In [23]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query to combine the id with each pid
combined_data = con.sql(f'''
    SELECT 
        id AS record_id,
        unnest.scheme AS pid_scheme,
        unnest.value AS pid_value,
        CONCAT(id, '_', unnest.value) AS combined_id_pid
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(pids) AS unnest
''').fetchdf()

# Print the resulting DataFrame
print("Combined id and pid:")
print(combined_data)


# Save the combined data to a new Parquet file for later use
combined_file_path = f"{processing_folder_path}/{folder_name}-combined-id-pid.parquet"
combined_data.to_parquet(combined_file_path, index=False)
print(f"Combined data saved to: {combined_file_path}")
print(f"File size: {os.path.getsize(combined_file_path) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Combined id and pid:
                                           record_id pid_scheme  \
0     doi_dedup___::fe2c8a61b8ccaa6515d3c2996b2144c9        doi   
1     doi_dedup___::fe2c8a61b8ccaa6515d3c2996b2144c9     mag_id   
2     doi_dedup___::7bd738ce5851f7e450ebf6388ad51522        doi   
3     doi_dedup___::7bd738ce5851f7e450ebf6388ad51522     mag_id   
4     doi_dedup___::7bd738ce5851f7e450ebf6388ad51522     handle   
...                                              ...        ...   
1392  doi_dedup___::da9603d028d4bd4ad2d9c980917fd5ac        doi   
1393  doi_dedup___::3da738372eaf79174655e4ef24d74cd2        doi   
1394  doi_dedup___::3da738372eaf79174655e4ef24d74cd2        doi   
1395  doi_dedup___::3da738372eaf79174655e4ef24d74cd2        doi   
1396  doi_dedup___::3da738372eaf79174655e4ef24d74cd2        doi   

                    pid_value  \
0     10.1175/jpo-d-16-0107.1   
1                  2497128190   
2        10.1002/2016jc012629   
3                  2601592524   
4        

### Step 5: Get Altmetric data

a. use the PIDS (df_pids_by_scheme) along with the record id (to be used as primary keys, connecting the tables later on), 

b. get mention data by parsing the pids over the altmetric API,

c. save the outcomes in a separate parquet file.

### Step 6: Get Overton data

### Step 7: Get SDG classification labels

a. Query the abstracts first along with the id (to be used as primary keys, connecting the tables later on), 

b. get sdg data by parsing the abstracts with more than 100 tokens over an LLM API with system prompt, 

c. save the outcomes in a separate parquet file.

##### step 7a: Get the abstracts, including the record id and the number of tokens i nthe abstract

Number of tokens are important later on, less then 100 tokens in the abstract deliver low quality SDG classifications.

In [28]:

# Connect to an in-memory DuckDB database
con = duckdb.connect()
# Query to extract the ID, description, remove XML tags, and calculate the number of tokens in the description
description_data = con.sql(f'''
    SELECT 
        id AS record_id,
        regexp_replace(descriptions[1], '<[^>]+>', '') AS description,  -- Remove XML tags
        array_length(split(regexp_replace(descriptions[1], '<[^>]+>', ''), ' ')) AS token_count
    FROM read_parquet('{master_file}')
    WHERE descriptions IS NOT NULL AND array_length(descriptions) > 0
''').fetchdf()

# Print the resulting DataFrame
print("Descriptions with token counts:")
print(description_data)

# Save the data to a new Parquet file for later use
description_file_path = f"{processing_folder_path}/{folder_name}-descriptions-with-tokens.parquet"
description_data.to_parquet(description_file_path, index=False)
print(f"Description data saved to: {description_file_path}")
print(f"File size: {os.path.getsize(description_file_path) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Descriptions with token counts:
                                          record_id  \
0    doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef   
1    doi_dedup___::e246801fc9ed25782358bac694517f8f   
2    doi_dedup___::fe2c8a61b8ccaa6515d3c2996b2144c9   
3    doi_dedup___::7bd738ce5851f7e450ebf6388ad51522   
4    doi_dedup___::bf1098713a38f89cb9c67a7f59401107   
..                                              ...   
650  doi_dedup___::3894f0d63b65411c0d289bf831716e48   
651  doi_dedup___::a96d857fd60b818f7bde9aa0c99bfc3f   
652  doi_dedup___::551f2ca097a75326cc2e7561f831d38b   
653  doi_dedup___::da9603d028d4bd4ad2d9c980917fd5ac   
654  doi_dedup___::3da738372eaf79174655e4ef24d74cd2   

                                           description  token_count  
0    Abstract</jats:title><jats:p>The Black Sea, th...          161  
1     The early twenty-first century’s warming tren...          247  
2    Abstract</jats:title><jats:p>The semienclosed ...          226  
3    Abstract</jats:title><

##### step 7b Get the official definitions of the SDG's from https://metadata.un.org/sdg/ using the Accept header application/rdf+xml

First we get the links to the top level goals.

In [25]:
import requests

# URL for the SDG metadata
sdg_metadata_url = "https://metadata.un.org/sdg/"

# Set the headers to request RDF/XML format
headers = {
    "Accept": "application/rdf+xml"
}

# Send the GET request
response = requests.get(sdg_metadata_url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Save the RDF/XML content to a file
    rdf_file_path = f"{processing_folder_path}/sdg_definitions.rdf"
    with open(rdf_file_path, "wb") as rdf_file:
        rdf_file.write(response.content)
    print(f"SDG definitions saved to: {rdf_file_path}")
else:
    print(f"Failed to fetch SDG definitions. Status code: {response.status_code}")
    print(f"Response: {response.text}")

SDG definitions saved to: ./data/2025-02-19/04_processed/argo-france/sdg_definitions.rdf


In [26]:
import pandas as pd

import xml.etree.ElementTree as ET

# Parse the RDF/XML file
tree = ET.parse(rdf_file_path)
root = tree.getroot()

# Find all skos:hasTopConcept elements and extract their rdf:resource attribute
top_concept_urls = []
for elem in root.findall('.//{http://www.w3.org/2004/02/skos/core#}hasTopConcept'):
    url = elem.attrib.get('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource')
    if url:
        top_concept_urls.append(url)

# sort the URLs based on the integer in the last part of the URL
top_concept_urls.sort(key=lambda x: int(x.split('/')[-1]))

print("Top concept URLs found in the RDF/XML:")
for url in top_concept_urls:
    print(url)



Top concept URLs found in the RDF/XML:
http://metadata.un.org/sdg/1
http://metadata.un.org/sdg/2
http://metadata.un.org/sdg/3
http://metadata.un.org/sdg/4
http://metadata.un.org/sdg/5
http://metadata.un.org/sdg/6
http://metadata.un.org/sdg/7
http://metadata.un.org/sdg/8
http://metadata.un.org/sdg/9
http://metadata.un.org/sdg/10
http://metadata.un.org/sdg/11
http://metadata.un.org/sdg/12
http://metadata.un.org/sdg/13
http://metadata.un.org/sdg/14
http://metadata.un.org/sdg/15
http://metadata.un.org/sdg/16
http://metadata.un.org/sdg/17


Next we get the goal number, goal name and goal description for each top level goal.

In [31]:
import requests
import pandas as pd

import xml.etree.ElementTree as ET

# Prepare lists to store the results
goal_numbers = []
goal_titles = []
goal_descriptions = []
goal_urls = []

# Loop through each top concept URL
for url in top_concept_urls:
    try:
        # Fetch the RDF/XML content
        resp = requests.get(url, headers={"Accept": "application/rdf+xml"})
        resp.raise_for_status()
        root = ET.fromstring(resp.content)
        # Find the main Description element
        desc = root.find('.//{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description')
        if desc is None:
            continue
        # Extract <skos:note xml:lang="en">Goal N</skos:note>
        goal_number = None
        for note in desc.findall('{http://www.w3.org/2004/02/skos/core#}note'):
            if note.attrib.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en' and note.text and note.text.startswith('Goal'):
                goal_number = note.text.replace('Goal ', '').strip()
                break
        # Extract <skos:altLabel xml:lang="en">...</skos:altLabel>
        goal_title = None
        for alt in desc.findall('{http://www.w3.org/2004/02/skos/core#}altLabel'):
            if alt.attrib.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
                goal_title = alt.text.strip()
                break
        # Extract <skos:prefLabel xml:lang="en">...</skos:prefLabel>
        goal_description = None
        for pref in desc.findall('{http://www.w3.org/2004/02/skos/core#}prefLabel'):
            if pref.attrib.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
                goal_description = pref.text.strip()
                break
        # Store results
        goal_numbers.append(goal_number)
        goal_titles.append(goal_title)
        goal_descriptions.append(goal_description)
        goal_urls.append(url)
    except Exception as e:
        print(f"Error processing {url}: {e}")

# Create DataFrame
df_sdg_goals = pd.DataFrame({
    "goal_number": goal_numbers,
    "goal_title": goal_titles,
    "goal_description": goal_descriptions,
    "goal_url": goal_urls
})

print(df_sdg_goals)

# Save the DataFrame to a CSV file
sdg_goals_csv_path = f"{processing_folder_path}/sdg_goals.csv"
df_sdg_goals.to_csv(sdg_goals_csv_path, index=False)
print(f"SDG goals saved to: {sdg_goals_csv_path}")

   goal_number                               goal_title  \
0            1                               No poverty   
1            2                              Zero hunger   
2            3               Good health and well-being   
3            4                        Quality education   
4            5                          Gender equality   
5            6               Clean water and sanitation   
6            7              Affordable and clean energy   
7            8          Decent work and economic growth   
8            9  Industry, innovation and infrastructure   
9           10                     Reduced inequalities   
10          11       Sustainable cities and communities   
11          12   Responsible consumption and production   
12          13                           Climate action   
13          14                         Life below water   
14          15                             Life on land   
15          16   Peace, justice and strong institutions 

##### Step 7c Here we prepare the System and User prompts to be used by an LLM.

In [49]:
# Define the text to classify
text = """
The United Nations Sustainable Development Goals (SDGs) are a universal call to action to end poverty, protect the planet, and ensure prosperity for all by 2030. They address global challenges such as inequality, climate change, environmental degradation, peace, and justice. The SDGs consist of 17 goals and 169 targets that aim to achieve a better and more sustainable future for all.
"""
# Print the text to classify
print("Text to classify:")
print(text)

# Define the expected output format, now including an explanation field
example_output_format = """
{
    "sdgs": [2, 6, 17],
    "explanation": "This text is related to SDG 2 (Zero hunger) because it discusses food security, SDG 6 (Clean water and sanitation) due to references to environmental protection, and SDG 17 (Partnerships for the goals) as it mentions global cooperation."
}
"""

# Print the example output format
print("Example Output Format:")
print(example_output_format)

# system_prompt
# Build SDG goal info string from df_sdg_goals
sdg_goal_info = "\n".join(
    f"{row.goal_number}: {row.goal_title} - {row.goal_description}"
    for _, row in df_sdg_goals.iterrows()
)

sdg_system_prompt = f"""
You are an intelligent multi-label classification system designed to map texts to their relevant Sustainable Development Goals.
Take the text delimited by triple quotation marks and return a JSON list of relevant SDGs. 
Example output format: {example_output_format}

Here are the SDG goals and their descriptions:
{sdg_goal_info}

"""
# Print the system prompt
print("System Prompt:")
print(sdg_system_prompt)
# user_prompt
sdg_user_prompt = f"""
"Classify the following text in terms of its relevance to the Sustainable Development Goals:",
Text: '''{text}'''
"""
# Print the user prompt
print("User Prompt:")
print(sdg_user_prompt)


Text to classify:

The United Nations Sustainable Development Goals (SDGs) are a universal call to action to end poverty, protect the planet, and ensure prosperity for all by 2030. They address global challenges such as inequality, climate change, environmental degradation, peace, and justice. The SDGs consist of 17 goals and 169 targets that aim to achieve a better and more sustainable future for all.

Example Output Format:

{
    "sdgs": [2, 6, 17],
    "explanation": "This text is related to SDG 2 (Zero hunger) because it discusses food security, SDG 6 (Clean water and sanitation) due to references to environmental protection, and SDG 17 (Partnerships for the goals) as it mentions global cooperation."
}

System Prompt:

You are an intelligent multi-label classification system designed to map texts to their relevant Sustainable Development Goals.
Take the text delimited by triple quotation marks and return a JSON list of relevant SDGs. 
Example output format: 
{
    "sdgs": [2, 6, 1

##### Step 7d: Get the LLM API prepared

In [41]:
# OpenWebUI API configuration
openwebui_base_url = "https://nebula.cs.vu.nl"  # Replace with your actual OpenWebUI API base URL
openwebui_api_key = "sk-5b5a024888c14a019c0e9b4857df9329"  # Replace with your actual API key

first get the models

In [None]:
# This script fetches the list of available models from the OpenWebUI API
# and prints their IDs, names, and parameter sizes.

import requests

# Use the existing variables openwebui_base_url and openwebui_api_key

headers = {
    "Authorization": f"Bearer {openwebui_api_key}"
}

# Ensure the base URL does not end with a slash
api_url = openwebui_base_url.rstrip('/') + "/api/models"

# print the request in curl
print(f"curl -X GET '{api_url}' -H 'Authorization: Bearer {openwebui_api_key}'")

response = requests.get(api_url, headers=headers)

if response.status_code == 200:
    models_json = response.json()
    models = models_json.get("data", [])
    print("Available models:")
    for model in models:
        print(f"- id: {model.get('id')}, name: {model.get('name')}, parameter_size: {model.get('ollama', {}).get('details', {}).get('parameter_size')}")
else:
    print(f"Failed to fetch models. Status code: {response.status_code}")
    print(f"Response: {response.text}")




Available models:
- id: deepseek-r1:1.5b, name: deepseek-r1:1.5b, parameter_size: 1.8B
- id: deepseek-r1:8b, name: deepseek-r1:8b, parameter_size: 8.0B
- id: llama3.1:8b, name: llama3.1:8b, parameter_size: 8.0B
- id: qwen2.5:1.5b, name: qwen2.5:1.5b, parameter_size: 1.5B
- id: qwen2.5:7b, name: qwen2.5:7b, parameter_size: 7.6B
curl -X GET 'https://nebula.cs.vu.nl/api/models' -H 'Authorization: Bearer sk-5b5a024888c14a019c0e9b4857df9329'


Select the model to use, when no model is chosen, deepseek-r1:1.5b will be the default (faser & cheaper)

In [46]:
import signal

# Select the model to use, when no model is chosen, deepseek-r1:1.5b will be the default
model = "deepseek-r1:1.5b"  # Replace with your actual model name

def timeout_handler(signum, frame):
    raise TimeoutError

print("Available models:")
for i, m in enumerate(models):
    print(f"{i}: {m['id']}")

print("Select the model index to use (default: 0, deepseek-r1:1.5b) [timeout 10s]:")
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(10)
try:
    user_input = input()
    if user_input.strip().isdigit():
        selected_model_index = int(user_input.strip())
        if 0 <= selected_model_index < len(models):
            model = models[selected_model_index]['id']
        else:
            print("Invalid index, using default model.")
            model = "deepseek-r1:1.5b"
    else:
        print("No valid input, using default model.")
        model = "deepseek-r1:1.5b"
except TimeoutError:
    print("No response received. Using default model.")
    model = "deepseek-r1:1.5b"
finally:
    signal.alarm(0)

print(f"Model selected: {model}")


Available models:
0: deepseek-r1:1.5b
1: deepseek-r1:8b
2: llama3.1:8b
3: qwen2.5:1.5b
4: qwen2.5:7b
Select the model index to use (default: 0, deepseek-r1:1.5b) [timeout 10s]:
Model selected: llama3.1:8b


Finally, for each abstract, run the system and user prompt

In [None]:
import requests
import pandas as pd
import time

# Add a testing parameter to limit the number of abstracts
testing_mode = True  # Set to True to limit to 2 abstracts for testing

if testing_mode:
    # Load only the first 2 abstracts for testing
    description_df = pd.read_parquet(f"{processing_folder_path}/{folder_name}-descriptions-with-tokens.parquet").head(2)
else:
    # Load all abstracts for production
    description_df = pd.read_parquet(f"{processing_folder_path}/{folder_name}-descriptions-with-tokens.parquet")                                                             

# Filter abstracts with at least 100 tokens
description_df = description_df[description_df['token_count'] >= 100]

# Prepare results list
sdg_results = []

# Loop through each abstract
for idx, row in description_df.iterrows():
    record_id = row['record_id']
    abstract = row['description']

    # Prepare the messages for the API
    messages = [
        {"role": "system", "content": sdg_system_prompt},
        {"role": "user", "content": f"Classify the following text in terms of its relevance to the Sustainable Development Goals:\nText: '''{abstract}'''"}
    ]

    data = {
        "model": model,
        "messages": messages
    }

    # Print the data variable for debugging
    print(f"Data for record_id {record_id}: {data}")

    # Make the API call
    response = requests.post(
        openwebui_base_url.rstrip('/') + "/api/chat/completions",
        headers={"Authorization": f"Bearer {openwebui_api_key}", "Content-Type": "application/json"},
        json=data
    )

    # Check if the request was successful
    if response.status_code != 200:
        print(f"Error processing record_id {record_id}: {response.status_code} - {response.text}")
        continue

    # Print the response for debugging
    print(f"Response for record_id {record_id}: {response.json()}")
    
    # Parse the response
    try:
        result = response.json()
        # Try to extract the SDG list from the response
        content = result['choices'][0]['message']['content']
        # Try to parse the JSON from the model output
        try:
            sdg_json = eval(content) if isinstance(content, str) else content
            sdgs = sdg_json.get("sdgs", [])
            explanation = sdg_json.get("explanation", "")
        except Exception:
            sdgs = []
            explanation = ""
    except Exception:
        sdgs = []
        explanation = ""

    # Append to results, including the explanation if available
    sdg_results.append({
        "record_id": record_id,
        "abstract": abstract,
        "sdgs": sdgs,
        "explanation": explanation
    })

    # Optional: print progress
    print(f"Processed record_id: {record_id}, SDGs: {sdgs}")

    # Optional: delay to avoid rate limits
    time.sleep(1)

# Print the number of results
print(f"Number of SDG results collected: {len(sdg_results)}")

# Make the value of the model variable suitable for using in the file names
model_filename = model.replace(":", "-").replace(" ", "_")

# Save results to parquet
sdg_results_df = pd.DataFrame(sdg_results)
sdg_results_path = f"{processing_folder_path}/{folder_name}-sdg-results-{model_filename}.parquet"
sdg_results_df.to_parquet(sdg_results_path, index=False)
print(f"SDG LLM results saved to: {sdg_results_path}")
print(f"File size: {os.path.getsize(sdg_results_path) / (1024**2):.2f} MB")

Data for record_id doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef: {'model': 'llama3.1:8b', 'messages': [{'role': 'system', 'content': '\nYou are an intelligent multi-label classification system designed to map texts to their relevant Sustainable Development Goals.\nTake the text delimited by triple quotation marks and return a JSON list of relevant SDGs. \nExample output format: \n{\n    "sdgs": [2, 6, 17],\n    "explanation": "This text is related to SDG 2 (Zero hunger) because it discusses food security, SDG 6 (Clean water and sanitation) due to references to environmental protection, and SDG 17 (Partnerships for the goals) as it mentions global cooperation."\n}\n\n\nHere are the SDG goals and their descriptions:\n1: No poverty - End poverty in all its forms everywhere\n2: Zero hunger - End hunger, achieve food security and improved nutrition and promote sustainable agriculture\n3: Good health and well-being - Ensure healthy lives and promote well-being for all at all ages\n4: Quali

### Step 8: Get Genderize data
a. First Query the authors with country of the affiliation along with the record id (to be used as primary keys, connecting the tables later on), 

b. get gender data by parsing the author names with country label over an API, 

c. save the outcomes in a separate parquet file.

In [29]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query to extract authors along with their full names and record IDs
authors = con.sql(f'''
    SELECT 
        id AS record_id,
        unnest.fullName AS full_name,
        unnest.name AS first_name,
        unnest.surname AS last_name,
        unnest.pid.id.value AS orcid,
        countries[1].label AS country_name,
        countries[1].code AS country_code
    FROM read_parquet('{master_file}')
    CROSS JOIN UNNEST(authors) AS unnest
    WHERE countries IS NOT NULL AND array_length(countries) > 0
''').fetchall()

# convert the result to a DataFrame
import pandas as pd
authors_df = pd.DataFrame(authors, columns=['record_id', 'full_name', 'first_name', 'last_name', 'orcid', 'country_name', 'country_code'])
# Print the authors DataFrame
print("Authors with full names and ORCID IDs:")
print(authors_df)  

# Save the authors data to a new Parquet file for later use
authors_file_path = f"{processing_folder_path}/{folder_name}-authors.parquet"
authors_df.to_parquet(authors_file_path, index=False)
print(f"Authors data saved to: {authors_file_path}")
print(f"File size: {os.path.getsize(authors_file_path) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Authors with full names and ORCID IDs:
                                           record_id             full_name  \
0     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef        Emil V. Stanev   
1     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef  Pierre‐Marie Poulain   
2     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef      Sebastian Grayek   
3     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef    Kenneth S. Johnson   
4     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef        Hervé Claustre   
...                                              ...                   ...   
5846  doi_dedup___::53e879864d1a55d055a8f2385005f5e0         Xiaogang Xing   
5847  doi_dedup___::53e879864d1a55d055a8f2385005f5e0        Antoine Poteau   
5848  doi_dedup___::53e879864d1a55d055a8f2385005f5e0     Giorgio Dall'Olmo   
5849  doi_dedup___::53e879864d1a55d055a8f2385005f5e0        Annick Bricaud   
5850  doi_dedup___::f872f5c02d3bce9bc6ae9962dbec5083                  Argo   

        first_name last_

In [30]:
# Filter the authors DataFrame to get unique names, countries, and record IDs
# Only use the first occurrence of each first name
unique_authors = authors_df[['record_id', 'first_name', 'country_code']].copy()
unique_authors['first_name'] = unique_authors['first_name'].str.split().str[0]  # Keep only the first word
# Remove one-letter names (e.g., "L.", "S.") that often end with a dot
unique_authors = unique_authors[~unique_authors['first_name'].str.match(r'^[A-Z]\.$', na=False)]
# Drop rows where 'first_name' is None or NaN
unique_authors = unique_authors.dropna(subset=['first_name'])
unique_authors = unique_authors[unique_authors['first_name'] != 'None']
unique_authors = unique_authors.drop_duplicates(subset=['first_name', 'record_id'], keep='first')

# Print unique authors with record IDs
print("Unique authors linked to record IDs:")
print(unique_authors)


Unique authors linked to record IDs:
                                           record_id    first_name  \
0     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef          Emil   
1     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef  Pierre‐Marie   
2     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef     Sebastian   
3     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef       Kenneth   
4     doi_dedup___::9e973d60bf13b4e8b28c199e27dea4ef         Hervé   
...                                              ...           ...   
5834  doi_dedup___::46a4623c5543eb6a03c0389effc47b8d       Nicolas   
5835  doi_dedup___::46a4623c5543eb6a03c0389effc47b8d         Sally   
5836  doi_dedup___::46a4623c5543eb6a03c0389effc47b8d       Thierry   
5837  doi_dedup___::46a4623c5543eb6a03c0389effc47b8d     Jean-Marc   
5838  doi_dedup___::46a4623c5543eb6a03c0389effc47b8d       Laurent   

     country_code  
0              FR  
1              FR  
2              FR  
3              FR  
4              FR  
..

In [40]:
# Adding variables to handle rate limiting and API key for Genderize API

# Check if the user has a paid subscription
paid_subscription = False  # Set this to True if you have a paid subscription

testing_mode = True  # Set to True for testing, False for production

# Set the rate limit based on the testing mode
if testing_mode:
    rate_limit = 10  # Reduced rate limit for testing
else:
    rate_limit = 1000 if paid_subscription else 100 # Adjust rate limit based on subscription, setting a default for free users

# delay between requests in seconds
delay_between_requests = 0.5  # Calculate delay based on rate limit

# Genderize API key
genderize_api_key= "da1a264b9bab63b46f27ac635dd7d2df"  # Replace with your actual API key

# Initialize request count
request_count = 0  # Initialize request count

# Base URL for Genderize API
base_url = "https://api.genderize.io"

# print all the above variables
print(f"Paid Subscription: {paid_subscription}")
print(f"Testing Mode: {testing_mode}")
print(f"Rate Limit: {rate_limit} requests per second")
print(f"Delay between requests: {delay_between_requests:.2f} seconds")

Paid Subscription: False
Testing Mode: True
Rate Limit: 10 requests per second
Delay between requests: 0.50 seconds


In [42]:
import requests
import time

# Initialize the list to store gender results
gender_results = []

# Iterate over the unique authors
for _, row in unique_authors.iterrows():
    if request_count >= rate_limit: # type: ignore
        print("Rate limit reached. Stopping for the day.")
        break

    first_name = row['first_name']
    country_code = row['country_code']
    record_id = row['record_id']  # Add the record ID

    # Skip if the first name is missing
    if pd.isna(first_name):
        continue

    # Prepare the API request
    params = {
        "name": first_name,
        "country_id": country_code
    }
    if paid_subscription:
        params["apikey"] = genderize_api_key

    try:
        # Send the request to the Genderize API
        response = requests.get(base_url, params=params)
        response.raise_for_status()
        data = response.json()

        # Append the result to the list
        gender_results.append({
            "first_name": first_name,
            "country_code": country_code,
            "gender": data.get("gender"),
            "probability": data.get("probability"),
            "count": data.get("count")
        })

        # Increment the request count
        request_count += 1

        # Print progress
        print(f"Processed: {first_name} ({country_code}) - Gender: {data.get('gender')}")

        # Add a delay between requests to avoid overwhelming the API
        time.sleep(delay_between_requests)

    except requests.exceptions.RequestException as e:
        print(f"Error processing {first_name} ({country_code}): {e}")

        # Increment the request count
        request_count += 1

        # Print progress
        print(f"Processed: {first_name} ({country_code}) - Gender: {data.get('gender')}")

        # Add a small delay to avoid overwhelming the API
        time.sleep(1)

    except requests.exceptions.RequestException as e:
        print(f"Error processing {first_name} ({country_code}): {e}")

# Convert the results to a DataFrame
gender_df = pd.DataFrame(gender_results)

# Save the results to a Parquet file
gender_file_path = f"{processing_folder_path}/{folder_name}-gender-data.parquet"
gender_df.to_parquet(gender_file_path, index=False)
print(f"Gender data saved to: {gender_file_path}")

Rate limit reached. Stopping for the day.
Gender data saved to: ./data/2025-02-19/04_processed/argo-france/argo-france-gender-data.parquet


### Step 9: Get Citizen Science classification labels

a. Query the abstracts first along with the id (to be used as primary keys, connecting the tables later on), 

b. get citizen science labels by parsing the abstract over an LLM API with system prompt, 

c. save the outcomes in a separate parquet file.