# OpenAIRE Community Data Dump Handling: Extraction, Tranformation and Enrichment

In this notebook we will start with one of the [OpenAIRE Community Subgraphs](https://graph.openaire.eu/docs/downloads/subgraphs) to enrich that informatino for further analysis.

This data process will extract an [OpenAIRE community data dump from Zenodo](https://doi.org/10.5281/zenodo.3974604), transforms it in to a portable file format .parquet (and updatable with changes for time seires analysis), that can be used to query with DuckDB, to enrich this with additional data  (also in .parquet, for join queries).

This additional data can be societal impact data from [Altmetric.com](https://details-page-api-docs.altmetric.com/) or [Overton.io](https://app.overton.io/swagger.php), Gender data using [genderize.io](https://genderize.io/documentation), sdg classification using [aurora-sdg](https://aurora-universities.eu/sdg-research/sdg-api/)

This script needs to be written in a way so that it can run every month using  the latest data.

## Processing steps

* the folder ./data/ is put in .gitignore to prevent that bulk datais sent to a code repository. So make sure that folder exists, and mkdir if not exists. 
* The script downloads the lastest Data Dump Tar file from one selected community. See https://doi.org/10.5281/zenodo.3974604 for the latest list. In our case the Aurora tar file. https://zenodo.org/records/14887484/files/aurora.tar?download=1
  * Use the json record of zenodo to get to the latest record, and fetch the download link of the aurora.tar file. for example : https://zenodo.org/records/14887484/export/json or https://zenodo.org/api/records/14887484/versions/latest 
  Make the tar filename a variable, so it can be used for multiple community dumps.
  Download the tar file in a target folder ./data/{filename+timestamp}/ where a subfolder is created using the filename and the timestamp. Make this also as a  variable to use later on.
* Extract the tar file, to the compressed .json.gz files and put these in target folder ./data/{filename+timestamp}/01-extracted/
* Transform the compressed .json.gz files into a single .parquet file in target folder ./data/{filename+timestamp}/02-transformed/
Use instructions in sections "Processing JSON files with DuckDB" and "Full dataset, bit by bit" and "Splitting and Processing JSON Files in Batches" https://github.com/mosart/OpenAIRE-tools/blob/main/duckdb-querying.ipynb to start with. (be aware of error messages, and fix the issues to get all the data in)
* Extract the SQL schema (schema-datadump.sql) from the .parquet file and put it in target folder ./data/{filename+timestamp}/02-transformed/ This is needed for further processing of the records with DuckDB later on.
Use instructions in section "Extracting Schema from Parquet File" https://github.com/mosart/OpenAIRE-tools/blob/main/duckdb-querying.ipynb to start with.
* Query to get all identifiers: openaire id, doi, isbn, hdl, etc.
* **Get Altmetric data:**
* Extract the Altmetric data using the Identifiers. put that in target folder ./data/{filename+timestamp}/03-altmetric-extracted/
* Transform the Altmetric data to a single .parquet file, with the identifiers. put that in target folder ./data/{filename+timestamp}/04-altmetric-transformed/ This way duckDB can make a join when querying over multiple parquet files.
* Extract the SQL schema (schema-altmetric.sql) from the .parquet file and put it in target folder ./data/{filename+timestamp}/04-altmetric-transformed/
* **Get Overton data:** Repeat the altmetric steps, bun than for Overton.
* **Get Gender data** query for the Author names and country codes, and run them over the gerderize api
* **Get SDG data** query for the abstracts, and run abstracs larger than 100 tokens over the aurora-SDG api.

ss

In [3]:
! pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting numpy>=1.22.4 (from pandas)
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m142.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m185.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: tzdata, numpy, pandas
[2K 

In [5]:
import requests
import json

# Fetch the JSON data from the URL
url = "https://zenodo.org/api/records/14887484/versions/latest"
response = requests.get(url)
data = response.json()

# Extract the files information
files = data.get("files", [])

# Create a list of dictionaries for the .tar files
tar_files = []
for file in files:
    if file["key"].endswith(".tar"):
        tar_files.append({
            "File Name": file["key"],
            "Checksum": file["checksum"],
            "Size (Readable)": f"{file['size'] / (1024**3):.2f} GB",  # Convert bytes to GB
            "Download Link": file["links"]["self"]
        })

# print the tar files
# If no tar files found, print a message
if not tar_files:
    print("No .tar files found in the dataset.")
else:
    print(f"Found {len(tar_files)} .tar files in the dataset.")
    print("Details of .tar files:")
    print(tar_files)

Found 37 .tar files in the dataset.
Details of .tar files:
[{'File Name': 'energy-planning_1.tar', 'Checksum': 'md5:0a2f551db46a9e629bb1d0a0098ae5cd', 'Size (Readable)': '6.99 GB', 'Download Link': 'https://zenodo.org/api/records/14887484/files/energy-planning_1.tar/content'}, {'File Name': 'edih-adria_1.tar', 'Checksum': 'md5:23559bed5a9023398b431777bdc8a126', 'Size (Readable)': '5.86 GB', 'Download Link': 'https://zenodo.org/api/records/14887484/files/edih-adria_1.tar/content'}, {'File Name': 'uarctic_1.tar', 'Checksum': 'md5:302e3844ebd041c5f4ed94505eb9a285', 'Size (Readable)': '9.75 GB', 'Download Link': 'https://zenodo.org/api/records/14887484/files/uarctic_1.tar/content'}, {'File Name': 'netherlands_1.tar', 'Checksum': 'md5:d1416c058b3961483aac340750ea8726', 'Size (Readable)': '3.91 GB', 'Download Link': 'https://zenodo.org/api/records/14887484/files/netherlands_1.tar/content'}, {'File Name': 'knowmad_1.tar', 'Checksum': 'md5:a79573a02f2c9a9d65c33b3f3a2eaab9', 'Size (Readable)': 

In [7]:
# Create a DataFrame to hold the tar files information for later use.

import pandas as pd

# Convert the list of dictionaries to a DataFrame
df_tar_files = pd.DataFrame(tar_files)

# Print the DataFrame
print(df_tar_files)

                     File Name                              Checksum  \
0        energy-planning_1.tar  md5:0a2f551db46a9e629bb1d0a0098ae5cd   
1             edih-adria_1.tar  md5:23559bed5a9023398b431777bdc8a126   
2                uarctic_1.tar  md5:302e3844ebd041c5f4ed94505eb9a285   
3            netherlands_1.tar  md5:d1416c058b3961483aac340750ea8726   
4                knowmad_1.tar  md5:a79573a02f2c9a9d65c33b3f3a2eaab9   
5              argo-france.tar  md5:2ce6b0fcc6f876b600207759a0dc9758   
6                   civica.tar  md5:d2f24bbef06809a91d124f0b07cb1034   
7                 covid-19.tar  md5:3b741e8138f39932ca6c13ca106fe5d3   
8                   aurora.tar  md5:9b6a8f38cd6f0ce16a85dfc020c220bf   
9                    dh-ch.tar  md5:dbebdcc8ad7fd1dc7894fe03ebe2a978   
10        heritage-science.tar  md5:ffd2537b08c58d78eea4bc23a99b3c07   
11                     dth.tar  md5:643894810ac8bfce0f8273cf40d05a7a   
12                  egrise.tar  md5:2f52b49fa8bd983bcf6884d6c4f5

In [1]:
# Variables

# URL of the tar file
url = "https://zenodo.org/records/14887484/files/aurora.tar?download=1"

# extract the file name from the URL
file_name = url.split("/")[-1]

# remove everything after the question mark
file_name = file_name.split("?")[0]


# Path to save the downloaded tar file using file_name variable
download_path = f"./data/01_input/{file_name}"

# Path to save the extracted files
extraction_path = "./data/02_extracted"

print(f"URL: {url}")
print(f"File Name: {file_name}")
print(f"Download Path: {download_path}")
print(f"Extraction Path: {extraction_path}")

URL: https://zenodo.org/records/14887484/files/aurora.tar?download=1
File Name: aurora.tar
Download Path: ./data/01_input/aurora.tar
Extraction Path: ./data/02_extracted


Download the tar file

In [10]:
import requests
import os

# Create the directory if it doesn't exist
os.makedirs(os.path.dirname(download_path), exist_ok=True)

# Download the tar file
response = requests.get(url)
with open(download_path, 'wb') as file:
    file.write(response.content)

print("Download complete.")

KeyboardInterrupt: 

extract the tar file

In [None]:
import os
import tarfile

# Create the directory if it doesn't exist
os.makedirs(extraction_path, exist_ok=True)

# Extract the tar file
with tarfile.open(download_path, 'r') as tar:
    tar.extractall(path=extraction_path)

print("Extraction complete.")
    

In [16]:
# List the extracted files
extracted_files = os.listdir(extraction_path)

# count he number of files in the extracted folder
num_files = len(extracted_path)
print(f"Number of files: {num_files}")

# print the first 5 files
print("First 5 files:")
for file in extracted_files[:5]:
    print(file) 

# print the added subdirectories
subdirectories = [file for file in extracted_files if os.path.isdir(os.path.join(extraction_path, file))]
print("Subdirectories:")
for subdirectory in subdirectories:
    print(subdirectory)

# print the latest added subdirectory based on date modified
latest_subdirectory = sorted(subdirectories, key=lambda x: os.path.getmtime(os.path.join(extraction_path, x)))[-1]
print(f"Latest subdirectory: {latest_subdirectory}")

# make varable for the path to the latest subdirectory
latest_extraction_path = os.path.join(extraction_path, latest_subdirectory)

# print the path of the latest extraction path
print(f"Latest extraction path: {latest_extraction_path}")



Number of files: 19
First 5 files:
aurora
Subdirectories:
aurora
Latest subdirectory: aurora
Latest extraction path: ./data/02_extracted/aurora


In [None]:
import duckdb
import os

# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Get the first 3 gzipped JSON files from the latest_extraction_path
gz_files = [file for file in os.listdir(latest_extraction_path) if file.endswith(".gz")][:3]

# Load the gzipped JSON files directly into DuckDB
for gz_file in gz_files:
    gz_file_path = os.path.join(latest_extraction_path, gz_file)
    con.execute("CREATE OR REPLACE TEMP TABLE temp_table AS SELECT * FROM read_json_auto(?, compression='gzip')", [gz_file_path])

# Generate the SQL schema from the temporary table
schema = con.execute("DESCRIBE temp_table").fetchall()

# Print the schema
print("SQL Schema:")
for column in schema:
    print(f"{column[0]}: {column[1]}")

# Close the DuckDB connection
con.close()

SQL Schema:
authors: STRUCT(fullName VARCHAR, "name" VARCHAR, rank BIGINT, surname VARCHAR, pid STRUCT(id STRUCT(scheme VARCHAR, "value" VARCHAR), provenance STRUCT(provenance VARCHAR, trust VARCHAR)))[]
collectedFrom: STRUCT("key" VARCHAR, "value" VARCHAR)[]
communities: STRUCT(code VARCHAR, "label" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR)[])[]
contributors: JSON[]
countries: JSON[]
coverages: JSON[]
dateOfCollection: VARCHAR
descriptions: VARCHAR[]
documentationUrls: JSON[]
formats: JSON[]
id: VARCHAR
indicators: STRUCT(citationImpact STRUCT(citationClass VARCHAR, citationCount DOUBLE, impulse DOUBLE, impulseClass VARCHAR, influence DOUBLE, influenceClass VARCHAR, popularity DOUBLE, popularityClass VARCHAR), usageCounts STRUCT(downloads BIGINT, "views" BIGINT))
instances: STRUCT(alternateIdentifiers STRUCT(scheme VARCHAR, "value" VARCHAR)[], collectedFrom STRUCT("key" VARCHAR, "value" VARCHAR), hostedBy STRUCT("key" VARCHAR, "value" VARCHAR), pids STRUCT(scheme V