# OpenAIRE Community Data Dump Handling: Extraction, Tranformation and Enrichment

In this notebook we will start with one of the [OpenAIRE Community Subgraphs](https://graph.openaire.eu/docs/downloads/subgraphs) to enrich that informatino for further analysis.

This data process will extract an [OpenAIRE community data dump from Zenodo](https://doi.org/10.5281/zenodo.3974604), transforms it in to a portable file format .parquet (and updatable with changes for time seires analysis), that can be used to query with DuckDB, to enrich this with additional data  (also in .parquet, for join queries).

This additional data can be societal impact data from [Altmetric.com](https://details-page-api-docs.altmetric.com/) or [Overton.io](https://app.overton.io/swagger.php), Gender data using [genderize.io](https://genderize.io/documentation), sdg classification using [aurora-sdg](https://aurora-universities.eu/sdg-research/sdg-api/)

This script needs to be written in a way so that it can run every month using  the latest data.

## Processing steps

* the folder ./data/ is put in .gitignore to prevent that bulk datais sent to a code repository. So make sure that folder exists, and mkdir if not exists. 
* The script downloads the lastest Data Dump Tar file from one selected community. See https://doi.org/10.5281/zenodo.3974604 for the latest list. In our case the Aurora tar file. https://zenodo.org/records/14887484/files/aurora.tar?download=1
  * Use the json record of zenodo to get to the latest record, and fetch the download link of the aurora.tar file. for example : https://zenodo.org/records/14887484/export/json or https://zenodo.org/api/records/14887484/versions/latest 
  Make the tar filename a variable, so it can be used for multiple community dumps.
  Download the tar file in a target folder ./data/{filename+timestamp}/ where a subfolder is created using the filename and the timestamp. Make this also as a  variable to use later on.
* Extract the tar file, to the compressed .json.gz files and put these in target folder ./data/{filename+timestamp}/01-extracted/
* Transform the compressed .json.gz files into a single .parquet file in target folder ./data/{filename+timestamp}/02-transformed/
Use instructions in sections "Processing JSON files with DuckDB" and "Full dataset, bit by bit" and "Splitting and Processing JSON Files in Batches" https://github.com/mosart/OpenAIRE-tools/blob/main/duckdb-querying.ipynb to start with. (be aware of error messages, and fix the issues to get all the data in)
* Extract the SQL schema (schema-datadump.sql) from the .parquet file and put it in target folder ./data/{filename+timestamp}/02-transformed/ This is needed for further processing of the records with DuckDB later on.
Use instructions in section "Extracting Schema from Parquet File" https://github.com/mosart/OpenAIRE-tools/blob/main/duckdb-querying.ipynb to start with.
* Query to get all identifiers: openaire id, doi, isbn, hdl, etc.
* **Get Altmetric data:**
* Extract the Altmetric data using the Identifiers. put that in target folder ./data/{filename+timestamp}/03-altmetric-extracted/
* Transform the Altmetric data to a single .parquet file, with the identifiers. put that in target folder ./data/{filename+timestamp}/04-altmetric-transformed/ This way duckDB can make a join when querying over multiple parquet files.
* Extract the SQL schema (schema-altmetric.sql) from the .parquet file and put it in target folder ./data/{filename+timestamp}/04-altmetric-transformed/
* **Get Overton data:** Repeat the altmetric steps, bun than for Overton.
* **Get Gender data** query for the Author names and country codes, and run them over the gerderize api
* **Get SDG data** query for the abstracts, and run abstracs larger than 100 tokens over the aurora-SDG api.

ss

## Step 1 : Get the latest Community Dump File

In [14]:
import requests
import json

# Fetch the JSON data from the URL
url = "https://zenodo.org/api/records/14887484/versions/latest"
response = requests.get(url)
data = response.json()

# Extract the files information
files = data.get("files", [])

# Create a list of dictionaries for the .tar files
tar_files = []
for file in files:
    if file["key"].endswith(".tar"):
        tar_files.append({
            "filename": file["key"],
            "size": f"{file['size'] / (1024**3):.2f} GB",  # Convert bytes to GB
            "downloadlink": file["links"]["self"],
            "checksum": file["checksum"]
        })

# print the tar files
# If no tar files found, print a message
if not tar_files:
    print("No .tar files found in the dataset.")
else:
    print(f"Found {len(tar_files)} .tar files in the dataset.")
    print("Details of .tar files:")
    print(tar_files)

# get and print the publication date
publication_date = data.get("metadata", {}).get("publication_date", "Unknown")
print(f"Publication date: {publication_date}")
# get and print the DOI
doi = data.get("doi", "Unknown")
print(f"DOI: {doi}")
# get and print the title
title = data.get("title", "Unknown")
print(f"Title: {title}")

Found 37 .tar files in the dataset.
Details of .tar files:
[{'filename': 'energy-planning_1.tar', 'size': '6.99 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/energy-planning_1.tar/content', 'checksum': 'md5:0a2f551db46a9e629bb1d0a0098ae5cd'}, {'filename': 'edih-adria_1.tar', 'size': '5.86 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/edih-adria_1.tar/content', 'checksum': 'md5:23559bed5a9023398b431777bdc8a126'}, {'filename': 'uarctic_1.tar', 'size': '9.75 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/uarctic_1.tar/content', 'checksum': 'md5:302e3844ebd041c5f4ed94505eb9a285'}, {'filename': 'netherlands_1.tar', 'size': '3.91 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/netherlands_1.tar/content', 'checksum': 'md5:d1416c058b3961483aac340750ea8726'}, {'filename': 'knowmad_1.tar', 'size': '10.08 GB', 'downloadlink': 'https://zenodo.org/api/records/14887484/files/knowmad_1.tar/content', 'checksum': 'md5:

In [2]:
# Create a DataFrame to hold the tar files information for later use.

import pandas as pd

# Convert the list of dictionaries to a DataFrame
df_tar_files = pd.DataFrame(tar_files)

# Sort the DataFrame by filename alphabetically
df_tar_files = df_tar_files.sort_values(by='filename')

# Print the DataFrame
print(df_tar_files)

                      filename      size  \
5              argo-france.tar   0.00 GB   
8                   aurora.tar   1.73 GB   
22                  beopen.tar   0.20 GB   
6                   civica.tar   0.23 GB   
7                 covid-19.tar   2.03 GB   
23                  dariah.tar   0.02 GB   
9                    dh-ch.tar   1.16 GB   
11                     dth.tar   0.01 GB   
1             edih-adria_1.tar   5.86 GB   
12                  egrise.tar   0.02 GB   
25               elixir-gr.tar   0.01 GB   
0        energy-planning_1.tar   6.99 GB   
27                enermaps.tar   1.59 GB   
24              eu-conexus.tar   0.18 GB   
26                     eut.tar   0.21 GB   
15                 eutopia.tar   1.60 GB   
28                 forthem.tar   0.91 GB   
10        heritage-science.tar   0.03 GB   
29                   inria.tar   0.27 GB   
14               iperionhs.tar   0.00 GB   
4                knowmad_1.tar  10.08 GB   
19               knowmad_2.tar  

In [21]:
# Print a reindexed list of available tar files
print("Available tar files:")
print(df_tar_files[['filename', 'size']].reset_index())

import signal

# Function to handle timeout
def timeout_handler(signum, frame):
    raise TimeoutError

# Set the timeout handler for the input
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(10)  # Set the timeout to 10 seconds

try:
    # Ask the user to select a tar file by its index
    selected_index = int(input("Enter the index of the tar file you want to download: "))
except TimeoutError:
    print("No response received. Defaulting to index 1.")
    selected_index = 1
finally:
    signal.alarm(0)  # Disable the alarm

# Get the selected tar file's download link and checksum
selected_file = df_tar_files.iloc[selected_index]
downloadlink = selected_file['downloadlink']
checksum = selected_file['checksum']

print(f"Selected file: {selected_file['filename']}")
print(f"Download link: {downloadlink}")
print(f"Checksum: {checksum}")

Available tar files:
    index                    filename      size
0       5             argo-france.tar   0.00 GB
1       8                  aurora.tar   1.73 GB
2      22                  beopen.tar   0.20 GB
3       6                  civica.tar   0.23 GB
4       7                covid-19.tar   2.03 GB
5      23                  dariah.tar   0.02 GB
6       9                   dh-ch.tar   1.16 GB
7      11                     dth.tar   0.01 GB
8       1            edih-adria_1.tar   5.86 GB
9      12                  egrise.tar   0.02 GB
10     25               elixir-gr.tar   0.01 GB
11      0       energy-planning_1.tar   6.99 GB
12     27                enermaps.tar   1.59 GB
13     24              eu-conexus.tar   0.18 GB
14     26                     eut.tar   0.21 GB
15     15                 eutopia.tar   1.60 GB
16     28                 forthem.tar   0.91 GB
17     10        heritage-science.tar   0.03 GB
18     29                   inria.tar   0.27 GB
19     14          

In [36]:
# Path Variables

# Extract the file name from the selected file
file_name = selected_file['filename']    

# Path to save the downloaded tar file using file_name variable
download_path = f"./data/{publication_date}/01_input/{file_name}"

# Create the folder name by removing the .tar extension
folder_name = selected_file['filename'].replace('.tar', '')

# Path to save the extracted files using the file_name variable without the .tar extension
extraction_path = f"./data/{publication_date}/02_extracted/{folder_name}"


print(f"File Name: {file_name}")
print(f"Download Path File: {download_path}")
print(f"Folder Name: {folder_name}")
print(f"Extraction Path Folder: {extraction_path}")

File Name: argo-france.tar
Download Path File: ./data/2025-02-19/01_input/argo-france.tar
Folder Name: argo-france
Extraction Path Folder: ./data/2025-02-19/02_extracted/argo-france


### Download the tar file

In [25]:
import os

# Ensure the directory for the download path exists
os.makedirs(os.path.dirname(download_path), exist_ok=True)

# Check if the file already exists
if not os.path.exists(download_path):
    # Get the file size in bytes
    file_size_bytes = float(selected_file['size'].split()[0]) * (1024**3)  # Convert GB to bytes
    print(f"Downloading file: {selected_file['filename']} ({selected_file['size']})")
    print(f"Download URL: {downloadlink}")
    
    # Estimate download duration assuming an average speed of 10 MB/s
    avg_speed = 10 * (1024**2)  # 10 MB/s in bytes
    estimated_duration = file_size_bytes / avg_speed
    print(f"Estimated download time: {estimated_duration:.2f} seconds")
    
    # Download the selected tar file
    response = requests.get(downloadlink, stream=True)
    with open(download_path, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)
else:
    print(f"File already exists: {download_path}")
    print(f"Download URL: {downloadlink}")

print(f"Download complete: {download_path}")


File already exists: ./data/2025-02-19/01_input/argo-france.tar
Download URL: https://zenodo.org/api/records/14887484/files/argo-france.tar/content
Download complete: ./data/2025-02-19/01_input/argo-france.tar


In [26]:
import hashlib

# Function to calculate the checksum of a file
def calculate_checksum(file_path, algorithm):
    hash_func = hashlib.new(algorithm)
    with open(file_path, 'rb') as f:
        while chunk := f.read(8192):
            hash_func.update(chunk)
    return hash_func.hexdigest()

# Extract the checksum algorithm and value
checksum_parts = checksum.split(':', 1)
checksum_algorithm = checksum_parts[0]
expected_checksum = checksum_parts[1]

# Calculate the checksum of the downloaded file
calculated_checksum = calculate_checksum(download_path, algorithm=checksum_algorithm)

# Compare the calculated checksum with the provided checksum
if calculated_checksum == expected_checksum:
    print("Checksum verification passed.")
else:
    print("Checksum verification failed.")
    print(f"Expected: {expected_checksum}")
    print(f"Calculated: {calculated_checksum}")

Checksum verification passed.


## Step 2: Extract the tar file

In [27]:
import os
import tarfile

In [31]:

# Check if the extraction directory already exists and contains files
if os.path.exists(extraction_path) and os.listdir(extraction_path):
    print("The tar file has already been extracted.")
else:
    # Create the directory if it doesn't exist
    os.makedirs(extraction_path, exist_ok=True)

    # Extract the tar file in the parent directory of the extraction_path - because the tar file contains a folder structure repeating the name of the tar file
    print(f"Extracting {download_path} to {extraction_path}...")
    parent_extraction_path = os.path.dirname(extraction_path)
    with tarfile.open(download_path, 'r') as tar:
        tar.extractall(path=parent_extraction_path)

    print("Extraction complete.")
    print(f"Files extracted to: {extraction_path}")
    

Extracting ./data/2025-02-19/01_input/argo-france.tar to ./data/2025-02-19/02_extracted/argo-france...
Extraction complete.
Files extracted to: ./data/2025-02-19/02_extracted/argo-france


In [78]:
# List the extracted files
extracted_files = os.listdir(extraction_path)

# add the path to the extracted files
extracted_files_with_path = [os.path.join(extraction_path, file) for file in extracted_files]

# count the number of files in the extracted folder
num_files = len(extracted_files)
print(f"Number of files: {num_files}")

# print the first 5 files
print("First 5 files:")
for file in extracted_files[:5]:
    print(file) 

# make a DataFrame for the extracted files
df_extracted_files = pd.DataFrame(extracted_files, columns=['filename'])
# Sort the DataFrame by filename alphabetically
df_extracted_files = df_extracted_files.sort_values(by='filename')
# Print the DataFrame
print(df_extracted_files)

# print the dimensions of the DataFrame
print(f"DataFrame dimensions: {df_extracted_files.shape}")

# print a random 5 files, to be used for testing, and use in a variable for later use
import random
random_files = random.sample(extracted_files, 5)
random_files_with_path = [os.path.join(extraction_path, file) for file in random_files]
print("Randomly selected files with full paths for testing:")
for file in random_files_with_path:
    print(file)

# one random file for later use
random_file = random.choice(extracted_files)
print(f"Random file selected for later use: {random_file}")
# Define the path to the random file
random_file_path = os.path.join(extraction_path, random_file)
print(f"Path to the random file: {random_file_path}")
# Check if the random file exists
if os.path.exists(random_file_path):
    print(f"The random file exists: {random_file_path}")
else:
    print(f"The random file does not exist: {random_file_path}")




Number of files: 285
First 5 files:
part-00000-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
part-00001-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
part-00002-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
part-00003-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
part-00004-2c0de614-bb18-4931-bd6a-64f101a27baf-c000.json.gz
                                              filename
0    part-00000-2c0de614-bb18-4931-bd6a-64f101a27ba...
1    part-00001-2c0de614-bb18-4931-bd6a-64f101a27ba...
2    part-00002-2c0de614-bb18-4931-bd6a-64f101a27ba...
3    part-00003-2c0de614-bb18-4931-bd6a-64f101a27ba...
4    part-00004-2c0de614-bb18-4931-bd6a-64f101a27ba...
..                                                 ...
280  part-00580-2c0de614-bb18-4931-bd6a-64f101a27ba...
281  part-00583-2c0de614-bb18-4931-bd6a-64f101a27ba...
282  part-00592-2c0de614-bb18-4931-bd6a-64f101a27ba...
283  part-00618-2c0de614-bb18-4931-bd6a-64f101a27ba...
284  part-00736-2c0de614-bb18-4931-bd6a-64f101a27ba...

## Step 3: Get a data sample to generate parquetfile and the SQL schema
We do this before we process the bulk of the data.

In [80]:
import duckdb

transformation_folder_path = f"./data/{publication_date}/03_transformed/{folder_name}"

# Ensure the target directory exists
os.makedirs(transformation_folder_path, exist_ok=True)

# for testing: Define and print the target output sample file path
sample_file = f"{transformation_folder_path}/{folder_name}-sample.parquet"
print(f"Output file path: {sample_file}")

# for testing: define and print the target output sample file for the multiple selected random sample files
multiple_sample_file = f"{transformation_folder_path}/{folder_name}-multiple-sample.parquet"
print(f"Multiple sample file path: {multiple_sample_file}")

# for production: define and print the target output master file for all extracted files
master_file = f"{transformation_folder_path}/{folder_name}-master.parquet"
print(f"Master file path: {master_file}")


Output file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-sample.parquet
Multiple sample file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-multiple-sample.parquet
Master file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-master.parquet


#### for testing: this part is for running on a single sample

In [72]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Use DuckDB to process the extracted JSON files and save a sample of all rows as a Parquet file
con.sql(f'''
    COPY (
        
        FROM (
            SELECT *
            FROM read_json('{random_file_path}', sample_size=-1, union_by_name=true)
        )
    )
    TO '{sample_file}' (FORMAT parquet, COMPRESSION gzip)
''')

print(f"Transformed data saved to: {sample_file}")
print(f"File size: {os.path.getsize(sample_file) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Transformed data saved to: ./data/2025-02-19/03_transformed/argo-france/argo-france-sample.parquet
File size: 0.03 MB


In [None]:
# schema path
schema_file_path = f"{transformation_folder_path}/{folder_name}-schema.sql"

#print the schema file path
print(f"Schema file path: {schema_file_path}")

duckdb.sql(f'''
    COPY (
        SELECT *
        FROM (DESCRIBE '{sample_file}')
    )
    TO '{schema_file_path}'
''')
# check if the schema file exists
if os.path.exists(schema_file_path):
    print(f"Schema file exists: {schema_file_path}")
else:
    print(f"Schema file does not exist: {schema_file_path}")
# Print the schema file content
with open(schema_file_path, 'r') as schema_file:
    schema_content = schema_file.read()
    print("Schema file content:")
    print(schema_content)


Schema file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file exists: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file content:
column_name,column_type,null,key,default,extra
authors,"STRUCT(fullName VARCHAR, ""name"" VARCHAR, rank BIGINT, surname VARCHAR, pid STRUCT(id STRUCT(scheme VARCHAR, ""value"" VARCHAR), provenance STRUCT(provenance VARCHAR, trust VARCHAR)))[]",YES,,,
bestAccessRight,"STRUCT(code VARCHAR, ""label"" VARCHAR, scheme VARCHAR)",YES,,,
collectedFrom,"STRUCT(""key"" VARCHAR, ""value"" VARCHAR)[]",YES,,,
communities,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR)[])[]",YES,,,
contributors,VARCHAR[],YES,,,
countries,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR))[]",YES,,,
coverages,JSON[],YES,,,
dateOfCollection,VARCHAR,YES,,,
descriptions,VARCHAR[],YES,,,
formats,VARCHAR[],YES,,,
id,VARCHAR,YES,,,
indicators,"

In [58]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Count the number of records in the Parquet file
record_count = con.sql(f'''
    SELECT COUNT(*)
    FROM read_parquet('{sample_file}')
''').fetchone()[0]

# Print the record count
print(f"Number of records in the Parquet file: {record_count}")

# Close the DuckDB connection
con.close()

Number of records in the Parquet file: 3


In [59]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query the titles from the Parquet file
titles = con.sql(f'''
    SELECT mainTitle
    FROM read_parquet('{sample_file}')
''').fetchall()

# Print the titles
print("Titles in the Parquet file:")
for title in titles:
    print(title[0])

# Close the DuckDB connection
con.close()

Titles in the Parquet file:
World Ocean Database 2013. 
Arctic mid-winter phytoplankton growth revealed by autonomous profilers
A shift in the ocean circulation has warmed the subpolar North Atlantic Ocean since 2016


#### for testing: this part is for running on a random sample of multiple files


In [71]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Use DuckDB to process the extracted JSON files and save a sample of all rows as a Parquet file
# Join the list of file paths into a comma-separated string
file_paths = ','.join(f"'{file}'" for file in random_files_with_path)

# Use DuckDB to process the extracted JSON files and save a sample of all rows as a Parquet file
con.sql(f'''
    COPY (
        SELECT *
        FROM read_json([{file_paths}], sample_size=-1, union_by_name=true)
    )
    TO '{multiple_sample_file}' (FORMAT parquet, COMPRESSION gzip)
''')

print(f"Transformed data saved to: {multiple_sample_file}")
print(f"File size: {os.path.getsize(multiple_sample_file) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

Transformed data saved to: ./data/2025-02-19/03_transformed/argo-france/argo-france-multiple-sample.parquet
File size: 0.07 MB


In [74]:
# schema path
schema_file_path = f"{transformation_folder_path}/{folder_name}-schema.sql"

#print the schema file path
print(f"Schema file path: {schema_file_path}")

duckdb.sql(f'''
    COPY (
        SELECT *
        FROM (DESCRIBE '{multiple_sample_file}')
    )
    TO '{schema_file_path}'
''')
# check if the schema file exists
if os.path.exists(schema_file_path):
    print(f"Schema file exists: {schema_file_path}")
else:
    print(f"Schema file does not exist: {schema_file_path}")
# Print the schema file content
with open(schema_file_path, 'r') as schema_file:
    schema_content = schema_file.read()
    print("Schema file content:")
    print(schema_content)


Schema file path: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file exists: ./data/2025-02-19/03_transformed/argo-france/argo-france-schema.sql
Schema file content:
column_name,column_type,null,key,default,extra
authors,"STRUCT(fullName VARCHAR, ""name"" VARCHAR, pid STRUCT(id STRUCT(scheme VARCHAR, ""value"" VARCHAR), provenance STRUCT(provenance VARCHAR, trust VARCHAR)), rank BIGINT, surname VARCHAR)[]",YES,,,
bestAccessRight,"STRUCT(code VARCHAR, ""label"" VARCHAR, scheme VARCHAR)",YES,,,
collectedFrom,"STRUCT(""key"" VARCHAR, ""value"" VARCHAR)[]",YES,,,
communities,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR)[])[]",YES,,,
container,"STRUCT(issnOnline VARCHAR, ""name"" VARCHAR, sp VARCHAR, vol VARCHAR, issnPrinted VARCHAR, ep VARCHAR)",YES,,,
contributors,VARCHAR[],YES,,,
countries,"STRUCT(code VARCHAR, ""label"" VARCHAR, provenance STRUCT(provenance VARCHAR, trust VARCHAR))[]",YES,,,
coverages,JSON[],YE

In [76]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Count the number of records in the Parquet file
record_count = con.sql(f'''
    SELECT COUNT(*)
    FROM read_parquet('{multiple_sample_file}')
''').fetchone()[0]

# Print the record count
print(f"Number of records in the Parquet file: {record_count}")

# Close the DuckDB connection
con.close()

Number of records in the Parquet file: 13


In [77]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Query the titles from the Parquet file
titles = con.sql(f'''
    SELECT mainTitle
    FROM read_parquet('{multiple_sample_file}')
''').fetchall()

# Print the titles
print("Titles in the Parquet file:")
for title in titles:
    print(title[0])

# Close the DuckDB connection
con.close()

Titles in the Parquet file:
Climatic, Decadal, and Interannual Variability in the Upper Layer of the Mediterranean Sea Using Remotely Sensed and In-Situ Data
Achievements and Prospects of Global Broadband Seismographic Networks After 30 Years of Continuous Geophysical Observations
Vortex–wall interaction on the <i>β</i>-plane and the generation of deep submesoscale cyclones by internal Kelvin Waves–current interactions
A global probabilistic study of the ocean heat content low‐frequency variability: Atmospheric forcing versus oceanic chaos
Particulate concentration and seasonal dynamics in the mesopelagic ocean based on the backscattering coefficient measured with Biogeochemical‐Argo floats
Observing the full ocean volume using Deep Argo floats
Using climatological salinities for estimating the oxygen content in ARGO floats
Aliasing of the Indian Ocean externally-forced warming spatial pattern by internal climate variability
Multifrequency seismic detectability of seasonal thermoclines

#### for production: parsing all extracted files into one master parquet file

In [79]:
# Connect to an in-memory DuckDB database
con = duckdb.connect()

# Use DuckDB to process the extracted JSON files and save a sample of all rows as a Parquet file
# Join the list of file paths into a comma-separated string
file_paths = ','.join(f"'{file}'" for file in extracted_files_with_path)

# Use DuckDB to process the extracted JSON files and save a sample of all rows as a Parquet file
con.sql(f'''
    COPY (
        SELECT *
        FROM read_json([{file_paths}], sample_size=-1, union_by_name=true)
    )
    TO '{master_file}' (FORMAT parquet, COMPRESSION gzip)
''')

print(f"Transformed data saved to: {master_file}")
print(f"File size: {os.path.getsize(master_file) / (1024**2):.2f} MB")

# Close the DuckDB connection
con.close()

NameError: name 'master_file' is not defined