# Accessing IODP Community Metadata using the Zenodo OAI-PMH API

See here for developer specification: https://developers.zenodo.org/#oai-pmh

I navigated to https://zenodo.org/oai2d and manually tracked down URL endpoints to get the data that is presented below using Sickle.

- All IODP record identifiers: https://zenodo.org/oai2d?verb=ListIdentifiers&metadataPrefix=oai_dc&set=user-iodp
- All IODP records: https://zenodo.org/oai2d?verb=ListRecords&metadataPrefix=oai_dc&set=user-iodp
- All OAI-PMH metadata formats: https://zenodo.org/oai2d?verb=ListMetadataFormats (Use these as "MetadataPrefix" to change formatting)


Zenodo's api is exposed to show all record endpoints. I can't find this in the official documentation anywhere but came across the api when looking at network traffic. This is useful for getting the file url endpoints of all the data we have uploaded and the usage statistics.
- https://zenodo.org/api/records/?communities=iodp&size=10000
- (Alternative) "https://zenodo.org/api/records/?sort=mostrecent&communities=iodp&page=1&size=10000"

Example of an abridged json response from the api/records endpoint. I deleted some of the authors and files to shorten the json.
```json
{
	"conceptdoi": "10.5281/zenodo.7705344",
	"conceptrecid": "7705344",
	"created": "2023-03-09T16:00:25.242639+00:00",
	"doi": "10.5281/zenodo.7705345",
	"files": [
		{
			"bucket": "87ac6767-6f95-4ab2-9f4d-1ff6282605c2",
			"checksum": "md5:019d7e58ae874345ae471acb08930319",
			"key": "385_SITE_LOCATIONS.HTML",
			"links": {
				"self": "https://zenodo.org/api/files/87ac6767-6f95-4ab2-9f4d-1ff6282605c2/385_SITE_LOCATIONS.HTML"
			},
			"size": 2694,
			"type": "html"
		},
		{
			"bucket": "87ac6767-6f95-4ab2-9f4d-1ff6282605c2",
			"checksum": "md5:0876d156a7bc532a816dbe3d9dcf06b7",
			"key": "data_by_hole.zip",
			"links": {
				"self": "https://zenodo.org/api/files/87ac6767-6f95-4ab2-9f4d-1ff6282605c2/data_by_hole.zip"
			},
			"size": 48913,
			"type": "zip"
		},
		{
			"bucket": "87ac6767-6f95-4ab2-9f4d-1ff6282605c2",
			"checksum": "md5:a98962eb88fd560f778f44f3385f1068",
			"key": "MICROIMG-README.txt",
			"links": {
				"self": "https://zenodo.org/api/files/87ac6767-6f95-4ab2-9f4d-1ff6282605c2/MICROIMG-README.txt"
			},
			"size": 2497,
			"type": "txt"
		},
		{
			"bucket": "87ac6767-6f95-4ab2-9f4d-1ff6282605c2",
			"checksum": "md5:e784b62591ca310253053c6566fc7481",
			"key": "U1545A.zip",
			"links": {
				"self": "https://zenodo.org/api/files/87ac6767-6f95-4ab2-9f4d-1ff6282605c2/U1545A.zip"
			},
			"size": 264609930,
			"type": "zip"
		},
		{
			"bucket": "87ac6767-6f95-4ab2-9f4d-1ff6282605c2",
			"checksum": "md5:b73133d039cba80072ee01fd7bd009d0",
			"key": "U1552B.zip",
			"links": {
				"self": "https://zenodo.org/api/files/87ac6767-6f95-4ab2-9f4d-1ff6282605c2/U1552B.zip"
			},
			"size": 628435265,
			"type": "zip"
		}
	],
	"id": 7705345,
	"links": {
		"badge": "https://zenodo.org/badge/doi/10.5281/zenodo.7705345.svg",
		"bucket": "https://zenodo.org/api/files/87ac6767-6f95-4ab2-9f4d-1ff6282605c2",
		"conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.7705344.svg",
		"conceptdoi": "https://doi.org/10.5281/zenodo.7705344",
		"doi": "https://doi.org/10.5281/zenodo.7705345",
		"html": "https://zenodo.org/record/7705345",
		"latest": "https://zenodo.org/api/records/7705345",
		"latest_html": "https://zenodo.org/record/7705345",
		"self": "https://zenodo.org/api/records/7705345"
	},
	"metadata": {
		"access_right": "open",
		"access_right_category": "success",
		"communities": [
			{
				"id": "iodp"
			}
		],
		"contributors": [
			{
				"name": "International Ocean Discovery Program",
				"type": "DataCollector"
			}
		],
		"creators": [
			{
				"name": "Teske, Andreas P.",
				"orcid": "0000-0003-3669-5425"
			},
			{
				"name": "Lizarralde, Daniel",
				"orcid": "0000-0001-6152-6039"
			},
			{
				"name": "Höfig, Tobias W.",
				"orcid": "0000-0002-9254-4528"
			},
			{
				"name": "Zhuang, Guangchao",
				"orcid": "0000-0002-6282-8415"
			}
		],
		"description": "<p>Microscopic images of discrete samples were acquired using stereo and upright light microscopes and captured on digital cameras. Image files were uploaded along with a brief description and a record of the microscopic and lighting conditions when the image was taken.</p>",
		"doi": "10.5281/zenodo.7705345",
		"keywords": [
			"International Ocean Discovery Program",
			"IODP",
			"JOIDES Resolution",
			"Expedition 385",
			"Site U1545",
			"Site U1546",
			"Site U1547",
			"Site U1548",
			"Site U1549",
			"Site U1550",
			"Site U1551",
			"Site U1552",
			"Guaymas Basin Tectonics and Biosphere",
			"deep biosphere",
			"Gulf of California",
			"Ringvent",
			"hydrothermal alteration",
			"heat flow",
			"sill emplacement",
			"carbon budget",
			"alteration"
		],
		"license": {
			"id": "CC0-1.0"
		},
		"publication_date": "2021-09-27",
		"related_identifiers": [
			{
				"identifier": "10.14379/iodp.proc.385.2021",
				"relation": "isDocumentedBy",
				"scheme": "doi"
			},
			{
				"identifier": "10.5281/zenodo.7705344",
				"relation": "isVersionOf",
				"scheme": "doi"
			}
		],
		"relations": {
			"version": [
				{
					"count": 1,
					"index": 0,
					"is_last": true,
					"last_child": {
						"pid_type": "recid",
						"pid_value": "7705345"
					},
					"parent": {
						"pid_type": "recid",
						"pid_value": "7705344"
					}
				}
			]
		},
		"resource_type": {
			"title": "Dataset",
			"type": "dataset"
		},
		"title": "IODP Expedition 385 Photomicrographs"
	},
	"owners": [
		88403
	],
	"revision": 2,
	"stats": {
		"downloads": 1,
		"unique_downloads": 1,
		"unique_views": 4,
		"version_downloads": 1,
		"version_unique_downloads": 1,
		"version_unique_views": 4,
		"version_views": 4,
		"version_volume": 48913,
		"views": 4,
		"volume": 48913
	},
	"updated": "2023-03-10T02:26:54.985841+00:00"
}
```

In [54]:
from sickle import Sickle
import pandas as pd
import numpy as np
import json
from IPython.display import JSON
import json
from types import SimpleNamespace
from itertools import chain
import jmespath # See JMESPATH documentation here for walkthroughs: https://jmespath.org/examples.html, https://github.com/jmespath/jmespath.py
from collections import Counter
import re
import requests # for making http calls

# Zenodo API access for data

These data originate from the backend API Zenodo uses itself to present data on its website. There is not clear documentation on its usage. I have found that it contains more information than the standard API Zenodo presents in its Developer Documentation. It also includes the file buckets and file links for each file uploaded against a deposition.

**Important**: When using query filters in jmespath bracket numbers in the filter expressions with a backtick ` and not a apostrophe '. This took me too long to figure out...

In [55]:
# Exporting dataframes to this file
summary_file = './iodp_community_records.xlsx'

In [56]:
zenodo_api = 'https://zenodo.org/api/records/?communities=iodp&size=10000'
response = requests.get(zenodo_api)
response = response.json()

# using jmsepath to query json
z = jmespath.search('hits.hits',response)
print(f'Example response {z[0]}')
JSON(z[0], expanded=True)

Example response {'conceptdoi': '10.5281/zenodo.7806045', 'conceptrecid': '7806045', 'created': '2023-04-07T16:31:06.233818+00:00', 'doi': '10.5281/zenodo.7806046', 'files': [{'bucket': '5018025f-3eb3-4d73-b9eb-d600b6027131', 'checksum': 'md5:b12e68211e8fb0c8dc9f9938c2add24a', 'key': '396_SITE_LOCATIONS.HTML', 'links': {'self': 'https://zenodo.org/api/files/5018025f-3eb3-4d73-b9eb-d600b6027131/396_SITE_LOCATIONS.HTML'}, 'size': 2208, 'type': 'html'}, {'bucket': '5018025f-3eb3-4d73-b9eb-d600b6027131', 'checksum': 'md5:54011a3690a9b3b5a0c88464af00a076', 'key': 'ALKALINITY-README.txt', 'links': {'self': 'https://zenodo.org/api/files/5018025f-3eb3-4d73-b9eb-d600b6027131/ALKALINITY-README.txt'}, 'size': 2127, 'type': 'txt'}, {'bucket': '5018025f-3eb3-4d73-b9eb-d600b6027131', 'checksum': 'md5:50d86233b221350a1c7de63f52a32193', 'key': 'ALKALINITY.zip', 'links': {'self': 'https://zenodo.org/api/files/5018025f-3eb3-4d73-b9eb-d600b6027131/ALKALINITY.zip'}, 'size': 204889, 'type': 'zip'}], 'id': 

<IPython.core.display.JSON object>

In [57]:
# The access statistics of all datasets

df = pd.DataFrame(jmespath.search('hits.hits[].[metadata.title, conceptdoi, conceptrecid, created, doi, id, owners[0], revision, stats, updated]',response))
col_names = dict(zip(range(0,10),['title','conceptdoi', 'conceptrecid', 'created', 'doi', 'id', 'owners', 'revision', 'stats', 'updated']))
df = df.rename(columns=col_names)
stats = df['stats'].apply(pd.Series)
df[['downloads', 'unique_downloads', 'unique_views', 'version_downloads',
       'version_unique_downloads', 'version_unique_views', 'version_views',
       'version_volume', 'views', 'volume']] = stats
df.pop('stats')
dfs = df
with pd.ExcelWriter(summary_file, engine='openpyxl', mode='a', if_sheet_exists="replace") as writer: # mode "a" is append
    dfs.to_excel(writer, sheet_name = 'dataset_statistics', index=False)
    
dfs

Unnamed: 0,title,conceptdoi,conceptrecid,created,doi,id,owners,revision,updated,downloads,unique_downloads,unique_views,version_downloads,version_unique_downloads,version_unique_views,version_views,version_volume,views,volume
0,IODP Expedition 396 Alkalinity and pH,10.5281/zenodo.7806045,7806045,2023-04-07T16:31:06.233818+00:00,10.5281/zenodo.7806046,7806046,88403,1,2023-04-07T16:31:07.281435+00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,0.000000e+00
1,IODP Expedition 396 Carbonates composite report,10.5281/zenodo.7806079,7806079,2023-04-07T16:30:49.612945+00:00,10.5281/zenodo.7806080,7806080,88403,1,2023-04-07T16:30:50.520354+00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,0.000000e+00
2,IODP Expedition 396 Elemental analysis (CHNS),10.5281/zenodo.7806081,7806081,2023-04-07T16:30:39.873079+00:00,10.5281/zenodo.7806082,7806082,88403,1,2023-04-07T16:30:40.806323+00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,0.000000e+00
3,IODP Expedition 396 Closeup images,10.5281/zenodo.7806083,7806083,2023-04-07T16:30:28.847735+00:00,10.5281/zenodo.7806084,7806084,88403,1,2023-04-07T16:30:29.755885+00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,0.000000e+00
4,IODP Expedition 396 Core composite images,10.5281/zenodo.7806106,7806106,2023-04-07T16:30:18.350708+00:00,10.5281/zenodo.7806107,7806107,88403,1,2023-04-07T16:30:19.227795+00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000e+00,0.0,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
517,IODP Expedition 361 X-ray diffraction (XRD),10.5281/zenodo.3633166,3633166,2020-01-31T17:26:42.282771+00:00,10.5281/zenodo.3633167,3633167,88403,2,2020-01-31T19:20:51.413284+00:00,19.0,18.0,141.0,19.0,18.0,141.0,158.0,4.950051e+06,158.0,4.950051e+06
518,IODP Expedition 361 Navigation,10.5281/zenodo.3629546,3629546,2020-01-28T15:32:24.370821+00:00,10.5281/zenodo.3629547,3629547,88403,2,2020-01-29T07:20:51.981765+00:00,6.0,6.0,122.0,6.0,6.0,122.0,131.0,1.302520e+08,131.0,1.302520e+08
519,IODP Expedition 361 Carbonates composite report,10.5281/zenodo.3629544,3629544,2020-01-28T15:22:47.429908+00:00,10.5281/zenodo.3629545,3629545,88403,4,2020-01-29T07:20:51.992551+00:00,14.0,14.0,154.0,14.0,14.0,154.0,173.0,4.454240e+05,173.0,4.454240e+05
520,IODP Expedition 361 Core composite images,10.5281/zenodo.3628852,3628852,2020-01-28T15:19:18.787807+00:00,10.5281/zenodo.3628853,3628853,88403,3,2020-01-31T16:57:13.343632+00:00,308.0,71.0,193.0,308.0,71.0,193.0,214.0,2.138586e+10,214.0,2.138586e+10


In [59]:
dfs[(dfs['title'].str.contains('Expedition 396'))].to_csv('./x396_DOI.csv', index=False)

In [47]:
# Views and downloads of all IODP Community datasets based on owner.
agg_dict = {'title':['count'],
            'unique_views':['sum'],
            'downloads':['sum']}
df = dfs.groupby('owners')['title', 'unique_views','downloads'].agg(agg_dict)
df.columns = ["_".join(x) for x in df.columns.to_flat_index()]
df = df.reset_index()

with pd.ExcelWriter(summary_file, engine='openpyxl', mode='a', if_sheet_exists="replace") as writer: # mode "a" is append
    df.to_excel(writer, sheet_name = 'stats_by_dataset_owner', index=False)
df

  df = dfs.groupby('owners')['title', 'unique_views','downloads'].agg(agg_dict)


Unnamed: 0,owners,title_count,unique_views_sum,downloads_sum
0,88403,513,22239.0,3499.0
1,91826,2,296.0,81.0
2,155858,1,66.0,75.0
3,241727,1,64.0,6.0
4,257871,1,349.0,240.0
5,340510,1,13.0,0.0
6,341527,1,19.0,7.0
7,370226,1,20.0,2.0
8,389228,1,26.0,29.0


In [48]:
# Datasets tagged as part of the IODP Community but uploaded from outside researchers.
df = dfs[dfs['owners']!=88403]
df = df.reset_index()

with pd.ExcelWriter(summary_file, engine='openpyxl', mode='a', if_sheet_exists="replace") as writer: # mode "a" is append
    df.to_excel(writer, sheet_name = 'community_datasets', index=False)
df


Unnamed: 0,index,title,conceptdoi,conceptrecid,created,doi,id,owners,revision,updated,downloads,unique_downloads,unique_views,version_downloads,version_unique_downloads,version_unique_views,version_views,version_volume,views,volume
0,181,Greigite formation modulated by turbidites and...,10.5281/zenodo.6487555,6487555,2022-10-23T09:13:18.696977+00:00,10.5281/zenodo.7025598,7025598,341527,5,2022-11-16T15:12:40.169927+00:00,7.0,7.0,19.0,15.0,9.0,40.0,44.0,6372565.0,22.0,74759.0
1,187,"ODP Site 1249, ODP Site 1252, and IODP Site U1...",10.5281/zenodo.7079481,7079481,2022-09-15T20:26:52.452599+00:00,10.5281/zenodo.7079482,7079482,340510,4,2022-11-15T22:18:03.871512+00:00,0.0,0.0,13.0,0.0,0.0,13.0,14.0,0.0,14.0,0.0
2,267,"Cruise Report, Research Vessel Thomas G. Thomp...",10.5281/zenodo.6981373,6981373,2022-08-10T17:54:18.177282+00:00,10.5281/zenodo.6981374,6981374,389228,3,2022-08-11T02:26:25.980531+00:00,29.0,25.0,26.0,29.0,25.0,26.0,29.0,674412000.0,29.0,674412000.0
3,268,"IODP Expedition -358, Site C0024 - LWD Data",10.5281/zenodo.6909791,6909791,2022-07-27T00:50:29.785812+00:00,10.5281/zenodo.6909792,6909792,370226,3,2022-08-10T18:16:27.147999+00:00,2.0,2.0,20.0,2.0,2.0,20.0,20.0,254968600.0,20.0,254968600.0
4,348,IODP Exp 396 - Porous basalt,10.5281/zenodo.5706029,5706029,2021-11-18T15:13:56.636201+00:00,10.5281/zenodo.5706030,5706030,241727,3,2022-08-10T18:16:36.399976+00:00,6.0,6.0,64.0,6.0,6.0,64.0,76.0,183794100.0,76.0,183794100.0
5,349,"Expedition 379T Preliminary Report, Digging De...",10.5281/zenodo.5553427,5553427,2021-10-06T23:02:00.198000+00:00,10.5281/zenodo.5553428,5553428,257871,3,2021-10-11T14:28:51.234877+00:00,240.0,227.0,349.0,240.0,227.0,349.0,396.0,1195511000.0,396.0,1195511000.0
6,386,IODP Expedition 382: Supplementary Tables for ...,10.5281/zenodo.3776572,3776572,2020-12-22T19:20:11.930649+00:00,10.5281/zenodo.3776573,3776573,91826,3,2021-06-03T14:12:14.261325+00:00,37.0,33.0,166.0,37.0,33.0,166.0,182.0,340358700.0,182.0,340358700.0
7,387,Mechanical data of rotary shear fluid pressuri...,10.5281/zenodo.4268279,4268279,2020-11-11T15:37:29.510303+00:00,10.5281/zenodo.4268280,4268280,155858,3,2021-06-03T14:12:05.913375+00:00,75.0,29.0,66.0,75.0,29.0,66.0,69.0,2219359000.0,69.0,2219359000.0
8,479,Mid to late Pleistocene IODP Expedition 354 Be...,10.5281/zenodo.3676056,3676056,2020-03-30T19:25:50.881515+00:00,10.5281/zenodo.3676057,3676057,91826,4,2022-11-18T15:07:04.607313+00:00,44.0,24.0,130.0,44.0,24.0,130.0,147.0,81452150.0,147.0,81452150.0


In [49]:
# Authors and ORCIDS of only IODP owned datasets. Note the IODP account owner id is '88403'
z = jmespath.search('hits.hits[?contains(owners,`88403`)][ metadata.creators[].[name,orcid]][][]',response) 
df = pd.DataFrame(z, columns=['author','orcid'])
df = df.groupby(['author','orcid'], dropna=False).size().to_frame().reset_index() # keep authors that do not have ORCIDs too
df = df.rename(columns={0:'count_publications'})

with pd.ExcelWriter(summary_file, engine='openpyxl', mode='a', if_sheet_exists="replace") as writer: # mode "a" is append
    df.to_excel(writer, sheet_name = 'authors', index=False)
    
df

Unnamed: 0,author,orcid,count_publications
0,"Agarwal, Amar",0000-0003-1011-4784,45
1,"Aiello, Ivano W.",0000-0001-9794-6852,50
2,"Albers, E.",,42
3,"Aljahdali, Mohammed H.",,46
4,"Almeev, Renat",0000-0003-0652-9469,45
...,...,...,...
353,"van Peer, Tim E.",0000-0003-3516-4198,41
354,"van de Flierdt, Tina",0000-0001-7176-9755,41
355,"van der Land, Cees",0000-0002-0301-6927,46
356,"van der Lubbe, J.J.L.",,43


In [50]:
# All the data records, links and their doi information
records = jmespath.search('hits.hits[*].[metadata.title, doi, links.[doi,latest,conceptdoi], files[*].[bucket,checksum,key,links.self,size,type]]',response)

dfs = []
for record in records:
    df = pd.DataFrame(record[3])
    df['title'] = record[0]
    df['doi'] = record[1]
    df[['link.doi','latest','conceptdoi']] = record[2]
    dfs.append(df)

dfs = pd.concat(dfs)
dfs = dfs.rename(columns={0:'bucket',1:'checksum',2:'key',3:'link', 4:'size', 5:'type'})
dfs.insert(0,'title',dfs.pop('title')) # fancy way of moving column to new index location
dfs = dfs.reset_index(drop=True)

with pd.ExcelWriter(summary_file, engine='openpyxl', mode='a', if_sheet_exists="replace") as writer: # mode "a" is append
    dfs.to_excel(writer, sheet_name = 'file_records', index=False)

dfs

Unnamed: 0,title,bucket,checksum,key,link,size,type,doi,link.doi,latest,conceptdoi
0,IODP Expedition 396 Alkalinity and pH,5018025f-3eb3-4d73-b9eb-d600b6027131,md5:b12e68211e8fb0c8dc9f9938c2add24a,396_SITE_LOCATIONS.HTML,https://zenodo.org/api/files/5018025f-3eb3-4d7...,2208,html,10.5281/zenodo.7806046,https://doi.org/10.5281/zenodo.7806046,https://zenodo.org/api/records/7806046,https://doi.org/10.5281/zenodo.7806045
1,IODP Expedition 396 Alkalinity and pH,5018025f-3eb3-4d73-b9eb-d600b6027131,md5:54011a3690a9b3b5a0c88464af00a076,ALKALINITY-README.txt,https://zenodo.org/api/files/5018025f-3eb3-4d7...,2127,txt,10.5281/zenodo.7806046,https://doi.org/10.5281/zenodo.7806046,https://zenodo.org/api/records/7806046,https://doi.org/10.5281/zenodo.7806045
2,IODP Expedition 396 Alkalinity and pH,5018025f-3eb3-4d73-b9eb-d600b6027131,md5:50d86233b221350a1c7de63f52a32193,ALKALINITY.zip,https://zenodo.org/api/files/5018025f-3eb3-4d7...,204889,zip,10.5281/zenodo.7806046,https://doi.org/10.5281/zenodo.7806046,https://zenodo.org/api/records/7806046,https://doi.org/10.5281/zenodo.7806045
3,IODP Expedition 396 Carbonates composite report,581ddad2-9e13-47e9-9175-22007259964d,md5:b12e68211e8fb0c8dc9f9938c2add24a,396_SITE_LOCATIONS.HTML,https://zenodo.org/api/files/581ddad2-9e13-47e...,2208,html,10.5281/zenodo.7806080,https://doi.org/10.5281/zenodo.7806080,https://zenodo.org/api/records/7806080,https://doi.org/10.5281/zenodo.7806079
4,IODP Expedition 396 Carbonates composite report,581ddad2-9e13-47e9-9175-22007259964d,md5:beee5e1379efa39cf0fff8fa62651ff6,CARB-README.txt,https://zenodo.org/api/files/581ddad2-9e13-47e...,2564,txt,10.5281/zenodo.7806080,https://doi.org/10.5281/zenodo.7806080,https://zenodo.org/api/records/7806080,https://doi.org/10.5281/zenodo.7806079
...,...,...,...,...,...,...,...,...,...,...,...
1917,IODP Expedition 361 Core composite images,ca138557-ae4b-40b0-ac71-a637c39ddaa1,md5:3e3561deec8cd4420e16e01b935502a0,361-U1479G-COREPHOTO.zip,https://zenodo.org/api/files/ca138557-ae4b-40b...,4604390,zip,10.5281/zenodo.3628853,https://doi.org/10.5281/zenodo.3628853,https://zenodo.org/api/records/3628853,https://doi.org/10.5281/zenodo.3628852
1918,IODP Expedition 361 Core composite images,ca138557-ae4b-40b0-ac71-a637c39ddaa1,md5:2d0dc469529966e80a70ef6b8ab133b3,361-U1479H-COREPHOTO.zip,https://zenodo.org/api/files/ca138557-ae4b-40b...,31970011,zip,10.5281/zenodo.3628853,https://doi.org/10.5281/zenodo.3628853,https://zenodo.org/api/records/3628853,https://doi.org/10.5281/zenodo.3628852
1919,IODP Expedition 361 Core composite images,ca138557-ae4b-40b0-ac71-a637c39ddaa1,md5:13476e92c9ab3bf822bceddf5a02f9d2,361-U1479I-COREPHOTO.zip,https://zenodo.org/api/files/ca138557-ae4b-40b...,9367910,zip,10.5281/zenodo.3628853,https://doi.org/10.5281/zenodo.3628853,https://zenodo.org/api/records/3628853,https://doi.org/10.5281/zenodo.3628852
1920,IODP Expedition 361 Core composite images,ca138557-ae4b-40b0-ac71-a637c39ddaa1,md5:4b830cdab7079bcb59b23570022042a5,README.txt,https://zenodo.org/api/files/ca138557-ae4b-40b...,1198,txt,10.5281/zenodo.3628853,https://doi.org/10.5281/zenodo.3628853,https://zenodo.org/api/records/3628853,https://doi.org/10.5281/zenodo.3628852


In [42]:
#the total size of all uploads by file type

agg_dict = {'checksum':['count'], 'size':['sum']}
sizes = dfs.groupby('type').agg(agg_dict).sort_values(('size','sum'),ascending=False)
sizes = sizes.reset_index()

#sizes.columns = sizes.columns.to_flat_index()
#dat.columns = ["_".join(a) for a in dat.columns.to_flat_index()]

new_names = [f'{x[0]}_{x[1]}' if x[1] != '' else f'{x[0]}' for x in sizes.columns]
sizes.columns = new_names
sizes['size_sum_gb'] = sizes['size_sum'] / 1e9

with pd.ExcelWriter(summary_file,engine='openpyxl', mode='a', if_sheet_exists="replace") as writer:
    sizes.to_excel(writer, sheet_name = 'file_sizes', index=False)

#sizes.to_excel('./iodp_community_records.xlsx', sheet_name = 'file_sizes', index=False)
sizes

Unnamed: 0,type,checksum_count,size_sum,size_sum_gb
0,zip,1055,316420667880,316.420668
1,mp4,6,1563686665,1.563687
2,txt,345,289165287,0.289165
3,pdf,2,28236882,0.028237
4,xlsx,21,17619320,0.017619
5,xls,2,3358208,0.003358
6,html,335,448090,0.000448
7,,2,91595,9.2e-05
8,csv,5,23108,2.3e-05


# Using Sickle to Access Zenodo's OAI-PMH API

This is the official API for accessing Zenodo metadata. The API shows less record information that using Zenodo's backend API. Some of the steps below are given in the Zenodo developer documentation.

In [None]:
base = 'https://zenodo.org/oai2d'
sickle = Sickle(base)
info = sickle.Identify()
# available metadata formats
list(sickle.ListMetadataFormats())

In [None]:
# accessing all iodp community records in specified metadata format
records = sickle.ListRecords(metadataPrefix='oai_dc',set='user-iodp')
records = list(records)

In [None]:
# Gets the unique keys from all the records. Note: Not all records use all keys. Will need to check before accessing.
keys = [list(record.metadata.keys()) for record in records]
list(set(chain(*keys)))

['source',
 'relation',
 'type',
 'date',
 'subject',
 'identifier',
 'description',
 'contributor',
 'rights',
 'language',
 'creator',
 'title']

In [None]:
# IODP created records
iodp = [record.metadata for record in records if 'contributor' in record.metadata if record.metadata['contributor'][0]=='International Ocean Discovery Program' ]
community = [record.metadata for record in records if not 'contributor' in record.metadata]

print(f"Count of IODP records: {len(iodp)}")
print(f"Count of community tagged records: {len(community)}")
print(f"Do iodp and community records add to the total record count? {(len(iodp)+len(community)) == len(records)}")

Count of IODP records: 479
Count of community tagged records: 10
Do iodp and community records add to the total record count? True


In [None]:
pd.DataFrame(iodp)

Unnamed: 0,contributor,creator,date,description,identifier,relation,rights,subject,title,type
0,[International Ocean Discovery Program],"[Teske, Andreas P., Lizarralde, Daniel, Höfig,...",[2021-09-27],"[Report lists hole data: location, water depth...","[https://zenodo.org/record/7703126, 10.5281/ze...","[doi:10.14379/iodp.proc.385.2021, doi:10.5281/...","[info:eu-repo/semantics/openAccess, https://cr...","[International Ocean Discovery Program, IODP, ...",[IODP Expedition 385 Hole summary],"[info:eu-repo/semantics/other, dataset]"
1,[International Ocean Discovery Program],"[McKay, Robert M., De Santis, Laura, Kulhanek,...",[2019-08-10],[Height profile data were measured on the Sect...,"[https://zenodo.org/record/6515831, 10.5281/ze...","[doi:10.14379/iodp.proc.374.2019, doi:10.5281/...","[info:eu-repo/semantics/openAccess, https://cr...","[International Ocean Discovery Program, IODP, ...",[IODP Expedition 374 Laser height profile (sec...,"[info:eu-repo/semantics/other, dataset]"
2,[International Ocean Discovery Program],"[Hall, I.R., Hemming, S.R., LeVay, L.J., Barke...",[2020-01-27],[Magnetic susceptibility was measured on whole...,"[https://zenodo.org/record/3641925, 10.5281/ze...","[doi:10.14379/iodp.proc.361.2017, doi:10.5281/...","[info:eu-repo/semantics/openAccess, https://cr...","[Expedition 361, JOIDES Resolution, South Afri...",[IODP Expedition 361 Magnetic susceptibility (...,"[info:eu-repo/semantics/other, dataset]"
3,[International Ocean Discovery Program],"[Arculus, Richard, Ishizuka, Osamu, Bogus, Kar...",[2015-08-25],[Shear strength was measured on section halves...,"[https://zenodo.org/record/7072339, 10.5281/ze...","[doi:10.14379/iodp.proc.351.2015, doi:10.5281/...","[info:eu-repo/semantics/openAccess, https://cr...","[International Ocean Discovery Program, IODP, ...",[IODP Expedition 351 Vane shear strength (AVS)],"[info:eu-repo/semantics/other, dataset]"
4,[International Ocean Discovery Program],"[de Ronde, Cornel E.J., Humphris, Susan E., H...",[2019-07-05],[Magnetic susceptibility was measured on whole...,"[https://zenodo.org/record/7504108, 10.5281/ze...","[doi:10.14379/iodp.proc.376.2019, doi:10.5281/...","[info:eu-repo/semantics/openAccess, https://cr...","[International Ocean Discovery Program, IODP, ...",[IODP Expedition 376 Magnetic susceptibility (...,"[info:eu-repo/semantics/other, dataset]"
...,...,...,...,...,...,...,...,...,...,...
474,[International Ocean Discovery Program],"[Childress, L.B., Alvarez Zarikian, C.A., Bria...",[2021-01-29],[Report includes detailed information about sa...,"[https://zenodo.org/record/4480291, 10.5281/ze...","[doi:10.14379/iodp.proc.367368.2018, doi:10.52...","[info:eu-repo/semantics/openAccess, https://cr...","[International Ocean Discovery Program, IODP, ...",[IODP Expedition 368X Sample report],"[info:eu-repo/semantics/other, dataset]"
475,[International Ocean Discovery Program],"[Fryer, P., Wheat, C.G., Williams, T., Albers,...",[2020-03-31],[Magnetic susceptibility was measured on secti...,"[https://zenodo.org/record/3801738, 10.5281/ze...","[doi:10.14379/iodp.proc.366.2018, doi:10.5281/...","[info:eu-repo/semantics/openAccess, https://cr...","[Expedition 366, Mariana, Site 1200, Site U149...",[IODP Expedition 366 Magnetic susceptibility (...,"[info:eu-repo/semantics/other, dataset]"
476,[International Ocean Discovery Program],"[McNeill, L.C., Dugan, B., Petronotis, K.E., B...",[2020-03-31],[Report includes detailed core data: drilling ...,"[https://zenodo.org/record/3752066, 10.5281/ze...","[doi:10.14379/iodp.proc.362.2017, doi:10.5281/...","[info:eu-repo/semantics/openAccess, https://cr...","[Expedition 362, Sumatra subduction zone, Site...",[IODP Expedition 362 Core summary],"[info:eu-repo/semantics/other, dataset]"
477,[International Ocean Discovery Program],"[Childress, L.B., Alvarez Zarikian, C.A., Bria...",[2021-01-29],[SRA provides safety monitoring and yields inf...,"[https://zenodo.org/record/4480297, 10.5281/ze...","[doi:10.14379/iodp.proc.367368.2018, doi:10.52...","[info:eu-repo/semantics/openAccess, https://cr...","[International Ocean Discovery Program, IODP, ...",[IODP Expedition 368X SRA (Source Rock Analysis)],"[info:eu-repo/semantics/other, dataset]"


In [139]:
# Gets all the authors and the number of records they authored

z = jmespath.search('[*].creator[]',iodp)
c = Counter(z)
keys, values = zip(*c.items())
df = pd.DataFrame({"authors":keys,"count":values})
print(f"Number of authors: {df.shape[0]}")
df.head()

NameError: name 'iodp' is not defined

In [None]:
# Get the dataset types and their counts

z = jmespath.search('[*].title[]',iodp)
print(z[0])
sorted(z)
z = [re.search(r'IODP Expedition [0-9]{1,3}[a-zA-Z]? (.+)',x).group(1) for x in z]
c = Counter(z)
keys, values = zip(*c.items())
df = pd.DataFrame({"dataset":keys,"count":values})
df = df.sort_values(by='count',ascending=False).reset_index(drop=True)
print(f"Unique report types: {df.shape[0]}")
df

IODP Expedition 385 Hole summary
Unique report types: 66


Unnamed: 0,dataset,count
0,IW elemental analysis (ICP-AES),12
1,Core composite images,12
2,Section-half images,12
3,Closeup images,12
4,Visual core description,12
...,...,...
61,XRF Summary,1
62,ICP-AES elemental analysis (solids),1
63,Whole Round Section Images,1
64,Magnetic susceptibility (KappaBridge),1


In [None]:
z = jmespath.search('[*].title[]',iodp)

In [None]:
# Get expeditions and their dataset counts

z = jmespath.search('[*].title[]',iodp)
sorted(z)
# returns a list of tuples (title and report name)
data = [re.search(r'(IODP Expedition [0-9]{1,3}[a-zA-Z]?) (.+)',x).groups() for x in z]
df = pd.DataFrame(data, columns=['expedition','dataset'])
df = df.groupby('expedition').count().reset_index()
df = df.rename(columns={"dataset":"dataset_count"})
df = df.sort_values(by='dataset_count',ascending=False)
df = df.reset_index(drop=True)
df

Unnamed: 0,expedition,dataset_count
0,IODP Expedition 385,53
1,IODP Expedition 362,52
2,IODP Expedition 351,46
3,IODP Expedition 350,45
4,IODP Expedition 352,45
5,IODP Expedition 361,44
6,IODP Expedition 366,43
7,IODP Expedition 374,41
8,IODP Expedition 372A,37
9,IODP Expedition 376,37


In [None]:
iodp

[{'contributor': ['International Ocean Discovery Program'],
  'creator': ['Teske, Andreas P.',
   'Lizarralde, Daniel',
   'Höfig, Tobias W.',
   'Aiello, Ivano W.',
   'Ash, Jeanine  L.',
   'Bojanova, Diana P.',
   'Buatier, Martine D.',
   'Edgcomb, Virginia P.',
   'Galerne, Christophe Y.',
   'Gontharet, Swanne',
   'Heuer, Verena B.',
   'Jiang, Shijun',
   'Kars, Myriam  A.C.',
   'Kim, Ji-Hoon',
   'Koornneef, Louise M.T.',
   'Marsaglia, Kathleen M.',
   'Meyer, Nicolette R.',
   'Morono, Yuki',
   'Negrete-Aranda, Raquel',
   'Neumann, Florian',
   'Pastor, Lucie C.',
   'Peña-Salinas, Manet E.',
   'Perez-Cruz, Ligia',
   'Ran, Lihua',
   'Riboulleau, Armelle',
   'Sarao, John A.',
   'Schubert, Florian',
   'Singh, S. K.',
   'Stock, Joann M.',
   'Toffin, Laurent M.A.A.',
   'Xie, Wei',
   'Yamanaka, Toshiro',
   'Zhuang, Guangchao'],
  'date': ['2021-09-27'],
  'description': ['Report lists hole data: location, water depth, drilling, coring system, and core recovery.'],
 

In [None]:
jmespath.search('[*].title[]',iodp)

['IODP Expedition 385 Hole summary',
 'IODP Expedition 374 Laser height profile (section half)',
 'IODP Expedition 361 Magnetic susceptibility (whole round)',
 'IODP Expedition 351 Vane shear strength (AVS)',
 'IODP Expedition 376 Magnetic susceptibility (whole round)',
 'IODP Expedition 376 Magnetic susceptibility (point or contact system)',
 'IODP Expedition 376 Photomicrographs',
 'IODP Expedition 376 P-wave velocity logger (whole round)',
 'IODP Expedition 376 Carbonates composite report',
 'IODP Expedition 376 Hole summary',
 'IODP Expedition 376 Bulk Density (GRA)',
 'IODP Expedition 376 Visual core description',
 'IODP Expedition 376 Visual core description',
 'IODP Expedition 376 Core drilling summary',
 'IODP Expedition 352 Thin section images',
 'IODP Expedition 352 Laser height profile (section half)',
 'IODP Expedition 352 Salinity',
 'IODP Expedition 352 Bulk Density (GRA)',
 'IODP Expedition 350 Hole drilling summary',
 'IODP Expedition 350 Natural gamma radiation',
 'IOD

In [None]:
# Get DOI and relation DOI for all records
z = jmespath.search('[*].[title[0], identifier[0], identifier[1], identifier[2], relation[0]]',iodp)
df = pd.DataFrame(z, columns=['dataset', 'zenodo_recordid', 'doi','oai_doi', 'proceedings_doi'])
df = df.sort_values(by='dataset').reset_index(drop='True')
df

Unnamed: 0,dataset,zenodo_recordid,doi,oai_doi,proceedings_doi
0,IODP Expedition 350 Alkalinity and pH,https://zenodo.org/record/7502011,10.5281/zenodo.7502011,oai:zenodo.org:7502011,doi:10.14379/iodp.proc.350.2015
1,IODP Expedition 350 Bulk Density (GRA),https://zenodo.org/record/7502133,10.5281/zenodo.7502133,oai:zenodo.org:7502133,doi:10.14379/iodp.proc.350.2015
2,IODP Expedition 350 Carbonates composite report,https://zenodo.org/record/7502378,10.5281/zenodo.7502378,oai:zenodo.org:7502378,doi:10.14379/iodp.proc.350.2015
3,IODP Expedition 350 Closeup images,https://zenodo.org/record/7502530,10.5281/zenodo.7502530,oai:zenodo.org:7502530,doi:10.14379/iodp.proc.350.2015
4,IODP Expedition 350 Closeup images,https://zenodo.org/record/7502041,10.5281/zenodo.7502041,oai:zenodo.org:7502041,doi:10.14379/iodp.proc.350.2015
...,...,...,...,...,...
474,IODP Expedition 385 Vane shear strength (Torvane),https://zenodo.org/record/7708697,10.5281/zenodo.7708697,oai:zenodo.org:7708697,doi:10.14379/iodp.proc.385.2021
475,IODP Expedition 385 Visual core description,https://zenodo.org/record/7708674,10.5281/zenodo.7708674,oai:zenodo.org:7708674,doi:10.14379/iodp.proc.385.2021
476,IODP Expedition 385 Whole-round core section c...,https://zenodo.org/record/7706695,10.5281/zenodo.7706695,oai:zenodo.org:7706695,doi:10.14379/iodp.proc.385.2021
477,IODP Expedition 385 Whole-round core section i...,https://zenodo.org/record/7706667,10.5281/zenodo.7706667,oai:zenodo.org:7706667,doi:10.14379/iodp.proc.385.2021


In [None]:
# Datasets with multiple versions:
# Use the most recent ('highest') doi
counts = Counter(df['dataset'])

# use most_common() to get an ordered list of the counts
# x.most_common()

# datasets that have more than one instance
data = [item for item in counts.items() if item[1] > 1]
df = pd.DataFrame(data,columns=['dataset','count_versions'])
df = df.sort_values('count_versions',ascending=False).reset_index(drop=True)
df

Unnamed: 0,dataset,count_versions
0,IODP Expedition 350 Closeup images,2
1,IODP Expedition 350 IW elemental analysis (ICP...,2
2,IODP Expedition 350 Section-half images,2
3,IODP Expedition 351 IW elemental analysis (ICP...,2
4,IODP Expedition 361 Magnetic remanence (SRM-lo...,2
5,IODP Expedition 362 Scanning electron microsco...,2
6,IODP Expedition 362 X-ray diffraction (XRD),2
7,IODP Expedition 366 Interstitial water composi...,2
8,IODP Expedition 368X Carbonates composite report,2
9,IODP Expedition 376 Visual core description,2


In [None]:
sorted(jmespath.search('[*].description[]',iodp))

['A digital composite image (PNG) is made for each core comprising core sections scanned using a line-scan camera. The composite layout is equivalent to traditional core table photos. Top left is top of core; color and meter rule references are included.',
 'A digital composite image (PNG) is made for each core comprising core sections scanned using a line-scan camera. The composite layout is equivalent to traditional core table photos. Top left is top of core; color and meter rule references are included.',
 'A digital composite image (PNG) is made for each core comprising core sections scanned using a line-scan camera. The composite layout is equivalent to traditional core table photos. Top left is top of core; color and meter rule references are included.',
 'A digital composite image (PNG) is made for each core comprising core sections scanned using a line-scan camera. The composite layout is equivalent to traditional core table photos. Top left is top of core; color and meter rule

In [None]:
records.next().metadata

{'contributor': ['International Ocean Discovery Program'],
 'creator': ['Barnes, Philip M.',
  'Pecher, Ingo A.',
  'LeVay, Leah J.',
  'Bourlange, Sylvain',
  'Brunet, Morgane M.Y.',
  'Cardona, Sebastian',
  'Clennell, Michael B.',
  'Cook, Ann E.',
  'Crundwell, Martin P.',
  'Dugan, Brandon',
  'Elger, Judith',
  'Gamboa, Davide',
  'Georgiopoulou, Aggeliki',
  'Greve, Annika',
  'Han, Shuoshuo',
  'Heeschen, Katja U.',
  'Hu, Gaowei',
  'Kim, Gil Young',
  'Kitajima, Hiroko',
  'Koge, Hiroaki',
  'Li, Xuesen',
  'Machado, Karina S.',
  'McNamara, David D.',
  'Moore, Gregory F.',
  'Mountjoy, Joshu J.',
  'Nole, Michael A.',
  'Owari, Satoko',
  'Paganoni, Matteo',
  'Petronotis, Katerina',
  'Rose, Paula S.',
  'Screaton, Elizabeth J.',
  'Shankar, Uma',
  'Shepherd, Claire L.',
  'Torres, Marta E.',
  'Underwood, Michael B.',
  'Wang, Xiujuan',
  'Woodhouse, Adam D.',
  'Wu, Hung-Yu (Sonata)'],
 'date': ['2019-05-05'],
 'description': ['Interstitial water constitutents were meas

In [8]:
# filters records based on date
records = sickle.ListRecords(**{
    'metadataPrefix': 'oai_dc',
    'set': 'user-iodp',
    'from': '2023-01-01',
})

In [63]:
data = records.next().metadata

In [68]:
data = json.dumps(data)

In [49]:
g = json.dumps(data)

In [71]:
# https://stackoverflow.com/questions/6578986/how-to-convert-json-data-into-a-python-object
# Parse JSON into an object with attributes corresponding to dict keys.
x = json.loads(data, object_hook=lambda d: SimpleNamespace(**d))
