# Finding metadata files

In this analysis we'll explore our samples, looking for self-describing datasets and metadata files.

From the file listing exercise we found that among the archive content there are several files of extensions like `*.json` `*.xml` `*.ttl` `*.rdf` indicating structured data, some of which potentially contain metadata. Here we will look further into their content and find to what extent these can be called self-describing or considered prototypical research objects.  The hypothesis is that such files, if present, would be neighbouring the files they describe within the same archive or record.

We will also look for files which are directly named "metadata" and similar variants.

In [5]:
!mkdir -p data/metadata-files/rdf data/metadata-files/json data/metadata-files/xml data/metadata-files/metadata

## RDF files

Although RDF was conceived as a mechanism for structuredr metadata on a web, it has for a long time been used for holding any semantically structured data.  Recently a resurgence in metadata has occurred thanks to [schema.org](http://schema.org/) and [JSON-LD](http://json-ld.org/). An hypothesis is that most of the RDF data in Zenodo are domain-specific data dumps, as metadata would mainly be useful on the web (to be indexed by search engines) rather than in static archives. 

However there is no clear distinction between data and metadata, so we should also inspect the RDF files we have sampled to see if they use metadata-like properties such as <http://purl.org/dc/terms/creator> or <https://www.w3.org/ns/prov/wasDerivedFrom>. Metadata files describing neighbouring files should in theory use relative paths to identify them, so we should also check for URIs that are non-HTTP-based. 

First we'll explore what kind of filenames are used for RDF files, then we'll (re)download and extract them to look at their properties and URIs used. While we look for the common RDF extensions we are here excluding RDFa as it would be embedded in HTML/XHTML/SVG. These would be typically self-describing the document they are within, rather than the containing archive, but are worh a later look.

In [25]:
! egrep -ri '(ttl|rdf|jsonld|n3|nq|nt|trig)$' data/*/listing > data/metadata-files/rdf/listing.txt

By looking up the sample id we can find which corresponding ZIP files to download and extract.

**TODO**: Change below to a workflow

In [26]:
! cat data/metadata-files/rdf/listing.txt | cut -d ":" -f 1 | sort | uniq \
  | sed s/listing/sample/ | sed s/txt$/tsv/ | \
  xargs awk '{print $5}' > data/metadata-files/rdf/urls.txt

In [28]:
!cat data/metadata-files/rdf/urls.txt

https://zenodo.org/api/files/f9911cdb-7923-4d1a-9693-9465ddd8dea3/Lang_1.zip
https://zenodo.org/api/files/54def73c-2a6b-4741-82c6-9d6bd1618184/PMC403.zip
https://zenodo.org/api/files/602ad012-6cad-4e1c-9e13-ffaf65b282b6/raptr-manuscript.zip
https://zenodo.org/api/files/5f4072d6-d4b9-4f95-afb4-4ca2da9366fd/moving-block-req-models.zip
https://zenodo.org/api/files/abab110d-78c0-4425-883a-f7ec576d7eb9/Math_29.zip
https://zenodo.org/api/files/54def73c-2a6b-4741-82c6-9d6bd1618184/PMC204.zip
https://zenodo.org/api/files/d642ee67-643b-4687-a5f8-27a56a465eb8/AE-AT-Opt-PACT18.zip
https://zenodo.org/api/files/978c426e-1be6-41c9-bf3f-017df44ce850/GC_QTOF_PP_samples.zip
https://zenodo.org/api/files/5b9631aa-2aaa-4d1f-9bc7-64b68a184ee7/BostonFingerprints2014_RawData_tc10.zip
https://zenodo.org/api/files/c7cd6c50-ca6a-4974-9073-ae4cdd77b515/Lang_57.zip
https://zenodo.org/api/files/755f6ed5-133f-4d9d-b43b-bff56d814821/Bohdan-Khomtchouk/ENCODE_histone_geneXtendeR_analysis-v1.0.zip
https://ze

## JSON files


In [32]:
! egrep -ri '(json|jsonld)$' data/*/listing > data/metadata-files/json/listing.txt

In [33]:
! head -n 1000 data/metadata-files/json/listing.txt

data/dataset/listing/zazo.txt:data/2017-05-18-145157833880/freeplay.poses.json
data/dataset/listing/zazo.txt:data/2017-05-18-145157833880/visual_tracking.poses.json
data/dataset/listing/zazo.txt:data/2017-05-18-152118060656/freeplay.poses.json
data/dataset/listing/zazo.txt:data/2017-05-18-152118060656/visual_tracking.poses.json
data/dataset/listing/zazo.txt:data/2017-06-01-102743523980/freeplay.poses.json
data/dataset/listing/zazo.txt:data/2017-06-01-102743523980/visual_tracking.poses.json
data/dataset/listing/zazo.txt:data/2017-06-06-145135235899/freeplay.poses.json
data/dataset/listing/zazo.txt:data/2017-06-06-145135235899/visual_tracking.poses.json
data/dataset/listing/zazo.txt:data/2017-06-06-150808383862/freeplay.poses.json
data/dataset/listing/zazo.txt:data/2017-06-06-150808383862/visual_tracking.poses.json
data/dataset/listing/zazo.txt:data/2017-06-07-101750079399/freeplay.poses.json
data/dataset/listing/zazo.txt:data/2017-06-07-101750079399/visual_tracking.poses.json

In [None]:
! grep jsonld data/metadata-files/json/listing.txt

In [None]:
! egrep -i 'meta.?data' data/metadata-files/json/listing.txt

In [None]:
! grep -i manifest data/metadata-files/json/listing.txt

In [35]:
! cat data/metadata-files/json/listing.txt | cut -d ":" -f 1 | sort | uniq \
  | sed s/listing/sample/ | sed s/txt$/tsv/ | \
  xargs awk '{print $5}' > data/metadata-files/json/urls.txt
!cat data/metadata-files/json/urls.txt

https://zenodo.org/api/files/602ad012-6cad-4e1c-9e13-ffaf65b282b6/raptr-manuscript.zip
https://zenodo.org/api/files/792b2464-eedf-427f-8166-883a85de8b5d/Fe-Example_input_files.zip
https://zenodo.org/api/files/ec40beec-f99c-4f95-914e-dcc329df3a84/weecology/PortalData-1.104.0.zip
https://zenodo.org/api/files/cf2590f4-e026-4400-89ba-bc83fe3c665c/rome.zip
https://zenodo.org/api/files/579d4580-d2b4-4441-ab11-3a9bdbf8aa89/CN-TU/nta-meta-analysis-2018.11.zip
https://zenodo.org/api/files/5d6ce4de-f954-4e1d-9285-4fce0ef0a146/Collocated%20Datesets%20for%20Terra%20and%20Aqua.zip
https://zenodo.org/api/files/5935dbec-ac23-4b32-8421-4b3bf07e3f75/LANL_Mustang_parquet.zip
https://zenodo.org/api/files/217beb43-3a3a-429b-86cf-a154ded2f6e5/graphs_json.zip
https://zenodo.org/api/files/a2fec3ee-69f7-467e-aeba-f9d8bab7853d/persons2.zip
https://zenodo.org/api/files/706a040e-d9be-41c7-8677-cd94d52a1fb8/data_metrics.zip
https://zenodo.org/api/files/8b54bee9-9932-4f9d-8cbb-61f83c5ccf56/collection-v1.

(Re)download all the JSON files to unzip them.

In [None]:
!cat data/metadata-files/json/urls.txt | xargs wget --mirror --directory-prefix=data/metadata-files/json/

--2020-02-07 11:29:34--  https://zenodo.org/api/files/602ad012-6cad-4e1c-9e13-ffaf65b282b6/raptr-manuscript.zip
Resolving zenodo.org (zenodo.org)... 188.184.95.95
Connecting to zenodo.org (zenodo.org)|188.184.95.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 721714644 (688M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/602ad012-6cad-4e1c-9e13-ffaf65b282b6/raptr-manuscript.zip'


2020-02-07 11:30:20 (15.3 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/602ad012-6cad-4e1c-9e13-ffaf65b282b6/raptr-manuscript.zip' saved [721714644/721714644]

--2020-02-07 11:30:20--  https://zenodo.org/api/files/792b2464-eedf-427f-8166-883a85de8b5d/Fe-Example_input_files.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 6098 (6.0K) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/792b2464-eedf-427f-8166-883a85de8b5d/Fe-Example_input_files.zi

HTTP request sent, awaiting response... 200 OK
Length: 27769190264 (26G) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/ed4de091-d86d-465a-bc3b-9d70e12a77b4/antiSMASH_Actinobacterial_BGCs.zip'


2020-02-07 12:02:49 (14.3 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/ed4de091-d86d-465a-bc3b-9d70e12a77b4/antiSMASH_Actinobacterial_BGCs.zip' saved [27769190264/27769190264]

--2020-02-07 12:02:49--  https://zenodo.org/api/files/e0425375-1288-4105-b970-08ebd8c654a8/dataset_for_validation_experiments_mitmprobe.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 15741512476 (15G) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/e0425375-1288-4105-b970-08ebd8c654a8/dataset_for_validation_experiments_mitmprobe.zip'


2020-02-07 12:20:14 (14.4 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/e0425375-1288-4105-b970-08ebd8c654a8/dataset_for_validation_exper

HTTP request sent, awaiting response... 200 OK
Length: 316605 (309K) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/6cce1121-0f26-4516-88b7-17bb1663a0c8/survey.zip'


2020-02-07 12:25:42 (14.0 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/6cce1121-0f26-4516-88b7-17bb1663a0c8/survey.zip' saved [316605/316605]

--2020-02-07 12:25:42--  https://zenodo.org/api/files/14466297-96dc-4445-83d0-4c3299eb42c4/weecology/PortalData-1.98.0.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 26784196 (26M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/14466297-96dc-4445-83d0-4c3299eb42c4/weecology/PortalData-1.98.0.zip'


2020-02-07 12:25:44 (18.4 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/14466297-96dc-4445-83d0-4c3299eb42c4/weecology/PortalData-1.98.0.zip' saved [26784196/26784196]

--2020-02-07 12:25:44--  https://zenodo.org/api/files/a191b63a-9d66

HTTP request sent, awaiting response... 200 OK
Length: 3182849385 (3.0G) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/b37db781-e7a5-4c00-9750-6bd8c2ffbcd2/sub-04.zip'


2020-02-07 12:36:31 (17.0 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/b37db781-e7a5-4c00-9750-6bd8c2ffbcd2/sub-04.zip' saved [3182849385/3182849385]

--2020-02-07 12:36:31--  https://zenodo.org/api/files/875d5f34-367a-4cab-b243-a20df1ead590/weecology/portalPredictions-2018-03-16.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 67549249 (64M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/875d5f34-367a-4cab-b243-a20df1ead590/weecology/portalPredictions-2018-03-16.zip'


2020-02-07 12:36:38 (10.3 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/875d5f34-367a-4cab-b243-a20df1ead590/weecology/portalPredictions-2018-03-16.zip' saved [67549249/67549249]

--2020-02-07 12:36:38-

HTTP request sent, awaiting response... 200 OK
Length: 4711348 (4.5M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/621c107b-e0e8-416d-be69-d6d0ce82bd9c/datacarpentry/R-ecology-lesson-v2017.04.3.zip'


2020-02-07 12:40:45 (13.5 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/621c107b-e0e8-416d-be69-d6d0ce82bd9c/datacarpentry/R-ecology-lesson-v2017.04.3.zip' saved [4711348/4711348]

--2020-02-07 12:40:45--  https://zenodo.org/api/files/e5a55eb3-b97e-4591-99f9-063400e0d852/glo-d-i-um/glo-d-i-um.github.io-1.0.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 2371693 (2.3M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/e5a55eb3-b97e-4591-99f9-063400e0d852/glo-d-i-um/glo-d-i-um.github.io-1.0.zip'


2020-02-07 12:40:45 (12.7 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/e5a55eb3-b97e-4591-99f9-063400e0d852/glo-d-i-um/glo-d-i-um.github.io-1.0.zi


2020-02-07 12:42:46 (12.7 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/9d2e22e0-b077-4844-96ee-09f2556e1310/2019-05-23_HandsOn_GrainBoundary1.zip' saved [93656378/93656378]

--2020-02-07 12:42:46--  https://zenodo.org/api/files/aceb9a66-c991-4d19-a299-0c64061e1f19/larmarange/analyse-R-2018-12-08.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 36770908 (35M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/aceb9a66-c991-4d19-a299-0c64061e1f19/larmarange/analyse-R-2018-12-08.zip'


2020-02-07 12:42:49 (13.2 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/aceb9a66-c991-4d19-a299-0c64061e1f19/larmarange/analyse-R-2018-12-08.zip' saved [36770908/36770908]

--2020-02-07 12:42:49--  https://zenodo.org/api/files/a991f7bc-f870-42ff-8ed1-9613fe3f9c46/o2r_project_website_and_blog_git-repository.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 20

HTTP request sent, awaiting response... 200 OK
Length: 36254504 (35M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/92d8577a-bf28-452c-847f-3a542087daeb/Open-Scholarship-Strategy/indexed-2.04.zip'


2020-02-07 12:44:07 (12.2 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/92d8577a-bf28-452c-847f-3a542087daeb/Open-Scholarship-Strategy/indexed-2.04.zip' saved [36254504/36254504]

--2020-02-07 12:44:07--  https://zenodo.org/api/files/7968d3ec-1501-4bd2-8bd3-3d934dd492a2/datacarpentry/OpenRefine-ecology-lesson-v2017.04.0.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 529473 (517K) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/7968d3ec-1501-4bd2-8bd3-3d934dd492a2/datacarpentry/OpenRefine-ecology-lesson-v2017.04.0.zip'


2020-02-07 12:44:08 (11.3 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/7968d3ec-1501-4bd2-8bd3-3d934dd492a2/datacarpentry


2020-02-07 12:46:00 (15.3 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/6707b75d-848d-46f5-8cbf-36f7047a6458/data.zip' saved [148931121/148931121]

--2020-02-07 12:46:00--  https://zenodo.org/api/files/94257e8d-bb76-45df-8448-68b04dfeb2bc/v.2018-02-28.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 666510 (651K) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/94257e8d-bb76-45df-8448-68b04dfeb2bc/v.2018-02-28.zip'


2020-02-07 12:46:00 (9.50 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/94257e8d-bb76-45df-8448-68b04dfeb2bc/v.2018-02-28.zip' saved [666510/666510]

--2020-02-07 12:46:00--  https://zenodo.org/api/files/cb9df964-0c4e-4829-a6a4-b9739ccf3c19/DMPs.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 170620 (167K) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/cb9df964-0c4e-48

HTTP request sent, awaiting response... 200 OK
Length: 3602821 (3.4M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/2d2652c6-c555-43e4-aa87-223d3da93b54/swcarpentry/make-novice-v2019.06.1.zip'


2020-02-07 12:46:36 (11.5 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/2d2652c6-c555-43e4-aa87-223d3da93b54/swcarpentry/make-novice-v2019.06.1.zip' saved [3602821/3602821]

--2020-02-07 12:46:36--  https://zenodo.org/api/files/75e63414-b867-4a08-af77-69ae4c058c27/Coupette_Juristische_Netzwerkforschung_Online-Appendix.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 129183992 (123M) [application/octet-stream]
Saving to: 'data/metadata-files/json/zenodo.org/api/files/75e63414-b867-4a08-af77-69ae4c058c27/Coupette_Juristische_Netzwerkforschung_Online-Appendix.zip'


2020-02-07 12:46:44 (15.7 MB/s) - 'data/metadata-files/json/zenodo.org/api/files/75e63414-b867-4a08-af77-69ae4c058c27/Coupette_Jur



In [None]:
! cd data/metadata-files/json; for f in `find . -name '*zip'` ; do DIR=`echo $f | sed s/.zip$//` ; unzip -d "$DIR" $f '*.json' ;  done

Commonly JSON files start with `{ key: value}` at the top level, so it might be a good question to see if there are any common top-level keys such as `creator` or `metadata`. Note that the counts here are per JSON file, and many archives will have many JSON files.

In [None]:
! find data/metadata-files/json -type f -name '*json' | xargs jq -r 'keys[]'  | sort | uniq -c > data/metadata-files/json/top-keys.txt
! sort data/metadata-files/json/top-keys.txt | grep -v " 1 "

## XML files

In [29]:
! egrep -ri '(xml)$' data/*/listing > data/metadata-files/xml/listing.txt

In [30]:
!head -n 1000 data/metadata-files/xml/listing.txt

data/dataset/listing/zbgj.txt:Math_89/findbugs-exclude-filter.xml
data/dataset/listing/zbgj.txt:Math_89/build.xml
data/dataset/listing/zbgj.txt:Math_89/pom.xml
data/dataset/listing/zbgj.txt:Math_89/test-jar.xml
data/dataset/listing/zbgj.txt:Math_89/project.xml
data/dataset/listing/zbgj.txt:Math_89/checkstyle.xml
data/dataset/listing/zbgj.txt:Math_89/maven.xml
data/dataset/listing/zbgj.txt:Math_89/src/assembly/src.xml
data/dataset/listing/zbgj.txt:Math_89/src/assembly/bin.xml
data/dataset/listing/zbgj.txt:Math_89/src/site/xdoc/developers.xml
data/dataset/listing/zbgj.txt:Math_89/src/site/xdoc/index.xml
data/dataset/listing/zbgj.txt:Math_89/src/site/xdoc/tasks.xml
data/dataset/listing/zbgj.txt:Math_89/src/site/xdoc/userguide/optimization.xml
data/dataset/listing/zbgj.txt:Math_89/src/site/xdoc/userguide/index.xml
data/dataset/listing/zbgj.txt:Math_89/src/site/xdoc/userguide/transform.xml
data/dataset/listing/zbgj.txt:Math_89/src/site/xdoc/userguide/complex.xml
data/dataset

In [None]:
!cat data/metadata-files/xml/urls.txt | xargs wget --mirror --directory-prefix=data/metadata-files/xml/

In [None]:
! cd data/metadata-files/xml; for f in `find . -name '*zip'` ; do DIR=`echo $f | sed s/.zip$//` ; unzip -d "$DIR" $f '*.xml' ;  done