# Querying RO-Crate as a knowledge graph

_This tutorial includes content adapted from Paul Houle's [gastrodon example](https://github.com/paulhoule/gastrodon/blob/master/notebooks/local/DBpedia_Schema_Queries.ipynb), see [LICENSE](/edit/LICENSE) for details_



## Setup

In [90]:
import sys
from collections import OrderedDict
from rdflib import Graph,URIRef
from rdflib.parser import URLInputSource
from gastrodon import LocalEndpoint,one,QName
import gzip
import pandas as pd
pd.set_option("display.width",100)
pd.set_option("display.max_colwidth",80)

## Loading the graph

In [91]:
g = Graph()
g.parse("https://w3id.org/ro/doi/10.5281/zenodo.5146227")
g.bind("s","http://schema.org/")

Because we are in Jupyter Notebook, we'll use [gastrodon](https://github.com/paulhoule/gastrodon) to get a nicer table rendering.

In [92]:
e = e=LocalEndpoint(g)

In [93]:

e.select("""
   SELECT ?p (COUNT(*) AS ?cnt) {
      ?s ?p ?o .
   } GROUP BY ?p ORDER BY DESC(?cnt)
   LIMIT 20
""")


Unnamed: 0_level_0,cnt
p,Unnamed: 1_level_1
s:name,196
rdf:type,192
s:author,137
s:additionalType,115
s:contributor,61
s:member,55
s:url,33
s:description,30
s:hasPart,25
s:creativeWorkStatus,24


In [94]:
e.select("""
   SELECT ?type (COUNT(?s) as ?cnt) {
      ?s a ?type 
   } GROUP BY ?type ORDER BY DESC(?cnt)
""")


Unnamed: 0_level_0,cnt
type,Unnamed: 1_level_1
s:Person,62
s:DefinedTerm,34
s:Role,30
s:SoftwareApplication,19
s:Audience,9
s:Dataset,7
s:ScholarlyArticle,7
s:CreativeWork,6
s:MediaObject,3
s:Organization,2


In [95]:
persons = e.select("""
SELECT ?person {
  ?person a s:Person
}
""")
persons

Unnamed: 0,person
0,https://orcid.org/0000-0001-6022-9825
1,https://orcid.org/0000-0001-6565-5145
2,https://orcid.org/0000-0001-6960-357X
3,https://orcid.org/0000-0001-8131-2150
4,https://orcid.org/0000-0001-8172-8981
...,...
57,https://www.researchobject.org/2021-packaging-research-artefacts-with-ro-cra...
58,https://www.researchobject.org/2021-packaging-research-artefacts-with-ro-cra...
59,https://www.researchobject.org/2021-packaging-research-artefacts-with-ro-cra...
60,https://www.researchobject.org/2021-packaging-research-artefacts-with-ro-cra...


In [109]:
orcids = e.select("""
SELECT ?person ?name {
  ?person a s:Person .
  ?person s:name ?name .
  FILTER(STRSTARTS(STR(?person), "https://orcid.org/"))
}
LIMIT 5
""")
orcids

Exception: Unknown namespace prefix : s

In [108]:
for orcid in orcids.person:
    print("Parsing", orcid)
    g.parse(str(orcid), format="json-ld")

Parsing https://orcid.org/0000-0001-6022-9825
Parsing https://orcid.org/0000-0001-6565-5145
Parsing https://orcid.org/0000-0001-6960-357X


https://doi.org/10.1007/978-3-642-16558-0\_22 does not look like a valid URI, trying to serialize this will break.


Parsing https://orcid.org/0000-0001-8131-2150
Parsing https://orcid.org/0000-0001-8172-8981
Parsing https://orcid.org/0000-0001-8420-5254
Parsing https://orcid.org/0000-0001-9842-9718
Parsing https://orcid.org/0000-0002-0048-3300
Parsing https://orcid.org/0000-0002-0309-604X
Parsing https://orcid.org/0000-0002-0337-8610


In [113]:
g.bind("s","http://schema.org/") ## latest g.parse might have changes prefixes
e.select("""
   SELECT ?work ?title {
      ?work s:creator ?person .
      ?work s:name ?title
   } 
""")

Unnamed: 0,work,title
0,https://doi.org/10.1145/3486897,Methods Included
1,https://doi.org/10.1371/journal.pcbi.1009823,Ten simple rules for making a software tool workflow-ready
2,https://doi.org/10.12688/f1000research.54159.1,Perspectives on automated composition of workflows in the life sciences [ver...
3,https://doi.org/10.5281/zenodo.5093125,Towards a Common Standard for Data and Specimen Provenance in Life Sciences
4,https://doi.org/10.7717/peerj-cs.387,Semantic micro-contributions with decentralized nanopublication services
...,...,...
1309,Ndc6aba2a27ae464bbe736805430e9c8d,Quantifying Groundwater Fluctuations in the Southern High Plains with GIS an...
1310,Nfbdee1f111b5432eb04519475e594370,Investigating Depletion of the Southern High Plains (Ogallala) Aquifer
1311,N0e988b178f76476cac94601db04e6ad4,Integrating GPS and GIS Techniques in Training GIS Professionals: A Case Study
1312,Ndd74667de24b46619211d33d41d4d641,Victory Drive Tree Inventory Data Creation and Assessment & ESRI Internship


In [116]:
dois = e.select("""
   SELECT ?work {
      ?work s:creator ?person .
      FILTER(STRSTARTS(STR(?work), "https://doi.org/"))
   } 
   LIMIT 5
""")
dois

Unnamed: 0,work
0,https://doi.org/10.1145/3486897
1,https://doi.org/10.1371/journal.pcbi.1009823
2,https://doi.org/10.12688/f1000research.54159.1
3,https://doi.org/10.5281/zenodo.5093125
4,https://doi.org/10.7717/peerj-cs.387
5,https://doi.org/10.5281/zenodo.4541002
6,https://doi.org/10.6084/m9.figshare.14453031
7,https://doi.org/10.6084/m9.figshare.14453031.v1
8,https://doi.org/10.1007/978-3-030-80960-7_16
9,https://doi.org/10.5281/zenodo.3541888


In [128]:
for doi in dois.work:
    print("Parsing", doi)
    try:
        g.parse(str(doi))
    except Exception as ex:
        print("  Failed", repr(ex))

Parsing https://doi.org/10.1145/3486897
Parsing https://doi.org/10.1371/journal.pcbi.1009823
Parsing https://doi.org/10.12688/f1000research.54159.1
Parsing https://doi.org/10.5281/zenodo.5093125
  Failed <HTTPError 422: 'Unprocessable Entity'>
Parsing https://doi.org/10.7717/peerj-cs.387
Parsing https://doi.org/10.5281/zenodo.4541002
  Failed <HTTPError 422: 'Unprocessable Entity'>
Parsing https://doi.org/10.6084/m9.figshare.14453031
  Failed <HTTPError 422: 'Unprocessable Entity'>
Parsing https://doi.org/10.6084/m9.figshare.14453031.v1
  Failed <HTTPError 422: 'Unprocessable Entity'>
Parsing https://doi.org/10.1007/978-3-030-80960-7_16
Parsing https://doi.org/10.5281/zenodo.3541888
  Failed <HTTPError 422: 'Unprocessable Entity'>


In [129]:
g4 = Graph()
g4.parse("https://doi.org/10.1145/3486897")
e4 = LocalEndpoint(g4)
e4.select("""
SELECT ?s ?p ?o WHERE {
  ?s ?p ?o
}
""")

Unnamed: 0,s,p,o
0,http://dx.doi.org/10.1145/3486897,j.2:doi,10.1145/3486897
1,http://id.crossref.org/contributor/the-cwl-community-vltbu40m1soi,foaf:familyName,Community
2,http://id.crossref.org/issn/0001-0782,j.2:issn,0001-0782
3,http://dx.doi.org/10.1145/3486897,j.2:volume,65
4,http://id.crossref.org/contributor/peter-amstutz-vltbu40m1soi,foaf:givenName,Peter
...,...,...,...
73,http://dx.doi.org/10.1145/3486897,dcterms:creator,http://id.crossref.org/contributor/peter-amstutz-vltbu40m1soi
74,http://dx.doi.org/10.1145/3486897,dcterms:creator,http://id.crossref.org/contributor/nebojsa-tijanic-vltbu40m1soi
75,http://id.crossref.org/contributor/herve-menager-vltbu40m1soi,foaf:name,Hervé Ménager
76,http://id.crossref.org/contributor/alexandru-iosup-vltbu40m1soi,foaf:name,Alexandru Iosup


We see that the DOI metadata follow a different metadata standard, the "classic" FOAF and DC Terms. In addition the URLs for each work now begins with `http://dx.doi.org/` instead of `https://doi.org/` as in the ORCID and RO-Crate.

In addition we see that authors (aka `dcterms:creator` here are not identified by ORCID but by `http://id.crossref.org` internal identifier -- again this is similar as we saw before, not every author will have an ORCID.

This means we have to do a bit more work to combine these data sources in a single knowledge graph.