# RANGES data to JSON-LD
Example of transforming some ranges data to JSON-LD and then transorm to RDF.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
from rdflib import Graph, ConjunctiveGraph
from pyld import jsonld
import json

## load ASUoccurrence.csv

In [3]:
df = pd.read_csv("ASUoccurrence.csv")
df.shape

(8962, 50)

### load test data
For testing just get the first two rows and first seven columns.  

In [4]:
test_df = df.head(2).iloc[:,0:7]
test_df

Unnamed: 0,occurrenceID,institutionCode,catalogNumber,scientificName,order,family,genus
0,2f44e05e-4eec-4ae9-8fc4-3477dcc2182f,ASU,ASUMAC009037,"Xerospermophilus tereticaudus (Baird, 1858)",Rodentia,Sciuridae,Xerospermophilus
1,cf36733e-a21c-4d31-8eda-0419af164c77,ASU,ASUMAC009027,"Otospermophilus variegatus (Erxleben, 1777)",Rodentia,Sciuridae,Otospermophilus


### fetch NCBItaxon IDs
For the first two records, get the NCBItaxon IDs. For the full dataset, this will obviously need to be expanded.  
This is just to show how to associate an IRI with a column value. The final solution will be much more involved.

In [5]:
test_df['taxon_id'] = np.where(
    test_df.scientificName.str.contains("Otospermophilus variegatus"), 
    "ncbit:4572", 
    "ncbit:99860"
)
test_df

Unnamed: 0,occurrenceID,institutionCode,catalogNumber,scientificName,order,family,genus,taxon_id
0,2f44e05e-4eec-4ae9-8fc4-3477dcc2182f,ASU,ASUMAC009037,"Xerospermophilus tereticaudus (Baird, 1858)",Rodentia,Sciuridae,Xerospermophilus,ncbit:99860
1,cf36733e-a21c-4d31-8eda-0419af164c77,ASU,ASUMAC009027,"Otospermophilus variegatus (Erxleben, 1777)",Rodentia,Sciuridae,Otospermophilus,ncbit:4572


The dataframe is easily converted into JSON, which can then be converted to RDF using JSON LD (see below).

In [6]:
print(json.dumps(test_df.to_dict(orient="records"), indent=4))

[
    {
        "occurrenceID": "2f44e05e-4eec-4ae9-8fc4-3477dcc2182f",
        "institutionCode": "ASU",
        "catalogNumber": "ASUMAC009037",
        "scientificName": "Xerospermophilus tereticaudus (Baird, 1858)",
        "order": "Rodentia",
        "family": "Sciuridae",
        "genus": "Xerospermophilus",
        "taxon_id": "ncbit:99860"
    },
    {
        "occurrenceID": "cf36733e-a21c-4d31-8eda-0419af164c77",
        "institutionCode": "ASU",
        "catalogNumber": "ASUMAC009027",
        "scientificName": "Otospermophilus variegatus (Erxleben, 1777)",
        "order": "Rodentia",
        "family": "Sciuridae",
        "genus": "Otospermophilus",
        "taxon_id": "ncbit:4572"
    }
]


## transform to JSON LD

### define JSON LD contexts
We need to define some contexts for JSON LD which will be used to convert to RDF. In the context, I have specified the default namepace to `http://purl.obolibrary.org/obo/FOVT/data#`. This means that unless otherwise specified, a field name will be converted to an IRI by prepending `http://purl.obolibrary.org/obo/FOVT/data#`. E.g.`institutionCode: ASU` will be converted to `http://purl.obolibrary.org/obo/FOVT/data#institutionCode: ASU` in RDF. 

There is a **super annoying** caveat to this. For some reason, JSON LD 1.1 the does permit base IRIs to end with an underscore (e.g. `http://purl.obolibrary.org/obo/NCBITaxon_`). To handle cases like this, you are supposed add a `"@prefix": true` to the context. E.g.:
```
"@context": {
  "ncbit" : {
     "@id": "http://purl.obolibary.org/NCBITaxon_",
     "@prefix": true
   }
}
```

Sadly, `rdflib` does not seem to recognize the `"@prefix": true` flag. So, in the context below, the NCBItaxon base IRI is `https://www.ncbi.nlm.nih.gov/Taxonomy/txid#` instead of `http://purl.obolibrary.org/obo/NCBITaxon_`. Note, the NCBItaxon base IRI ends with an `#`, which is another annoying detail. grrrr ...  

The `pyLD` library does recognize the `"@prefix": true` flag. Below is an example use it with a context named `obo_context`.

In [7]:
context = {
    "@vocab": "http://purl.obolibrary.org/obo/FOVT/data#",
    "taxon_id": {"@type": "@id"},
    "ncbit": {
        "@id": "https://www.ncbi.nlm.nih.gov/Taxonomy/txid#",
        "@type": "@id"
    }
}

### transform dataframe to json and load into an rdflib graph  
Below the output is in `ntriples` to show that the conversions to IRIs. But, the output can be in turtle too.  
The output is a bit difficult to read, but you very that `taxon_id` has an IRI as a value. See the line:
```
_:N16b203ad4b6d48b7809dff1fecea24ac <http://purl.obolibrary.org/obo/FOVT/data#taxon_id> <https://www.ncbi.nlm.nih.gov/Taxonomy/txid#4572> .
```

In [8]:
g = Graph().parse(data=test_df.to_json(orient="records"), format="json-ld", context=context)
# print(g.serialize(format="ttl"))
print(g.serialize(format="nt"))

_:Na6634507c1dd4966ada6ddf5925b2a27 <http://purl.obolibrary.org/obo/FOVT/data#order> "Rodentia" .
_:N880c3127d9094ab18095c27f8f327bc6 <http://purl.obolibrary.org/obo/FOVT/data#family> "Sciuridae" .
_:N880c3127d9094ab18095c27f8f327bc6 <http://purl.obolibrary.org/obo/FOVT/data#scientificName> "Xerospermophilus tereticaudus (Baird, 1858)" .
_:N880c3127d9094ab18095c27f8f327bc6 <http://purl.obolibrary.org/obo/FOVT/data#occurrenceID> "2f44e05e-4eec-4ae9-8fc4-3477dcc2182f" .
_:N880c3127d9094ab18095c27f8f327bc6 <http://purl.obolibrary.org/obo/FOVT/data#catalogNumber> "ASUMAC009037" .
_:Na6634507c1dd4966ada6ddf5925b2a27 <http://purl.obolibrary.org/obo/FOVT/data#catalogNumber> "ASUMAC009027" .
_:Na6634507c1dd4966ada6ddf5925b2a27 <http://purl.obolibrary.org/obo/FOVT/data#occurrenceID> "cf36733e-a21c-4d31-8eda-0419af164c77" .
_:N880c3127d9094ab18095c27f8f327bc6 <http://purl.obolibrary.org/obo/FOVT/data#institutionCode> "ASU" .
_:Na6634507c1dd4966ada6ddf5925b2a27 <http://purl.obolibrary.org/obo/FOV

## tranform using pyLD  
`pyLD` will recognize the ` "@prefix": True` flag. So, we can OBO style IRI bases (i.e., ends with an `_`).  
For example, see line:
```
_:b1 <http://purl.obolibrary.org/obo/FOVT/data#taxon_id> <http://purl.obolibrary.org/obo/NCBITaxon_4572> .
```

In [9]:
obo_context = {
    "@context": {
        "@vocab": "http://purl.obolibrary.org/obo/FOVT/data#",
        "taxon_id": {"@type": "@id"},
        "ncbit": {
            "@id": "http://purl.obolibrary.org/obo/NCBITaxon_",
            "@prefix": True
        }
    }
}

In [10]:
rdf = jsonld.to_rdf(
    {"@context": obo_context, "@graph": test_df.to_dict(orient="records")},
    {'format': 'application/n-quads'} # must use application/n-quad
)
print(rdf)

_:b0 <http://purl.obolibrary.org/obo/FOVT/data#catalogNumber> "ASUMAC009037" .
_:b0 <http://purl.obolibrary.org/obo/FOVT/data#family> "Sciuridae" .
_:b0 <http://purl.obolibrary.org/obo/FOVT/data#genus> "Xerospermophilus" .
_:b0 <http://purl.obolibrary.org/obo/FOVT/data#institutionCode> "ASU" .
_:b0 <http://purl.obolibrary.org/obo/FOVT/data#occurrenceID> "2f44e05e-4eec-4ae9-8fc4-3477dcc2182f" .
_:b0 <http://purl.obolibrary.org/obo/FOVT/data#order> "Rodentia" .
_:b0 <http://purl.obolibrary.org/obo/FOVT/data#scientificName> "Xerospermophilus tereticaudus (Baird, 1858)" .
_:b0 <http://purl.obolibrary.org/obo/FOVT/data#taxon_id> <http://purl.obolibrary.org/obo/NCBITaxon_99860> .
_:b1 <http://purl.obolibrary.org/obo/FOVT/data#catalogNumber> "ASUMAC009027" .
_:b1 <http://purl.obolibrary.org/obo/FOVT/data#family> "Sciuridae" .
_:b1 <http://purl.obolibrary.org/obo/FOVT/data#genus> "Otospermophilus" .
_:b1 <http://purl.obolibrary.org/obo/FOVT/data#institutionCode> "ASU" .
_:b1 <http://purl.oboli

If necessary, the `pyLD` RDF can be loaded into an `rdflib` graph. This might help for loading the RDF into a triple store.  
Note: The output is converted to turtle instead of nquads.

In [11]:
g = Graph().parse(data=rdf, format="nquads")
print(g.serialize(format="ttl"))

@prefix ns1: <http://purl.obolibrary.org/obo/FOVT/data#> .

[] ns1:catalogNumber "ASUMAC009037" ;
    ns1:family "Sciuridae" ;
    ns1:genus "Xerospermophilus" ;
    ns1:institutionCode "ASU" ;
    ns1:occurrenceID "2f44e05e-4eec-4ae9-8fc4-3477dcc2182f" ;
    ns1:order "Rodentia" ;
    ns1:scientificName "Xerospermophilus tereticaudus (Baird, 1858)" ;
    ns1:taxon_id <http://purl.obolibrary.org/obo/NCBITaxon_99860> .

[] ns1:catalogNumber "ASUMAC009027" ;
    ns1:family "Sciuridae" ;
    ns1:genus "Otospermophilus" ;
    ns1:institutionCode "ASU" ;
    ns1:occurrenceID "cf36733e-a21c-4d31-8eda-0419af164c77" ;
    ns1:order "Rodentia" ;
    ns1:scientificName "Otospermophilus variegatus (Erxleben, 1777)" ;
    ns1:taxon_id <http://purl.obolibrary.org/obo/NCBITaxon_4572> .


