<a href="https://colab.research.google.com/github/schemaorg/schemaorg/blob/main/scripts/Schema_org_Dashboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is part of the Schema.org project codebase at https://github.com/schemaorg/schemaorg and licensed under the same terms. **bold text**


The purpose of this notebook is to show how to work programmatically with schema.org's definitions. 

See also https://colab.research.google.com/drive/1GVQaP5t8G-NRLAmEvVSp8k5MnsrfttDP for another approach to this.

SPARQL

How to query schema.org schemas using SPARQL

In [18]:
# run this once per session to bring in a required library

!pip --quiet install sparqlwrapper | grep -v 'already satisfied'

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import io
import requests

In [19]:
# This function shows how to use rdflib to query a REMOTE sparql database

q1 = """SELECT distinct ?prop ?type1 ?type2 WHERE  {
  ?type1 rdfs:subClassOf* <https://schema.org/Organization> . 
  ?type2 rdfs:subClassOf* <https://schema.org/Person> . 
  ?prop <https://schema.org/domainIncludes> ?type1 .
  ?prop <https://schema.org/domainIncludes> ?type2 .
}"""

pd.set_option('display.max_colwidth', None)

# data
wd_endpoint = 'https://query.wikidata.org/sparql'
sdo_endpoint = "https://dydra.com/danbri/schema-org-v11/sparql"

# utility function
def df_from_query(querystring=q1, endpoint=sdo_endpoint):
  sparql = SPARQLWrapper(endpoint)
  sparql.setQuery(querystring)
  sparql.setReturnFormat(JSON)
  results = sparql.query().convert()
  return( pd.json_normalize(results['results']['bindings']) )

In [25]:
# TODO: Need a function that loads https://webschemas.org/version/latest/schemaorg-current-https.nt into a named graph SPARQL store 

sdo_nt_url = "https://webschemas.org/version/latest/schemaorg-current-https.nt"

import rdflib
import json
from collections import Counter
from rdflib import Graph, plugin
from rdflib.serializer import Serializer

g = rdflib.Graph()
g.parse(sdo_nt_url) 
g 

<Graph identifier=N2d59080d25564e1f8cb93f4bb3aa1cf1 (<class 'rdflib.graph.Graph'>)>

In [20]:

x = df_from_query(q1)
x

Unnamed: 0,prop.type,prop.value,type1.type,type1.value,type2.type,type2.value
0,uri,https://schema.org/email,uri,https://schema.org/Organization,uri,https://schema.org/Person
1,uri,https://schema.org/faxNumber,uri,https://schema.org/Organization,uri,https://schema.org/Person
2,uri,https://schema.org/award,uri,https://schema.org/Organization,uri,https://schema.org/Person
3,uri,https://schema.org/telephone,uri,https://schema.org/Organization,uri,https://schema.org/Person
4,uri,https://schema.org/memberOf,uri,https://schema.org/Organization,uri,https://schema.org/Person
5,uri,https://schema.org/sponsor,uri,https://schema.org/Organization,uri,https://schema.org/Person
6,uri,https://schema.org/knowsAbout,uri,https://schema.org/Organization,uri,https://schema.org/Person
7,uri,https://schema.org/gender,uri,https://schema.org/SportsTeam,uri,https://schema.org/Person
8,uri,https://schema.org/vatID,uri,https://schema.org/Organization,uri,https://schema.org/Person
9,uri,https://schema.org/brand,uri,https://schema.org/Organization,uri,https://schema.org/Person


# Examples

How to access schema.org examples

In [21]:
# First we clone the entire schema.org repo, then we collect up the examples from .txt files:

!git clone https://github.com/schemaorg/schemaorg

Cloning into 'schemaorg'...
remote: Enumerating objects: 132, done.[K
remote: Counting objects: 100% (132/132), done.[K
remote: Compressing objects: 100% (110/110), done.[K
remote: Total 23477 (delta 80), reused 50 (delta 21), pack-reused 23345[K
Receiving objects: 100% (23477/23477), 96.36 MiB | 28.09 MiB/s, done.
Resolving deltas: 100% (16700/16700), done.
Checking out files: 100% (1788/1788), done.


In [24]:
!find . -name \*example\*.txt -exec ls {} \;

./schemaorg/SchemaExamples/example-code/examples.txt
./schemaorg/data/sdo-bus-stop-examples.txt
./schemaorg/data/sdo-trip-examples.txt
./schemaorg/data/sdo-police-station-examples.txt
./schemaorg/data/sdo-airport-examples.txt
./schemaorg/data/sdo-train-station-examples.txt
./schemaorg/data/sdo-videogame-examples.txt
./schemaorg/data/sdo-book-series-examples.txt
./schemaorg/data/sdo-automobile-examples.txt
./schemaorg/data/sdo-invoice-examples.txt
./schemaorg/data/sdo-creativework-examples.txt
./schemaorg/data/sdo-itemlist-examples.txt
./schemaorg/data/sdo-offeredby-examples.txt
./schemaorg/data/sdo-digital-document-examples.txt
./schemaorg/data/examples.txt
./schemaorg/data/sdo-hotels-examples.txt
./schemaorg/data/ext/pending/issue-2490-examples.txt
./schemaorg/data/ext/pending/issue-1670-examples.txt
./schemaorg/data/ext/pending/issue-2192-examples.txt
./schemaorg/data/ext/pending/issue-894-examples.txt
./schemaorg/data/ext/pending/issue-2396-examples.txt
./schemaorg/data/ext/pending/

TODOs:
 * can we load all the examples into a multi-graph SPARQL store? (in rdflib not remote endpoint); put them into 'core' and 'pending' named graphs or similar.
  * then load triples from latest webschemas, https://webschemas.org/version/latest/schemaorg-current-https.jsonld into a named graph.
  * find triples in 'core' examples that are not in the vocabulary (then same with pending)
