<a href="https://colab.research.google.com/github/schemaorg/schemaorg/blob/main/scripts/Schema_org_Dashboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is part of the Schema.org project codebase at https://github.com/schemaorg/schemaorg and licensed under the same terms. **bold text**


The purpose of this notebook is to show how to work programmatically with schema.org's definitions. 

See also https://colab.research.google.com/drive/1GVQaP5t8G-NRLAmEvVSp8k5MnsrfttDP for another approach to this.

# SPARQL

How to query schema.org schemas using SPARQL

In [17]:
# run this once per session to bring in a required library

!pip --quiet install sparqlwrapper | grep -v 'already satisfied'

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import io
import requests

In [18]:
# This function shows how to use rdflib to query a REMOTE sparql dataset

q1 = """SELECT distinct ?prop ?type1 ?type2 WHERE  {
  ?type1 rdfs:subClassOf* <https://schema.org/Organization> . 
  ?type2 rdfs:subClassOf* <https://schema.org/Person> . 
  ?prop <https://schema.org/domainIncludes> ?type1 .
  ?prop <https://schema.org/domainIncludes> ?type2 .
}"""

pd.set_option('display.max_colwidth', None)

# data
wd_endpoint = 'https://query.wikidata.org/sparql'
sdo_endpoint = "https://dydra.com/danbri/schema-org-v11/sparql"

# utility function
def df_from_query(querystring=q1, endpoint=sdo_endpoint):
  sparql = SPARQLWrapper(endpoint)
  sparql.setQuery(querystring)
  sparql.setReturnFormat(JSON)
  results = sparql.query().convert()
  return( pd.json_normalize(results['results']['bindings']) )

In [19]:
# This shows how to use rdflib to query a LOCAL sparql dataset
# TODO: Need a function that loads https://webschemas.org/version/latest/schemaorg-current-https.nt into a named graph SPARQL store 


import rdflib
import json
from collections import Counter
from rdflib import Graph, plugin, ConjunctiveGraph
from rdflib.serializer import Serializer

def toDF(result):
  return pd.DataFrame(result, columns=result.vars)

# Fetch Schema.org definitions

sdo_current_https_url = "https://webschemas.org/version/latest/schemaorg-current-https.nq"
sdo_all_https_url = "https://webschemas.org/version/latest/schemaorg-all-https.nq"

# TODO - is this the only way to figure out what is in the attic? except both files use same NG URL
g = ConjunctiveGraph(store="IOMemory")
g.parse( sdo_all_https_url,    format="nquads",    publicID="https://schema.org/")
g.parse( sdo_current_https_url,    format="nquads",    publicID="https://schema.org/")


<Graph identifier=https://schema.org/ (<class 'rdflib.graph.Graph'>)>

In [20]:
result = toDF( g.query("select * where { GRAPH ?g { ?article_type rdfs:subClassOf <https://schema.org/NewsArticle> ; rdfs:label ?label }}") )


In [21]:
result

Unnamed: 0,g,article_type,label
0,https://schema.org/11.0,https://schema.org/AskPublicNewsArticle,AskPublicNewsArticle
1,https://schema.org/11.0,https://schema.org/BackgroundNewsArticle,BackgroundNewsArticle
2,https://schema.org/11.0,https://schema.org/OpinionNewsArticle,OpinionNewsArticle
3,https://schema.org/11.0,https://schema.org/AnalysisNewsArticle,AnalysisNewsArticle
4,https://schema.org/11.0,https://schema.org/ReportageNewsArticle,ReportageNewsArticle
5,https://schema.org/11.0,https://schema.org/ReviewNewsArticle,ReviewNewsArticle


In [30]:
result = toDF( g.query("select * where { ?attic_term <https://schema.org/isPartOf> <https://attic.schema.org> ; rdfs:label ?label }") )
print(result)

                                            attic_term                            label
0                 https://schema.org/variablesMeasured                variablesMeasured
1                 https://schema.org/productReturnDays                productReturnDays
2                        https://schema.org/StupidType                       StupidType
3          https://schema.org/ProductReturnUnspecified         ProductReturnUnspecified
4                 https://schema.org/productReturnLink                productReturnLink
5   https://schema.org/ProductReturnFiniteReturnWindow  ProductReturnFiniteReturnWindow
6            https://schema.org/hasProductReturnPolicy           hasProductReturnPolicy
7      https://schema.org/ProductReturnUnlimitedWindow     ProductReturnUnlimitedWindow
8         https://schema.org/ProductReturnNotPermitted        ProductReturnNotPermitted
9                    https://schema.org/stupidProperty                   stupidProperty
10              https://schema.o

In [27]:
result

x = df_from_query(q1)
x

Unnamed: 0,prop.type,prop.value,type1.type,type1.value,type2.type,type2.value
0,uri,https://schema.org/email,uri,https://schema.org/Organization,uri,https://schema.org/Person
1,uri,https://schema.org/faxNumber,uri,https://schema.org/Organization,uri,https://schema.org/Person
2,uri,https://schema.org/award,uri,https://schema.org/Organization,uri,https://schema.org/Person
3,uri,https://schema.org/telephone,uri,https://schema.org/Organization,uri,https://schema.org/Person
4,uri,https://schema.org/memberOf,uri,https://schema.org/Organization,uri,https://schema.org/Person
5,uri,https://schema.org/sponsor,uri,https://schema.org/Organization,uri,https://schema.org/Person
6,uri,https://schema.org/knowsAbout,uri,https://schema.org/Organization,uri,https://schema.org/Person
7,uri,https://schema.org/gender,uri,https://schema.org/SportsTeam,uri,https://schema.org/Person
8,uri,https://schema.org/vatID,uri,https://schema.org/Organization,uri,https://schema.org/Person
9,uri,https://schema.org/brand,uri,https://schema.org/Organization,uri,https://schema.org/Person


# Examples

How to access schema.org examples

In [23]:
# First we clone the entire schema.org repo, then we collect up the examples from .txt files:

!git clone https://github.com/schemaorg/schemaorg

fatal: destination path 'schemaorg' already exists and is not an empty directory.


In [24]:
!find . -name \*example\*.txt -exec ls {} \;

./SchemaExamples/example-code/examples.txt
./data/sdo-bus-stop-examples.txt
./data/sdo-trip-examples.txt
./data/sdo-police-station-examples.txt
./data/sdo-airport-examples.txt
./data/sdo-train-station-examples.txt
./data/sdo-videogame-examples.txt
./data/sdo-book-series-examples.txt
./data/sdo-automobile-examples.txt
./data/sdo-invoice-examples.txt
./data/sdo-creativework-examples.txt
./data/sdo-itemlist-examples.txt
./data/sdo-offeredby-examples.txt
./data/sdo-digital-document-examples.txt
./data/examples.txt
./data/sdo-hotels-examples.txt
./data/ext/pending/issue-2490-examples.txt
./data/ext/pending/issue-1670-examples.txt
./data/ext/pending/issue-2192-examples.txt
./data/ext/pending/issue-894-examples.txt
./data/ext/pending/issue-2396-examples.txt
./data/ext/pending/issue-1156-examples.txt
./data/ext/pending/issue-2384-examples.txt
./data/ext/pending/issue-1698-examples.txt
./data/ext/pending/issue-2294-examples.txt
./data/ext/pending/issue-2543-examples.txt
./data/ext/pending/issue

TODOs:
 * can we load all the examples into a multi-graph SPARQL store? (in rdflib not remote endpoint); put them into 'core' and 'pending' named graphs or similar.
  * then load triples from latest webschemas, https://webschemas.org/version/latest/schemaorg-current-https.jsonld into a named graph.
  * find triples in 'core' examples that are not in the vocabulary (then same with pending)
