<a href="https://colab.research.google.com/github/schemaorg/schemaorg/blob/main/scripts/Schema_org_Dashboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is part of the Schema.org project codebase at https://github.com/schemaorg/schemaorg and licensed under the same terms. **bold text**


The purpose of this notebook is to show how to work programmatically with schema.org's definitions. 

See also https://colab.research.google.com/drive/1GVQaP5t8G-NRLAmEvVSp8k5MnsrfttDP for another approach to this, and this [2016 dashboard](https://github.com/schemaorg/schemaorg/blob/main/scripts/dashboard.ipynb) for some useful SPARQL queries to migrate here.

# SPARQL

How to query schema.org schemas using SPARQL

In [29]:
# run this once per session to bring in a required library

!pip --quiet install sparqlwrapper | grep -v 'already satisfied'

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import io
import requests

In [30]:
# This function shows how to use rdflib to query a REMOTE sparql dataset

q1 = """SELECT distinct ?prop ?type1 ?type2 WHERE  {
  ?type1 rdfs:subClassOf* <https://schema.org/Organization> . 
  ?type2 rdfs:subClassOf* <https://schema.org/Person> . 
  ?prop <https://schema.org/domainIncludes> ?type1 .
  ?prop <https://schema.org/domainIncludes> ?type2 .
}"""

pd.set_option('display.max_colwidth', None)

# data
wd_endpoint = 'https://query.wikidata.org/sparql'
sdo_endpoint = "https://dydra.com/danbri/schema-org-v11/sparql"

# utility function
def df_from_query(querystring=q1, endpoint=sdo_endpoint):
  sparql = SPARQLWrapper(endpoint)
  sparql.setQuery(querystring)
  sparql.setReturnFormat(JSON)
  results = sparql.query().convert()
  return( pd.json_normalize(results['results']['bindings']) )

In [46]:
# This shows how to use rdflib to query a LOCAL sparql dataset
# TODO: Need a function that loads https://webschemas.org/version/latest/schemaorg-current-https.nt into a named graph SPARQL store 


import rdflib
import json
from collections import Counter
from rdflib import Graph, plugin, ConjunctiveGraph
from rdflib.serializer import Serializer

def toDF(result):
  return pd.DataFrame(result, columns=result.vars)

# Fetch Schema.org definitions

sdo_current_https_url = "https://webschemas.org/version/latest/schemaorg-current-https.nq"
sdo_all_https_url = "https://webschemas.org/version/latest/schemaorg-all-https.nq"

# TODO - is this the only way to figure out what is in the attic? except both files use same NG URL
g = ConjunctiveGraph(store="IOMemory")
g.parse( sdo_all_https_url,    format="nquads",    publicID="https://schema.org/")
#g.parse( sdo_current_https_url,    format="nquads",    publicID="https://schema.org/")


<Graph identifier=https://schema.org/ (<class 'rdflib.graph.Graph'>)>

In [38]:
result = toDF( g.query("select * where { GRAPH ?g { ?article_type rdfs:subClassOf <https://schema.org/NewsArticle> ; rdfs:label ?label }}") )


In [39]:
result

Unnamed: 0,article_type,g,label
0,https://schema.org/OpinionNewsArticle,https://schema.org/11.0,OpinionNewsArticle
1,https://schema.org/ReportageNewsArticle,https://schema.org/11.0,ReportageNewsArticle
2,https://schema.org/ReviewNewsArticle,https://schema.org/11.0,ReviewNewsArticle
3,https://schema.org/AskPublicNewsArticle,https://schema.org/11.0,AskPublicNewsArticle
4,https://schema.org/BackgroundNewsArticle,https://schema.org/11.0,BackgroundNewsArticle
5,https://schema.org/AnalysisNewsArticle,https://schema.org/11.0,AnalysisNewsArticle


In [40]:
toDF( g.query("select * where { ?attic_term <https://schema.org/isPartOf> <https://attic.schema.org> ; rdfs:label ?label }") )

Unnamed: 0,label,attic_term
0,ProductReturnUnlimitedWindow,https://schema.org/ProductReturnUnlimitedWindow
1,ProductReturnFiniteReturnWindow,https://schema.org/ProductReturnFiniteReturnWindow
2,hasProductReturnPolicy,https://schema.org/hasProductReturnPolicy
3,variablesMeasured,https://schema.org/variablesMeasured
4,ProductReturnUnspecified,https://schema.org/ProductReturnUnspecified
5,stupidProperty,https://schema.org/stupidProperty
6,ProductReturnNotPermitted,https://schema.org/ProductReturnNotPermitted
7,ProductReturnEnumeration,https://schema.org/ProductReturnEnumeration
8,StupidType,https://schema.org/StupidType
9,productReturnLink,https://schema.org/productReturnLink


In [41]:
grandchild_count_query = """SELECT ?child (count(?grandchild) as ?nGrandchildren) where { ?child rdfs:subClassOf <https://schema.org/Thing> . OPTIONAL { ?grandchild rdfs:subClassOf ?child } } GROUP BY ?child order by desc(count(?grandchild))"""
res = g.query (grandchild_count_query)
mydf = toDF( res )
#mydf.plot(kind='bar')

In [47]:
mydf.columns

Index(['child', 'nGrandchildren'], dtype='object')

In [None]:
# https://www.shanelynn.ie/bar-plots-in-python-using-pandas-dataframes/

In [45]:
mydf['nGrandchildren']

KeyError: ignored

In [44]:
print(mydf)
mydf['nGrandchildren'].plot(kind='bar')

                              child nGrandchildren
0   https://schema.org/CreativeWork             71
1     https://schema.org/Intangible             59
2          https://schema.org/Event             22
3  https://schema.org/MedicalEntity             19
4         https://schema.org/Action             16
5   https://schema.org/Organization             15
6          https://schema.org/Place              9
7        https://schema.org/Product              6
8         https://schema.org/Person              1
9     https://schema.org/StupidType              0


KeyError: ignored

In [17]:
result

x = df_from_query(q1)
x

Unnamed: 0,prop.type,prop.value,type1.type,type1.value,type2.type,type2.value
0,uri,https://schema.org/email,uri,https://schema.org/Organization,uri,https://schema.org/Person
1,uri,https://schema.org/faxNumber,uri,https://schema.org/Organization,uri,https://schema.org/Person
2,uri,https://schema.org/award,uri,https://schema.org/Organization,uri,https://schema.org/Person
3,uri,https://schema.org/telephone,uri,https://schema.org/Organization,uri,https://schema.org/Person
4,uri,https://schema.org/memberOf,uri,https://schema.org/Organization,uri,https://schema.org/Person
5,uri,https://schema.org/sponsor,uri,https://schema.org/Organization,uri,https://schema.org/Person
6,uri,https://schema.org/knowsAbout,uri,https://schema.org/Organization,uri,https://schema.org/Person
7,uri,https://schema.org/gender,uri,https://schema.org/SportsTeam,uri,https://schema.org/Person
8,uri,https://schema.org/vatID,uri,https://schema.org/Organization,uri,https://schema.org/Person
9,uri,https://schema.org/brand,uri,https://schema.org/Organization,uri,https://schema.org/Person


# Examples

How to access schema.org examples

In [21]:
# First we clone the entire schema.org repo, then we collect up the examples from .txt files:

!git clone https://github.com/schemaorg/schemaorg

Cloning into 'schemaorg'...
remote: Enumerating objects: 132, done.[K
remote: Counting objects: 100% (132/132), done.[K
remote: Compressing objects: 100% (110/110), done.[K
remote: Total 23477 (delta 80), reused 50 (delta 21), pack-reused 23345[K
Receiving objects: 100% (23477/23477), 96.36 MiB | 28.09 MiB/s, done.
Resolving deltas: 100% (16700/16700), done.
Checking out files: 100% (1788/1788), done.


In [24]:
!find . -name \*example\*.txt -exec ls {} \;

./schemaorg/SchemaExamples/example-code/examples.txt
./schemaorg/data/sdo-bus-stop-examples.txt
./schemaorg/data/sdo-trip-examples.txt
./schemaorg/data/sdo-police-station-examples.txt
./schemaorg/data/sdo-airport-examples.txt
./schemaorg/data/sdo-train-station-examples.txt
./schemaorg/data/sdo-videogame-examples.txt
./schemaorg/data/sdo-book-series-examples.txt
./schemaorg/data/sdo-automobile-examples.txt
./schemaorg/data/sdo-invoice-examples.txt
./schemaorg/data/sdo-creativework-examples.txt
./schemaorg/data/sdo-itemlist-examples.txt
./schemaorg/data/sdo-offeredby-examples.txt
./schemaorg/data/sdo-digital-document-examples.txt
./schemaorg/data/examples.txt
./schemaorg/data/sdo-hotels-examples.txt
./schemaorg/data/ext/pending/issue-2490-examples.txt
./schemaorg/data/ext/pending/issue-1670-examples.txt
./schemaorg/data/ext/pending/issue-2192-examples.txt
./schemaorg/data/ext/pending/issue-894-examples.txt
./schemaorg/data/ext/pending/issue-2396-examples.txt
./schemaorg/data/ext/pending/

TODOs:
 * can we load all the examples into a multi-graph SPARQL store? (in rdflib not remote endpoint); put them into 'core' and 'pending' named graphs or similar.
  * then load triples from latest webschemas, https://webschemas.org/version/latest/schemaorg-current-https.jsonld into a named graph.
  * find triples in 'core' examples that are not in the vocabulary (then same with pending)
