# Introduction to the Semantic Web

## Brief History

* 1969: ARPAnet delivers first message
* 1983: TCP/IP added to internet
* 1989: WWW Invented along with HTTP proposed and implemented by [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee)
* 1999: Term "Semantic Web" defined by [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee)
* 1999: RDF recommendation adopted by W3C
* 2004: RDF 1.0 spec published

## Key Definitions

[Semantic Web](https://www.w3.org/standards/semanticweb/)

In addition to the classic “Web of documents” W3C is helping to build a technology stack to support a “Web of data,” the sort of data you find in databases. The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.

[Linked Data](https://www.w3.org/standards/semanticweb/data.html)

It is important to have the huge amount of data on the Web available in a standard format, reachable and manageable by Semantic Web tools. Furthermore, not only does the Semantic Web need access to data, but relationships among data should be made available, too, to create a Web of Data (as opposed to a sheer collection of datasets). This collection of interrelated datasets on the Web can also be referred to as Linked Data.

[RDF (Resource Description Framework)](https://en.wikipedia.org/wiki/Resource_Description_Framework)

The RDF data model is similar to classical conceptual modeling approaches (such as entity–relationship or class diagrams). It is based on the idea of making statements about resources (in particular web resources) in expressions of the form subject–predicate–object, known as triples. The subject denotes the resource, and the predicate denotes traits or aspects of the resource, and expresses a relationship between the subject and the object.

[Ontology](https://www.w3.org/standards/semanticweb/ontology.html)

On the Semantic Web, vocabularies define the concepts and relationships (also referred to as “terms”) used to describe and represent an area of concern. Vocabularies are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. In practice, vocabularies can be very complex (with several thousands of terms) or very simple (describing one or two concepts only).

There is no clear division between what is referred to as “vocabularies” and “ontologies”. The trend is to use the word “ontology” for more complex, and possibly quite formal collection of terms, whereas “vocabulary” is used when such strict formalism is not necessarily used or only in a very loose sense. Vocabularies are the basic building blocks for inference techniques on the Semantic Web.

## References and Resources

https://github.com/semantalytics/awesome-semantic-web

## What is actually in a URL?

In [1]:
from urllib.parse import urlparse

urlparse("https://www.w3.org/People/Berners-Lee/card#i")

ParseResult(scheme='https', netloc='www.w3.org', path='/People/Berners-Lee/card', params='', query='', fragment='i')

## Building the URI (Uniform Resource Identifier)

For most of our cases, the scheme will be http or https, but can really be any protocol used to communicate over a network.

The netloc, also known as the authority, is the network device that hosts the data being referenced by the URI.

The path represents the location of the component located at the authority, where the data can be retrieved.

The fragment portion is part of the URN (Uniform Resource Name), and signifies which object at the URL, the URI is pointing to.

Lastly, from here on we will be using the term IRI (International Resource Identifier), which is an international version of the URI.

## Working with RDF data

RDF data is an implementation of an unstructured graph data structure. Not all RDF can be cleanly placed into a relational database, but all relational database data can be cleanly represented as a graph matrix where the Subject is the IRI pointing to the primary key, the Predicate is the IRI pointing to the definition of the column name in a table, and the Object is either a literaly value, or a IRI pointing to the foreign key.

Data in RDFs can be one of 3 types:

1. IRI
   * When used in a RDF, the IRI is used to denote the absolute or relative location to the resource being referenced.
2. Literal
   * A basic value that are not IRIs. Such as strings, integers, or an instance of a concrete class.
3. Blank Node
   * An anonymous reference to an object without an IRI. Usually used as a container type to reference a collection of RDF statements


* Subject
  * IRI
  * Blank Node
* Predicate
  * IRI
* Object
  * IRI (Denotes a forign key when describing relational database data)
  * Literal
  * Blank Node
  
Example:

*Table Name: People*

|ID|Name|Age|Type|Knows|
|--|----|---|----|------|
|0|Tory|32|Human||
|1|Clyde|13|Cat|0|


Using the above table, we could make the following RDF statements:

```text
Subject(People#0) -> Predicate(has_name) -> Object(Tory)
Subject(People#1) -> Predicate(knows) -> Object(People#0)
```

In [2]:
!pip install rdflib rdflib-jsonld
import rdflib
from rdflib.namespace import FOAF , XSD

# create a Graph to store our RDF objects
graph = rdflib.Graph()

# Delcare our namespace to stuff all our stuff
ns = rdflib.Namespace("http://example.org")

# Declare new types
Person = rdflib.URIRef(ns + "/person")
Human = rdflib.URIRef(ns + "/human")
Cat = rdflib.URIRef(ns + "/cat")

# Create our People
Tory = rdflib.URIRef(Person + "/0")
Clyde = rdflib.URIRef(Person + "/1")

# Start populating the graphs
graph.add( (Tory, FOAF.name, rdflib.Literal("Tory")) )
graph.add( (Tory, rdflib.RDF.type, Human ) )
graph.add( (Tory, FOAF.age, rdflib.Literal(32) ) )

graph.add( (Clyde, FOAF.name, rdflib.Literal("Clyde")) )
graph.add( (Clyde, rdflib.RDF.type, Cat) )
graph.add( (Clyde, FOAF.age, rdflib.Literal(13) ) )
graph.add( (Clyde, FOAF.knows, Tory ) )

Collecting rdflib
  Downloading rdflib-5.0.0-py3-none-any.whl (231 kB)
[K     |████████████████████████████████| 231 kB 6.5 MB/s eta 0:00:01
[?25hCollecting rdflib-jsonld
  Downloading rdflib-jsonld-0.5.0.tar.gz (55 kB)
[K     |████████████████████████████████| 55 kB 3.3 MB/s eta 0:00:01
[?25hCollecting isodate
  Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 4.0 MB/s eta 0:00:01
Building wheels for collected packages: rdflib-jsonld
  Building wheel for rdflib-jsonld (setup.py) ... [?25ldone
[?25h  Created wheel for rdflib-jsonld: filename=rdflib_jsonld-0.5.0-py2.py3-none-any.whl size=15348 sha256=dfd1c8ac2103d5f6d69e0657d1c92d84e2d1d0c3f91281a040fa449b82adb761
  Stored in directory: /home/jovyan/.cache/pip/wheels/3a/97/90/e133cbb98e344c2ca55120f8d704f6ff57bdfd8e30f1dc5451
Successfully built rdflib-jsonld
Installing collected packages: isodate, rdflib, rdflib-jsonld
Successfully installed isodate-0.6.0 rdflib-5.0.0 rdflib-j

In [3]:
for triple in graph:
    print(triple)

(rdflib.term.URIRef('http://example.org/person/1'), rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef('http://example.org/cat'))
(rdflib.term.URIRef('http://example.org/person/1'), rdflib.term.URIRef('http://xmlns.com/foaf/0.1/name'), rdflib.term.Literal('Clyde'))
(rdflib.term.URIRef('http://example.org/person/1'), rdflib.term.URIRef('http://xmlns.com/foaf/0.1/age'), rdflib.term.Literal('13', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('http://example.org/person/0'), rdflib.term.URIRef('http://xmlns.com/foaf/0.1/age'), rdflib.term.Literal('32', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('http://example.org/person/0'), rdflib.term.URIRef('http://xmlns.com/foaf/0.1/name'), rdflib.term.Literal('Tory'))
(rdflib.term.URIRef('http://example.org/person/1'), rdflib.term.URIRef('http://xmlns.com/foaf/0.1/knows'), rdflib.term.URIRef('http://example.org/pers

## RDF Serialization Formats

Some common serialization formats include:

* XML (Extensible Markup Language)
* N3 (Notation3)
* TTL (Terse RDF Triple Language)
* JSON-LD (JSON for Linked-Data)
* And MANY more!

In [4]:
for fmt in ["xml", "n3", "ttl", "json-ld"]:
    print("=" * 20, fmt, "=" * 20)
    print(graph.serialize(format=fmt).decode())

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:ns1="http://xmlns.com/foaf/0.1/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
  <rdf:Description rdf:about="http://example.org/person/1">
    <rdf:type rdf:resource="http://example.org/cat"/>
    <ns1:name>Clyde</ns1:name>
    <ns1:age rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">13</ns1:age>
    <ns1:knows rdf:resource="http://example.org/person/0"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.org/person/0">
    <ns1:age rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">32</ns1:age>
    <ns1:name>Tory</ns1:name>
    <rdf:type rdf:resource="http://example.org/human"/>
  </rdf:Description>
</rdf:RDF>

@prefix ns1: <http://xmlns.com/foaf/0.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/person/1> a <http://example.org/cat> ;
    ns1:age 13 ;
    ns1:knows <http://example.org/person/0> ;
    ns1:name "Clyde" .

<http://example.org/person/0> a <http:

## Applying Semantics to Linked Data

The simplest version of the [data pyramid](https://en.wikipedia.org/wiki/DIKW_pyramid) goes:

Data -> Information -> Knowledge -> Wisdom

Data is a collection of observations stated as facts.

Information is inferred from data by extracting the useful parts out of the data. This is normally done through some sort of ETL (Extract. Transform. Load) process.

Knowledge is information that is enriched with domain knowledge to add context to the information, or a combination of multiple sources of information to add context that would otherwise not have that context when standing alone.

Wisdom is the shared understanding of the knowledge and how to apply it to business objects, and why it is useful.

The value or cell in a database somewhere represents a data point. Each triple in our graph represents a single piece of information. The graph as a whole represents our collection of information. The next step is to apply domain knowledge to lift the information graph to a knowledge graph.

The way we will encode and apply our domain knowledge is by using an ontology.

## Vocabularies, Ontologies, and Schemas

The two most common ways to encode domain knowledge are:

1. [RDFS (RDF Schema)](https://www.w3.org/TR/rdf-schema/)
2. [OWL (Web Ontology Language)](https://www.w3.org/TR/owl-overview/)

Vocabularies are themselves also just RDF graphs, but contain an encoding of domain knowledge and a set of constraints to validate or extend an existing knowledge graph.

Lets start by bringing in the ontology for foaf (Friend of a Friend), the social network encoding ontology used in our previous example.

In [5]:
ontology = rdflib.Graph()

url = "http://xmlns.com/foaf/spec/20140114.rdf"
namespace = rdflib.Namespace("http://xmlns.com/foaf/0.1/")

ontology.parse(url)

subjects = set(ontology.subjects())
predicates = set(ontology.predicates())
objects = set(ontology.objects())

print(f"Triples({len(ontology)}), Subjects({len(subjects)}), Predicates({len(predicates)}), Objects({len(objects)})")

Triples(631), Subjects(86), Predicates(15), Objects(192)


From here, we are able to categorize our subjects into two groups:

1. Subjects that belong to the FOAF namespace
2. External subjects that enrich the FOAF namespace items

In [6]:
internal_subjects = set(sub for sub in subjects if sub.startswith(namespace))
external_subjects = subjects - internal_subjects

print(f"Internal({len(internal_subjects)}), External({len(external_subjects)})")

Internal(76), External(10)


In [7]:
from collections import namedtuple

IRIRef = namedtuple("IRIRef", ("iri", "delim", "urn"))

def split(item):
    if not isinstance(item, rdflib.term.URIRef):
        return item
    iri, delim, urn = item.rpartition("#" if "#" in item else "/")
    return IRIRef(iri, delim, urn)

In [8]:
[split(subject)[2] for subject in internal_subjects]

['name',
 'accountName',
 'tipjar',
 'depiction',
 'geekcode',
 'schoolHomepage',
 'interest',
 'based_near',
 '',
 'page',
 'gender',
 'family_name',
 'focus',
 'familyName',
 'OnlineEcommerceAccount',
 'holdsAccount',
 'pastProject',
 'age',
 'fundedBy',
 'givenName',
 'Agent',
 'skypeID',
 'OnlineAccount',
 'topic_interest',
 'currentProject',
 'member',
 'OnlineGamingAccount',
 'surname',
 'dnaChecksum',
 'Document',
 'Group',
 'mbox_sha1sum',
 'nick',
 'topic',
 'aimChatID',
 'plan',
 'Organization',
 'primaryTopic',
 'theme',
 'knows',
 'Person',
 'account',
 'icqChatID',
 'Project',
 'sha1',
 'homepage',
 'depicts',
 'birthday',
 'openid',
 'lastName',
 'jabberID',
 'OnlineChatAccount',
 'membershipClass',
 'maker',
 'LabelProperty',
 'accountServiceHomepage',
 'firstName',
 'workInfoHomepage',
 'phone',
 'workplaceHomepage',
 'isPrimaryTopicOf',
 'weblog',
 'myersBriggs',
 'Image',
 'thumbnail',
 'logo',
 'msnChatID',
 'publications',
 'status',
 'mbox',
 'PersonalProfileDocume

In [9]:
[split(subject) for subject in external_subjects]

[IRIRef(iri='http://www.w3.org/2003/01/geo/wgs84_pos', delim='#', urn='SpatialThing'),
 IRIRef(iri='http://purl.org/dc/elements/1.1', delim='/', urn='date'),
 IRIRef(iri='http://xmlns.com/wot/0.1', delim='/', urn='assurance'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='Class'),
 IRIRef(iri='http://www.w3.org/2004/02/skos/core', delim='#', urn='Concept'),
 IRIRef(iri='http://xmlns.com/wot/0.1', delim='/', urn='src_assurance'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='Thing'),
 IRIRef(iri='http://purl.org/dc/elements/1.1', delim='/', urn='title'),
 IRIRef(iri='http://www.w3.org/2003/06/sw-vocab-status/ns', delim='#', urn='term_status'),
 IRIRef(iri='http://purl.org/dc/elements/1.1', delim='/', urn='description')]

In our example, the predicates are where things get interesting. Lets start by only looking at the namespaces brought in.

In the below example, you can see we are bringing in predicates from not only multiple vocabularies, but multiple types or standards of vocabularies such as RDFS, and OWL standards. When building RDFs, you are encouraged to include as much sematics as possible, and you aren't required to stick to a single namespace, or even a single domain when describing your graph.

In [10]:
set(split(predicate)[0] for predicate in predicates)

{'http://purl.org/dc/elements/1.1',
 'http://www.w3.org/1999/02/22-rdf-syntax-ns',
 'http://www.w3.org/2000/01/rdf-schema',
 'http://www.w3.org/2002/07/owl',
 'http://www.w3.org/2003/06/sw-vocab-status/ns'}

In [11]:
set(split(predicate) for predicate in predicates)

{IRIRef(iri='http://purl.org/dc/elements/1.1', delim='/', urn='description'),
 IRIRef(iri='http://purl.org/dc/elements/1.1', delim='/', urn='title'),
 IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='type'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='comment'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='domain'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='isDefinedBy'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='label'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='range'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='subClassOf'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='subPropertyOf'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='disjointWith'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='equivalentClass'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim

While most RDF graphs describe relationships in your data, a vocabulary describes the relationships in your TYPES of data. In this case, a vocabulary is similar to a UML diagram that would descibe the relationships between python classes and python subclasses, to include base types (see the above "http://www.w3.org/1999/02/22-rdf-syntax-ns" and "http://purl.org/dc/elements/1.1")

Next, lets look at objects that are references (not literal or blank nodes) that are not internal references inside the namespace. This will tell us how FOAF depends on external vocabularies.

In [12]:
foreign_references = tuple(split(i) for i in predicates.union(objects).difference(subjects) if isinstance(i, rdflib.term.URIRef))
foreign_references

(IRIRef(iri='http://schema.org', delim='/', urn='ImageObject'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='subClassOf'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='range'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='comment'),
 IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='Property'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='AnnotationProperty'),
 IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='Literal'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='disjointWith'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='Ontology'),
 IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='type'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='Class'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='ObjectProperty'),
 IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='equivale

In [13]:
!pip install owlrl

import owlrl

combined_graph = graph + ontology

owlrl.DeductiveClosure(owlrl.CombinedClosure.RDFS_OWLRL_Semantics).expand(combined_graph)
print(f"The original graph contained {len(graph)} triples and the ontoloty contained {len(ontology)} triples. But after automated deductive reasoning it now contains {len(combined_graph)} triples!")

Collecting owlrl
  Downloading owlrl-5.2.1-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 3.9 MB/s eta 0:00:011
Installing collected packages: owlrl
Successfully installed owlrl-5.2.1
The original graph contained 7 triples and the ontoloty contained 631 triples. But after automated deductive reasoning it now contains 2362 triples!


In [14]:
# Search the combined graph for all triples where Clyde is the Subject

for s, p, o in combined_graph.triples( (Clyde, None, None) ):  # None is considered a wildcard for iteration purposes
    if (s, p, o) in graph:
        continue  # Skip items in original graph, we only want to see new data that was learned through deductive reasoning
    print(split(p), split(o))

IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='type') IRIRef(iri='http://www.w3.org/2000/10/swap/pim/contact', delim='#', urn='Person')
IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='type') IRIRef(iri='http://schema.org', delim='/', urn='Person')
IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='type') IRIRef(iri='http://www.w3.org/2003/01/geo/wgs84_pos', delim='#', urn='SpatialThing')
IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='type') IRIRef(iri='http://xmlns.com/foaf/0.1', delim='/', urn='Person')
IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='type') IRIRef(iri='http://xmlns.com/foaf/0.1', delim='/', urn='Agent')
IRIRef(iri='http://www.w3.org/1999/02/22-rdf-syntax-ns', delim='#', urn='type') IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='Thing')
IRIRef(iri='http://www.w3.org/2000/01/rdf-schema', delim='#', urn='label') Clyde
IRIRef(iri='http:/

In [15]:
# Search the combined graph for all triples where Clyde is the Object

for s, p, o in combined_graph.triples( (None, None, Clyde) ):  # None is considered a wildcard for iteration purposes
    if (s, p, o) in graph:
        continue  # Skip items in original graph, we only want to see new data that was learned through deductive reasoning
    print(f"{split(s)}\n\t{split(p)}\n\t{split(o)}")

IRIRef(iri='http://example.org/person', delim='/', urn='1')
	IRIRef(iri='http://www.w3.org/2002/07/owl', delim='#', urn='sameAs')
	IRIRef(iri='http://example.org/person', delim='/', urn='1')
