<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span></li><li><span><a href="#1.-UniProt-in-RDF/SPARQL" data-toc-modified-id="1.-UniProt-in-RDF/SPARQL-2">1. UniProt in RDF/SPARQL</a></span><ul class="toc-item"><li><span><a href="#1.1-UniProt-RDF" data-toc-modified-id="1.1-UniProt-RDF-2.1">1.1 UniProt RDF</a></span><ul class="toc-item"><li><span><a href="#Documentation" data-toc-modified-id="Documentation-2.1.1">Documentation</a></span></li><li><span><a href="#Distribution" data-toc-modified-id="Distribution-2.1.2">Distribution</a></span></li></ul></li><li><span><a href="#1.2-UniProt-SPARQL-endpoint" data-toc-modified-id="1.2-UniProt-SPARQL-endpoint-2.2">1.2 UniProt SPARQL endpoint</a></span></li></ul></li><li><span><a href="#2.-Required-Python-libraries" data-toc-modified-id="2.-Required-Python-libraries-3">2. Required Python libraries</a></span><ul class="toc-item"><li><span><a href="#2.1-RDFlib-package" data-toc-modified-id="2.1-RDFlib-package-3.1">2.1 RDFlib package</a></span><ul class="toc-item"><li><span><a href="#Read-a-UniProt-entry-and-save-it-as-a-Graph" data-toc-modified-id="Read-a-UniProt-entry-and-save-it-as-a-Graph-3.1.1">Read a UniProt entry and save it as a Graph</a></span><ul class="toc-item"><li><span><a href="#Print-the-number-of-&quot;triples&quot;-in-the-Graph" data-toc-modified-id="Print-the-number-of-&quot;triples&quot;-in-the-Graph-3.1.1.1">Print the number of "triples" in the Graph</a></span></li><li><span><a href="#Print-out-the-entire-Graph-in-the-RDF-Turtle-format" data-toc-modified-id="Print-out-the-entire-Graph-in-the-RDF-Turtle-format-3.1.1.2">Print out the entire Graph in the RDF Turtle format</a></span></li></ul></li><li><span><a href="#Contains-check" data-toc-modified-id="Contains-check-3.1.2">Contains check</a></span></li><li><span><a href="#Set-Operations-on-RDFLib-Graphs" data-toc-modified-id="Set-Operations-on-RDFLib-Graphs-3.1.3">Set Operations on RDFLib Graphs</a></span></li><li><span><a href="#Basic-triple-matching" data-toc-modified-id="Basic-triple-matching-3.1.4">Basic triple matching</a></span></li><li><span><a href="#Querying-with-SPARQL" data-toc-modified-id="Querying-with-SPARQL-3.1.5">Querying with SPARQL</a></span></li></ul></li><li><span><a href="#2.2-SPARQLWrapper-package" data-toc-modified-id="2.2-SPARQLWrapper-package-3.2">2.2 SPARQLWrapper package</a></span></li></ul></li></ul></div>

# Introduction

This series of notebooks aims to show you how to handle UniProt RDF data model and related resources and how to perform SPARQL queries.

**Date**: April 2021  
**Authors**: Swiss-Prot group  


| Notebook                    | Comment         |  
|-----------------------------|-----------------|  
| 00_introduction.ipynb       |*this notebook* |  
| 01_basic_information.ipynb  |accession, mnemo, date,...  |  
| 02_protein_name.ipynb       |protein names |  
| 03_replicon_gene.ipynb      |replicon and gene names |  
| 04_taxonomy.ipynb           |organism and taxonomy |  
| 05_sequence.ipynb           |protein sequence |  
| 06_annotation.ipynb         |UniProt annotation types |  
| 07_evidence.ipynb           |UniProt evidence tags |  
| 08_variant.ipynb            |variant annotation |   
| 09_metabolism.ipynb         |metabolism.related data: catalyzed reaction, enzyme classification, cofactor |   


# 1. UniProt in RDF/SPARQL  

## 1.1 UniProt RDF  

### Documentation  
Documentation about the data model is available [here](https://sparql.uniprot.org/.well-known/void). We use standard and community supported vocabularies ([Dublin core](https://en.wikipedia.org/wiki/Dublin_Core), [SKOS](https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System), etc.) where possible to extend our own [UniProt core vocabulary](https://www.uniprot.org/core/).

### Distribution
[ftp parent directory](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/)  
[README](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/README)  


## 1.2 UniProt SPARQL endpoint

The UniProt SPARQL endpoint [sparql.uniprot.org](https://sparql.uniprot.org) is free to use. It is updated in sync with the www.uniprot.org and ftp releases.  

**SPARQL** is a W3C standardized query language for the Semantic Web. If you know SQL, it will look familiar to you and you can do similar types of queries with it.  
SPARQL also allows you to query and combine data from a variety of SPARQL endpoints, providing a valuable low-cost alternative to building your own data warehouse. You can combine UniProt data from [sparql.uniprot.org](https://sparql.uniprot.org) with that from other SPARQL endpoints (Rhea, Bgee, OMA, orthoDB, neXtProt, etc).


# 2. Required Python libraries

In this serie of notebooks, we will use two Python libraries:  

* **[RDFLib](https://rdflib.readthedocs.io/en/stable/index.html)** is a pure Python package for working with RDF. RDFLib contains useful APIs for working with RDF.  
See [Getting started with RDFLib](https://rdflib.readthedocs.io/en/stable/gettingstarted.html)  

* **[SPARQLWrapper](https://pypi.org/project/SPARQLWrapper/)** is a simple Python wrapper around a SPARQL service to remotelly execute your queries. It helps in creating the query invokation and, possibly, convert the result into a more manageable format.  
See [SPARQLWrapper documentation](https://sparqlwrapper.readthedocs.io/_/downloads/en/latest/pdf/)  


## 2.1 RDFlib package

### Read a UniProt entry and save it as a Graph

You can access to each UniProtKB entry in RDF/XML (.rdf) or turtle (.ttl) format.  
Example: P0A877 entry  
[P0A877.rdf](https://www.uniprot.org/uniprot/P0A877.rdf)  
[P0A877.ttl](https://www.uniprot.org/uniprot/P0A877.ttl)  

In [43]:
from rdflib import Graph
g = Graph()
# read P0A877 in RDF/XML format
g.parse("https://www.uniprot.org/uniprot/P0A877.rdf")

<Graph identifier=N10fe769ccde04b1bbccebed494665a5d (<class 'rdflib.graph.Graph'>)>

#### Print the number of "triples" in the Graph

In [42]:
print("Graph g has {} statements.".format(len(g)))

Graph g has 1282 statements.


#### Print out the entire Graph in the RDF Turtle format

In [27]:
print(g.serialize(format="turtle").decode("utf-8"))

@prefix : <http://purl.uniprot.org/core/> .
@prefix ECO: <http://purl.obolibrary.org/obo/ECO_> .
@prefix annotation: <http://purl.uniprot.org/annotation/> .
@prefix citation: <http://purl.uniprot.org/citations/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix enzyme: <http://purl.uniprot.org/enzyme/> .
@prefix faldo: <http://biohackathon.org/resource/faldo#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix go: <http://purl.obolibrary.org/obo/GO_> .
@prefix isoform: <http://purl.uniprot.org/isoforms/> .
@prefix keyword: <http://purl.uniprot.org/keywords/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix position: <http://purl.uniprot.org/position/> .
@prefix pubmed: <http://purl.uniprot.org/pubmed/> .
@prefix range: <http://purl.uniprot.org/range/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix taxon: <http://purl.uniprot.org/taxo

### Contains check

In [29]:
from rdflib import URIRef
from rdflib.namespace import RDF

P0A877 = URIRef("http://purl.uniprot.org/uniprot/P0A877")
Protein = URIRef("http://purl.uniprot.org/core/Protein")
if (P0A877, RDF.type, Protein) in g:
    print("This graph knows that P0A877 is a Protein!")


This graph knows that P0A877 is a Protein!


### Set Operations on RDFLib Graphs

Addition, subtraction and other set-operations on Graphs:

| operation     | effect                                            |  
|---------------|---------------------------------------------------|   
| G1 + G2       | return new graph with union                       |  
| G1 += G1      | in place union / addition                         |
| G1 - G2       | return new graph with difference                  |
| G1 -= G2      | in place difference / subtraction                 |
| G1 & G2       | intersection (triples in both graphs)             |
| G1 ^ G2       | xor (triples in either G1 or G2, but not in both) |



In [46]:
G = Graph()
print('Parse P0A877 UniProt entry (turtle format)')
G.parse("https://www.uniprot.org/uniprot/P0A877.ttl")
print("-> Graph G has {} statements.".format(len(g)))
print()
print('Parse P0A879 UniProt entry and add it to graph G')
G += G.parse("https://www.uniprot.org/uniprot/P0A879.ttl")

print("-> Graph G has {} statements.".format(len(G)))

Parse P0A877 UniProt entry (turtle format)
-> Graph G has 1282 statements.

Parse P0A879 UniProt entry and add it to graph G
-> Graph G has 2758 statements.


### Basic triple matching

In [47]:
Protein = URIRef("http://purl.uniprot.org/core/Protein")
for s, p, o in G.triples((None, RDF.type, Protein)):
    print("{} is a Protein".format(s))

http://purl.uniprot.org/uniprot/P0A877 is a Protein
http://purl.uniprot.org/uniprot/P0A879 is a Protein


### Querying with SPARQL

In [49]:
qres = G.query(
    """PREFIX up: <http://purl.uniprot.org/core/> 
       SELECT ?protein
       WHERE {
          ?protein rdf:type up:Protein .
       }""")
print()
for row in qres:
    print("%s rdf:type up:Protein" % row)

(rdflib.term.URIRef('http://purl.uniprot.org/uniprot/P0A877'),)
(rdflib.term.URIRef('http://purl.uniprot.org/uniprot/P0A879'),)

http://purl.uniprot.org/uniprot/P0A877 rdf:type up:Protein
http://purl.uniprot.org/uniprot/P0A879 rdf:type up:Protein


In [51]:
qres = G.query(
    """PREFIX up: <http://purl.uniprot.org/core/> 
       SELECT ?protein
       WHERE {
          ?protein a up:Protein .
       }""")

for row in qres:
    print("%s a up:Protein" % row)

http://purl.uniprot.org/uniprot/P0A877 a up:Protein
http://purl.uniprot.org/uniprot/P0A879 a up:Protein


## 2.2 SPARQLWrapper package

In [54]:
from SPARQLWrapper import SPARQLWrapper, JSON

# Set the SPARQL endpoint (UniProt)
sparql = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# Define the query
sparql.setQuery("""
PREFIX up: <http://purl.uniprot.org/core/> 
SELECT ?protein
WHERE {
  ?protein a up:Protein .
}
LIMIT 3
""")

# Set the output format as JSON
sparql.setReturnFormat(JSON)

# Run the SPARQL query and convert to the defined format
results = sparql.query().convert()

# Print the query result
for result in results["results"]["bindings"]:
    print(result["protein"]["value"])

http://purl.uniprot.org/uniprot/A0A024R563
http://purl.uniprot.org/uniprot/A0A024R564
http://purl.uniprot.org/uniprot/A0A024R565
