# Required Python library

First we import rdflib which is a well known python library that gives RDF and its query language support to Python2 and 3

In [1]:
import sys
from rdflib import *
import pandas as pd
pd.options.display.width=120
pd.options.display.max_colwidth=100


from SPARQLWrapper import SPARQLWrapper, JSON

# Minimal UniProt entry in RDF

We have the absolute minimal information here.
We say that `Q96P20` is an `up:Protein` .
We also say that it is `up:reviewed` .
    
There are 5 lines in the rdf.
The first three are declaring abreviations.

The last two are our data.

The first line starting with `@base` say that if something found between `<>` is not an absolute IRI prepend the base. This base is for all entries in UniProtKB.
The second line says anything starting with `up:` is in the UniProt core schema ontology i.e. our custom terminology
The third line is for the `rdf:` standard namespace.


In [2]:
Q96P20=Graph().parse(format='ttl',
                     data='base <http://purl.uniprot.org/uniprot/>  \
                     prefix up: <http://purl.uniprot.org/core/>  \
                     prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  \
                     <Q96P20> rdf:type up:Protein ;  \
                       up:reviewed true .')

for subj, pred, obj in Q96P20:
   if (subj, pred, obj) not in Q96P20:
       raise Exception("It better be!")

s = Q96P20.serialize(format='ttl')
print(s)


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix up: <http://purl.uniprot.org/core/> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://purl.uniprot.org/uniprot/Q96P20> a up:Protein ;
    up:reviewed true .




# Reviewed or unreviewed: that is the question

Lets ask if our entry is a Swiss-Prot entry or a TrEMBL entry

In [3]:
qres=Q96P20.query('prefix up: <http://purl.uniprot.org/core/> \
SELECT \
    ?protein \
    ?isEntryAnSwissProtEntry \
WHERE {\
  ?protein a up:Protein . \
  ?protein up:reviewed ?isEntryAnSwissProtEntry . \
}')

for row in qres:
    print("%s is an Swiss-Prot entry %s" % row)


http://purl.uniprot.org/uniprot/Q96P20 is an Swiss-Prot entry true


So what did you see here. The query starts with a prefix again introducting the abbreviation for the uniprot schema. Then a `SELECT` clause.
We ask for two things here an `?protein` and a `?isEntryAnSwissProtEntry`.
This is followed by a `WHERE` clause.
The `WHERE` clause contains two basic graph patterns (BGP)
BGP are bassically triples with some of the subject,predicate or object replaced by a variable.
The query engine fills in the values from the data in the database and returns it.

In [4]:
qres=Q96P20.query('prefix up: <http://purl.uniprot.org/core/> \
SELECT \
    ?protein \
    ?isEntryAnSwissProtEntry \
WHERE {\
  ?protein a up:Protein ; \
           up:reviewed ?isEntryAnSwissProtEntry \
}')

for row in qres:
    print("%s is an Swiss-Prot entry %s" % row)

http://purl.uniprot.org/uniprot/Q96P20 is an Swiss-Prot entry true


The same query al we did was use the abbreviation of ';' so that we did not repeat ourselves.
Not repeating yourself is a good way to avoid silly typo's from ruining your query!

# Names, names so many names

In [5]:
#There are lot's of different kinds of names for proteins.
from SPARQLWrapper import SPARQLWrapper, JSON

sparql=SPARQLWrapper('https://sparql.uniprot.org/sparql')
sparql.setQuery("""
  PREFIX up: <http://purl.uniprot.org/core/>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
  SELECT 
    ?kindOfName 
  WHERE {
    GRAPH<http://purl.uniprot.org/core/>{ 
        ?kindOfName rdfs:subPropertyOf up:structuredNameType
    }
  }""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

for result in results["results"]["bindings"]:
    print(result["kindOfName"]["value"])

http://purl.uniprot.org/core/allergenName
http://purl.uniprot.org/core/biotechName
http://purl.uniprot.org/core/cdAntigenName
http://purl.uniprot.org/core/ecName
http://purl.uniprot.org/core/fullName
http://purl.uniprot.org/core/internationalNonproprietaryName
http://purl.uniprot.org/core/shortName


There is a lot going on in this example. First we use the full uniprot sparql endpoint, so a real live endpoint, used by lots of people.
When using the real uniprot sparql endpoint it would be nice for your co users if you write targeted queries.
The faster your query is written the easier it is to answer and more likely it does not time out our generous limits and your http connecction settings

So in the query we use two prefixes, `up:` and `rdfs:`. The up: schema is of course ours, but the RDFS one you will find in almost every other dataset too.
It is avery generic litle schema for relating concepts. Here we use `rdf:subPropertyOf` which is basically just saying that this is a more specific predicate.

There are biological (community) reasons for all these names.
However they are grouped in bunches that belong together.

In [6]:
turtle="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>  
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  
<Q96P20> rdf:type up:Protein ;  
  up:reviewed true ;
  up:recommendedName <Q96P20#SIPE0A17F4351205436> ;
  up:alternativeName <Q96P20#SIPA190CC955EA6E7E7> ,
    <Q96P20#SIPFE6783A277ACFCCC> ,
    <Q96P20#SIP5C26CFD46CD2AE93> ,
    <Q96P20#SIP67B63DF8BD1345C7> ,
    <Q96P20#SIP779F44FABBD9C4E2> .
<Q96P20#SIPE0A17F4351205436> rdf:type up:Structured_Name ;
  up:fullName "NACHT, LRR and PYD domains-containing protein 3" .
<Q96P20#SIPA190CC955EA6E7E7> rdf:type up:Structured_Name ;
  up:fullName "Angiotensin/vasopressin receptor AII/AVP-like" .
<Q96P20#SIPFE6783A277ACFCCC> rdf:type up:Structured_Name ;
  up:fullName "Caterpiller protein 1.1" ;
  up:shortName "CLR1.1" .
<Q96P20#SIP5C26CFD46CD2AE93> rdf:type up:Structured_Name ;
  up:fullName "Cold-induced autoinflammatory syndrome 1 protein" .
<Q96P20#SIP67B63DF8BD1345C7> rdf:type up:Structured_Name ;
  up:fullName "Cryopyrin" .
<Q96P20#SIP779F44FABBD9C4E2> rdf:type up:Structured_Name ;
  up:fullName "PYRIN-containing APAF1-like protein 1" ."""

Q96P20=Graph().parse(format='ttl',
                     data=turtle)
qres=Q96P20.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
SELECT 
    ?groupOfNames 
    ?name 
WHERE {
  ?protein a up:Protein ; 
           up:recommendedName|up:alternativeName|up:submittedName ?groupOfNames .
  ?groupOfNames (up:allergenName|up:biotechName|up:cdAntigenName|up:ecName|up:fullName|up:internationalNonproprietaryName|up:shortName) ?name
}""")

for row in qres:
    print("%s is an structured name %s is one of it's names" % row)

http://purl.uniprot.org/uniprot/Q96P20#SIPE0A17F4351205436 is an structured name NACHT, LRR and PYD domains-containing protein 3 is one of it's names
http://purl.uniprot.org/uniprot/Q96P20#SIPA190CC955EA6E7E7 is an structured name Angiotensin/vasopressin receptor AII/AVP-like is one of it's names
http://purl.uniprot.org/uniprot/Q96P20#SIP5C26CFD46CD2AE93 is an structured name Cold-induced autoinflammatory syndrome 1 protein is one of it's names
http://purl.uniprot.org/uniprot/Q96P20#SIP779F44FABBD9C4E2 is an structured name PYRIN-containing APAF1-like protein 1 is one of it's names
http://purl.uniprot.org/uniprot/Q96P20#SIPFE6783A277ACFCCC is an structured name Caterpiller protein 1.1 is one of it's names
http://purl.uniprot.org/uniprot/Q96P20#SIPFE6783A277ACFCCC is an structured name CLR1.1 is one of it's names
http://purl.uniprot.org/uniprot/Q96P20#SIP67B63DF8BD1345C7 is an structured name Cryopyrin is one of it's names


In [7]:
turtle="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>  
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  
<Q96P20> rdf:type up:Protein ;  
  up:reviewed true ;
  up:recommendedName [ rdf:type up:Structured_Name ;
                       up:fullName "NACHT, LRR and PYD domains-containing protein 3"  ] ;
  up:alternativeName [ rdf:type up:Structured_Name ;
                       up:fullName "Angiotensin/vasopressin receptor AII/AVP-like" ] ,
                     [ rdf:type up:Structured_Name ;
                       up:fullName "Caterpiller protein 1.1" ;
                       up:shortName "CLR1.1"  ] ,
                     [ rdf:type up:Structured_Name ;
                       up:fullName "Cold-induced autoinflammatory syndrome 1 protein"  ] ,
                     [ rdf:type up:Structured_Name ;
                       up:fullName "Cryopyrin"  ] ,
                     [ rdf:type up:Structured_Name ;
                       up:fullName "PYRIN-containing APAF1-like protein 1"  ] .
"""

Q96P20=Graph().parse(format='ttl',
                     data=turtle)
qres=Q96P20.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
SELECT 
    ?groupOfNames 
    ?name 
WHERE {
  ?protein a up:Protein ; 
           up:recommendedName|up:alternativeName|up:submittedName ?groupOfNames .
  ?groupOfNames (up:allergenName|up:biotechName|up:cdAntigenName|up:ecName|up:fullName|up:internationalNonproprietaryName|up:shortName) ?name
}""")

for row in qres:
    print("%s is an structured name %s is one of it's names" % row)

ub3bL6C22 is an structured name NACHT, LRR and PYD domains-containing protein 3 is one of it's names
ub3bL15C22 is an structured name Cryopyrin is one of it's names
ub3bL13C22 is an structured name Cold-induced autoinflammatory syndrome 1 protein is one of it's names
ub3bL8C22 is an structured name Angiotensin/vasopressin receptor AII/AVP-like is one of it's names
ub3bL10C22 is an structured name Caterpiller protein 1.1 is one of it's names
ub3bL10C22 is an structured name CLR1.1 is one of it's names
ub3bL17C22 is an structured name PYRIN-containing APAF1-like protein 1 is one of it's names


This example is the 'same' data as before just our internal identifiers have been replaced by a blank node. i.e. the RDF equivalent of the second line from the top in this document. We used to have a lot of them but are actively replacing them with stable or content based indentifiers.

# Values example

In [8]:
qres=Q96P20.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
SELECT 
    ?groupOfNames 
    ?name 
WHERE {
  VALUES (?structuredNameTypes) {(up:recommendedName) (up:alternativeName) (up:submittedName) }
  ?protein a up:Protein ; 
           ?structuredNameTypes ?groupOfNames .
  ?groupOfNames (up:allergenName|up:biotechName|up:cdAntigenName|up:ecName|up:fullName|up:internationalNonproprietaryName|up:shortName) ?name
}""")

for row in qres:
    print("%s is an structured name %s is one of it's names" % row)

ub3bL6C22 is an structured name NACHT, LRR and PYD domains-containing protein 3 is one of it's names
ub3bL15C22 is an structured name Cryopyrin is one of it's names
ub3bL13C22 is an structured name Cold-induced autoinflammatory syndrome 1 protein is one of it's names
ub3bL8C22 is an structured name Angiotensin/vasopressin receptor AII/AVP-like is one of it's names
ub3bL10C22 is an structured name Caterpiller protein 1.1 is one of it's names
ub3bL10C22 is an structured name CLR1.1 is one of it's names
ub3bL17C22 is an structured name PYRIN-containing APAF1-like protein 1 is one of it's names


Again the an equivalent query, we are just introducing the VALUES clause which can be used to embed lists into a query.

Another way is to write this query using the [UNION clause](https://www.w3.org/TR/sparql11-query/#alternatives)

# UNION example

In [9]:
qres=Q96P20.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
SELECT 
    ?groupOfNames 
    ?name 
WHERE {
  VALUES (?structuredNameTypes) {(up:recommendedName) (up:alternativeName) (up:submittedName) }
  ?protein a up:Protein ; 
           ?structuredNameTypes ?groupOfNames .
  { ?groupOfNames up:allergenName ?name}
  
  { ?groupOfNames up:biotechName ?name } 
  
  { ?groupOfNames up:cdAntigenName ?name }
  
  { ?groupOfNames up:ecName ?name }
  
  { ?groupOfNames up:fullName ?name }
  
  { ?groupOfNames up:internationalNonproprietaryName ?name }
  
  { ?groupOfNames up:shortName ?name }
}""")


for row in qres:
    print("%s is an structured name %s is one of it's names" % row)

# Genes

UniProt being a protein database has some information about genes but not always well combined with genome resources. This is something we have started to improve.

In [10]:
turtle="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>  
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  
prefix skos: <http://www.w3.org/2004/02/skos/core#>
<Q96P20> rdf:type up:Protein ;  
  up:reviewed true ;
  up:encodedBy <Q96P20#gene-MD57F23A94CA630E0FBCD9731572CF3E82A> .
<Q96P20#gene-MD57F23A94CA630E0FBCD9731572CF3E82A> rdf:type up:Gene ;
  skos:prefLabel "NLRP3" ;
  skos:altLabel "C1orf7" ,
    "CIAS1" ,
    "NALP3" ,
    "PYPAF1" .
"""


Q96P20=Graph().parse(format='ttl',
                     data=turtle)

qres=Q96P20.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT 
    ?protein 
    ?gene 
    (COUNT(?geneName) AS ?countOfNames)
WHERE {
    ?protein up:encodedBy ?gene .
    ?gene skos:prefLabel|skos:altLabel ?geneName
} 
GROUP BY ?protein ?gene 
ORDER BY DESC(COUNT(?geneName))
HAVING (COUNT(?geneName) > 100)
""")
    
for row in qres:
    print("%s is a protein with a gene %s that has %s names" % row)

http://purl.uniprot.org/uniprot/Q96P20 is a protein with a gene http://purl.uniprot.org/uniprot/Q96P20#gene-MD57F23A94CA630E0FBCD9731572CF3E82A that has 5 names


# Annotation: what do we know about a Protein

In [11]:
turtle="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>  
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix isoform: <http://purl.uniprot.org/isoforms/> 
prefix annotation: <http://purl.uniprot.org/annotation/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
<Q96P20> rdf:type up:Protein ;  
  up:reviewed true ;
  up:annotation annotation:PRO_0000213569 ,
                annotation:VSP_013037 .

isoform:Q9NT62-2 rdf:type up:Modified_Sequence ;
  up:name "2" ;
  up:basedOn isoform:Q9NT62-1 ;
  rdf:value "MQNVINTVKGKALEVAEYLTPVLKESKFKETGVITPEEFVAAGDHLVHHCPTWQWATGEELKVKAYLPTGKQFLVTKNVPCYKRCKQMEYSDELEAIIEEDDGDGGWVDTYHNTGITGITEAVKEITLENKDNIRLQDCSALCEEEEDEDEGEAADMEEYEESGLLETDEATLDTRKIVEACKAKTDAGGEDAILQTRTYDLYITYDKYYQTPRLWLFGYDEQRQPLTVEHMYEDISQDHVKKTVTIENHPHLPPPPMCSVHPCRHAEVMKKIIETVAEGGGELGVHMYPSLYVRLVAKWLLTIFFLRNLV" ;
  up:modification annotation:VSP_013037 .
  
annotation:PRO_0000213569 rdf:type up:Chain_Annotation ;
  rdfs:comment "Ubiquitin-like-conjugating enzyme ATG3" .
  
annotation:VSP_013037 rdf:type up:Alternative_Sequence_Annotation ;
  rdfs:comment "In isoform 2." ;
  up:substitution "PSLYVRLVAKWLLTIFFLRNLV" .
"""


Q96P20=Graph().parse(format='ttl',
                     data=turtle)



Here we start to introduce annotations, the most interesting data in UniProtKB.

A few annotation types have stable identifiers in all our formats, (Peptide) Chains and Alternative Sequence Annotation has this.

All our annotations have a
`rdf:type` -> which kind of annotation is it.
`rdfs:comment` -> the curator entered text in Swiss-Prot, what was submitted or automatically added in TrEMBL

Many have:
`up:sequence` -> which of the isoforms/products this specific annotation applies too.

Here we also see
`up:substitution` -> peptides that are inserted at a range to make the sequence. Used to describe variations in the Protein sequence.

Roughly speaking there are two main categories of annotations, features and comments or in RDF speak `up:Sequence_Annotation` and just `up:Annotation` 

`up:Sequence_Annotation` uses [FALDO](http://biohackathon.org/resource/faldo) to describe where a feature is. 

In [12]:
turtle="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>  
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix isoform: <http://purl.uniprot.org/isoforms/> 
prefix annotation: <http://purl.uniprot.org/annotation/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix faldo: <http://biohackathon.org/resource/faldo#>
prefix position: <http://purl.uniprot.org/position/> 
prefix range: <http://purl.uniprot.org/range/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

<Q96P20> rdf:type up:Protein ;  
  up:reviewed true ;
  up:annotation annotation:VSP_013037 ;
  up:sequence isoform:Q9NT62-1, isoform:Q9NT62-2 .
  
annotation:VSP_013037 rdf:type up:Alternative_Sequence_Annotation ;
  rdfs:comment "In isoform 2." ;
  up:substitution "PSLYVRLVAKWLLTIFFLRNLV" ;
  up:range range:22862481696633390tt290tt314 .
  
range:22862481696633390tt290tt314 rdf:type faldo:Region ;
  faldo:begin position:22862481696633390tt290 ;
  faldo:end position:22862481696633390tt314 .

position:22862481696633390tt290 rdf:type faldo:Position ,
    faldo:ExactPosition ;
  faldo:position 290 ;
  faldo:reference isoform:Q9NT62-1 .
  
position:22862481696633390tt314 rdf:type faldo:Position ,
    faldo:ExactPosition ;
  faldo:position 314 ;
  faldo:reference isoform:Q9NT62-1 .


isoform:Q9NT62-1 rdf:type up:Simple_Sequence ;
  up:modified "2000-10-01"^^xsd:date ;
  up:version 1 ;
  up:mass 35864 ;
  up:crc64Checksum "40EFE88DB5FE3EAB"^^xsd:token ;
  up:name "1" ;
  rdf:value "MQNVINTVKGKALEVAEYLTPVLKESKFKETGVITPEEFVAAGDHLVHHCPTWQWATGEELKVKAYLPTGKQFLVTKNVPCYKRCKQMEYSDELEAIIEEDDGDGGWVDTYHNTGITGITEAVKEITLENKDNIRLQDCSALCEEEEDEDEGEAADMEEYEESGLLETDEATLDTRKIVEACKAKTDAGGEDAILQTRTYDLYITYDKYYQTPRLWLFGYDEQRQPLTVEHMYEDISQDHVKKTVTIENHPHLPPPPMCSVHPCRHAEVMKKIIETVAEGGGELGVHMYLLIFLKFVQAVIPTIEYDYTRHFTM" .
  
isoform:Q9NT62-2 rdf:type up:Modified_Sequence ;
  up:name "2" ;
  up:basedOn isoform:Q9NT62-1 ;
  rdf:value "MQNVINTVKGKALEVAEYLTPVLKESKFKETGVITPEEFVAAGDHLVHHCPTWQWATGEELKVKAYLPTGKQFLVTKNVPCYKRCKQMEYSDELEAIIEEDDGDGGWVDTYHNTGITGITEAVKEITLENKDNIRLQDCSALCEEEEDEDEGEAADMEEYEESGLLETDEATLDTRKIVEACKAKTDAGGEDAILQTRTYDLYITYDKYYQTPRLWLFGYDEQRQPLTVEHMYEDISQDHVKKTVTIENHPHLPPPPMCSVHPCRHAEVMKKIIETVAEGGGELGVHMYPSLYVRLVAKWLLTIFFLRNLV" ;
  up:modification annotation:VSP_013037 .
"""


Q96P20=Graph().parse(format='ttl',
                     data=turtle)

qres=Q96P20.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
SELECT 
    ?protein 
    ?annotationType 
    ?start
    ?end
    ?sequence
WHERE {
    ?protein up:annotation ?annotation .
    ?annotation rdf:type ?annotationType .
    ?annotation up:range ?range .
    ?range     faldo:begin ?begin ;
               faldo:end ?stop .
    ?begin faldo:position ?start ;
               faldo:reference ?sequence .
    ?stop faldo:position ?end ;
               faldo:reference ?sequence .
} 
""")

for row in qres:
    print("%s is a protein with an annotation of type %s starting at %s and ending at %s on sequence %s" % row)


http://purl.uniprot.org/uniprot/Q96P20 is a protein with an annotation of type http://purl.uniprot.org/core/Alternative_Sequence_Annotation starting at 290 and ending at 314 on sequence http://purl.uniprot.org/isoforms/Q9NT62-1


Why the layers of indirection. Simple biology doesn't need this, unfortunatly lots of biology is not simple.

Things you might expect but are not always true
genes -> start before they end
      -> have a definedable length
      -> are shorter than their chromesome they are on
      -> start on the same chromesome that it ends on
      -> position is actually known
      -> genes are not wholy within other genes
      -> and if they are they don't share exons
      -> and if they do mutations in the one don't make the other non functional
      
[FALDO](https://pubmed.ncbi.nlm.nih.gov/27296299/) deals with the rare edge cases, and we pay for the extra triples in the simple cases. Specifically because the rare cases are often interesting for biological reasons.


In [13]:
qres=Q96P20.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
SELECT 
    ?protein 
    ?sequence
    ?sequenceLength
WHERE {
    ?protein up:sequence ?sequence .
    ?sequence rdf:value ?sequenceIUPACstring .
    BIND(STRLEN(?sequenceIUPACstring ) AS ?sequenceLength)
} 
""")

for row in qres:
    print("%s is a protein record with a sequence %s of length %s" % row)


http://purl.uniprot.org/uniprot/Q96P20 is a protein record with a sequence http://purl.uniprot.org/isoforms/Q9NT62-1 of length 314
http://purl.uniprot.org/uniprot/Q96P20 is a protein record with a sequence http://purl.uniprot.org/isoforms/Q9NT62-2 of length 311


# More annotation types!

In [14]:
sparql=SPARQLWrapper('https://sparql.uniprot.org/sparql')
sparql.setQuery("""
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX up:<http://purl.uniprot.org/core/> 
SELECT 
	?annotationType 
	?manual
FROM <http://purl.uniprot.org/core/>
WHERE
{
	?annotationType rdfs:subClassOf+ up:Annotation .
	OPTIONAL {
		?annotationType rdfs:seeAlso ?manual
	}
}
""")


sparql.setReturnFormat(JSON)
results = sparql.query().convert()

for result in results["results"]["bindings"]:
    print(result["annotationType"]["value"])
    if "manual" in result.keys():
        print(result["manual"]["value"])
    else:
        print("No manual link!")

http://purl.uniprot.org/core/Active_Site_Annotation
http://www.uniprot.org/manual/act_site
http://purl.uniprot.org/core/Allergen_Annotation
http://www.uniprot.org/manual/allergenic_properties
http://purl.uniprot.org/core/Alternative_Products_Annotation
http://www.uniprot.org/manual/alternative_products
http://purl.uniprot.org/core/Alternative_Sequence_Annotation
http://www.uniprot.org/manual/var_seq
http://purl.uniprot.org/core/Beta_Strand_Annotation
http://www.uniprot.org/manual/strand
http://purl.uniprot.org/core/Binding_Site_Annotation
http://www.uniprot.org/manual/binding
http://purl.uniprot.org/core/Biophysicochemical_Annotation
http://purl.uniprot.org/core/biophysicochemical_properties
http://purl.uniprot.org/core/Biotechnology_Annotation
http://www.uniprot.org/manual/biotechnological_use
http://purl.uniprot.org/core/Calcium_Binding_Annotation
http://www.uniprot.org/manual/ca_bind
http://purl.uniprot.org/core/Catalytic_Activity_Annotation
http://www.uniprot.org/manual/catalytic_a

Here we query our schema ontology for links to our manual pages, describing the biology beyond sets of annotations.

# Disease Annotation and related features

In [16]:
P06239ttl="""@base <http://purl.uniprot.org/uniprot/> .
@prefix annotation: <http://purl.uniprot.org/annotation/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix citation: <http://purl.uniprot.org/citations/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix disease: <http://purl.uniprot.org/diseases/> .
@prefix ECO: <http://purl.obolibrary.org/obo/ECO_> .
@prefix enzyme: <http://purl.uniprot.org/enzyme/> .
@prefix faldo: <http://biohackathon.org/resource/faldo#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix go: <http://purl.obolibrary.org/obo/GO_> .
@prefix isoform: <http://purl.uniprot.org/isoforms/> .
@prefix keyword: <http://purl.uniprot.org/keywords/> .
@prefix location: <http://purl.uniprot.org/locations/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix position: <http://purl.uniprot.org/position/> .
@prefix pubmed: <http://purl.uniprot.org/pubmed/> .
@prefix range: <http://purl.uniprot.org/range/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix taxon: <http://purl.uniprot.org/taxonomy/> .
@prefix tissue: <http://purl.uniprot.org/tissues/> .
@prefix up: <http://purl.uniprot.org/core/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<P06239> rdf:type up:Protein ;
  up:reviewed true ;
  up:created "1988-01-01"^^xsd:date ;
  up:modified "2018-02-28"^^xsd:date ;
  up:version 239 ;
  up:mnemonic "LCK_HUMAN" ;
  up:annotation <P06239#SIP02543850A4A84B97> , annotation:VAR_071291 .
<P06239#SIP02543850A4A84B97> rdf:type up:Disease_Annotation ;
  rdfs:comment "The disease is caused by mutations affecting the gene represented in this entry." ;
  up:disease disease:4079 .
annotation:VAR_071291 rdf:type up:Natural_Variant_Annotation ;
  rdfs:comment "In IMD22." ;
  up:substitution "P" ;
  rdfs:seeAlso <http://purl.uniprot.org/dbsnp/rs587777335> ;
  skos:related disease:4079 ;
  up:range range:22571007465437486tt341tt341 . 
disease:4079 rdf:type up:Disease ;
  skos:prefLabel "Immunodeficiency 22" ;
  up:mnemonic "IMD22" ;
  rdfs:comment "A primary immunodeficiency characterized by T-cell dysfunction. Affected individuals present with lymphopenia, recurrent infections, severe diarrhea, and failure to thrive." ;
   rdfs:seeAlso <http://purl.uniprot.org/mim/615758> ,
    <http://purl.uniprot.org/medgen/CN186319> ,
    <http://id.nlm.nih.gov/mesh/D007153> .

range:22571007465437486tt341tt341 rdf:type faldo:Region ;
  faldo:begin position:22571007465437486tt341 ;
  faldo:end position:22571007465437486tt341 .
position:22571007465437486tt341 rdf:type faldo:Position ,
    faldo:ExactPosition ;
  faldo:position 341 .

  """

P06239=Graph().parse(format='ttl',
                     data=P06239ttl)

qres=P06239.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT 
    ?protein 
    ?disease
    ?diseaseAnnotation
    ?diseaseName    
WHERE {
    ?protein up:annotation ?diseaseAnnotation .
    ?diseaseAnnotation up:disease ?disease .
    ?disease skos:prefLabel ?diseaseName .
} 
""")

for row in qres:
    print("%s is a protein record related to %s via annotation %s named %s" % row)



http://purl.uniprot.org/uniprot/P06239 is a protein record related to http://purl.uniprot.org/diseases/4079 via annotation http://purl.uniprot.org/uniprot/P06239#SIP02543850A4A84B97 named Immunodeficiency 22


# Combining databases!

In [22]:
q="""
PREFIX rh:<http://rdf.rhea-db.org/>
PREFIX ch:<http://purl.obolibrary.org/obo/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/>
SELECT ?chebi
       ?reaction
       ?equation
WHERE {
  SERVICE <http://sparql.uniprot.org/sparql> {
    	?protein up:reviewed true .
        ?protein up:annotation ?a .
        ?a a up:Cofactor_Annotation .
        ?a up:cofactor ?chebi .
        VALUES (?protein) {(uniprotkb:P15877)} .
   }
  ?reaction rdfs:subClassOf rh:Reaction .
  ?reaction rh:status rh:Approved .
  ?reaction rh:equation ?equation .
  ?reaction rh:side ?reactionSide .
  ?reactionSide rh:contains ?participant .
  ?participant rh:compound ?compound .
  ?compound rh:chebi ?chebi .
}
order by ?chebi
"""

sparql=SPARQLWrapper('http://sparql.rhea-db.org/sparql')
sparql.setQuery(q)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

for result in results["results"]["bindings"]:
    print(result["chebi"]["value"] + result["reaction"]["value"] + result["equation"]["value"])

http://purl.obolibrary.org/obo/CHEBI_58442http://rdf.rhea-db.org/27962myo-inositol + pyrroloquinoline quinone = H(+) + pyrroloquinoline quinol + scyllo-inosose
http://purl.obolibrary.org/obo/CHEBI_58442http://rdf.rhea-db.org/31247H(+) + prop-2-ynal + pyrroloquinoline quinol = prop-2-yn-1-ol + pyrroloquinoline quinone
http://purl.obolibrary.org/obo/CHEBI_58442http://rdf.rhea-db.org/31263but-3-ynal + H(+) + pyrroloquinoline quinol = but-3-yn-1-ol + pyrroloquinoline quinone
http://purl.obolibrary.org/obo/CHEBI_58442http://rdf.rhea-db.org/106926-(2-amino-2-carboxyethyl)-7,8-dioxo-1,2,3,4,7,8-hexahydroquinoline-2,4-dicarboxylate + 3 O2 = H(+) + 2 H2O + 2 H2O2 + pyrroloquinoline quinone


The real value in SPARQL is not that you can query one database, but that you can query many databases as if they where one on your demand.