<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Taxonomy" data-toc-modified-id="Taxonomy-1">Taxonomy</a></span></li><li><span><a href="#Import-Python-package" data-toc-modified-id="Import-Python-package-2">Import Python package</a></span><ul class="toc-item"><li><span><a href="#Organism-identifier" data-toc-modified-id="Organism-identifier-2.1">Organism identifier</a></span></li><li><span><a href="#Retrieve-the-taxon-(organism-id)-of-a-protein" data-toc-modified-id="Retrieve-the-taxon-(organism-id)-of-a-protein-2.2">Retrieve the taxon (organism id) of a protein</a></span></li><li><span><a href="#Taxonomy-data" data-toc-modified-id="Taxonomy-data-2.3">Taxonomy data</a></span><ul class="toc-item"><li><span><a href="#Retrieve-the-rank-and-the-scientific-name-of-the-organism" data-toc-modified-id="Retrieve-the-rank-and-the-scientific-name-of-the-organism-2.3.1">Retrieve the rank and the scientific name of the organism</a></span></li></ul></li><li><span><a href="#Taxonomy-hierarchy" data-toc-modified-id="Taxonomy-hierarchy-2.4">Taxonomy hierarchy</a></span></li><li><span><a href="#Host-organisms" data-toc-modified-id="Host-organisms-2.5">Host organisms</a></span></li></ul></li><li><span><a href="#How-to-retrieve-all-UniProt-entries-for-a-given-organism-?" data-toc-modified-id="How-to-retrieve-all-UniProt-entries-for-a-given-organism-?-3"><span style="color: red">How to retrieve all UniProt entries for a given organism ?</span></a></span></li><li><span><a href="#How-to-retrieve-the-lineage-of-an-organism-?" data-toc-modified-id="How-to-retrieve-the-lineage-of-an-organism-?-4"><span style="color: red">How to retrieve the lineage of an organism ?</span></a></span></li><li><span><a href="#How-to-retrieve-all-organisms-with-at-least-one-entry-in-UniProtKB/Swiss-Prot-?" data-toc-modified-id="How-to-retrieve-all-organisms-with-at-least-one-entry-in-UniProtKB/Swiss-Prot-?-5"><span style="color: red">How to retrieve all organisms with at least one entry in UniProtKB/Swiss-Prot ?</span></a></span></li></ul></div>

# Taxonomy

This notebook aims to show you how taxonomy data are represented in UniProt.  

UniProtKB taxonomy data is manually curated (see details [here](https://www.uniprot.org/taxonomy/)).


The organism which is the source of a protein sequence is identified by a unique identifier (often called _taxon_ or _taxid_) from the [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) database.   
This is the only taxonomy information that is stored in the RDF format of a UniProtKB entry. However, the full NCBI taxonomy is modelled and available as well.   

# Import Python package

First we import rdflib which is a well known python library that gives RDF and its query language support to Python 3 (and Python 2).  


In [12]:
from rdflib import Graph
from SPARQLWrapper import SPARQLWrapper, JSON

## Organism identifier

The organism identifier (taxon) is stored in the `organism` property of a uniprot entry.  

In [13]:
P05067ttl = """base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix taxon: <http://purl.uniprot.org/taxonomy/>

<P05067> a up:Protein ;
         up:organism taxon:9606 .
"""

P05067=Graph().parse(format='ttl', data=P05067ttl)

for subj, pred, obj in P05067:
   print(subj, pred, obj)


http://purl.uniprot.org/uniprot/P05067 http://purl.uniprot.org/core/organism http://purl.uniprot.org/taxonomy/9606
http://purl.uniprot.org/uniprot/P05067 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.uniprot.org/core/Protein


## Retrieve the taxon (organism id) of a protein

In [14]:
qres=P05067.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?protein ?taxon
WHERE {
  ?protein a up:Protein ;
           up:organism ?taxon .
}""")

for row in qres:
    print("The taxon (organism id) of %s is %s" % row)

The taxon (organism id) of http://purl.uniprot.org/uniprot/P05067 is http://purl.uniprot.org/taxonomy/9606


## Taxonomy data

**Properties**:
- `rank`  
- `mnemonic`  
- `scientificName`  
- `commonName`  
- `otherName`  
- `seeAlso` (xref)  
- `subClassOf` (hierarchy)   
- <span style="color:red">narrowerTransitive</span>  
- <span style="color:red">partOfLineage</span>  

In [15]:
# Description of the taxon:9606 (Homo sapiens)
taxon=Graph().parse(format='ttl',
                 data="""
base <http://purl.uniprot.org/taxonomy/> 
prefix up: <http://purl.uniprot.org/core/> 
prefix foaf: <http://xmlns.com/foaf/0.1/> 
prefix owl: <http://www.w3.org/2002/07/owl#> 
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
prefix skos: <http://www.w3.org/2004/02/skos/core#> 
prefix xsd: <http://www.w3.org/2001/XMLSchema#> 

<9606> a up:Taxon ;
       up:rank up:Species ;
       up:mnemonic "HUMAN" ;
       up:scientificName "Homo sapiens" ;
       up:commonName "Human" ;
       up:otherName "Home sapiens" ,
                    "Homo sapiens Linnaeus, 1758" ,
                    "man" ;
       rdfs:seeAlso <http://animaldiversity.org/site/accounts/information/Homo_sapiens.html> ,
                    <http://archaeologyinfo.com/homo-sapiens/> ,
                    <http://www.ensembl.org/Homo_sapiens/Info/Index> ,
                    <https://www.sciencedaily.com/releases/2005/02/050223122209.htm> ;
       rdfs:subClassOf <9605> ;
       skos:narrowerTransitive <63221> ,
                               <741158> ;
       up:partOfLineage false .

<9605> a up:Taxon ;
       up:rank up:Genus ;
       up:scientificName "Homo" ;
       up:otherName "Homo Linnaeus, 1758" ,
                    "humans" ;
       rdfs:subClassOf <207598> ;
       skos:narrowerTransitive <9606> ,
                               <1425170> ,
                               <2665952> ;
       up:partOfLineage true .
""")

### Retrieve the rank and the scientific name of the organism

The rank and scientificName are by far the most queried properties of a taxon.

In [16]:
qres=taxon.query("""PREFIX up: <http://purl.uniprot.org/core/> 
SELECT ?taxon 
       ?rank
       ?scientificName
WHERE {
  ?taxon a up:Taxon ;
         up:rank ?rank ;
         up:scientificName ?scientificName .
}""")

for row in qres:
    print('Taxon "%s", rank = "%s", scientificName = "%s"' % row)

Taxon "http://purl.uniprot.org/taxonomy/9606", rank = "http://purl.uniprot.org/core/Species", scientificName = "Homo sapiens"
Taxon "http://purl.uniprot.org/taxonomy/9605", rank = "http://purl.uniprot.org/core/Genus", scientificName = "Homo"


## Taxonomy hierarchy

Querying the taxonomic hierarchy is straightforward using the `rdfs:subClassOf` property.  
In our _taxon_ example shown previously:  
<9605> rdfs:subClassOf <9606>  
<9606> rdfs:subClassOf <207598>  

In order to facilitate the search, the UniProt SPARQL endpoint materialized all relationships. In other words, you don't need to use SPARQL property path to query the taxonomy classification.  
Note that if you use other endpoints you might need to use `rdfs:subClassOf+` to query by higher levels of taxonomy.


In [17]:
qres=taxon.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?species ?genus 
WHERE {
  ?species a up:Taxon ;
           up:rank up:Species ;
           rdfs:subClassOf ?genus .
  ?genus a up:Taxon ;
         up:rank up:Genus .
}""")

for row in qres:
    print("%s is part of the genus %s" % row)

http://purl.uniprot.org/taxonomy/9606 is part of the genus http://purl.uniprot.org/taxonomy/9605


## Host organisms

Sometimes an organism is known to be hosted inside an other one (_e.g._ parasite, symbiont, infection).   
We defined the `host` property to link an organism to its host.  

In [18]:
host=Graph().parse(format='ttl',
                 data="""
base <http://purl.uniprot.org/taxonomy/> 
prefix up: <http://purl.uniprot.org/core/> 
<1241371> a up:Taxon ;
          up:mnemonic "ABHV" ;
          up:host <6451> .
""")

In [19]:
qres=host.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT 
    ?virus ?host 
WHERE {
    ?virus up:host ?host .
}""")

for row in qres:
    print("%s hosted by %s" % row)

http://purl.uniprot.org/taxonomy/1241371 hosted by http://purl.uniprot.org/taxonomy/6451


# <span style="color:red">How to retrieve all UniProt entries for a given organism ?</span>

# <span style="color:red">How to retrieve the lineage of an organism ?</span>

# <span style="color:red">How to retrieve all organisms with at least one entry in UniProtKB/Swiss-Prot ?</span>

In [23]:
query3 = """PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT distinct ?taxid
                ?scientificName
                ?domain
                ?domainName
WHERE {
  # reviewed entries
  ?uniprot up:reviewed true .
  # taxid
  ?uniprot up:organism ?taxid . 
  ?taxid up:scientificName ?scientificName .
    
  VALUES ?domain { taxon:2 # bacteria
                   taxon:2157 # archaea
                   taxon:2759 # eukaryota
                   taxon:10239 #viruses
                 } .
  ?taxid rdfs:subClassOf ?domain .
}
LIMIT 3
""" 

# Set the SPARQL endpoint to use
sparql=SPARQLWrapper("https://sparql.uniprot.org/sparql/")
# Set the SPARQL query.
sparql.setQuery(query3)
# Set the output format to JSON
sparql.setReturnFormat(JSON)
# Run the query and save the result in res
res = sparql.query().convert()
# Parse (JSON format) and print the query result 
print(res)

{'head': {'vars': ['taxid', 'scientificName', 'domain', 'domainName']}, 'results': {'bindings': [{'scientificName': {'type': 'literal', 'value': 'Neisseria meningitidis serogroup B (strain MC58)'}, 'taxid': {'type': 'uri', 'value': 'http://purl.uniprot.org/taxonomy/122586'}, 'domain': {'type': 'uri', 'value': 'http://purl.uniprot.org/taxonomy/2'}}, {'scientificName': {'type': 'literal', 'value': 'Streptococcus thermophilus (strain ATCC BAA-250 / LMG 18311)'}, 'taxid': {'type': 'uri', 'value': 'http://purl.uniprot.org/taxonomy/264199'}, 'domain': {'type': 'uri', 'value': 'http://purl.uniprot.org/taxonomy/2'}}, {'scientificName': {'type': 'literal', 'value': 'Streptococcus thermophilus (strain CNRZ 1066)'}, 'taxid': {'type': 'uri', 'value': 'http://purl.uniprot.org/taxonomy/299768'}, 'domain': {'type': 'uri', 'value': 'http://purl.uniprot.org/taxonomy/2'}}]}}


In [21]:
import pandas as pd
from pandas import json_normalize

def query_sparql(sparqlQuery, sparql_service_url):
    """
    Query a SPARQL endpoint with a given query string and return the results as a pandas Dataframe.
    """
    sparql=SPARQLWrapper(sparql_service_url)
    
    # set timeout to 162000 seconds = 45 min
    sparql.setTimeout(162000)
    
    # set SPARQL query
    sparql.setQuery(sparqlQuery)
    
    # set return format as JSON
    sparql.setReturnFormat(JSON)
    
    # run the SPARQL query
    res = sparql.query().convert()
    # convert the JSON result in pandas dataframe
    res_sparql_df = json_normalize(res["results"]["bindings"])
    
    # distinguish .type and .value
    col_type = [c for c in res_sparql_df.columns.tolist() if ".type" in c]
    col_value = [c for c in res_sparql_df.columns.tolist() if ".value" in c]
    col_datatype = [c for c in res_sparql_df.columns.tolist() if ".datatype" in c]
    
    # Remove prefix part from URI
    #for i in range(0,len(col_type)):
    #    if 'uri' in res_sparql_df[col_type[i]].unique().tolist() :
    #        #tdf[col_value[i]] = tdf[col_value[i]].fillna("")
    #        res_sparql_df[col_value[i]] = res_sparql_df[col_value[i]].str.split(pat='/').str.get(-1)

    # Remove .type columns
    res_sparql_df.drop(col_type,axis=1,inplace=True)
    # Remove .datatype columns
    res_sparql_df.drop(col_datatype,axis=1,inplace=True)
    
    # Remove ".value" from column names
    res_sparql_df = res_sparql_df.rename(columns = lambda col: col.replace(".value", ""))
    
    return res_sparql_df

In [24]:
query = """PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT distinct ?taxid
                ?scientificName
                ?domain
                ?domainName
WHERE {
  # reviewed entries
  ?uniprot up:reviewed true .
  # taxid
  ?uniprot up:organism ?taxid . 
  ?taxid up:scientificName ?scientificName .
    
  VALUES ?domain { taxon:2 # bacteria
                   taxon:2157 # archaea
                   taxon:2759 # eukaryota
                   taxon:10239 #viruses
                 } .
  ?taxid rdfs:subClassOf ?domain .
}
""" 

query_sparql(query,"https://sparql.uniprot.org/sparql/")

Unnamed: 0,scientificName,taxid,domain
0,Neisseria meningitidis serogroup B (strain MC58),http://purl.uniprot.org/taxonomy/122586,http://purl.uniprot.org/taxonomy/2
1,Streptococcus thermophilus (strain ATCC BAA-25...,http://purl.uniprot.org/taxonomy/264199,http://purl.uniprot.org/taxonomy/2
2,Streptococcus thermophilus (strain CNRZ 1066),http://purl.uniprot.org/taxonomy/299768,http://purl.uniprot.org/taxonomy/2
3,Streptococcus thermophilus (strain ATCC BAA-49...,http://purl.uniprot.org/taxonomy/322159,http://purl.uniprot.org/taxonomy/2
4,Acholeplasma laidlawii,http://purl.uniprot.org/taxonomy/2148,http://purl.uniprot.org/taxonomy/2
...,...,...,...
14009,Hepatitis C virus (isolate TH),http://purl.uniprot.org/taxonomy/11117,http://purl.uniprot.org/taxonomy/10239
14010,Influenza B virus (strain B/Leningrad/179/1986),http://purl.uniprot.org/taxonomy/11536,http://purl.uniprot.org/taxonomy/10239
14011,Feline leukemia virus (isolate CFE-6),http://purl.uniprot.org/taxonomy/11922,http://purl.uniprot.org/taxonomy/10239
14012,Tomato black ring virus (strain C),http://purl.uniprot.org/taxonomy/12276,http://purl.uniprot.org/taxonomy/10239
