<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Question4Jerven" data-toc-modified-id="Question4Jerven-1">Question4Jerven</a></span></li><li><span><a href="#Replicon-and-genes" data-toc-modified-id="Replicon-and-genes-2">Replicon and genes</a></span></li><li><span><a href="#Required-Python-libraries" data-toc-modified-id="Required-Python-libraries-3">Required Python libraries</a></span></li><li><span><a href="#Gene-Names" data-toc-modified-id="Gene-Names-4">Gene Names</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Selecting-encoding-genes" data-toc-modified-id="Selecting-encoding-genes-4.0.1">Selecting encoding genes</a></span></li><li><span><a href="#Selecting-the-recommended-gene-names" data-toc-modified-id="Selecting-the-recommended-gene-names-4.0.2">Selecting the recommended gene names</a></span></li><li><span><a href="#Selecting-alternative-gene-names" data-toc-modified-id="Selecting-alternative-gene-names-4.0.3">Selecting alternative gene names</a></span></li><li><span><a href="#Selecting-ordered-locus-names" data-toc-modified-id="Selecting-ordered-locus-names-4.0.4">Selecting ordered locus names</a></span></li><li><span><a href="#Selecting-ORF-names" data-toc-modified-id="Selecting-ORF-names-4.0.5">Selecting ORF names</a></span></li></ul></li></ul></li><li><span><a href="#Replicons" data-toc-modified-id="Replicons-5">Replicons</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Organelles-and-Plasmids" data-toc-modified-id="Organelles-and-Plasmids-5.0.1">Organelles and Plasmids</a></span></li></ul></li></ul></li></ul></div>

# Question4Jerven
*If a gene is located in an organelle other than the nucleus, or/and on a plasmid rather than a chromosome, the gene location is stored in encodedIn properties.*  


A gene is located on a replicon, either a chromsosome or a plasmid.  

A replicon is located in the nucleus, the mitochondria or the cytosol (prokaryote).  

**I don't understand our data model ?**
- the chromosome is encoded with the proteome  
- organelle and plasmid are represented by encodedIn  

# Replicon and genes

This notebook aims to show you basic informations on **genes** that encode the protein and their replicon (chromosome, plasmid, etc).   

# Required Python libraries

If you are not familiar with **RDFlib** and **SPARQLWrapper** libraries, please read `00_introduction.ipynb` first. 

In [1]:
from rdflib import *
from SPARQLWrapper import SPARQLWrapper, JSON

# Gene Names

This notebook aims to show you basic informations on **genes** that encode the protein.   

The name(s) of the gene(s) that encode the protein by a separate `encodedBy` properties.

There are four categories of gene names.  
- The primary gene name is represented with a `skos:prefLabel` property
- The synonyms with `skos:altLabel` property. 
- Ordered locus names (OLN) with `locusName` property.
- ORF names with `orfName` property.

The resources representing a gene are members of the `up:Gene` class.

In [2]:
entry=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>

<Q0JNS6>
  a up:Protein ;
  up:encodedBy
    <Q0JNS6#51304A4E53360019> ,
    <Q0JNS6#51304A4E5336001A> ,
    <Q0JNS6#51304A4E5336001B> .

<Q0JNS6#51304A4E53360019>
  rdf:type up:Gene ;
  skos:prefLabel "CAM1-1" ;
  skos:altLabel "CAM1" ;
  up:locusName "Os03g0319300" ,
    "LOC_Os03g20370" ;
  up:orfName "OsJ_010214" .

<Q0JNS6#51304A4E5336001A>
  rdf:type up:Gene ;
  skos:prefLabel "CAM1-2" ;
  skos:altLabel "CAM" ;
  up:locusName "Os07g0687200" ,
    "LOC_Os07g48780" ;
  up:orfName "OJ1150_E04.120-1" ,
    "OJ1200_C08.124-1" ,
    "OsJ_024630" .

<Q0JNS6#51304A4E5336001B>
  rdf:type up:Gene ;
  skos:prefLabel "CAM1-3" ;
  up:locusName "Os01g0267900" ,
    "LOC_Os01g16240" ;
  up:orfName "OsJ_001186" ,
    "P0011D01.22" .""")


### Selecting encoding genes

In [3]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?protein
       ?gene 
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
}""")

for row in qres:
    print("%s is encoded by %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6 is encoded by http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B
http://purl.uniprot.org/uniprot/Q0JNS6 is encoded by http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019
http://purl.uniprot.org/uniprot/Q0JNS6 is encoded by http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A


### Selecting the recommended gene names

In [4]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene 
       ?recommendedGeneName
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
  ?gene skos:prefLabel ?recommendedGeneName .
}""")

for row in qres:
    print("%s is called %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B is called CAM1-3
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 is called CAM1-1
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A is called CAM1-2


### Selecting alternative gene names

In [5]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene 
       ?altGeneName
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
  ?gene skos:altLabel ?altGeneName .
}""")

for row in qres:
    print("%s is also known as %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 is also known as CAM1
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A is also known as CAM


### Selecting ordered locus names

In [6]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene 
       ?oln
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
  ?gene up:locusName ?oln .
}""")

for row in qres:
    print("%s has a ordered locus name %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B has a ordered locus name Os01g0267900
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B has a ordered locus name LOC_Os01g16240
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 has a ordered locus name Os03g0319300
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 has a ordered locus name LOC_Os03g20370
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a ordered locus name LOC_Os07g48780
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a ordered locus name Os07g0687200


### Selecting ORF names

In [7]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?gene 
       ?orfName
WHERE {
  ?protein a up:Protein ;
           up:encodedBy ?gene .
  ?gene up:orfName ?orfName .
}""")

for row in qres:
    print("%s has a open reading frame name %s" % row)

http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B has a open reading frame name OsJ_001186
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001B has a open reading frame name P0011D01.22
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E53360019 has a open reading frame name OsJ_010214
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a open reading frame name OJ1200_C08.124-1
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a open reading frame name OsJ_024630
http://purl.uniprot.org/uniprot/Q0JNS6#51304A4E5336001A has a open reading frame name OJ1150_E04.120-1


# Replicons

In [8]:
entry=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/> 
prefix annotation: <http://purl.uniprot.org/annotation/> 
prefix citation: <http://purl.uniprot.org/citations/> 
prefix dcterms: <http://purl.org/dc/terms/> 
prefix disease: <http://purl.uniprot.org/diseases/> 
prefix ECO: <http://purl.obolibrary.org/obo/ECO_> 
prefix enzyme: <http://purl.uniprot.org/enzyme/> 
prefix faldo: <http://biohackathon.org/resource/faldo#> 
prefix foaf: <http://xmlns.com/foaf/0.1/> 
prefix go: <http://purl.obolibrary.org/obo/GO_> 
prefix isoform: <http://purl.uniprot.org/isoforms/> 
prefix keyword: <http://purl.uniprot.org/keywords/> 
prefix location: <http://purl.uniprot.org/locations/> 
prefix owl: <http://www.w3.org/2002/07/owl#> 
prefix position: <http://purl.uniprot.org/position/> 
prefix pubmed: <http://purl.uniprot.org/pubmed/> 
prefix range: <http://purl.uniprot.org/range/> 
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
prefix skos: <http://www.w3.org/2004/02/skos/core#> 
prefix taxon: <http://purl.uniprot.org/taxonomy/> 
prefix tissue: <http://purl.uniprot.org/tissues/> 
prefix up: <http://purl.uniprot.org/core/> 
prefix xsd: <http://www.w3.org/2001/XMLSchema#> 

<Q71RH2> rdf:type up:Protein ;
  up:reviewed true ;
  up:created "2005-04-12"^^xsd:date ;
  up:modified "2021-04-07"^^xsd:date ;
  up:version 130 ;
  up:mnemonic "TLC3B_HUMAN" .


<Q71RH2> up:proteome <http://purl.uniprot.org/proteomes/UP000005640#Chromosome%2016> .""")

In [9]:
qres=entry.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?protein
       ?replicon 
WHERE {
  ?protein a up:Protein ;
           up:proteome ?proteomeData .
  BIND( strafter( str(?proteomeData), "#" ) as ?replicon )
}""")

for row in qres:
    print("The gene coding for %s is located on %s" % row)

The gene coding for http://purl.uniprot.org/uniprot/Q71RH2 is located on Chromosome%2016


In [10]:
query = """
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX keywords:<http://purl.uniprot.org/keywords/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX proteome:<http://purl.uniprot.org/proteomes/>

SELECT distinct ?proteomeData
WHERE {
  # reviewed entries (UniProtKB/Swiss-Prot)
  ?protein up:reviewed true . 
  # restricted to Human taxid
  ?uniprot up:organism taxon:9606 . 
  # reference proteome
  ?uniprot up:classifiedWith keywords:1185 .
  ?uniprot up:proteome ?proteomeData .
  BIND( strbefore( str(?proteomeData), "#" ) as ?proteome )
  BIND( strafter( str(?proteomeData), "#" ) as ?replicon )
}
LIMIT 3
"""

# Set the SPARQL endpoint (UniProt)
sparql = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# Define the query
sparql.setQuery(query)

# Set the output format as JSON
sparql.setReturnFormat(JSON)

# Run the SPARQL query and convert to the defined format
results = sparql.query().convert()

# Print the query result
for result in results["results"]["bindings"]:
    print(result["proteomeData"]["value"])

http://purl.uniprot.org/proteomes/UP000005640#Chromosome%2011
http://purl.uniprot.org/proteomes/UP000005640#Chromosome%202
http://purl.uniprot.org/proteomes/UP000005640#Chromosome%203


### Organelles and Plasmids

If a gene is located in an organelle other than the nucleus, or/and on a plasmid rather than a chromosome, the gene location is stored in encodedIn properties. Note that if a plasmid has several names, they are listed as multiple `rdfs:label` properties.

In [16]:
entry2=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

<Q01529>
  a up:Protein ;
  up:encodedIn up:Mitochondrion ,
               <Q01529#SIP29DF58> . 

<Q01529#SIP29DF58>
  rdf:type up:Plasmid ;
  rdfs:label "pAL2-1" .
  """)


qres=entry2.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT 
    ?protein 
    ?plasmidOrOrganelle
    ?label
WHERE {
    ?protein a up:Protein ;
      up:encodedIn ?plasmidOrOrganelle .
    OPTIONAL {
        ?plasmidOrOrganelle rdfs:label ?label .
    }
}""")

for row in qres:
    #print("%s is encodedIn a '%s'" % row)
    print(row)

(rdflib.term.URIRef('http://purl.uniprot.org/uniprot/Q01529'), rdflib.term.URIRef('http://purl.uniprot.org/uniprot/Q01529#SIP29DF58'), rdflib.term.Literal('pAL2-1'))
(rdflib.term.URIRef('http://purl.uniprot.org/uniprot/Q01529'), rdflib.term.URIRef('http://purl.uniprot.org/core/Mitochondrion'), None)


Sometimes it is known that a gene is located on a plasmid, but the name of the plasmid is unknown. The example below shows how this is represented.

In [13]:
entry3=Graph().parse(format='ttl',
                     data="""
base <http://purl.uniprot.org/uniprot/>  
prefix up: <http://purl.uniprot.org/core/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix isoform:<http://purl.uniprot.org/isoforms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
<Q7BS32>
  a up:Protein ;
  up:encodedIn 
    <Q7BS32#51374253333200E> .

<Q7BS32#51374253333200E>
  rdf:type up:Plasmid .""")

qres=entry3.query("""
PREFIX up: <http://purl.uniprot.org/core/> 
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT 
    ?protein 
    ?type
WHERE {
    ?protein a up:Protein ;
      up:encodedIn ?plasmidOrOrganelle .
    OPTIONAL {
        ?plasmidOrOrganelle a ?type .
    }
}""")

for row in qres:
    print("%s is encodedIn a '%s'" % row)

http://purl.uniprot.org/uniprot/Q7BS32 is encodedIn a 'http://purl.uniprot.org/core/Plasmid'
