**Source of the materials**: Biopython cookbook (adapted)
<font color='red'>Status: Draft</font>

Swiss-Prot and ExPASy {#chapter:swiss_prot}
=====================

Parsing Swiss-Prot files
------------------------

Swiss-Prot (<http://www.expasy.org/sprot>) is a hand-curated database of
protein sequences. Biopython can parse the “plain text” Swiss-Prot file
format, which is still used for the UniProt Knowledgebase which combined
Swiss-Prot, TrEMBL and PIR-PSD. We do not (yet) support the UniProtKB
XML file format.

### Parsing Swiss-Prot records

In Section \[sec:SeqIO\_ExPASy\_and\_SwissProt\], we described how to
extract the sequence of a Swiss-Prot record as a `SeqRecord` object.
Alternatively, you can store the Swiss-Prot record in a
`Bio.SwissProt.Record` object, which in fact stores the complete
information contained in the Swiss-Prot record. In this section, we
describe how to extract `Bio.SwissProt.Record` objects from a Swiss-Prot
file.

To parse a Swiss-Prot record, we first get a handle to a Swiss-Prot
record. There are several ways to do so, depending on where and how the
Swiss-Prot record is stored:

-   Open a Swiss-Prot file locally:
    `>>> handle = open("myswissprotfile.dat")`

-   Open a gzipped Swiss-Prot file:



In [1]:
from Bio import ExPASy
from Bio import SeqIO

with ExPASy.get_sprot_raw("O23729") as handle:
    seq_record = SeqIO.read(handle, "swiss")
print(seq_record.id)
print(seq_record.name)
print(seq_record.description)
print(repr(seq_record.seq))
print("Length %i" % len(seq_record))
print(seq_record.annotations["keywords"])

O23729
CHS3_BROFI
RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName: Full=Naringenin-chalcone synthase 3;
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE')
Length 394
['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']


In [2]:
from Bio import SeqIO

orchid_dict = SeqIO.to_dict(SeqIO.parse("data/ls_orchid.gbk", "genbank"))

In [3]:
seq_record = orchid_dict["Z78475.1"]
print(seq_record.description)

P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA


In [5]:
## Ucomment if you have a local file
#import gzip
#handle = gzip.open("data/myswissprotfile.dat.gz")



-   Open a Swiss-Prot file over the internet:

    - [You can find a description of Swiss-Prot here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102476/)
    - [And an extensive set of sample data here](https://www.expasy.org/)
    - [Here's a specific example data list](https://swissmodel.expasy.org/repository?query=blast)
    - [Ancestral genome of Lepidosauria](https://omabrowser.org/oma/ancestralgenome/Lepidosauria/info/)
      - [Ancient Lizard!](https://en.wikipedia.org/wiki/Lepidosauria)
      - [Many Late Cretaceous (approximately 75 mybp) fossil species are assignable to modern families and some Late Jurassic (135 + mybp) taxa are recognizable as varanoids related to living monitor lizards and snakes (Estes, 1983).](https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/lepidosauria)  **NOTE**:Jurrassic Park is not a popular film in Philadelphia right now (Sixers up 3-0, now in a 3-2 series against the Toronto Goldblum's.. er, Velocoraptors... er, Raptors.)
      - [Let's get some Jurrassic Lizard Data](https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/lepidosauria)
      - [Short Analytical Video](https://www.youtube.com/watch?v=dQw4w9WgXcQ)
      - [Varying crosslinking motifs drive the mesoscale mechanics of actin-microtubule composites](https://www.nature.com/articles/s41598-019-49236-4) **Important Jurassic Lizard Sequence**
      - BLAST ID: A0A670JRM7
      
**HUMAN SEQUENCE FOUND** 
      - [Hemophilia](https://www.uniprot.org/uniprot/?query=keyword%3A%22Hemophilia+%5BKW-0355%5D%22+AND+reviewed%3Ayes&sort=length)





In [10]:
## Example
import requests
requests.__version__
# A few basic variables to make our query strings shorter
BASE = 'http://www.uniprot.org'
KB_ENDPOINT = '/uniprot/'
TOOL_ENDPOINT = '/uploadlists/'

fullURL = ('http://www.uniprot.org/uniprot/?'
                    'query=keyword%3A"Hemophilia+[KW-0355]"+AND+reviewed%3Ayes&sort=length&'
                    'format=list')


r = requests.get(fullURL, stream=True)

for line in r.iter_lines(decode_unicode=True):
    if line: print(line)
    if line[3]: filerunner=line

result = requests.get(fullURL)

if result.ok:
    print(result.text)
else:
    print('Something went wrong ', result.status_code)

result.headers.get('content-type')

print(filerunner)

P00451
P00741
P00740
P19540
P00451
P00741
P00740
P19540

P19540


In [11]:
#import urllib.request
#filerunner = urllib.request.urlopen("https://www.uniprot.org/uniprot/P00451.fasta")



-   Open a Swiss-Prot file over the internet from the ExPASy database
    (see section \[subsec:subsec:expasy_swissprot]):



In [12]:
from Bio import ExPASy
from Bio import SeqIO

with ExPASy.get_sprot_raw(filerunner) as handle:
    seq_record = SeqIO.read(handle, "swiss")
print(seq_record.id)
print(seq_record.name)
print(seq_record.description)
print(repr(seq_record.seq))
print("Length %i" % len(seq_record))
print(seq_record.annotations["keywords"])
handle = ExPASy.get_sprot_raw(filerunner)


P19540
FA9_CANLF
RecName: Full=Coagulation factor IX; EC=3.4.21.22 {ECO:0000250|UniProtKB:P00740}; AltName: Full=Christmas factor; Contains: RecName: Full=Coagulation factor IXa light chain; Contains: RecName: Full=Coagulation factor IXa heavy chain; Flags: Precursor;
Seq('MAEASGLVTVCLLGYLLSAECAVFLDRENATKILSRPKRYNSGKLEEFVRGNLE...KLT')
Length 452
['Blood coagulation', 'Calcium', 'Cleavage on pair of basic residues', 'Disease variant', 'Disulfide bond', 'EGF-like domain', 'Gamma-carboxyglutamic acid', 'Glycoprotein', 'Hemophilia', 'Hemostasis', 'Hydrolase', 'Hydroxylation', 'Magnesium', 'Metal-binding', 'Phosphoprotein', 'Protease', 'Reference proteome', 'Repeat', 'Secreted', 'Serine protease', 'Signal', 'Sulfation', 'Zymogen']



The key point is that for the parser, it doesn’t matter how the handle
was created, as long as it points to data in the Swiss-Prot format.

We can use `Bio.SeqIO` as described in
Section \[sec:SeqIO\_ExPASy\_and\_SwissProt\] to get file format
agnostic `SeqRecord` objects. Alternatively, we can use `Bio.SwissProt`
get `Bio.SwissProt.Record` objects, which are a much closer match to the
underlying file format.

To read one Swiss-Prot record from the handle, we use the function
`read()`:



In [13]:
from Bio import SwissProt
record = SwissProt.read(handle)



This function should be used if the handle points to exactly one
Swiss-Prot record. It raises a `ValueError` if no Swiss-Prot record was
found, and also if more than one record was found.

We can now print out some information about this record:



In [14]:
print(record.description)

RecName: Full=Coagulation factor IX; EC=3.4.21.22 {ECO:0000250|UniProtKB:P00740}; AltName: Full=Christmas factor; Contains: RecName: Full=Coagulation factor IXa light chain; Contains: RecName: Full=Coagulation factor IXa heavy chain; Flags: Precursor;


In [15]:
for ref in record.references:
    print("authors:", ref.authors)
    print("title:", ref.title)

authors: Axelrod J.H., Read M.S., Brinkhous K.M., Verma I.M.
title: Phenotypic correction of factor IX deficiency in skin fibroblasts of hemophilic dogs.
authors: Evans J.P., Watzke H.H., Ware J.L., Stafford D.W., High K.A.
title: Molecular cloning of a cDNA encoding canine factor IX.
authors: Evans J.P., Brinkhous K.M., Brayer G.D., Reisner H.M., High K.A.
title: Canine hemophilia B resulting from a point mutation with unusual consequences.


In [16]:
print(record.organism_classification)

['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Laurasiatheria', 'Carnivora', 'Caniformia', 'Canidae', 'Canis']



However, uncompressing a large file takes time, and each time you open
the file for reading in this way, it has to be decompressed on the fly.
So, if you can spare the disk space you’ll save time in the long run if
you first decompress the file to disk, to get the `uniprot_sprot.dat`
file inside. Then you can open the file for reading as usual:




As of June 2009, the full Swiss-Prot database downloaded from ExPASy
contained 468851 Swiss-Prot records. One concise way to build up a list
of the record descriptions is with a list comprehension:




Or, using a for loop over the record iterator:



In [17]:
dir(record)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'accessions',
 'annotation_update',
 'comments',
 'created',
 'cross_references',
 'data_class',
 'description',
 'entry_name',
 'features',
 'gene_name',
 'host_organism',
 'host_taxonomy_id',
 'keywords',
 'molecule_type',
 'organelle',
 'organism',
 'organism_classification',
 'protein_existence',
 'references',
 'seqinfo',
 'sequence',
 'sequence_length',
 'sequence_update',
 'taxonomy_id']


### Parsing the Swiss-Prot keyword and category list

Swiss-Prot also distributes a file `keywlist.txt`, which lists the
keywords and categories used in Swiss-Prot. The file contains entries in
the following form:



```
ID   2Fe-2S.
AC   KW-0001
DE   Protein which contains at least one 2Fe-2S iron-sulfur cluster: 2 iron
DE   atoms complexed to 2 inorganic sulfides and 4 sulfur atoms of
DE   cysteines from the protein.
SY   Fe2S2; [2Fe-2S] cluster; [Fe2S2] cluster; Fe2/S2 (inorganic) cluster;
SY   Di-mu-sulfido-diiron; 2 iron, 2 sulfur cluster binding.
GO   GO:0051537; 2 iron, 2 sulfur cluster binding
HI   Ligand: Iron; Iron-sulfur; 2Fe-2S.
HI   Ligand: Metal-binding; 2Fe-2S.
CA   Ligand.
//
ID   3D-structure.
AC   KW-0002
DE   Protein, or part of a protein, whose three-dimensional structure has
DE   been resolved experimentally (for example by X-ray crystallography or
DE   NMR spectroscopy) and whose coordinates are available in the PDB
DE   database. Can also be used for theoretical models.
HI   Technical term: 3D-structure.
CA   Technical term.
//
ID   3Fe-4S.
...

```


The entries in this file can be parsed by the `parse` function in the
`Bio.SwissProt.KeyWList` module. Each entry is then stored as a
`Bio.SwissProt.KeyWList.Record`, which is a Python dictionary.



In [18]:
from Bio.SwissProt import KeyWList
records = KeyWList.parse(handle)
for record in records:
    print(record['ID'])
    print(record['DE'])


This prints



```
2Fe-2S.
Protein which contains at least one 2Fe-2S iron-sulfur cluster: 2 iron atoms
complexed to 2 inorganic sulfides and 4 sulfur atoms of cysteines from the
protein.
...

```


Parsing Prosite records
-----------------------

Prosite is a database containing protein domains, protein families,
functional sites, as well as the patterns and profiles to recognize
them. Prosite was developed in parallel with Swiss-Prot. In Biopython, a
Prosite record is represented by the `Bio.ExPASy.Prosite.Record` class,
whose members correspond to the different fields in a Prosite record.

In general, a Prosite file can contain more than one Prosite records.
For example, the full set of Prosite records, which can be downloaded as
a single file (`prosite.dat`) from the [ExPASy FTP
site](ftp://ftp.expasy.org/databases/prosite/prosite.dat), contains 2073
records (version 20.24 released on 4 December 2007). To parse such a
file, we again make use of an iterator:




Accessing the ExPASy server
---------------------------

Swiss-Prot, Prosite, and Prosite documentation records can be downloaded
from the ExPASy web server at <http://www.expasy.org>. Six kinds of
queries are available from ExPASy:

get\_prodoc\_entry

:   To download a Prosite documentation record in HTML format

get\_prosite\_entry

:   To download a Prosite record in HTML format

get\_prosite\_raw

:   To download a Prosite or Prosite documentation record in raw format

get\_sprot\_raw

:   To download a Swiss-Prot record in raw format

sprot\_search\_ful

:   To search for a Swiss-Prot record

sprot\_search\_de

:   To search for a Swiss-Prot record

To access this web server from a Python script, we use the `Bio.ExPASy`
module.

### Retrieving a Swiss-Prot record {#subsec:expasy_swissprot}

Let’s say we are looking at chalcone synthases for Orchids (see
section \[sec:orchids\] for some justification for looking for
interesting things about orchids). Chalcone synthase is involved in
flavanoid biosynthesis in plants, and flavanoids make lots of cool
things like pigment colors and UV protectants.

If you do a search on Swiss-Prot, you can find three orchid proteins for
Chalcone Synthase, id numbers O23729, O23730, O23731. Now, let’s write a
script which grabs these, and parses out some interesting information.

First, we grab the records, using the `get_sprot_raw()` function of
`Bio.ExPASy`. This function is very nice since you can feed it an id and
get back a handle to a raw text record (no HTML to mess with!). We can
the use `Bio.SwissProt.read` to pull out the Swiss-Prot record, or
`Bio.SeqIO.read` to get a SeqRecord. The following code accomplishes
what I just wrote:



In [19]:
from Bio import ExPASy
from Bio import SwissProt

In [20]:
accessions = ["O23729", "O23730", "O23731"]
records = []

In [22]:
for accession in accessions:
    handle = ExPASy.get_sprot_raw(accession)
    record = SwissProt.read(handle)
    records.append(record)


If the accession number you provided to `ExPASy.get_sprot_raw` does not
exist, then `SwissProt.read(handle)` will raise a `ValueError`. You can
catch `ValueException` exceptions to detect invalid accession numbers:



In [23]:
for accession in accessions:
    handle = ExPASy.get_sprot_raw(accession)
    try:
        record = SwissProt.read(handle)
    except ValueException:
        print("WARNING: Accession %s not found" % accession)
    records.append(record)


### Searching Swiss-Prot

Now, you may remark that I knew the records’ accession numbers
beforehand. Indeed, `get_sprot_raw()` needs either the entry name or an
accession number. When you don’t have them handy, you can use one of the
`sprot_search_de()` or `sprot_search_ful()` functions.

`sprot_search_de()` searches in the ID, DE, GN, OS and OG lines;
`sprot_search_ful()` searches in (nearly) all the fields. They are
detailed on <http://www.expasy.org/cgi-bin/sprot-search-de> and
<http://www.expasy.org/cgi-bin/sprot-search-ful> respectively. Note that
they don’t search in TrEMBL by default (argument `trembl`). Note also
that they return HTML pages; however, accession numbers are quite easily
extractable:




For non-existing accession numbers, `ExPASy.get_prosite_raw` returns a
handle to an emptry string. When faced with an empty string,
`Prosite.read` and `Prodoc.read` will raise a ValueError. You can catch
these exceptions to detect invalid accession numbers.

The functions `get_prosite_entry()` and `get_prodoc_entry()` are used to
download Prosite and Prosite documentation records in HTML format. To
create a web page showing one Prosite record, you can use

