# Data access in SWISS MODEL Repository


SMR currently provide models for the Swiss-Prot section of UniProtKB and for the reference proteomes of the following model organisms: H. sapiens, M. musculus, C. elegans, E. coli K12, A. thaliana, D. melanogaster, S. cerevisiae, Caulobacter crescentus, M. tuberculosis, P. aeruginosa, Staphylococcus aureus, and P. falciparum.

In [1]:
#import dependencies
import pandas as pd
import json

## Two modes for bulk data access:
1. Download models for reference proteome (13 model organisms)
2. Use API services to access data for specific UniprotKB sequence



### 1. Download models for reference proteome (13 model organisms)

Reference proteome: Homo sapiens. Metadata file (models and structures) downloaded from [SMR](https://swissmodel.expasy.org/repository/) 

In [3]:
#open file downloaded from SMR
with open('../../data/swiss-model/metadata/SWISS-MODEL_Repository/INDEX.json', 'r') as file:
     variable_name = json.load(file)

In [4]:
#display the response object
variable_name

{'index': [{'uniprot_ac': 'P31946',
   'iso_id': 'P31946-1',
   'uniprot_seq_length': 246,
   'uniprot_seq_md5': 'c82f2efd57f939ee3c4e571708dd31a8',
   'coordinate_id': '6041e7c2a2248c526cf471a1',
   'provider': 'SWISSMODEL',
   'from': 3,
   'to': 232,
   'template': '2c1j.1.A',
   'qmean': -0.6144063009,
   'qmean_norm': 0.7498794727,
   'seqid': 86.8852462769,
   'url': 'https://swissmodel.expasy.org/repository/uniprot/P31946.pdb?provider=swissmodel&from=3&to=232&template=2c1j.1.A&provider=swissmodel'},
  {'uniprot_ac': 'P31946-2',
   'iso_id': 'P31946-2',
   'uniprot_seq_length': 244,
   'uniprot_seq_md5': '0bd136ab4d2a4ee6a9caf7e4c0ffc256',
   'coordinate_id': '6041f0521cd80f52a4c27cf7',
   'provider': 'SWISSMODEL',
   'from': 1,
   'to': 232,
   'template': '1qjb.1.A',
   'qmean': -0.8827656344,
   'qmean_norm': 0.7416823776,
   'seqid': 87.2427978516,
   'url': 'https://swissmodel.expasy.org/repository/uniprot/P31946-2.pdb?from=1&to=232&template=1qjb.1.A&provider=swissmodel'},
 

In [5]:
index = variable_name['index']
len(index)

155644

In [6]:
# Convert to a dataframe
df = pd.DataFrame(index)
df

Unnamed: 0,uniprot_ac,iso_id,uniprot_seq_length,uniprot_seq_md5,coordinate_id,provider,from,to,template,qmean,qmean_norm,seqid,url
0,P31946,P31946-1,246,c82f2efd57f939ee3c4e571708dd31a8,6041e7c2a2248c526cf471a1,SWISSMODEL,3,232,2c1j.1.A,-0.614406,0.749879,86.8852,https://swissmodel.expasy.org/repository/unipr...
1,P31946-2,P31946-2,244,0bd136ab4d2a4ee6a9caf7e4c0ffc256,6041f0521cd80f52a4c27cf7,SWISSMODEL,1,232,1qjb.1.A,-0.882766,0.741682,87.2428,https://swissmodel.expasy.org/repository/unipr...
2,P31946,P31946-1,246,c82f2efd57f939ee3c4e571708dd31a8,5f3dae76676e7622ac974741,PDB,3,233,6gnj,,,,https://swissmodel.expasy.org/repository/unipr...
3,P31946,P31946-1,246,c82f2efd57f939ee3c4e571708dd31a8,5f3da464676e7622ac967bee,PDB,3,232,6gnn,,,,https://swissmodel.expasy.org/repository/unipr...
4,P31946,P31946-1,246,c82f2efd57f939ee3c4e571708dd31a8,5f3d9c4a676e7622ac95d5f2,PDB,3,233,5n10,,,,https://swissmodel.expasy.org/repository/unipr...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
155639,A0A2R8Y5M8,,258,76b226b5f6ea1281fda8fba1fe4e4d69,6041b47edfe32bd1e86f0abc,SWISSMODEL,43,258,4pyk.2.B,-1.481664,0.720044,37.963,https://swissmodel.expasy.org/repository/unipr...
155640,A0A2R8Y5M8,,258,76b226b5f6ea1281fda8fba1fe4e4d69,6041b47edfe32bd1e86f0ac0,SWISSMODEL,85,254,4ymg.1.A,-1.943189,0.694864,26.3804,https://swissmodel.expasy.org/repository/unipr...
155641,A0A2R8Y5M8,,258,76b226b5f6ea1281fda8fba1fe4e4d69,6041b47edfe32bd1e86f0ac4,SWISSMODEL,43,258,5lsa.1.A,-0.742443,0.749859,37.963,https://swissmodel.expasy.org/repository/unipr...
155642,A0A1B0GVB3,,256,e49cb9d6aa6a03c22790beecd0a40e3f,601a4295acae65556a741528,SWISSMODEL,226,251,6lkf.1.A,0.109521,0.774163,38.4615,https://swissmodel.expasy.org/repository/unipr...


### Metadata definition 

In [7]:
pd.DataFrame(df.columns)

Unnamed: 0,0
0,uniprot_ac
1,iso_id
2,uniprot_seq_length
3,uniprot_seq_md5
4,coordinate_id
5,provider
6,from
7,to
8,template
9,qmean


In [8]:
# Check unique providers (Swiss Model, PDB)
df["provider"].unique()

array(['SWISSMODEL', 'PDB'], dtype=object)

In [19]:
# Filter models from SMR
df_swissmodel = df.loc[(df['provider'] =="SWISSMODEL")]
len(df_swissmodel)

91463

Total 91463 models are available for homo sapiens

In [16]:
# Check for unique Uniprot acc ids, how many templates/models are available, their scores etc.
df_swissmodel["uniprot_ac"].value_counts()

A6NN14      30
Q02388      30
Q02388-2    30
O43345-2    29
P58107      28
            ..
Q13621-3     1
Q9UBN4-6     1
O43711       1
Q8NGP2       1
Q15723-4     1
Name: uniprot_ac, Length: 36891, dtype: int64

Models available for 36891 unique sequences

In [17]:
#Check multiple instances of a Sequence
uniprot = df.loc[df["uniprot_ac"]=='A6NN14']
uniprot

Unnamed: 0,uniprot_ac,iso_id,uniprot_seq_length,uniprot_seq_md5,coordinate_id,provider,from,to,template,qmean,qmean_norm,seqid,url
153170,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ad57396bb9ff433ea1,SWISSMODEL,655,931,5v3m.1.C,-2.557852,0.671339,50.974,https://swissmodel.expasy.org/repository/unipr...
153171,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ac57396bb9ff433e7d,SWISSMODEL,851,1155,5v3m.1.C,-3.216336,0.657585,51.6234,https://swissmodel.expasy.org/repository/unipr...
153172,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ad57396bb9ff433e95,SWISSMODEL,373,651,5wjq.1.C,-2.995696,0.654303,53.7367,https://swissmodel.expasy.org/repository/unipr...
153173,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ad57396bb9ff433e8f,SWISSMODEL,263,539,5v3m.1.C,-2.768441,0.662862,51.6234,https://swissmodel.expasy.org/repository/unipr...
153174,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ad57396bb9ff433e9b,SWISSMODEL,195,455,5v3m.1.C,-2.698588,0.664927,48.9726,https://swissmodel.expasy.org/repository/unipr...
153175,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ac57396bb9ff433e6b,SWISSMODEL,739,1015,5v3m.1.C,-2.902971,0.657447,51.2987,https://swissmodel.expasy.org/repository/unipr...
153176,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ac57396bb9ff433e89,SWISSMODEL,317,595,5wjq.1.C,-2.622521,0.669244,53.0249,https://swissmodel.expasy.org/repository/unipr...
153177,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ad57396bb9ff433ea7,SWISSMODEL,459,735,5v3m.1.C,-2.482402,0.674376,52.2727,https://swissmodel.expasy.org/repository/unipr...
153178,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ac57396bb9ff433e4d,SWISSMODEL,515,791,5v3m.1.C,-2.858754,0.659227,52.2727,https://swissmodel.expasy.org/repository/unipr...
153179,A6NN14,,1252,ea25bb254fdb4e006a3c96382f788da4,6046a2ad57396bb9ff433ead,SWISSMODEL,627,903,5v3m.1.C,-2.453341,0.675546,52.2727,https://swissmodel.expasy.org/repository/unipr...


[Zinc finger protein](https://swissmodel.expasy.org/repository/uniprot/A6NN14?template=5v3m.1.C&range=319-595): An example of a protein sequence that uses multiple templates for different coverage within the sequence

<img src="../images/zinc_finger.png">





Observations:
No isoform for the sequence
Overlapping sequence range for a given template




In [20]:
uniprot["url"]

153170    https://swissmodel.expasy.org/repository/unipr...
153171    https://swissmodel.expasy.org/repository/unipr...
153172    https://swissmodel.expasy.org/repository/unipr...
153173    https://swissmodel.expasy.org/repository/unipr...
153174    https://swissmodel.expasy.org/repository/unipr...
153175    https://swissmodel.expasy.org/repository/unipr...
153176    https://swissmodel.expasy.org/repository/unipr...
153177    https://swissmodel.expasy.org/repository/unipr...
153178    https://swissmodel.expasy.org/repository/unipr...
153179    https://swissmodel.expasy.org/repository/unipr...
153180    https://swissmodel.expasy.org/repository/unipr...
153181    https://swissmodel.expasy.org/repository/unipr...
153182    https://swissmodel.expasy.org/repository/unipr...
153183    https://swissmodel.expasy.org/repository/unipr...
153184    https://swissmodel.expasy.org/repository/unipr...
153185    https://swissmodel.expasy.org/repository/unipr...
153186    https://swissmodel.expasy.org/

### 2. Use API services to access data for specific UniprotKB sequence


Endpoint construction: 

Base url = ‘https://swissmodel.expasy.org/repository/uniprot/’

param = AccessionID.json?provider=swissmodel


In [21]:
# API call returns list of models + experimental structures
import requests
url = "https://swissmodel.expasy.org/repository/uniprot/A6NN14.json?provider=swissmodel"
response = requests.get(url).json()
print(json.dumps(response, indent = 4, sort_keys = True))

{
    "api_version": "2.0",
    "query": {
        "ac": "A6NN14",
        "provider": [
            "swissmodel"
        ]
    },
    "query_date": "2021-04-06T19:39:34.015Z",
    "result": {
        "crc64": "DA92ED8A0114DE8F",
        "md5": "ea25bb254fdb4e006a3c96382f788da4",
        "sequence": "MPGAPGSLEMGPLTFRDVTIEFSLEEWQCLDTVQQNLYRDVMLENYRNLVFLGMAVFKPDLITCLKQGKEPWNMKRHEMVTKPPVMRSHFTQDLWPDQSTKDSFQEVILRTYARCGHKNLRLRKDCKSANEGKMHKEGYNKLNQCRTATQRKIFQCNKHMKVFHKYSNRNKVRHTKKKTFKCIKCSKSFFMLSCLIRHKRIHIRQNIYKCEERGKAFKSFSTLTKHKIIHTEDKPYKYKKCGNAFKFSSTFTKHKRIHTGETPFRCEECGKAFNQSSNLTDHKRIHTGEKTYKCEECGKAFKGSSNFNAHKVIHTAEKPYKCEDCGKTFNHFSALRKHKIIHTGKKPYKREECGKAFSQSSTLRKHEIIHTGEKPYKCEECGKAFKWSSKLTVHKVVHTGEKPYKCEECGKAFSQFSTLKKHKIIHTGKKPYKCEECGKAFNSSSTLMKHKIIHTGEKPYKCEECGKAFRQSSHLTRHKAIHTGEKPYKCEECGKAFNHFSDLRRHKIIHTGKKPYKCEECGKAFSQSSTLRNHQIIHTGEKPYKCEECGKAFKWSSKLTVHKVIHTGEKPCKCEECGKAFKHFSALRKHKVIHTREKLYKCEECGKAFNNSSILAKHKIIHTGKKPYKCEECGKAFRQSSHLTRHKAIHTGEKPYKCEECGKAFSHFSALRRHKIIHTGKKPYKCEECGKAFSHFSA

Other custom parameters:
    https://swissmodel.expasy.org/repository/uniprot/P0DTD1?template=6m71.1.A
        range=1024-1194
        from=1024&to=1194
        csm=969F65FCC0BC86FD (The CRC64 checksum of the protein sequence. This can be used to ensure the sequence for the given UniProtKB entry matches our record.)