# Introduction

Given a protein sequence and structural ensemble, how do we know if it is fully IDP, protein with some IDR regions or it is fully structured with some loop?

Follow the ideas from @Necci2016 that intrinsically disordered proteins can be classified based on the number of consecutive disordered residues:

 1. Short IDR: $5-19$ residues

 2. Long IDR:  $\ge$ 20 residues

 3. Fully disordered protein:  $\ge$ 50 residues and $> 95\%$ ID content.  

In this example, we are going to map sequence features from DISPROT and MobiDB (where in the sequence are structured, disorder) to protein entry in PED (protein ensemble database).


# Code implementation

In [10]:
# import library
import json
import pandas as pd
import requests

from colorama import Fore, Back, Style
from itertools import groupby
from operator import itemgetter

In [2]:
# function to get disorder from DISPROT/MobiDB
def get_mobidb_disordered_stats(uniprot):
    """
    as the disorder information from mobiDB, the order of method I trust: curated-disorder-disprot-which is curated by the team.
    then, homology. Prediction is not accurate-this observation comes from various manually check.
    """
    keywords = ['curated-disorder-disprot','curated-disorder-priority','homology-disorder-priority','prediction-disorder-priority'] #,
    url='https://mobidb.org/api/download?format=json&acc='+uniprot
    # check if the ID exists in DISPROT/MOBIDB
    res = requests.get(url)

    if res.status_code == 200:
        try:
            result = res.json()
        except:
            print("[ID does not exist in Database]")
        else:
            disorder_found = False

            for key in keywords:
                if key in result.keys():
                    print(f"{key}: {tuple(result[key]['regions'])}")
                    disorder_found = True
                    break
            if not disorder_found:
                print("([No_DISPROT_INFO])")

    else:
        print("[ID does not exist in Database]")

In [3]:
# Function to read data from PED
def print_seq(sequence):
    l=len(sequence)
    i=0
    while i < l:
        print(sequence[i:i+10], end=' ')
        i+=10
#     for i in range(int(l/10)):
# #         print(i)
#         print(sequence[i:i+10], end=' ')
    print("")
    
def get_ped_stats(PEDID):
    url = "https://deposition.proteinensemble.org/api/v1/entries/" + PEDID
    res = requests.get(url).json()
    # source information-UniProt
    print("********************************************************************************")
    print(Back.GREEN +'Entry ID:')
    print(Style.RESET_ALL)
    print(res['entry_id'])
    print(f"Title: {res['description'].get('title')}")
    
    construct_chains = res['construct_chains']
#     print('Construct Chains:')
    n_chains=len(construct_chains)
    print(f"Number of chains in this entry: {n_chains}")
    
    #     working with single chain in construct
    for chain in construct_chains:
        print("----")
        if n_chains ==1:
            print(f"Chain name: {chain['chain_name']}")
        else:
            print(f"Chain name: {res['entry_id']}_{chain['chain_name']}")
        
#         print(chain['alignment'])
        n_fragments = len(chain['fragments'])
        fragments = chain['fragments']
        print(f'There are {n_fragments} fragment(s) in this chain')
        # working with fragment
        print(Back.GREEN +"***INFORMATION ABOUT EACH FRAGMENT***")
        print(Style.RESET_ALL)
        # print(chain['alignment'])
        for fragment, fragment_stats in zip(fragments, chain['fragments_stats']):
            print(Back.YELLOW + "Protein name:",end='')
            print(Style.RESET_ALL,end='')
            print(f" {fragment['description']}")
            # print("Source_sequence (Full sequence from UniProt):")
            # print_seq(fragment['source_sequence'])
            print(f"POSITION OF FRAGMENT IN UNIPROT SEQUENCE: ([{fragment['start_position']}, {fragment['end_position']}])")

            print(f"uniprot_acc: {fragment['uniprot_acc']}")
            print("****")  
            
            if fragment_stats['uniprot'] is not None:
                               
                # Length, starting and ending residue in PED PDB file
                print(f"Length_total_PDB: {fragment_stats['length_total_pdb']}")
                # print(f"Start Position alig: {fragment_stats['start_position_alig']}")
                # print(f"End Position align: {fragment_stats['end_position_alig']}")
                print(f"Start Position PDB: {fragment_stats['start_position_pdb']}")
                print(f"End Position PDB: {fragment_stats['end_position_pdb']}")
                
                print(f"UniProt: {fragment_stats['uniprot']}")
                print(f"Length_total_UniProt: {fragment_stats['length_total_uniprot']}")
                print("Disordered region from Disprot:")
                get_mobidb_disordered_stats(fragment_stats['uniprot'])
                print("\n...")

To use the functions, run the following block code in which, `ID` is the entry ID.

In [4]:
ID=431
PEDID='PED'+f'{ID:05d}'
# print(PEDID)
get_ped_stats(PEDID)

********************************************************************************
[42mEntry ID:
[0m
PED00431
Title: Solution structure of the LEDGF/p75 IBD - IWS1 complex
Number of chains in this entry: 1
----
Chain name: A
There are 3 fragment(s) in this chain
[42m***INFORMATION ABOUT EACH FRAGMENT***
[0m
[43mProtein name:[0m Expression tag
POSITION OF FRAGMENT IN UNIPROT SEQUENCE: ([1, 6])
uniprot_acc: None
****
[43mProtein name:[0m LEDGF
POSITION OF FRAGMENT IN UNIPROT SEQUENCE: ([345, 442])
uniprot_acc: O75475
****
Length_total_PDB: 206
Start Position PDB: 345
End Position PDB: 442
UniProt: O75475
Length_total_UniProt: 530
Disordered region from Disprot:
curated-disorder-disprot: ([430, 471],)

...
[43mProtein name:[0m IWS1
POSITION OF FRAGMENT IN UNIPROT SEQUENCE: ([447, 548])
uniprot_acc: Q96ST2
****
Length_total_PDB: 206
Start Position PDB: 447
End Position PDB: 548
UniProt: Q96ST2
Length_total_UniProt: 819
Disordered region from Disprot:
prediction-disorder-priority: (

To manually check the information from MobiDB: 

In [75]:
get_mobidb_disordered_stats("P0DP29")

([No_DISPROT_INFO])


Note that when the output is `([No_DISPROT_INFO])`, we need to check from MobiDB website to see if the protein is fully structured or the protein does not exist on the database.

# Other functions

In the previous sections, we discussed the fundamental core functions for extracting information. However, the world of data analysis often requires more than just the basics. In this section, we'll delve into some other essential functions that can be incredibly valuable in specific scenarios.

One common situation that researchers often encounter is the need to handle disordered residues. In some cases, curated information about disordered residues may differ from predictions. To create a unified approach where a residue is considered disordered if either it's curated as disordered or predicted as disordered, you can employ the following code snippet. It's worth noting that the code provided accomplishes the task but doesn't standardize the results yet.

In [3]:
# Example:
disorder_regions_1 = [[8, 12]]
disorder_regions_2 = [[1, 10], [15, 19]]
disorder_region = disorder_regions_1+disorder_regions_2

In [4]:
disorder_region

[[8, 12], [1, 10], [15, 19]]

As we can see, disorder contain 3 regions that are separated by some residues that makes it not consecutive.

In [5]:
# extract every residues in disorder region
final_list = []
for reg in disorder_region:
    final_list += list(range(reg[0], reg[1]+1))

In [6]:
# residues that are disordered
final_list

[8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 16, 17, 18, 19]

In [7]:
# remove duplicated value
final_list = list( dict.fromkeys(final_list) )

In [8]:
# sort by residue indices for group them togather latter
final_list = sorted(final_list)

In [11]:
ranges = []
for k,g in groupby(enumerate(final_list),lambda x:x[0]-x[1]):
    group = (map(itemgetter(1),g))
    group = list(map(int,group))
    ranges.append((group[0],group[-1]))

In [12]:
ranges

[(1, 12), (15, 19)]

See! two region of `[1,10]` and `[8, 12]` have been merge to `[1,12]`

This code snippet will help you harmonize the disparate sources of data related to disordered residues, allowing for a more comprehensive analysis. However, for a complete and standardized solution, further steps may be necessary, depending on your specific use case.

In the world of data analysis, having a variety of functions at your disposal is crucial. These additional functions go beyond the basics and empower you to handle more complex scenarios, enhancing the depth and accuracy of your analyses.