<a href="https://colab.research.google.com/github/rcsb/rcsb-training-resources/blob/master/training-events/API-CrashCourse-2023/Preparing_a_dataset_for_ML-based_prediction_of_heterodimer_binding_sites.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Leveraging RCSB PDB APIs for Bioinformatics Analyses and Machine Learning
## Preparing a dataset for ML/AI-based prediction of heterodimer binding sites
This Colab notebook provides an example demonstration of using the RCSB PDB Search & Data APIs to assemble a dataset that can be used for training AI/ML models to predict protein-protein binding sites.

## Useful links


*   [RCSB.org Advanced Search](https://www.rcsb.org/search/advanced)

*   [Search API](https://search.rcsb.org)
  *   [Search API query-editor](https://search.rcsb.org/query-editor.html)

*   [Data API](https://data.rcsb.org/#data-api)
  *   [Data API GraphiQL](https://data.rcsb.org/graphql/index.html)




## Required libraries

In [None]:
!pip install python-graphql-client



## Observations


*   Apply **ONLY** to this talk !!!
*   PDB Instance ~ PDB chain
*   PDB Entity ~ PDB sequence
*   Document ~ Dictionary ~ Hash ~ Map ~ Record
  * `{key: value}` pairs
  * `{"name": "Jhon", "age": 24}`
  * Values can be Documents
```
{
        "name": "Jhon",
        "DOB": {
          "year": 1999
          "month": 2
          "day": 12
        }
}
```

  * `"DOB.year": 1999`



## RCSB PDB API URLs

In [None]:
search_api_url = "https://search.rcsb.org/rcsbsearch/v2/query"
data_api_url = "https://data.rcsb.org/graphql"

# Compiling a dataset for protein-protein binding site prediction



*   Request hetoredimeric assemblies
*   Strategies for dealing with redundancy
*   How to split my dataset into training and testing
*   Annotating binding site residues
*   General pipeline
  * Search API request
  * Data API enrichment



## Search API query

The following search request will collect all heterodimeric assemblies from the structural archive that meet next conditions ([query-editor](https://search.rcsb.org/query-editor.html?json=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_assembly_info.polymer_composition%22%2C%22operator%22%3A%22exact_match%22%2C%22negation%22%3Afalse%2C%22value%22%3A%22heteromeric%20protein%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_assembly_info.polymer_entity_instance_count%22%2C%22operator%22%3A%22equals%22%2C%22negation%22%3Afalse%2C%22value%22%3A2%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22entity_poly.rcsb_sample_sequence_length%22%2C%22operator%22%3A%22greater%22%2C%22negation%22%3Afalse%2C%22value%22%3A30%7D%7D%5D%2C%22label%22%3A%22text%22%7D%2C%22return_type%22%3A%22assembly%22%2C%22request_options%22%3A%7B%22results_content_type%22%3A%5B%22experimental%22%5D%2C%22return_all_hits%22%3Atrue%7D%7D))


*   Heteromeric assemblies
*   Dimeric
*   Sequence length gt 30 aa





In [None]:
heterodimer_search = {
  "query": {
    "type": "group",
    "logical_operator": "and",
    "nodes": [
      {
        "type": "terminal",
        "service": "text",
        "parameters": {
          "attribute": "rcsb_assembly_info.polymer_composition",
          "operator": "exact_match",
          "negation": False,
          "value": "heteromeric protein"
        }
      },
      {
        "type": "terminal",
        "service": "text",
        "parameters": {
          "attribute": "rcsb_assembly_info.polymer_entity_instance_count",
          "operator": "equals",
          "negation": False,
          "value": 2
        }
      },
      {
        "type": "terminal",
        "service": "text",
        "parameters": {
          "attribute": "entity_poly.rcsb_sample_sequence_length",
          "operator": "greater",
          "negation": False,
          "value": 30
        }
      }
    ],
    "label": "text"
  },
  "return_type": "assembly",
  "request_options": {
    "results_content_type": [
      "experimental"
    ],
    "return_all_hits": True
  }
}

**Important: We want to exclude CSMs**

```
  "request_options": {
    "results_content_type": [
      "experimental", "computational"
    ]
  }
```

**Important: We are interested in assemblies**

```
  "return_type": "entry"|"assembly"|"polymer_instance"|"entity"| ...
```

## Collecting and parsing search results

In [None]:
import requests
import json

search_response = requests.post(search_api_url, json = heterodimer_search)

search_response_json = json.loads(search_response.text)

print("Number of heterodimeric assemblies: %s\n" % search_response_json['total_count'])

assembly_id_list = [item['identifier'] for item in search_response_json['result_set']]

print("Assemly Ids: %s, ...\n" % ", ".join(assembly_id_list[0:10]))

Number of heterodimeric assemblies: 26190

Assemly Ids: 1A08-1, 1A08-2, 1A09-2, 1A09-3, 1A0N-1, 1A0O-1, 1A0O-2, 1A0O-3, 1A0O-4, 1A0Q-1, ...



## PDB Assembly, Instance and Entity Ids relationships

*   Identifiers attribute
  *   `rcsb_id` is always available
  *   `rcsb_*_container_identifiers`
  *   Simple [request](https://data.rcsb.org/graphql/index.html?variables=%7B%0A%20%20%22assembly_ids%22%3A%20%5B%0A%20%20%20%20%221A08-1%22%0A%20%20%5D%0A%7D&operationName=AssemblyInstances&query=query%20AssemblyInstances(%24assembly_ids%3A%20%5BString!%5D!)%20%7B%0A%20%20assemblies(assembly_ids%3A%24assembly_ids)%7B%0A%09%09rcsb_id%0A%20%20%20%20rcsb_assembly_container_identifiers%7B%0A%20%20%20%20%20%20entry_id%0A%20%20%20%20%20%20assembly_id%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D)
*   GraphQL queries follow the PDB hierarchy
  *   General idea ([request](https://data.rcsb.org/graphql/index.html?variables=%7B%0A%20%20%22assembly_ids%22%3A%20%5B%0A%20%20%20%20%221A08-1%22%0A%20%20%5D%0A%7D&query=query%20%7B%0A%09entries(entry_ids%3A%5B%22101M%22%5D)%7B%0A%20%20%20%20rcsb_id%0A%20%20%7D%0A%7D))
  *   Assemblies are composed of different instances
  *   `polymer_entity_instances` request is accessible from `assembly` query
*   Data API ([request](https://data.rcsb.org/graphql/index.html?variables=%7B%0A%20%20%22assembly_ids%22%3A%20%5B%0A%20%20%20%20%221A08-1%22%0A%20%20%5D%0A%7D&operationName=AssemblyInstances&query=query%20AssemblyInstances(%24assembly_ids%3A%20%5BString!%5D!)%20%7B%0A%20%20assemblies(assembly_ids%3A%24assembly_ids)%7B%0A%09%09rcsb_id%0A%20%20%20%20rcsb_assembly_container_identifiers%7B%0A%20%20%20%20%20%20entry_id%0A%20%20%20%20%20%20assembly_id%0A%20%20%20%20%7D%0A%20%20%20%20polymer_entity_instances%7B%0A%20%20%20%20%20%20rcsb_polymer_entity_instance_container_identifiers%20%7B%0A%20%20%20%20%20%20%20%20entry_id%0A%20%20%20%20%20%20%20%20entity_id%0A%20%20%20%20%20%20%20%20asym_id%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D))



In [None]:
from python_graphql_client import GraphqlClient

# Instantiate the client with an endpoint.
client = GraphqlClient(endpoint = data_api_url)

assembly_query = """
  query AssemblyInstances($assembly_ids: [String!]!) {
    assemblies(assembly_ids:$assembly_ids){
      rcsb_id
      rcsb_assembly_container_identifiers{
        entry_id
        assembly_id
      }
      polymer_entity_instances {
        rcsb_polymer_entity_instance_container_identifiers {
          entry_id
          entity_id
          asym_id
        }
      }
    }
  }
"""

# Data API request should be done in batches. Using the whole list of ids will result in a timeout error
assembly_query_variables = {"assembly_ids": assembly_id_list[0:100]}

# Synchronous request
assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)

assembly_instances = [
  [
      {
          "assembly_id": assembly['rcsb_assembly_container_identifiers']['assembly_id'],
          **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers']
      }
      for polymer_entity_instances in assembly['polymer_entity_instances']
  ]
  for assembly in assembly_data['data']['assemblies']
]

print(json.dumps(assembly_instances[0:10], indent=2))



[
  [
    {
      "assembly_id": "1",
      "entry_id": "1A08",
      "entity_id": "1",
      "asym_id": "A"
    },
    {
      "assembly_id": "1",
      "entry_id": "1A08",
      "entity_id": "2",
      "asym_id": "B"
    }
  ],
  [
    {
      "assembly_id": "2",
      "entry_id": "1A08",
      "entity_id": "1",
      "asym_id": "C"
    },
    {
      "assembly_id": "2",
      "entry_id": "1A08",
      "entity_id": "2",
      "asym_id": "D"
    }
  ],
  [
    {
      "assembly_id": "2",
      "entry_id": "1A09",
      "entity_id": "1",
      "asym_id": "A"
    },
    {
      "assembly_id": "2",
      "entry_id": "1A09",
      "entity_id": "2",
      "asym_id": "B"
    }
  ],
  [
    {
      "assembly_id": "3",
      "entry_id": "1A09",
      "entity_id": "1",
      "asym_id": "C"
    },
    {
      "assembly_id": "3",
      "entry_id": "1A09",
      "entity_id": "2",
      "asym_id": "D"
    }
  ],
  [
    {
      "assembly_id": "1",
      "entry_id": "1A0N",
      "entity_id": "1",




*   Why `assembly_instances` is a list of lists?
  *      Each asebmly might contain multiple instances (chains). Data API `assembly` requests will return information for all of them

*   Are `assembly_instances` elements always length 2?
  *      In this example, yes. Remeber our assemblies are heterodimers




## Collecting PDB Entity information

### Adding sequence length

*   GraphQL queries follow the PDB hierarchy
  *   PDB Instance describes a particular occurrence of a PDB Entity
  *   `polymer_entity` request is accessible from `polymer_entity_instances` query


In [None]:
assembly_query = """
query AssemblyInstances($assembly_ids: [String!]!) {
  assemblies(assembly_ids:$assembly_ids){
    rcsb_assembly_container_identifiers{
        entry_id
        assembly_id
    }
    polymer_entity_instances{
      rcsb_polymer_entity_instance_container_identifiers {
        entry_id
        entity_id
        asym_id
      }
      polymer_entity {
        entity_poly{
          rcsb_sample_sequence_length
        }
      }
    }
  }
}
"""

assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)

assembly_instances = [
    [
      {
        "assembly_id": assembly['rcsb_assembly_container_identifiers']['assembly_id'],
        **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'],
        "length": polymer_entity_instances['polymer_entity']['entity_poly']['rcsb_sample_sequence_length']
      }
      for polymer_entity_instances in assembly['polymer_entity_instances']
    ]
    for assembly in assembly_data['data']['assemblies']]

print(json.dumps(assembly_instances[0], indent=2))



[
  {
    "assembly_id": "1",
    "entry_id": "1A08",
    "entity_id": "1",
    "asym_id": "A",
    "length": 107
  },
  {
    "assembly_id": "1",
    "entry_id": "1A08",
    "entity_id": "2",
    "asym_id": "B",
    "length": 4
  }
]




*   Why there is a chain of length 4?
  *   This is a technical limitation of the search engine. In one-to-many relationships we have to fix the behaviour: for-all or for-any



### Filtering by sequence length

We would like to build a data set for training protein-protein binding site prediction.

In [None]:
assembly_data = [
    assembly for assembly in assembly_data['data']['assemblies']
    if  assembly['polymer_entity_instances'][0]['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'] > 30
    and assembly['polymer_entity_instances'][1]['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'] > 30
  ]

assembly_instances = [
    [
      {
        "assembly_id": assembly['rcsb_assembly_container_identifiers']['assembly_id'],
        **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'],
        'length': polymer_entity_instances['polymer_entity']['entity_poly']['rcsb_sample_sequence_length']
      }
      for polymer_entity_instances in assembly['polymer_entity_instances']
    ]
    for assembly in assembly_data
  ]


print(json.dumps(assembly_instances[0], indent=2))


[
  {
    "assembly_id": "1",
    "entry_id": "1A0O",
    "entity_id": "1",
    "asym_id": "A",
    "length": 128
  },
  {
    "assembly_id": "1",
    "entry_id": "1A0O",
    "entity_id": "2",
    "asym_id": "B",
    "length": 134
  }
]


### Adding UniProt accession to PDB Entities


*   Adding UniProt Id can be used as a strategy to deal with redundancy
*   None of my assemblies will share the same combination of UniProt Ids



In [None]:
assembly_query = """
query AssemblyInstances($assembly_ids: [String!]!) {
  assemblies(assembly_ids:$assembly_ids){
    rcsb_assembly_container_identifiers{
        entry_id
        assembly_id
    }
    polymer_entity_instances{
      rcsb_polymer_entity_instance_container_identifiers {
        entry_id
        entity_id
        asym_id
      }
      polymer_entity {
        rcsb_polymer_entity_container_identifiers {
          uniprot_ids
        }
        entity_poly{
          rcsb_sample_sequence_length
        }
      }
    }
  }
}
"""

assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)
assembly_instances = [
    [
      {
        "assembly_id": assembly['rcsb_assembly_container_identifiers']['assembly_id'],
        **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'],
        'uniprot_ids': polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_container_identifiers']['uniprot_ids'],
        'length': polymer_entity_instances['polymer_entity']['entity_poly']['rcsb_sample_sequence_length']
      }
      for polymer_entity_instances in assembly['polymer_entity_instances']
    ]
    for assembly in assembly_data['data']['assemblies']
  ]

print(json.dumps(assembly_instances[0], indent=2))


[
  {
    "assembly_id": "1",
    "entry_id": "1A08",
    "entity_id": "1",
    "asym_id": "A",
    "uniprot_ids": [
      "P12931"
    ],
    "length": 107
  },
  {
    "assembly_id": "1",
    "entry_id": "1A08",
    "entity_id": "2",
    "asym_id": "B",
    "uniprot_ids": null,
    "length": 4
  }
]


### Putting all together


*   Filtering by sequence length
*   Adding UniProt ids



In [None]:
assembly_query = """
query AssemblyInstances($assembly_ids: [String!]!) {
  assemblies(assembly_ids:$assembly_ids){
    rcsb_assembly_container_identifiers{
        entry_id
        assembly_id
    }
    polymer_entity_instances{
      rcsb_polymer_entity_instance_container_identifiers {
        entry_id
        entity_id
        asym_id
      }
      polymer_entity {
        entity_poly{
          rcsb_sample_sequence_length
        }
        rcsb_polymer_entity_container_identifiers {
          uniprot_ids
        }
      }
    }
  }
}
"""
assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)
assembly_data = [
    assembly for assembly in assembly_data['data']['assemblies']
    if  assembly['polymer_entity_instances'][0]['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'] > 30
    and assembly['polymer_entity_instances'][1]['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'] > 30
  ]

assembly_instances = [
    [
        {
            "assembly_id": assembly['rcsb_assembly_container_identifiers']['assembly_id'],
            **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'],
            'length': polymer_entity_instances['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'],
            'uniprot_ids': polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_container_identifiers']['uniprot_ids']
        }
        for polymer_entity_instances in assembly['polymer_entity_instances']
    ]
    for assembly in assembly_data
  ]

print(json.dumps(assembly_instances[0], indent=2))


[
  {
    "assembly_id": "1",
    "entry_id": "1A0O",
    "entity_id": "1",
    "asym_id": "A",
    "length": 128,
    "uniprot_ids": [
      "P0AE67"
    ]
  },
  {
    "assembly_id": "1",
    "entry_id": "1A0O",
    "entity_id": "2",
    "asym_id": "B",
    "length": 134,
    "uniprot_ids": [
      "P07363"
    ]
  }
]




*   Lenght is related to the PDB Entity sequence not the UniProt entry



### Sequence identity cluster membership


*   Not all PDB entities map to a UniProt accession
*   We can use sequence clustering to remove redundancy
*   [Request](https://data.rcsb.org/graphql/index.html?query=query%20AssemblyInstances(%24assembly_ids%3A%20%5BString!%5D!)%20%7B%0A%20%20assemblies(assembly_ids%3A%24assembly_ids)%7B%0A%20%20%20%20polymer_entity_instances%7B%0A%20%20%20%20%20%20rcsb_polymer_entity_instance_container_identifiers%20%7B%0A%20%20%20%20%20%20%20%20entry_id%0A%20%20%20%20%20%20%20%20entity_id%0A%20%20%20%20%20%20%20%20asym_id%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20polymer_entity%20%7B%0A%20%20%20%20%20%20%20%20entity_poly%7B%0A%20%20%20%20%20%20%20%20%20%20rcsb_sample_sequence_length%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20rcsb_polymer_entity_container_identifiers%20%7B%0A%20%20%20%20%20%20%20%20%20%20uniprot_ids%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20rcsb_polymer_entity_group_membership%20%7B%0A%20%20%20%20%20%20%20%20%20%20similarity_cutoff%0A%20%20%20%20%20%20%20%20%20%20group_id%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D&operationName=AssemblyInstances)




In [None]:
assembly_query = """
query AssemblyInstances($assembly_ids: [String!]!) {
  assemblies(assembly_ids:$assembly_ids){
    polymer_entity_instances{
      rcsb_polymer_entity_instance_container_identifiers {
        entry_id
        entity_id
        asym_id
      }
      polymer_entity {
        entity_poly{
          rcsb_sample_sequence_length
        }
        rcsb_polymer_entity_container_identifiers {
          uniprot_ids
        }
        rcsb_polymer_entity_group_membership {
          similarity_cutoff
          group_id
        }
      }
    }
  }
}
"""

assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)
assembly_data = [
    assembly for assembly in assembly_data['data']['assemblies']
    if  assembly['polymer_entity_instances'][0]['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'] > 30
    and assembly['polymer_entity_instances'][1]['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'] > 30
  ]

assembly_instances = [
    [
        {
            **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'],
            'length': polymer_entity_instances['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'],
            'uniprot_ids': polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_container_identifiers']['uniprot_ids'],
            'group_ids': polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_group_membership']
        }
        for polymer_entity_instances in assembly['polymer_entity_instances']
    ]
    for assembly in assembly_data
]

print(json.dumps(assembly_instances[0], indent=2))




[
  {
    "entry_id": "1A0O",
    "entity_id": "1",
    "asym_id": "A",
    "length": 128,
    "uniprot_ids": [
      "P0AE67"
    ],
    "group_ids": [
      {
        "similarity_cutoff": 95.0,
        "group_id": "591_95"
      },
      {
        "similarity_cutoff": 50.0,
        "group_id": "1246_50"
      },
      {
        "similarity_cutoff": 70.0,
        "group_id": "974_70"
      },
      {
        "similarity_cutoff": 30.0,
        "group_id": "762_30"
      },
      {
        "similarity_cutoff": null,
        "group_id": "P0AE67"
      },
      {
        "similarity_cutoff": 100.0,
        "group_id": "2438_100"
      },
      {
        "similarity_cutoff": 90.0,
        "group_id": "632_90"
      }
    ]
  },
  {
    "entry_id": "1A0O",
    "entity_id": "2",
    "asym_id": "B",
    "length": 134,
    "uniprot_ids": [
      "P07363"
    ],
    "group_ids": [
      {
        "similarity_cutoff": 70.0,
        "group_id": "29865_70"
      },
      {
        "similarity_cuto



*   Why do entities have mutiple `group_id`?
  *   Sequence clustering is calculated independent for each identity threshold. Each threshold will generate different clusters



### Choosing 100% sequence identity



*   Removing redundancy based on sequence




In [None]:
assembly_instances = [
    [
        {
          **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'],
          'length': polymer_entity_instances['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'],
          'group_id_at_100': [group for group in polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_group_membership'] if group['similarity_cutoff'] == 100][0]['group_id'],
          'uniprot_ids': polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_container_identifiers']['uniprot_ids']
        }
        for polymer_entity_instances in assembly['polymer_entity_instances']
    ]
    for assembly in assembly_data
  ]

print(json.dumps(assembly_instances[0], indent=2))


[
  {
    "entry_id": "1A0O",
    "entity_id": "1",
    "asym_id": "A",
    "length": 128,
    "group_id_at_100": "2438_100",
    "uniprot_ids": [
      "P0AE67"
    ]
  },
  {
    "entry_id": "1A0O",
    "entity_id": "2",
    "asym_id": "B",
    "length": 134,
    "group_id_at_100": "28313_100",
    "uniprot_ids": [
      "P07363"
    ]
  }
]


### Adding PDB Instance annotation

* `rcsb_*_annotation`
* [instance request](https://data.rcsb.org/graphql/index.html?variables=%7B%0A%20%20%22assembly_ids%22%3A%20%5B%0A%20%20%20%20%221A08-1%22%0A%20%20%5D%0A%7D&query=query%20%7B%0A%20%20polymer_entity_instances(instance_ids%3A%5B%222UZI.C%22%5D)%7B%0A%20%20%20%20rcsb_polymer_instance_annotation%7B%0A%20%20%20%20%20%20name%0A%20%20%20%20%20%20type%0A%20%20%20%20%20%20annotation_id%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D)
* [entity request](https://data.rcsb.org/graphql/index.html?variables=%7B%0A%20%20%22assembly_ids%22%3A%20%5B%0A%20%20%20%20%221A08-1%22%0A%20%20%5D%0A%7D&query=query%20%7B%0A%20%20polymer_entities(entity_ids%3A%5B%222UZI_3%22%5D)%7B%0A%20%20%20%20rcsb_polymer_entity_annotation%7B%0A%20%20%20%20%20%20name%0A%20%20%20%20%20%20type%0A%20%20%20%20%20%20annotation_id%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D)

In [None]:
assembly_query = """
query AssemblyInstances($assembly_ids: [String!]!) {
  assemblies(assembly_ids:$assembly_ids){
    polymer_entity_instances{
      rcsb_polymer_entity_instance_container_identifiers {
        entry_id
        entity_id
        asym_id
      }
      polymer_entity {
        entity_poly{
          rcsb_sample_sequence_length
        }
        rcsb_polymer_entity_container_identifiers {
          uniprot_ids
        }
        rcsb_polymer_entity_group_membership {
          similarity_cutoff
          group_id
        }
      }
      rcsb_polymer_instance_annotation{
        type
        name
        annotation_id
      }
    }
  }
}
"""

assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)
assembly_data = [
    assembly for assembly in assembly_data['data']['assemblies']
    if  assembly['polymer_entity_instances'][0]['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'] > 30
    and assembly['polymer_entity_instances'][1]['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'] > 30
  ]

assembly_instances = [
    [
        {
          **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'],
          'length': polymer_entity_instances['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'],
          'uniprot_ids': polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_container_identifiers']['uniprot_ids'],
          'group_id_at_100': [group for group in polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_group_membership'] if group['similarity_cutoff'] == 100][0]['group_id'],
          'annotations': polymer_entity_instances['rcsb_polymer_instance_annotation']
        }
        for polymer_entity_instances in assembly['polymer_entity_instances']
    ]
    for assembly in assembly_data
  ]

print(json.dumps(assembly_instances[0], indent=2))


[
  {
    "entry_id": "1A0O",
    "entity_id": "1",
    "asym_id": "A",
    "length": 128,
    "uniprot_ids": [
      "P0AE67"
    ],
    "group_id_at_100": "2438_100",
    "annotations": [
      {
        "type": "CATH",
        "name": "Response regulator",
        "annotation_id": "3.40.50.2300"
      },
      {
        "type": "SCOP",
        "name": "CheY protein",
        "annotation_id": "d1a0oa_"
      },
      {
        "type": "SCOP2",
        "name": "CheY-like",
        "annotation_id": "8034311"
      },
      {
        "type": "ECOD",
        "name": "Response_reg",
        "annotation_id": "e1a0oA1"
      }
    ]
  },
  {
    "entry_id": "1A0O",
    "entity_id": "2",
    "asym_id": "B",
    "length": 134,
    "uniprot_ids": [
      "P07363"
    ],
    "group_id_at_100": "28313_100",
    "annotations": [
      {
        "type": "CATH",
        "name": "CheY-binding domain of CheA",
        "annotation_id": "3.30.70.400"
      },
      {
        "type": "SCOP",
        "na



*  We can use CATH domain annotations to split our datasets in training and testing
  *   None of the testing assemblies will share the same combinations of domains with the training set



In [None]:
assembly_instances = [
    [
        {
          **polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'],
          'length': polymer_entity_instances['polymer_entity']['entity_poly']['rcsb_sample_sequence_length'],
          'uniprot_ids': polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_container_identifiers']['uniprot_ids'],
          'group_id_at_100': [group for group in polymer_entity_instances['polymer_entity']['rcsb_polymer_entity_group_membership'] if group['similarity_cutoff'] == 100][0]['group_id'],
          'cath_ids': [annotation['annotation_id'] for annotation in polymer_entity_instances['rcsb_polymer_instance_annotation'] if annotation['type'] == "CATH"]
        }
        for polymer_entity_instances in assembly['polymer_entity_instances']
    ]
    for assembly in assembly_data
  ]

print(json.dumps(assembly_instances[0], indent=2))

[
  {
    "entry_id": "1A0O",
    "entity_id": "1",
    "asym_id": "A",
    "length": 128,
    "uniprot_ids": [
      "P0AE67"
    ],
    "group_id_at_100": "2438_100",
    "cath_ids": [
      "3.40.50.2300"
    ]
  },
  {
    "entry_id": "1A0O",
    "entity_id": "2",
    "asym_id": "B",
    "length": 134,
    "uniprot_ids": [
      "P07363"
    ],
    "group_id_at_100": "28313_100",
    "cath_ids": [
      "3.30.70.400"
    ]
  }
]


## Protein-Protein Interface data
### PDB Assembly and PDB Interface relationship
*   Interfaces are stored for each pair of PDB Instances in a PDB Assembly
*   GraphQL queries follow the PDB hierarchy
  *   Protein-protein interfaces are defined in PDB Assemblies
  *   `interfaces` request is accessible from `assemblies` query
  *   [Request](https://data.rcsb.org/graphql/index.html?query=query%20AssemblyInstances(%24assembly_ids%3A%20%5BString!%5D!)%20%7B%0A%20%20assemblies(assembly_ids%3A%24assembly_ids)%7B%0A%20%20%20%20rcsb_id%0A%20%20%7D%0A%7D&operationName=AssemblyInstances)

In [None]:
assembly_query = """
query AssemblyInstances($assembly_ids: [String!]!) {
  assemblies(assembly_ids:$assembly_ids){
    interfaces {
      rcsb_interface_container_identifiers {
        rcsb_id
        entry_id
      }
      rcsb_interface_partner{
        interface_partner_identifier{
          entity_id
          asym_id
        }
      }
    }
    polymer_entity_instances{
      rcsb_polymer_entity_instance_container_identifiers {
        entry_id
        entity_id
        asym_id
      }
    }
  }
}
"""

assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)

assembly_instances = [
    {
      'pdb_instances':[polymer_entity_instances['rcsb_polymer_entity_instance_container_identifiers'] for polymer_entity_instances in assembly['polymer_entity_instances']],
      'pdb_interfaces':[
          [
              {
                  'entry_id': interface['rcsb_interface_container_identifiers']['entry_id'],
                  **partner['interface_partner_identifier']
              }
              for partner in interface['rcsb_interface_partner']

          ]
          for interface in assembly['interfaces']
      ]
    }
    for assembly in assembly_data['data']['assemblies']
  ]

print("%s\n" % json.dumps(assembly_instances[0]['pdb_instances'], indent=2))
print("Number of interfaces: %s" % len(assembly_instances[0]['pdb_interfaces']))
print(json.dumps(assembly_instances[0]['pdb_interfaces'][0], indent=2))


[
  {
    "entry_id": "1A0Q",
    "entity_id": "1",
    "asym_id": "A"
  },
  {
    "entry_id": "1A0Q",
    "entity_id": "2",
    "asym_id": "B"
  }
]

Number of interfaces: 1
[
  {
    "entry_id": "1A0Q",
    "entity_id": "1",
    "asym_id": "A"
  },
  {
    "entry_id": "1A0Q",
    "entity_id": "2",
    "asym_id": "B"
  }
]


## PDB Interface features

* Features are defined on protein regions at Entity, Instance or Interface level

* `*_feature` attribute
  * [Entity features](https://data.rcsb.org/graphql/index.html?variables=%7B%0A%20%20%22assembly_ids%22%3A%20%5B%0A%20%20%20%20%221A08-1%22%0A%20%20%5D%0A%7D&query=query%20%7B%0A%09polymer_entities(entity_ids%3A%5B%22101M_1%22%5D)%7B%0A%20%20%09rcsb_polymer_entity_feature%7B%0A%20%20%20%20%20%20type%0A%20%20%20%20%20%20name%0A%20%20%20%20%20%20feature_positions%20%7B%0A%20%20%20%20%20%20%20%20beg_seq_id%0A%20%20%20%20%20%20%20%20end_seq_id%0A%20%20%20%20%20%20%20%20values%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%09%7D%0A%7D)
  * [Instance features](https://data.rcsb.org/graphql/index.html?variables=%7B%0A%20%20%22assembly_ids%22%3A%20%5B%0A%20%20%20%20%221A08-1%22%0A%20%20%5D%0A%7D&query=query%20%7B%0A%09polymer_entity_instances(instance_ids%3A%5B%22101M.A%22%5D)%7B%0A%20%20%20%20rcsb_polymer_instance_feature%7B%0A%20%20%20%20%20%20type%0A%20%20%20%20%20%20name%0A%20%20%20%20%20%20feature_positions%20%7B%0A%20%20%20%20%20%20%20%20beg_seq_id%0A%20%20%20%20%20%20%20%20end_seq_id%0A%20%20%20%20%20%20%20%20values%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%09%7D%0A%7D)
  * [Interface features](https://data.rcsb.org/graphql/index.html?variables=%7B%0A%20%20%22assembly_ids%22%3A%20%5B%0A%20%20%20%20%221A08-1%22%0A%20%20%5D%0A%7D&query=query%20%7B%0A%09interfaces(interface_ids%3A%5B%222UZI-1.1%22%5D)%7B%0A%20%20%20%20rcsb_interface_partner%7B%0A%20%20%20%20%20%20interface_partner_feature%7B%0A%20%20%20%20%20%20%20%20type%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%20%20feature_positions%7B%0A%20%20%20%20%20%20%20%20%20%20beg_seq_id%0A%20%20%20%20%20%20%20%20%20%20values%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%09%7D%0A%7D)

* Numerical features defined over resiude positions

```
feature_positions {
  beg_seq_id
  values
}
```
* Categirical features defined over residue segments

```
feature_positions {
  beg_seq_id
  end_seq_id
}
```

In [None]:
assembly_query = """
query AssemblyInstances($assembly_ids: [String!]!) {
  assemblies(assembly_ids:$assembly_ids){
    interfaces {
      rcsb_interface_container_identifiers {
        rcsb_id
        entry_id
      }
      rcsb_interface_partner{
        interface_partner_identifier{
          entity_id
          asym_id
        }
        interface_partner_feature{
          type
          feature_positions{
            beg_seq_id
            values
          }
        }
      }
    }
    polymer_entity_instances{
      rcsb_polymer_entity_instance_container_identifiers {
        entry_id
        entity_id
        asym_id
      }
    }
  }
}
"""
assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)

assembly_interfaces = [
    [
        [
            {
              'entry_id': interface['rcsb_interface_container_identifiers']['entry_id'],
              'partner_id': partner['interface_partner_identifier'],
              'partner_features': partner['interface_partner_feature']
            }
            for partner in interface['rcsb_interface_partner']
        ]
        for interface in assembly['interfaces']
    ]
    for assembly in assembly_data['data']['assemblies']
]

print(json.dumps(assembly_interfaces[0], indent=2))



[
  [
    {
      "entry_id": "1A0Q",
      "partner_id": {
        "entity_id": "1",
        "asym_id": "A"
      },
      "partner_features": [
        {
          "type": "ASA_UNBOUND",
          "feature_positions": [
            {
              "beg_seq_id": 2,
              "values": [
                48.57673189287011,
                119.90613442009894,
                16.167256574081424,
                89.54904021907043,
                22.87871133162074,
                57.05770630546273,
                43.536185966899104,
                88.57080481541033,
                71.43449759795084,
                50.67054559118288,
                46.18987671501579,
                7.659143827510029,
                56.16819701816239,
                104.74330156603793,
                45.43308966423311,
                21.141727266725066,
                131.5426049187291,
                5.64357966237581,
                65.50754250831005,
                0.0,
                6

In [None]:
assembly_interfaces = [
    [
        [
            {
              'entry_id': interface['rcsb_interface_container_identifiers']['entry_id'],
              'partner_id': partner['interface_partner_identifier'],
              'partner_asa_bound': [feature for feature in partner['interface_partner_feature'] if feature['type'] == "ASA_BOUND"][0]['feature_positions'],
              'partner_asa_unbound': [feature for feature in partner['interface_partner_feature'] if feature['type'] == "ASA_UNBOUND"][0]['feature_positions'],
            }
            for partner in interface['rcsb_interface_partner']
        ]
        for interface in assembly['interfaces']
    ]
    for assembly in assembly_data['data']['assemblies']
]

print(json.dumps(assembly_interfaces[0], indent=2))


[
  [
    {
      "entry_id": "1A0Q",
      "partner_id": {
        "entity_id": "1",
        "asym_id": "A"
      },
      "partner_asa_bound": [
        {
          "beg_seq_id": 2,
          "values": [
            44.95287335102774,
            111.72894573392314,
            16.167256574081424,
            89.54904021907043,
            22.87871133162074,
            57.05770630546273,
            43.536185966899104,
            75.18982550916063,
            71.43449759795084,
            50.67054559118288,
            46.18987671501579,
            7.659143827510029,
            56.16819701816239,
            104.74330156603793,
            45.43308966423311,
            21.141727266725066,
            131.5426049187291,
            5.64357966237581,
            65.50754250831005,
            0.0,
            61.710646299745015,
            2.239589880624645,
            93.25560181583707,
            0.0,
            40.75715698163774,
            101.1778816741006,
           

* Buried surface fraction `1-bound_asa_value/unbound_asa_value`

In [None]:
first_assembly = assembly_interfaces[0]
first_assembly_interface = assembly_interfaces[0][0]

first_assembly_interface_partner_a = first_assembly_interface[0]
interface_buried_fraction_partner_a = [
    {
      'beg_seq_id': bound_region['beg_seq_id'],
      'values': [(1-bound_value/unbound_value) if unbound_value > 0 else 0 for bound_value, unbound_value in zip(bound_region['values'],unbound_region['values'])]
    }
    for bound_region,unbound_region in zip(first_assembly_interface_partner_a['partner_asa_bound'],first_assembly_interface_partner_a['partner_asa_unbound'])
  ]


print(json.dumps(interface_buried_fraction_partner_a, indent=2))


[
  {
    "beg_seq_id": 2,
    "values": [
      0.07460070697704269,
      0.06819658331680079,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.15107663675560912,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0,
      0.0,
      0.0,
      0.0,
      0,
      0.0,
      0.0,
      0.0,
      0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.7966823510674619,
      0.0,
      0.8057904994202726,
      0.0,
      0.0,
      0.0,
      0.0,
      0.04770574238762293,
      0.9218588292162768,
      0.008880709151148358,
      0.46253417577153555,
      0.0,
      0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0,
      0.0

In [None]:
interface_residues_partner_a = sum([[region['beg_seq_id'] + idx for idx, value in enumerate(region['values']) if value > 0] for region in interface_buried_fraction_partner_a], [])

print(interface_residues_partner_a)

[2, 3, 9, 36, 38, 43, 44, 45, 46, 85, 87, 94, 95, 96, 97, 98, 99, 100, 102, 113, 115, 116, 117, 118, 119, 120, 122, 123, 126, 130, 132, 134, 136, 137, 157, 158, 159, 160, 161, 162, 163, 164, 166, 168, 173, 174, 175, 177, 179]


In [None]:
def partner_interface_residues(assembly_interface_partner):
  interface_buried_fraction_partner = [{
    'beg_seq_id': bound_region['beg_seq_id'],
    'values': [(1-bound_value/unbound_value) if unbound_value > 0 else 0 for bound_value, unbound_value in zip(bound_region['values'],unbound_region['values'])]
  } for bound_region,unbound_region in zip(assembly_interface_partner['partner_asa_bound'],assembly_interface_partner['partner_asa_unbound'])]
  return sum([[region['beg_seq_id'] + idx for idx, value in enumerate(region['values']) if value > 0] for region in interface_buried_fraction_partner], [])

assembly_query = """
query AssemblyInstances($assembly_ids: [String!]!) {
  assemblies(assembly_ids:$assembly_ids){
    interfaces {
      rcsb_interface_container_identifiers {
        rcsb_id
        entry_id
        assembly_id
      }
      rcsb_interface_partner{
        interface_partner_identifier{
          entity_id
          asym_id
        }
        interface_partner_feature{
          type
          feature_positions{
            beg_seq_id
            values
          }
        }
      }
    }
  }
}
"""
assembly_data = client.execute(query=assembly_query, variables=assembly_query_variables)

assembly_interfaces = [
    [
        [
            {
              'entry_id': interface['rcsb_interface_container_identifiers']['entry_id'],
              'assembly_id': interface['rcsb_interface_container_identifiers']['assembly_id'],
              'partner_id': partner['interface_partner_identifier'],
              'partner_asa_bound': [feature for feature in partner['interface_partner_feature'] if feature['type'] == "ASA_BOUND"][0]['feature_positions'],
              'partner_asa_unbound': [feature for feature in partner['interface_partner_feature'] if feature['type'] == "ASA_UNBOUND"][0]['feature_positions'],
            }
            for partner in interface['rcsb_interface_partner']
        ]
        for interface in assembly['interfaces']
    ]
    for assembly in assembly_data['data']['assemblies']
]

assembly_interfaces = [
    [
        {
          'entry_id': interface[0]['entry_id'],
          'assembly_id': interface[0]['assembly_id'],
          'partner_a':{
            'entity_id': interface[0]['partner_id']['entity_id'],
            'asym_id': interface[0]['partner_id']['asym_id'],
            'interface_residues': partner_interface_residues(interface[0])
          },
          'partner_b':{
            'entity_id': interface[1]['partner_id']['entity_id'],
            'asym_id': interface[1]['partner_id']['asym_id'],
            'interface_residues': partner_interface_residues(interface[1])
          },
        }
        for interface in assembly
    ]
    for assembly in assembly_interfaces
  ]

print(json.dumps(assembly_interfaces[0], indent=2))


[
  {
    "entry_id": "1A0Q",
    "assembly_id": "1",
    "partner_a": {
      "entity_id": "1",
      "asym_id": "A",
      "interface_residues": [
        2,
        3,
        9,
        36,
        38,
        43,
        44,
        45,
        46,
        85,
        87,
        94,
        95,
        96,
        97,
        98,
        99,
        100,
        102,
        113,
        115,
        116,
        117,
        118,
        119,
        120,
        122,
        123,
        126,
        130,
        132,
        134,
        136,
        137,
        157,
        158,
        159,
        160,
        161,
        162,
        163,
        164,
        166,
        168,
        173,
        174,
        175,
        177,
        179
      ]
    },
    "partner_b": {
      "entity_id": "2",
      "asym_id": "B",
      "interface_residues": [
        35,
        37,
        39,
        40,
        42,
        43,
        44,
        45,
        46,
        47,
     