# Find Pairwise Interactions
This notebook demonstrates how to calculate pairwise intra- and inter-molecular interactions at specified levels of granularity within biological assemblies and asymmetric units.

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.utils import ColumnarStructure
from mmtfPyspark.interactions import InteractionExtractorPd

### Start a Spark Session

In [2]:
spark = SparkSession.builder.appName("Interactions").getOrCreate()

## Define Interaction Partners
Interactions are defined by specifing two subsets of atoms, named **query** and **target**. Once defined, interactions can calculated between these two subsets.

### Use Pandas Dataframes to Create Subsets
The InteractionExtractorPd internally uses Pandas dataframe queries to create query and target atom sets. Any of the Pandas column names below can be used to create subsets.

Example of a structure represented in a Pandas dataframe.

In [3]:
structures = mmtfReader.download_mmtf_files(["1OHR"]).cache()

# get first structure from Spark RDD (keys = PDB IDs, value = mmtf structures)
first_structure = structures.values().first()

# convert to a Pandas dataframe
df = ColumnarStructure(first_structure).to_pandas()
df.head(5)

Unnamed: 0,chain_name,chain_id,group_number,group_name,atom_name,altloc,x,y,z,o,b,element,polymer
0,A,A,1,PRO,N,,-3.477,7.714,33.890999,1.0,26.32,N,True
1,A,A,1,PRO,CA,,-2.582,6.722,34.505001,1.0,24.299999,C,True
2,A,A,1,PRO,C,,-1.168,6.908,34.015999,1.0,22.52,C,True
3,A,A,1,PRO,O,,-0.984,7.654,33.063,1.0,22.27,O,True
4,A,A,1,PRO,CB,,-3.083,5.331,34.122002,1.0,26.459999,C,True


### Create a subset of atoms using boolean expressions
The following query creates a subset of ligand (non-polymer) atoms that are not water (HOH) or heavy water (DOD).

In [4]:
query = "not polymer and (group_name not in ['HOH','DOD'])"
df_lig = df.query(query)
df_lig.head(5)

Unnamed: 0,chain_name,chain_id,group_number,group_name,atom_name,altloc,x,y,z,o,b,element,polymer
1755,A,C,201,1UN,C1,,2.645,-4.153,16.315001,1.0,24.549999,C,False
1756,A,C,201,1UN,C2,,3.461,-5.453,16.593,1.0,20.709999,C,False
1757,A,C,201,1UN,C3,,2.685,-6.686,16.166,1.0,19.440001,C,False
1758,A,C,201,1UN,C4,,2.272,-6.548,14.711,1.0,21.07,C,False
1759,A,C,201,1UN,C5,,1.434,-5.277,14.506,1.0,24.780001,C,False


## Calculate Interactions
 The following boolean expressions specify two subsets: ligands (query) and polymer groups (target). In this example, interactions within a distance cutoff of 4 &#197; are calculated.

In [5]:
query = "not polymer and (group_name not in ['HOH','DOD'])"
target = "polymer"
distance_cutoff = 4.0

# the result is a Spark dataframe
interactions = InteractionExtractorPd.get_interactions(structures, distance_cutoff,
                                                       query, target)

# get the first 5 rows of the Spark dataframe and display it as a Pandas dataframe
interactions.limit(5).toPandas()

Unnamed: 0,structure_chain_id,q_chain_name,q_trans,q_group_number,q_group_name,t_chain_name,t_trans,t_group_number,t_group_name
0,1OHR.A,A,0,201,1UN,A,0,82,VAL
1,1OHR.A,A,0,201,1UN,A,0,84,ILE
2,1OHR.A,A,0,201,1UN,A,0,28,ALA
3,1OHR.A,A,0,201,1UN,A,0,30,ASP
4,1OHR.A,A,0,201,1UN,A,0,81,PRO


## Calculate all interactions
If query and target are not specified, all interactions are calculated. By default, intermolecular interactions are calculated.

In [6]:
interactions = InteractionExtractorPd.get_interactions(structures, distance_cutoff)
interactions.limit(5).toPandas()

Unnamed: 0,structure_chain_id,q_chain_name,q_trans,q_group_number,q_group_name,t_chain_name,t_trans,t_group_number,t_group_name
0,1OHR.B,A,0,96,THR,B,0,96,THR
1,1OHR.B,A,0,26,THR,B,0,24,LEU
2,1OHR.B,A,0,27,GLY,B,0,26,THR
3,1OHR.B,A,0,6,TRP,B,0,87,ARG
4,1OHR.B,A,0,23,LEU,B,0,27,GLY


## Aggregate Interactions at Different Levels of Granularity
Pairwise interactions can be listed at different levels of granularity by setting the **level**:
* **level='coord'**: pairwise atom interactions, distances, and coordinates
* **level='atom'**:  pairwise atom interactions and distances
* **level='group'**: pairwise atom interactions aggregated at the group (residue) level (default)
* **level='chain'**: pairwise atom interactions aggregated at the chain level

The next example lists the interactions at the **coord** level, the level of highest granularity. You need to scroll in the dataframe to see all columns.

In [7]:
interactions = InteractionExtractorPd.get_interactions(structures, distance_cutoff,
                                                         query, target, level='coord')
interactions.limit(5).toPandas()

Unnamed: 0,structure_chain_id,q_chain_name,q_trans,q_group_number,q_group_name,q_atom_name,t_chain_name,t_trans,t_group_number,t_group_name,t_atom_name,distance,q_x,q_y,q_z,t_x,t_y,t_z
0,1OHR.A,A,0,201,1UN,C5,A,0,80,THR,OG1,3.998984,1.434,-5.277,14.506,-2.493,-5.322,13.752
1,1OHR.A,A,0,201,1UN,C5,A,0,84,ILE,CD1,3.795594,1.434,-5.277,14.506,-1.076,-2.606,15.492
2,1OHR.A,A,0,201,1UN,C6,A,0,84,ILE,CD1,3.71087,2.293,-4.012,14.826,-1.076,-2.606,15.492
3,1OHR.A,A,0,201,1UN,C4,A,0,81,PRO,CD,3.771212,2.272,-6.548,14.711,-0.749,-7.954,12.945
4,1OHR.A,A,0,201,1UN,C5,A,0,81,PRO,CD,3.790586,1.434,-5.277,14.506,-0.749,-7.954,12.945


## Calculate Inter- vs Intra-molecular Interactions
Inter- and intra-molecular interactions can be calculated by explicitly setting the **inter** and **intra** flags.
* **inter=True** (default)
* **intra=False** (default)

### Find intermolecular salt-bridges
This example uses the default settings, i.e., finds intramolecular salt-bridges.

In [8]:
query = "polymer and (group_name in ['ASP', 'GLU']) and (atom_name in ['OD1', 'OD2', 'OE1', 'OE2'])"
target = "polymer and (group_name in ['ARG', 'LYS', 'HIS']) and (atom_name in ['NH1', 'NH2', 'NZ', 'ND1', 'NE2'])"
distance_cutoff = 3.5
    
interactions = InteractionExtractorPd.get_interactions(structures, distance_cutoff,
                                                         query, target, level='atom')
interactions.limit(5).toPandas()

Unnamed: 0,structure_chain_id,q_chain_name,q_trans,q_group_number,q_group_name,q_atom_name,t_chain_name,t_trans,t_group_number,t_group_name,t_atom_name,distance
0,1OHR.B,A,0,29,ASP,OD2,B,0,8,ARG,NH2,2.774146
1,1OHR.A,B,0,29,ASP,OD2,A,0,8,ARG,NH2,2.875426


### Find intramolecular hydrogen bonds
In this example, the inter and intra flags have been set to find intramolecular hydrogen bonds.

In [9]:
query = "polymer and element in ['N','O']"
target = "polymer and element in ['N','O']"
distance_cutoff = 3.5

interactions = InteractionExtractorPd.get_interactions(structures, distance_cutoff,
                                                       query, target, 
                                                       inter=False, intra=True,
                                                       level='atom')
interactions.limit(5).toPandas()

Unnamed: 0,structure_chain_id,q_chain_name,q_trans,q_group_number,q_group_name,q_atom_name,t_chain_name,t_trans,t_group_number,t_group_name,t_atom_name,distance
0,1OHR.A,A,0,43,LYS,O,A,0,44,PRO,N,2.223348
1,1OHR.A,A,0,56,VAL,O,A,0,45,LYS,N,2.845912
2,1OHR.A,A,0,57,ARG,NH1,A,0,42,TRP,NE1,3.002198
3,1OHR.A,A,0,44,PRO,N,A,0,43,LYS,O,2.223348
4,1OHR.A,A,0,58,GLN,N,A,0,43,LYS,O,2.71943


## Calculate Interaction in the Biological Assembly vs. Asymmetric Unit

In [10]:
structures = mmtfReader.download_mmtf_files(["1STP"]).cache()

By default, interactions in the first biological assembly are calculated. The **bio** parameter specifies the biological assembly number. Most PDB structure have only one biological assembly (bio=1), a few have more than one.
* **bio=1** use first biological assembly (default)
* **bio=2** use second biological assembly
* **bio=None** use the asymmetric unit

In [11]:
query = "not polymer and (group_name not in ['HOH','DOD'])"
target = "polymer"
distance_cutoff = 4.0

# The asymmetric unit is a monomer (1 ligand, 1 protein chain)
interactions = InteractionExtractorPd.get_interactions(structures, distance_cutoff,
                                                       query, target, bio=None)
print("Ligand interactions in asymmetric unit (monomer)        :", interactions.count())

# The first biological assembly is a tetramer (4 ligands, 4 protein chain)
interactions = InteractionExtractorPd.get_interactions(structures, distance_cutoff,
                                                       query, target, bio=1)
print("Ligand interactions in 1st bio assembly (tetramer)      :", interactions.count())

# There is no second biological assembly, in that case zero interactions are returned
interactions = InteractionExtractorPd.get_interactions(structures, distance_cutoff,
                                                       query, target, bio=2)
print("Ligand interactions in 2st bio assembly (does not exist):", interactions.count())

Ligand interactions in asymmetric unit (monomer)        : 16
Ligand interactions in 1st bio assembly (tetramer)      : 68
Ligand interactions in 2st bio assembly (does not exist): 0


The 1st biological unit contains 68 - 4x16 = 4 additional interactions not found in the asymmetric unit.

## Stop Spark!

In [12]:
spark.stop()