# Generalized Network Analysis Tutorial - Step 1

The Network Analysis Tutorial is part of the work entitled **Generalized correlation-based dynamical network analysis: a new high-performance approach for identifying allosteric communications in molecular dynamics trajectories**, by Marcelo C. R. Melo, Rafael C. Bernardi, Cesar de la Fuente-Nunez, and Zaida Luthey-Schulten. For more information see http://faculty.scs.illinois.edu/schulten/. 

In this tutorial, we will use the Dynamic Network Analysis python package to explore the interactions between the OMP decarboxylase enzyme and its substrate, identifying clusters of amino acid residues that form domains, and active site residues that are important for binding and enzymatic activity.

The tutorial is divided in two jupyter notebooks. In this notebook, **Step 1**, we will analyze the MD trajectory for the OMP decarboxylase system and generate the network data used for analysis. 

The trajectory files have approximately 500MB in size, and must be downloaded [from this link](http://www.ks.uiuc.edu/~rcbernardi/NetworkAnalysis/DynamicNetworkAnalysis_MDdata.tar.gz) and placed in the *TutorialData* folder.

In the accompanying notebook, **Step 2**, we will load the data generated in the Step 1, and create analysis plots and visualizations of the network. 

In **Step 3**, which is outside of the jupyter notebook environment, we will produce high-quality renderings of the network using the popular visualization software VMD.

In [1]:
# Load the python package
import os
import dynetan
from dynetan.toolkit import *
from dynetan.viz import *
from dynetan.proctraj import *
from dynetan.gencor import *
from dynetan.contact import *
import argparse
from operator import itemgetter

In [2]:
# Create the object that processes MD trajectories.
dnap = DNAproc()

In [3]:
bPythonExport = False

In [4]:
mapResidueNames={'ALA':'A','CYS':'C','ASP':'D','GLU':'E','PHE':'F',
                 'GLY':'G','HIS':'H','HSD':'H','HSE':'H','ILE':'I','LYS':'K','LEU':'L',
                 'MET':'M','ASN':'N','PRO':'P','GLN':'Q','ARG':'R',
                 'SER':'S','THR':'T','VAL':'V','TRP':'W','TYR':'Y',
                 'MG':'Mg','POPC':'Popc',
                 'ATP':'Atp','GTP':'Gtp',
                 'NA':'Sod','SOD':'Sod','CLA':'Cl','CL':'Cl','POT':'Pot','K':'Pot',
                 'SOL':'h2o','HOH':'h2o','WAT':'h2o','TIP':'h2o','H2O':'h2o',
                }
def name_node(dnap, node):
    #i=dnap.nodesAtmSel[node].index
    resname=dnap.nodesAtmSel[node].resname ; resid=dnap.nodesAtmSel[node].resid
    return "%s%s" % (mapResidueNames[resname], resid)

def clarify_duplicate_nodes(dictNames, dictSuffix):
    """
    From two dicts with the same keys, add the respective suffix to all keys in the former that possess duplicate values.
    """
    from itertools import chain
    dictRev = {}
    for k, v in dictNames.items():
        dictRev.setdefault(v, set()).add(k)
    setDuplicateKeys = set(chain.from_iterable( v for k, v in dictRev.items() if len(v) > 1))
    
    for k in setDuplicateKeys:
        dictNames[k] = dictNames[k]+"_"+dictSuffix[k]
    return dictNames

# Node Groups

In Network Analysis, each residue is represented by one or more "nodes", serving as proxies for groups of atoms form the original residue (Figure 1). This approach lowers computational cost and noise.

For our purposes, we treat this as a coarse-graining procedure such that large molecules like ATP and POPC will gain multiple nodes to better distinguish between connections.

In [5]:
# = = = Default workflow control parameters.
bHasInterfaceDefinitions = False
bIncludeSolvent = False
segIDs = [] ; h2oName = []
numSampledFrames = 100
strideDCD = 0
cutoffDist = 4.5 ; contactPersistence = 0.75
numCoresAvailable=1

In [6]:
if bPythonExport:
    parser = argparse.ArgumentParser(description='Process trajactories for Dynamic Network Analysis.'
                                                 'Equivalent to Step 1 of the tutorial without using Jupyter.',
                                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('-o', '--output', type=str, dest='outputPrefix', default=None,
                        help='Prefix used for all output files.')    
    parser.add_argument('--top', type=str, dest='fileTopology', default='seg.psf',
                        help='The name of a suitable file for MDAnalysis to generate a topology.')
    parser.add_argument('fileTrajectories', type=str, metavar='N', nargs='+', default='sum.xtc',
                        help='One or more trajectory files for MDAnalysis to read in. All frames are combined together.')
    parser.add_argument('-n', '--num_windows', type=int, dest='numWinds', default=None,
                        help='[Opt.] Split the total trajectory frames into N windows. By default equal to the number of files read.')
    parser.add_argument('-f', '--num_frames', type=int, dest='numFrames', default=numSampledFrames,
                        help='Number of frames to read for each window. Total number of frames will then be nWind*nFrames. '
                       'This is computed by DNA as the total frames divided evenly.')
    parser.add_argument('--dcd_stride', type=int, dest='strideDCD', default=strideDCD,
                        help='The frame stride for the output dcd file, give to save disk space. Give 0 to save only the first and last frames.')
#    = = = Very inconvenient to do within MDAnalysis. Outside design doc.
#    parser.add_argument('--no_span', dest='bNoSpan', action='store_true',
#                        help='When the number of frames is given, prevent windows from spanning across multiple trajectory files. '
#                       'This will cause remainder frames to be discarded.')
    parser.add_argument('--cpus', type=int, dest='nCPUs', default=numCoresAvailable,
                            help='Set number of processes to enable partial parallelisation.')
    #   =============== Analysis parameters
    parser.add_argument('--segIDs', typr=str, dest='segIDs', default=None,
                        help='A list of comma-separated segment IDs of components that should be included in network analysis. '
                             'By default, will try to predict and include segIDs that are not water or named after simply ions. '
                             'Note: The water segment is defined in the next argument.')
    parser.add_argument('--segIDs_solv', typr=str, dest='segIDs_solv', default=None,
                        help='Name of the water segment. A prediction will be attempted if not given '
                             'by searching for possible names like SOL, HOH, TIP, WAT, H2O.')
    parser.add_argument('--interfaceA', typr=str, dest='interfaceA', default=None,
                        help='Definition of side-A of an interface to be analysed, using MDAnalysis selection syntax.')
    parser.add_argument('--interfaceB', typr=str, dest='interfaceB', default=None,
                        help='Definition of side-B of an interface to be analysed, using MDAnalysis selection syntax.')
    parser.add_argument('--exclude_solvent', action='store_true', dest='bExcludeSolvent',
                        help='Boolean to exclude potential solvent contacts in the analysis.')
    parser.add_argument('--cutoff', type=float, dest='distCutoff', default=cutoffDist,
                        help='Distance in Angstroms to consider two nodes as being in contact.'
                             'Calculated as the nearest distance between all atoms of the node.')
    parser.add_argument('--persistence', type=float, dest='ratioContactPersistence', default=contactPersistence,
                        help='Minimum portion of frames for DNA to consider two nodes to be in sufficient contact,'
                             'such that it will include this node pair as an edge in the output graph.')
    args = parser.parse_args()

    numCoresAvailable=args.nCPUs

    # = = = Analysis Configurations = = =
    # Cutoff for contact map (In Angstroms)
    cutoffDist = args.distCutoff
    # Minimum contact persistance (In ratio of total trajectory frames)
    contactPersistence = args.ratioContactPersistence
    # Inclusion of solvent
    bIncludeSolvent = not args.bExcludeSolvent

    # = = = Optional interface definition. A dummy is other calculated.
    if args.interfaceA is not None and args.interfaceB is not None:
        bHasInterfaceDefinitions=True
        seltextInterfaceA=args.interfaceA
        seltextInterfaceB=args.interfaceB
    
    # = = Define selections of segments that will be included. = = 
    # = = Following the NAMD/VMD system, DynaNet uses a segment syntax instead of chain syntax to classify atoms.
    #     Thus, segments will need to be defined in the topology file.
    if args.segIDs is not None:
        segIDs = args.segIDs.split(",")
    if args.segIDs_solv is not None:
        h2oName = args.segIDs_solv.split(",")
    
    topFile = args.fileTopology
    trjFiles = args.fileTrajectories

    if args.outputPrefix is None:
        pathToData = "./results" % (allele, temperature)
        fileNameRoot = "analysis"
    else:
        pathToData, fileNameRoot = os.path.split(args.outputPrefix)
        
    fullPathRoot = os.path.join(pathToData, fileNameRoot)

    # Number of windows created from full simulation.
    if args.numWinds is not None:
        numWinds = args.numWinds
    else:
        numWinds = len(args.fileTrajectories)
        
    # Sampled frames per window
    numSampledFrames = args.numFrames
    
    # Output
    strideDCD = args.strideDCD

In [7]:
# = = = Change to the relevant work folder
if not bPythonExport:
    systemExample="caspase-1"
    if systemExample == "UbqCHARMM":
        %cd /home/zharmad/host/projects/Ubq-md
    elif systemExample == "UbqAthi":
        %cd /home/zharmad/host/shared-colleague/Ubq-2017
    elif systemExample == "periplasmic":
        %cd /home/zharmad/projects/periplasmic/leucine-binding_protein
    elif systemExample == "caspase-1":
        %cd /home/zharmad/host/projects/caspase-1        
    elif systemExample == "CFTR":
        %cd /home/zharmad/projects/cftr/DyNetAn
    else:
        %cd ..

/mnt/d/ubuntu/projects/caspase-1


In [8]:
# = = = Customised workflow control parameters for publication examples.
if not bPythonExport:
    numCoresAvailable = 6
    if systemExample == "UbqCHARMM":
        # apo1 apo2 apo3
        state="apo1" ; workDir = "./apo-md01"
        topFile = "./tops/prot-segids.pdb"
        trjFiles= os.path.join(workDir, "prot-masses.xtc" )
        numWinds = 5 ; numSampledFrames = 1000
        pathToData = "./dynetan"
        fileNameRoot = "%s_%ix%i" % (state, numWinds, numSampledFrames)
        fullPathRoot = os.path.join(pathToData, fileNameRoot)
        #segIDs = ["UBQ"] ; h2oName = ["SOL"]
    elif systemExample == "UbqAthi":
        # UbqI13V  UbqI23A  UbqI30A  UbqL43A  UbqL67A  UbqL69A  UbqV17A  UbqWT
        state="UbqV17A" ; workDir = "./%s" % state
        topFile = os.path.join(workDir, "reference.pdb" )
        trjFiles= os.path.join(workDir, "traj-1ns.xtc" )
        numWinds = 5 ; numSampledFrames = 200
        pathToData = "./dynetan"
        fileNameRoot = "%s_%ix%i" % (state, numWinds, numSampledFrames)
        fullPathRoot = os.path.join(pathToData, fileNameRoot)
        #segIDs = ["UBQ"] ; h2oName = ["SOL"]        
    elif systemExample == "periplasmic":
        # apo holo-leu
        state="apo" ; workDir = "./apo"
        topFile = "./apo/tops/prot-segids.pdb"
        trjFiles= os.path.join(workDir, "protcent.xtc" )
        numWinds = 5 ; numSampledFrames = 200
        pathToData = "./dynetan"
        fileNameRoot = "%s_%ix%i" % (state, numWinds, numSampledFrames)
        fullPathRoot = os.path.join(pathToData, fileNameRoot)
        #segIDs = ["LBP"] ; h2oName = ["SOL"]
        if state != "apo":
            bHasInterfaceDefinitions = True
            seltextInterfaceA="resid 15 18 77 78 79 80 100 101 102 103 118 121 124 150 202 226 227 276"
            seltextInterfaceB="resid 347"
    elif systemExample == "caspase-1":
        # off-state on-state
        state="off-state" ; workDir = "./off-state"
        topFile = "%s/apo-em.pdb" % workDir
        trjFiles= os.path.join(workDir, "prod/centered.xtc" )
        numWinds = 2 ; numSampledFrames = 5
        pathToData = "./dynetan"
        fileNameRoot = "%s_%ix%i" % (state, numWinds, numSampledFrames)
        fullPathRoot = os.path.join(pathToData, fileNameRoot)
        #segIDs = ["C1A","C1B","C2A","C2B"] ; h2oName = ["SOL"]
        bIncludeSolvent=True
    elif systemExample == "CFTR":
        # Define mutant file IO locations. wt, P67L, E56K, R75Q, dF508 , S945L
        allele="wt" ; temperature="310K" ; nRepl=6
        workDir = "./trajectories/%s/%s/" % (allele, temperature)
        topFile = os.path.join(workDir, "1/clustered_d3.5_r0.50.pdb")
        trjFiles=[]
        for i in range(1,nRepl+1):
            trjFiles.append(os.path.join(workDir, "%i/clustered_d3.5_r0.50.xtc" %i))
        pathToData = "./results/%s/%s/" % (allele, temperature)
        fileNameRoot = "1to%i" % nRepl
        fullPathRoot = os.path.join(pathToData, fileNameRoot)        
        numWinds = 5 ; numSampledFrames = 200
        #segIDs = ["LAS","TD1","ND1","RDO","TD2","ND2","CTR","LIP","ATP","POT","CLA"]
        segIDs = ["LAS","TD1","ND1","RDO","TD2","ND2","CTR","CRY"]
        h2oName = ["SOL"]
        bIncludeSolvent=True
    else:
        raise KeyboardInterrupt

In [9]:
if numCoresAvailable>1:
    strBackend="openmp"
else:
    strBackend="serial"

if not os.path.exists(pathToData):
    os.makedirs(pathToData)

In [10]:
# Number of sampled frames for automatic selection of solvent and ions.
# numAutoFrames = numSampledFrames*numWinds

# Network Analysis will make one node per protein residue (in the alpha carbon)
# For all other residues, the user must specify atom(s) that will represent a node.
# ...
# We also need to know the heavy atoms that compose each node group.

customResNodes = {}
usrNodeGroups = {}

# Cater to HSD and HSE in the CHARMM forcefield.
# Lip: Uses POPC as a reference. Node designations include:
#      ...zwitterion centers, ester centers, and two additional carbon centers for each tail capturing ~7 atoms.
#      ...heavy atom assignment based on charge groups in CHARMMM36
#      ...a PSFGEN error means that the naming of carbons C216 -> 6C12.
#customResNodes["HSD"] = ["CA"]
#usrNodeGroups["HSD"] = {}
#usrNodeGroups["HSD"]["CA"] = set("N CA CB ND1 CG CE1 NE2 CD2 C O".split())
#customResNodes["HSE"] = ["CA"]
#usrNodeGroups["HSE"] = {}
#usrNodeGroups["HSE"]["CA"] = set("N CA CB ND1 CG CE1 NE2 CD2 C O".split())

customResNodes["K"] = ["K"]
usrNodeGroups["K"] = {}
usrNodeGroups["K"]["K"] = set("K".split())

customResNodes["CL"] = ["CL"]
usrNodeGroups["CL"] = {}
usrNodeGroups["CL"]["CL"] = set("CL".split())

customResNodes["SOL"] = ["OW"]
usrNodeGroups["SOL"] = {}
usrNodeGroups["SOL"]["OW"] = set("OW HW1 HW2".split())

In [11]:
#################################
### Load info to object

dnap.setNumWinds(numWinds)
dnap.setNumSampledFrames(numSampledFrames)
dnap.setCutoffDist(cutoffDist)
dnap.setContactPersistence(contactPersistence)
dnap.seth2oName(h2oName)
dnap.setSegIDs(segIDs)

dnap.setCustomResNodes(customResNodes)
dnap.setUsrNodeGroups(usrNodeGroups)

# Load the trajectory

Our Generalized Network Analysis leverages the MDAnalysis package to create a *universe* that contains all the trajectory and system information.

In [12]:
dnap.loadSystem(topFile,trjFiles)

In [13]:
# = = = First check if requests are sane.
if numWinds*numSampledFrames > dnap.workU.trajectory.n_frames:
    print("= = ERROR: You have requested more simulation frames to be analysed than are available!")
    if bPythonExport:
        import sys
        sys.exit()
    else:
        raise KeyboardInterrupt

In [14]:
# = = = Attempt to predict segment IDs to be analysed, if not given.
ionSegIDs   = ['ION','SOD','CLA','POT','HET','NA','K','CL']
waterSegIDs = ['HOH','SOL','WAT','TIP','H2O']
if segIDs == []:
    # = = = grab every segment that is not named after likely simple ion or water segments.
    soluteIDs = []
    for s in dnap.workU.segments.segids:
        if (not s in ionSegIDs) and (not s in waterSegIDs):
            soluteIDs.append( s )
    dnap.setSegIDs( soluteIDs )
    print("= = Automated solute segemnt prodiction has assigned %s to be included for analysis." & soluteIDs)
if h2oName == []:
    # = = = Search the segments for a name that sounds like water
    for s in waterSegIDs:
        if s in dnap.workU.segments.segids:
            h2oName = s
            break
    if h2oName != []:
        dnap.seth2oName(h2oName)
    else:
        # = = = Is possibly a protein-only trajectory. Give a dummy argument.
        dnap.seth2oName(['HOH'])
    print("= = Automated water segemnt prodiction has assigned %s to be the water segment ID." & h2oName)

### Checks segments and residue names

This is important to know if there are residues in the structure that we didn't know of, and need to be addresssed so that network analysis can create nodes in all selected residues.

In [15]:
dnap.checkSystem()

[34mResidue verification:
[39m
---> SegID  [32mC1A [39m: 20 unique residue types:
{'ARG', 'ALA', 'SER', 'TYR', 'MET', 'CYS', 'GLY', 'GLN', 'GLU', 'TRP', 'LEU', 'LYS', 'PRO', 'HIS', 'THR', 'VAL', 'ASP', 'ASN', 'ILE', 'PHE'}

---> SegID  [32mC1B [39m: 20 unique residue types:
{'ARG', 'ALA', 'SER', 'TYR', 'MET', 'CYS', 'GLY', 'GLN', 'GLU', 'TRP', 'LEU', 'LYS', 'PRO', 'HIS', 'THR', 'VAL', 'ASP', 'ASN', 'ILE', 'PHE'}

---> SegID  [32mC2A [39m: 20 unique residue types:
{'ARG', 'ALA', 'SER', 'TYR', 'MET', 'CYS', 'GLY', 'GLN', 'GLU', 'TRP', 'LEU', 'LYS', 'PRO', 'HIS', 'THR', 'VAL', 'ASP', 'ASN', 'ILE', 'PHE'}

---> SegID  [32mC2B [39m: 20 unique residue types:
{'ARG', 'ALA', 'SER', 'TYR', 'MET', 'CYS', 'GLY', 'GLN', 'GLU', 'TRP', 'LEU', 'LYS', 'PRO', 'HIS', 'THR', 'VAL', 'ASP', 'ASN', 'ILE', 'PHE'}

---> 20 total selected residue types:
{'ARG', 'ALA', 'SER', 'TYR', 'MET', 'CYS', 'GLY', 'GLN', 'GLU', 'TRP', 'LEU', 'LYS', 'PRO', 'HIS', 'THR', 'VAL', 'ASP', 'ASN', 'ILE', 'PHE'}

---> 3 

### Automatically identify crystallographic waters and ions.

In this section we identify all residues which will be checked for connectivity with selected segments.

- First, we define if all solvent molecules will be checked, or if just ions, ligands, lipids, and other molecules will be checked.

- Second, we sample a small set of frames from the trajectory to select likely residues, then we check all trajectory to see if they are closer than the cutoff distance for at least x% of the simulation (where x is the "contact persistence" fraction of the trajectory).

- Third, we load the trajectory of the relevant atoms to memory to speed up the analysis

In [16]:
dnap.selectSystem(withSolvent=bIncludeSolvent)

Checking 10 frames (striding 1000)...


HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

17 extra residues will be added to the system.
The initial universe had 71749 atoms.
The final universe has 4071 atoms.
Loading universe to memory...




### Prepare network representation of the system

Here we check that we know how to treat all types of residues in the final selection. Every residue will generate one or more nodes in the final network. Then we store the groups of atoms that define each node.

In [17]:
# dnap.getU().residues[300]
# dnap.getU().residues[300].resname
# 'C13' in set.union(*dnap.resNodeGroups['POPC'].values())

In [18]:
dnap.prepareNetwork()

Preparing nodes...


HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

Nodes are ready for network analysis.


## Align the trajectory based on selected segments

We align the trajectory to its first frame using heavy atoms (non-hydrogen) from the selected segments. In the process, we also transfer the trajectory to the computer memory, so that future analysis and manipulations are completed faster.

In [19]:
# If your system is too large, you can turn off the "in memory" option, at a cost for performance.
dnap.alignTraj(inMemory=True)

  0%|          | 0/10001 [00:00<?, ?it/s]

### Select residues that are closer than 4.5A for more than 75% of simulation

Creates an N-by-N matrix for all N nodes in the selected region, and automatically selected nodes (ions, solvent).

The following cell defines efficient functions to run the analysis and create a contact matrix. We leverage both MDAnalysis parallel contact detection tools, as well as accelerated Numba and Cython function. After creating the contact matrix, we remove any automatically selected nodes that have insuficient persistance, and filter the contacts by (optionally) removing contacts between nodes in consecutive residues in a protein or nucleic chain.


**Attention** For every time you start this Jupyter Notebook, the first time you execute this function may take significanlty longer (several seconds) to start. This is because we use *Cython* and *Numba* to compile functions "on-demand", and a new compilation may be necessary after the notebook is re-started.

In [20]:
# To speed-up the contact matrix calculation, a larger stride can be selected, at a cost for precision.
dnap.findContacts(stride=1)

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

We found 1 nodes with no contacts.
We found 2387 contacting pairs out of 137550 total pairs of nodes.
(That's 1.7%, by the way)


### Removing contacts between nodes in the same residue.

The following function guarantees that there will be no "self-contacts" (contacts between a node and itself), and gives you the opportunity to remove contacts between nodes in consecutive residues (such as sequential amino acids in the same chain, removing back-bone interactions). 

The function also removes nodes that are never in contact with any other node in the system (such as the ends of flexible chains, or residues in flexible loops). This will automatically update the MDanalysis universe and related network informatio, such as number of nodes and atom-to-node mappings.

In [21]:
dnap.filterContacts(notSameRes=True, notConsecutiveRes=False, removeIsolatedNodes=True)

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

Window: 0
We found 2265 contacting pairs out of 137550 total pairs of nodes.
(That's 1.647%, by the way)
Window: 1
We found 2294 contacting pairs out of 137550 total pairs of nodes.
(That's 1.668%, by the way)

Removing isolated nodes...

We found 1 nodes with no contacts.

Isolated nodes removed. We now have 524 nodes in the system

Running new contact matrix sanity check...
We found 0 nodes with no contacts.
We found 2387 contacting pairs out of 137026 total pairs of nodes.
(That's 1.7%, by the way)

Updating Universe to reflect new node selection...
Updating atom-to-node mapping...


HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

## Calculate Generalized Correlation with Python/Numba

In [22]:
# We can calculate generalized correlaions in parallel using Python's multiprocessing package.
dnap.calcCor(ncores=numCoresAvailable)

Calculating correlations...

Using window length of 5000 simulation steps.
- > Using multi-core implementation with 6 threads.


HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

## Calculate cartesian distances between all nodes in the selected system.

Here, we will calculate the **shortest** distance between atoms in all pairs of nodes. It is similar to the contact matrix calculation, but we check all distances and keep the shortest one to use in our analysis.

In [23]:
print( dnap.atomToNode.shape )

(4070,)


In [24]:
# We can leverage MDanalysis parallelization options with backend="serial" or backend="openmp".
# For very small systems, the serial can be faster!
dnap.calcCartesian(backend=strBackend)

Sampling a total of 10 frames from 2 windows (5 per window)...


HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

# Network Calculations
Create a graph from our correlation matrix. Different properties are calculated:

*Density* measures how connected the graph is compared to how connected it *could* be. It is the ratio between edges in the graph over all possible edges between all pairs of nodes.

*Transitivity* maesures the triadic closure, comparing present triangles to possible triangles. In a triangle, if A is connected to B, and B connected to C, then A is connected to C.

*Degree* measures the number of connections a node has.

(Reference)[1]

[1]:https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python#advanced-networkx-community-detection-with-modularity

In [25]:
dnap.calcGraphInfo()

In [26]:
# Basic information of the network as interpreted as a graph.
print( nx.info(dnap.nxGraphs[0]))

Name: 
Type: Graph
Number of nodes: 524
Number of edges: 2265
Average degree:   8.6450


In [27]:
# Both density and transitivity are scaled from 0 to 1
for win in range(dnap.numWinds):
    print("----- Window {} -----".format(win))
    print("Density:", round( nx.density(dnap.nxGraphs[win]), 4) )
    print("Transitivity:", round( nx.transitivity(dnap.nxGraphs[win]), 4) )
    print()

----- Window 0 -----
Density: 0.0165
Transitivity: 0.4159

----- Window 1 -----
Density: 0.0167
Transitivity: 0.4139



In [28]:
from operator import itemgetter

# We can check the nodes that have the most connections in each window.
for win in range(dnap.numWinds):
    print("----- Window {} -----".format(win))
    
    sorted_degree = sorted(dnap.getDegreeDict(win).items(), key=itemgetter(1), reverse=True)
    
    print("Top 10 nodes by degree: [node --> degree : selection]")
    for n,d in sorted_degree[:5]:
        print("{0:>4} --> {1:>2} : {2}".format(n, d, getSelFromNode(n, dnap.nodesAtmSel)))
    
    print()

----- Window 0 -----
Top 10 nodes by degree: [node --> degree : selection]
 223 --> 18 : resname ARG and resid 374 and segid C1B
 477 --> 16 : resname ARG and resid 374 and segid C2B
  79 --> 15 : resname MET and resid 211 and segid C1A
 179 --> 15 : resname PHE and resid 330 and segid C1B
 220 --> 15 : resname ARG and resid 371 and segid C1B

----- Window 1 -----
Top 10 nodes by degree: [node --> degree : selection]
 223 --> 18 : resname ARG and resid 374 and segid C1B
 477 --> 17 : resname ARG and resid 374 and segid C2B
 219 --> 15 : resname PHE and resid 370 and segid C1B
 220 --> 15 : resname ARG and resid 371 and segid C1B
 333 --> 15 : resname MET and resid 211 and segid C2A



## Calculate optimal paths
We choose the Floyd Warshall algorithm[1]. This uses the **correlations as weights** to calculate network distances and shortest distances.

[1]:https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.shortest_paths.dense.floyd_warshall.html?highlight=warshall#networkx.algorithms.shortest_paths.dense.floyd_warshall

In [29]:
dnap.calcOptPaths(ncores=numCoresAvailable)

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

## Calculate betweenness

We calculate both betweenness centrality[1] for edges and eigenvector centrality[2] for nodes.

[1]:https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.edge_betweenness_centrality.html?highlight=betweenness#networkx.algorithms.centrality.edge_betweenness_centrality
[2]:https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.centrality.eigenvector_centrality.html

In [30]:
dnap.calcBetween(ncores=numCoresAvailable)

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

In [31]:
from itertools import islice

# Pairs of nodes with highest Betweeness values, compared to their correlation values (in Window 0)
for k,v in islice(dnap.btws[0].items(),5):
    print("Nodes {} have betweenes {} and correlation {}.".format(k, 
                                                                  round(v,3), 
                                                                  round(dnap.corrMatAll[0, k[0], k[1]], 3) ) )

Nodes (127, 362) have betweenes 0.043 and correlation 0.451.
Nodes (362, 364) have betweenes 0.033 and correlation 0.889.
Nodes (219, 220) have betweenes 0.028 and correlation 0.923.
Nodes (149, 219) have betweenes 0.028 and correlation 0.399.
Nodes (220, 500) have betweenes 0.026 and correlation 0.315.


Turn to node centrality instead of edge centrality:

In [32]:
dnap.calcEigenCentral()

### Calculate communities



Using **Louvain heuristices** is feasible. 
This method also maximizes the modularity of the network.

http://iopscience.iop.org/article/10.1088/1742-5468/2008/10/P10008/meta

In [33]:
dnap.calcCommunities()

In [34]:
# Sort communities based on number of nodes
for comIndx in dnap.nodesComm[0]["commOrderSize"]:
    print("Modularity Class {0:>2}: {1:>3} nodes.".format(comIndx, len(dnap.nodesComm[0]["commNodes"][comIndx])))

Modularity Class  2:  71 nodes.
Modularity Class 11:  51 nodes.
Modularity Class  4:  47 nodes.
Modularity Class  8:  47 nodes.
Modularity Class  1:  44 nodes.
Modularity Class  6:  42 nodes.
Modularity Class  9:  41 nodes.
Modularity Class 13:  40 nodes.
Modularity Class  0:  35 nodes.
Modularity Class  7:  24 nodes.
Modularity Class 12:  22 nodes.
Modularity Class  5:  21 nodes.
Modularity Class 10:  21 nodes.
Modularity Class  3:  18 nodes.


In [35]:
# Sort communities based on the node with highest eigenvector centrality
for comIndx in dnap.nodesComm[0]["commOrderEigenCentr"]:
    print("Modularity Class {0} ({1} nodes) Sorted by Eigenvector Centrality:".format(
                                                                    comIndx, 
                                                                len(dnap.nodesComm[0]["commNodes"][comIndx])))
    for node in dnap.nodesComm[0]["commNodes"][comIndx][:5]:
        print("Name: {0:>4} | Degree: {1:>2} | Eigenvector Centrality: {2}".format(
            node, dnap.nxGraphs[win].nodes[node]['degree'], dnap.nxGraphs[win].nodes[node]['eigenvector']))
    print()

Modularity Class 2 (71 nodes) Sorted by Eigenvector Centrality:
Name:   79 | Degree: 14 | Eigenvector Centrality: 0.08072045586896259
Name:   74 | Degree: 12 | Eigenvector Centrality: 0.07309015923508207
Name:   38 | Degree: 14 | Eigenvector Centrality: 0.06356710683400515
Name:   82 | Degree: 10 | Eigenvector Centrality: 0.06902842190785805
Name:   76 | Degree: 10 | Eigenvector Centrality: 0.05655282279950122

Modularity Class 3 (18 nodes) Sorted by Eigenvector Centrality:
Name:   54 | Degree: 13 | Eigenvector Centrality: 0.06889422385735083
Name:   50 | Degree:  8 | Eigenvector Centrality: 0.04125924843110719
Name:   53 | Degree: 12 | Eigenvector Centrality: 0.0562058260561231
Name:   57 | Degree: 13 | Eigenvector Centrality: 0.06636348473837075
Name:   61 | Degree: 12 | Eigenvector Centrality: 0.05691158708423974

Modularity Class 1 (44 nodes) Sorted by Eigenvector Centrality:
Name:  129 | Degree: 12 | Eigenvector Centrality: 0.04691658821770893
Name:  132 | Degree: 12 | Eigenvector

# Process Interface Residues

We now find all nodes that are close to both selections chosen by the user. That may include amino acids in the interface, as well as ligands, waters and ions.

In [36]:
if bHasInterfaceDefinitions:
    dnap.interfaceAnalysis(selAstr=seltextInterfaceA, selBstr=seltextInterfaceB)
else:
    # Conduct analysis on dummy inputs since this is hard-coded.
    dnap.interfaceAnalysis(selAstr=getSelFromNode(0,dnap.nodesAtmSel), selBstr=getSelFromNode(1,dnap.nodesAtmSel) )

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))

35 pairs of nodes connecting the two selections.
34 unique nodes in interface node pairs.


Compute additional graph properties here to save some time for the analysis in Step 2.

In [37]:
# Note: btws is the same as the results derived from nx.edge_betweenness_centrality()
for w in range(numWinds):
    G = dnap.nxGraphs[w]

    # Transfer edge-betweenness to graph.
    for u,v in dnap.btws[w]:
        G.edges[u,v]['btws']=dnap.btws[w][u,v]
    # Note: edge-betweeness weighted clustering coefficient. 
    c = nx.clustering(G, weight='btws')
    for x in range(dnap.numNodes):
        G.nodes[x]['bwcc']=c[x]
    # Node betweenness centrality as an alternative to eigenvector centrality.
    c = nx.betweenness_centrality(dnap.nxGraphs[w])
    for x in range(dnap.numNodes):
        dnap.nxGraphs[w].nodes[x]['btws']=c[x]

    # Set the name of the node for future display. Append atom names to residues that have multiple nodes.
    # It's sometimes important to do things here in Step 1 when issues of duplicate nodes might arise.
    nodeNames={} ; nodeSegIDs={} ; nodeAtomNames={}
    for x in G.nodes():
        #i=dnap.nodesAtmSel[x].index
        nodeNames[x]  = name_node(dnap, x)
        nodeSegIDs[x] = dnap.nodesAtmSel[x].segid
        nodeAtomNames[x] = dnap.nodesAtmSel[x].name
    nodeNames = clarify_duplicate_nodes( nodeNames, nodeAtomNames )
    nx.set_node_attributes(G, nodeNames, "name")
    nx.set_node_attributes(G, nodeSegIDs, "segid")

## Save data and reduced trajectory for visualization

In [38]:
dnap.saveData(fullPathRoot)

In [39]:
# This function will save a reduced DCD trajectory with the heavy atoms used for network analysis
# A smaller trajectory can be created by choosing a "stride" that sub-samples the original trajectory.
# This function will also produce a PDB file so that information on atoms and residues can be loaded to
#    visualization software such as VMD.
if strideDCD == 0:
    strideDCD = dnap.workU.trajectory.n_frames-1
print("We will save {} heavy atoms and {} frames.".format(dnap.workU.atoms.n_atoms, 
                                                          len(dnap.workU.trajectory[::strideDCD]) ))

We will save 4070 heavy atoms and 2 frames.


In [40]:
dnap.saveReducedTraj(fullPathRoot, stride = strideDCD)

HBox(children=(HTML(value=''), IntProgress(value=1, max=1), Label(value='')))



MDAnalysis may print warnings regarding missing data fields, such as altLocs, icodes, occupancies, or tempfactor, which provide information commonly found in PDB files.
The warnings are for your information and in the context of this tutorial they are expected and do not indicate a problem.

# Analysis

The we have finished processing the trajectory and storing all related data. We can now move on to analysis of the network properties calculated here.

**All analysis code was placed in a second tutorial notebook for clarity.**


# ---- The End ----