Hitting time !! So, guys this is about a simple project or more like a job I did recently. It is about studying the protein pathways. We tried to reproduce results from a published paper (PLOS computational biology) named "Signal propagation in proteins and relation to equilibrium fluctuations". At first sight it appears that this paper is something like biology routine but it ain't. It is more a mathematical and computational based paper which involves a bit probability, coding and algorithms. When, I first read the paper, I was amazed by the originality of the paper. It is pure application. So, lets get down to business.
We have a protein whose native structure is determined and we have all the data about the protein structure, which includes (x,y,z) co-ordinates of all atoms(majorly C,N,O,S,H). Now, the protein structure determines the protein function but this structure in fact is not a constant one. There are continuous motions in various parts of the protein, these motions could be random. These random motions could be local or it might also change the structure at a global level. Now, in the present paper they wanted to show how these local or global rearrangements help in studying the communication patterns in a native protein. They used the Markov chain Monte Carlo (MCMC) which involves a bit probability theory, then devised an algorithm to solve for hitting time and plot the data for further analysis. Let's do this explanation as step by step process.
STEP#1- Now we have the data about the protein structure, acknowledgments to Protein Data Bank(PDB). Say, each residue is a node and these residues are interacting with each other. These interactions are defined as per Gaussian Network Model(GNM) where each edge is assumed as a spring which vibrates when disturbed. We need to give this interaction a quantitative value, say affinity which implies in strength of the communication between the nodes. So, this affinity is calculated by the number of atoms in residue/node-v(i) interacting with atoms in residue/node-v(j). Now these number of atom-atom contacts, say N(i)(j) is divided by a factor of sqrt( N(i)*N(j) ), where N(i) and N(j) represent the number of large atoms in residues v(i) and v(j). This is done because the large atoms over estimate the affinity values even when contact through them counts to just one, hence we try to bring down the affinity value by some estimated factor. Now, we have defined the affinity between all the 'n' residues/nodes, represented as a(i)(j) for given node i and j. Now notice that a(i)(j)=a(j)(i), so if see the matrix A=a(n)(m) , it is a symmetric matrix. Lets call, A as the affinity matrix. After that, we are going to normalize the affinity matrix along the row, so we add all the elements in the i'th row (say they add to d(i)) and divide each element in the i'th row by d(i). Call this new normalized affinity matrix as M = m(i)(j),
such that, ∑ m(i)(j)= 1 for given 'i' and j= 1 to n.
This matrix M is called as transition matrix, where each element m(i)(j) represents the probability of moving from present state or node v(i) to node v(j). It's kind of transition taking place here from one node to another node, hence the name. Once this matrix is obtained, end of step#1.
STEP#2- In this step, we introduce a process called Markov process. In the present problem, the process of passing the information is assumed discrete Markov chain. A chain here implies a collection of steps or transitions as described in step#1. Now, Markov chain is defined mathematically as given below-
Pr( X(n+1) = x(n+1) | X(n) = x(n),
X(n-1) =x(n-1),
X(n-2) =x(n-2)...)
= Pr( X(n+1) =x(n+1) | X(n) =x(n) )
which means, the probability of occurring of n+1 th random variable or step is dependent only on the nth random variable or step. To put in a crude manner, the past history does not effect the future, it is dependent only on the present. Now we know what is a Markov process, now the question is how (or why) is it used here? Now, if you are at some node v(i) initially, also called as initial state X(0)= v(i) then what is the probability that you are at node v(j) after taking a step i.e X(1)= v(j), where X(i) is a random variable. Now this initial state could be anything, so we define a vector p(0) whose dimension size is equal to number of nodes. The j'th component p(j)(0) of p(0) represents the probability of initiating the process from node v(j). Now, once it is initiated, we would like to know the probabilities of next step ending at some state or node v(k), this can be represented by the probability vector p(1) which is equal to M * p(0) (matrix multiplication), where M is the normalized affinity matrix or transition matrix. For example, we are initially at v(i) with p(i)(0) probability, now in the next step we jumped to state v(j), now the probability of this happening is given as p(j)(1)= m(i)(j)p(i)(0), product of probability of jumping from node v(i) to v(j) and probability that it is at v(i) initially. As simple as that. Now as we know this process is a Markov process p(1) is kind of initial condition for calculating the p(2). Now, after doing this for k steps, we notice p(k) = MMMM*.....Mp(0), where M is k times. So, now if we continue this process for infinitely many times then, we reach a distribution called as "stationary distribution" . This stationary distribution is kind of estimation of the end state probability distribution. Stationary distribution is independent of the initial conditions. Whatsoever are the values of the p(0), we end up at same stationary distribution. Done this, end of step#2.
STEP#3- This step consists of the major work I have done. Now, we will define what is hitting time and try determine it's value. Hitting time is defined as the number of average number of steps taken to go from state i to j, i.e from node v(i) to v(j), denoted as H(j,i). So, while calculation the hitting time, we assume that we are initially at node v(i), then we try to calculate the number of paths we can reach in r steps, denote it as P(j,i::r). This P(j,i::r) is kind of weight to the number of steps r, so average would look like,
P(j,i::1)*1 + P(j,i::2)*2 + P(j,i::3)*3 + P(j,i::4)*4+..
and so on. This series need to converge, if you wanna find the hitting time between i and j. Sometimes, calculation of this series is kind of mathematically rigorous. Also, we have
H(j,i) != H(i,j).
So, it's not like you need to calculate one way, the other way automatically comes. For these same reasons, a simple algorithm has been devised. And it goes as follows-
H(j,i)= ∑ [ 1+H(j,k) ] *m(i)(k) ; for k= 1 to n
Now, the above equation is a recursive formula. It says that from v(i) to moved to some arbitrary node v(k), that takes "1" step, so now we need to go from the k'th node to j'th node, which takes H(j,k) average number of steps from there. So, the average number of steps on whole will be product of 1+H(j,k) and m(i)(k) ( is the kind of weight to the "1" step i-->k ). Now, we apply this formula recursively until we reach node j. A simple simplification can be done to the equation, which will be like-
H(j,i) = ∑ m(i)(k) + ∑ H(j,k)*m(i)(k),
for k=1 to n in first term and for k != j in second term
H(j,i) = 1 + ∑ H(j,k)*m(i)(k) for k != j
Notice that H(j,j) =0. Now, done understanding this algorithms, end of step#3.
STEP#4- Now, we are going to understand the implementation of the above algorithm. It's not easy. Because implementing the algorithm is like trying to understand the algorithm in such a way that it's numerically solvable, a new angle of approach should be derived. So, it is easy like this. Expand the whole damn thing in H(j,i). Now, in the expansion, in turn we get the undetermined terms H(j,k) where k varies along all nodes in the system. Now, expand the H(j,k) for all k. We simply get n-1 equations, which are expansions of all n-2 H(j,k)'s and 1 H(j,i), hence n-1. And at the same time we have n-1 variables. Now, it is a simple system of equations, which is deterministic. When you try to solve these equations you need to put them in the matrix form. Putting these into matrix form and trying to solve for the inverse of the matrix involves the major part of the problem. I have done this part and coded in C++. After, obtaining the final inverse matrix, in which each ij'th element represent the hitting time from node i to j. While writing the code for this algorithm, things to be kept in mind are- Memory minimization- saving memory at every possible step, stack overflow and indexes of the matrix- keep track at every point what is what.