### From Principal Component (PCA) to Direct Coupling Analysis (DCA) of Coevolution in Proteins

This notebook takes a look at a 2013 paper from Simona Cocco, Remi Monasson, Martin Weigt titled
**From Principal Component to Direct Coupling Analysis of Coevolution in Proteins: Low-Eigenvalue Modes are Needed for Structure Prediction.** (2013Cocco)

Link: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003176

This paper looks at extracting functional and structural information from Multiple Sequence Alignments (MSA) of homologous proteins. First a covariance matrix of the residues are created from the MSA. Then the paper connects two approaches 

1. PCA - which identifies correlated groups of residues
2. DCA - which identifies residue-residue contacts

It shows how these two methods are related in non-intuitive ways using sophisticated statistical-physics models. This connection between the two approaches allows one to perform some sort of "dimension reduction" on DCA and to accurately predict residue-residue contacts with a smaller number of parameters. It also shows that the low eigenvalue values, which are discarded by PCA, are actually important to recover contact information. 

We use a DHFR alignment as an example to try out the methods of the papers.  

In [1]:
import os
import itertools
import numpy as np

In [2]:
datadir = "../data"

msa_file = os.path.join(datadir, "DHFR.aln")

# Read all the lines in the file into a 2D array of type S1
with open(msa_file) as fh:
    arr = np.array([[x for x in line.strip()] for line in fh], np.dtype("S1"))

print("shape =", arr.shape, ",dtype= ", arr.dtype)

shape = (56165, 186) ,dtype=  |S1


In [3]:
# M is the number of sequences
# L is the length
M, L = arr.shape

In [4]:
# the first sequence
arr[0, :].tostring()

b'VRPLNCIVAVSQNMGIGKNGDLPWPPLRNEFKYFQRMTTTSSVEGKQNLVIMGRKTWFSIPEKNRPLKDRINIVLSRELKEPPRGAHFLAKSLDDALRLIEQPELASKVDMVWIVGGSSVYQEAMNQPGHLRLFVTRIMQEFESDTFFPEIDLGKYKLLPEYPGVLSEVQEEKGIKYKFEVYEKKD'

In [5]:
# the second sequence
arr[1, :].tostring()

b'----SIVVVMCKRFGIGRNGVLPWSPLQADMQRFRSITAG-------GGVIMGRTTFDSIPEEHRPLQGRLNVVLTTSADLMKNSNIIFVSSFDELDAIVGL----HDHLPWHVIGGVSVYQHFLEKSQVTSMYVTFVDGSLECDTFFPHQFLSHFEITRA---SALMSDTTSGMSYRFVDYTR--'

In [6]:
# We can order the amino acids any way we like
# Here is a sorting based on some amino acid properties. 
# https://proteinstructures.com/Structure/Structure/amino-acids.html
AMINO_ACIDS = np.array([aa for aa in "RKDEQNHSTCYWAILMFVPG-"], "S1")

### Compute the weights of each sequence

In [7]:
progress_bar = True
try:
    from IPython.display import clear_output
except ImportError:
     progress_bar = False


In [8]:
hamming_cutoff = 0.2 # This is x in equation 27 in the 2013 Coco et al. paper

weights_file = os.path.join(datadir, "DHFR.weights.npy")

if os.path.isfile(weights_file):
    weights = np.load(weights_file)
    print("Loading weights from : ", weights_file)

else:
    weights = np.zeros(M)

    for i in range(M):
        weights[i] = 1. / (np.sum(np.sum(arr[i, :] != arr, axis=1) <= hamming_cutoff * L))
        if i % 100 == 0:
            if progress_bar:
                clear_output(wait=True)
            print ("Processing sequence", i, "of", M)
    np.save(weights_file, weights)
    print("Finished computing sequence weights and saved to : ", weights_file)


Loading weights from :  ../data/DHFR.weights.npy


In [9]:
M_eff = sum(weights) # Equation 28 in 2013 Coco et al. paper
print(int(round(M_eff)))

15238


In [10]:
q = 21
pseudo_count = round(M_eff)

In [11]:
arr_onehot = np.zeros(arr.shape + (q,), dtype=np.uint8)

for i, a in enumerate(AMINO_ACIDS):
    arr_onehot[..., i] = (arr == a)

In [12]:
arr_onehot.shape

(56165, 186, 21)

In [13]:
arr_onehot[:10, :20, 0]

array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]],
      dtype=uint8)

### Compute Single and Double site marginals

In [13]:
single_site_marginal_file = os.path.join(datadir, "DHFR.single.npy")
double_site_marginal_file = os.path.join(datadir, "DHFR.double.npy")

if os.path.isfile(double_site_marginal_file) and os.path.isfile(single_site_marginal_file):
    f_i_a = np.load(single_site_marginal_file)
    print("Loading single site marginals from ", single_site_marginal_file)

    f_i_j_a_b = np.load(double_site_marginal_file)
    print("Loading double site marginals from ", double_site_marginal_file)    
    
else:
    # single site marginals. Eqn 29 in 2013 Coco et al. paper
    f_i_a = np.zeros((L, q))
    # double site marginals
    f_i_j_a_b = np.zeros((L*q, L*q))

    # This is a dictionary where the index (i,a) points to a 
    # numpy array of integers which is 1 if amino acid a is in row i
    # and 0 otherwise 
    ia = dict()
    for i, a in itertools.product(range(L), range(q)):
        ia[(i, a)] = (arr[:, i] == AMINO_ACIDS[a]).astype(np.int)

    for i, a in itertools.product(range(L), range(q)):
        delta_i_a = ia[(i,a)]
        f_i_a[i, a] =  np.sum(weights * delta_i_a)
        for j, b in itertools.product(range(L), range(q)):
            f_i_j_a_b[a + q*i, b + q*j] = np.sum(weights * ia[(j,b)] * delta_i_a )
        if progress_bar:
            clear_output(wait=True)
        print("Finished processing i={}, a={}, AA={}".format(i, a, AMINO_ACIDS[a].tostring().decode()))

    del ia
    np.save(single_site_marginal_file, f_i_a)
    np.save(double_site_marginal_file, f_i_j_a_b)
    print("Finished computing sigle and double site marginals and saved to cache files")

Loading single site marginals from  ../data/DHFR.single.npy
Loading double site marginals from  ../data/DHFR.double.npy


In [14]:
# Add Pseudo count
f_i_a = 1. / (M_eff + pseudo_count) * (pseudo_count/q + f_i_a)
f_i_j_a_b = 1. / (M_eff + pseudo_count) * (pseudo_count/(q*q) + f_i_j_a_b)

In [32]:
# Covariance Matrix

C_i_j_a_b = np.zeros(((q-1)*L, (q-1)*L), dtype=f_i_j_a_b.dtype)

for i, a in itertools.product(range(L), range(q-1)):
    for j, b in itertools.product(range(L), range(q-1)):
        C_i_j_a_b[a + i*(q-1), b + j*(q-1)] = f_i_j_a_b[a + (q-1)*i, b + (q-1)*j] -  f_i_a[i, a] * f_i_a[j, b]

In [48]:
arr[:, 0 ].tostring()

b'V-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [67]:
c = C_i_j_a_b[:(q-1), :(q-1)]

In [70]:
c

array([[0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689],
       [0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689],
       [0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689],
       [0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0.00056689, 0.00056689,
        0.00056689, 0.00056689, 0.00056689, 0

In [73]:
c[np.where(~np.isclose(c, c[0,0]))]

array([0.00056707])

In [56]:
C_i_j_a_b_trim = np.zeros(((q-1)* (L - 10), (q-1) * (L-10)), dtype=C_i_j_a_b.dtype)

In [61]:
for i, a in itertools.product(range(5, L - 5), range(q-1)):
    for j, b in itertools.product(range(5, L-5), range(q-1)):
        C_i_j_a_b_trim[a + (i - 5)*(q-1), b + (j-5)*(q-1)] = C_i_j_a_b[a + i * (q-1), b + j * (q-1)]

In [63]:
np.linalg.cond(c)

3.524265801905483e+17

In [16]:
C_self = np.zeros_like(C_i_j_a_b)

In [17]:
for i, a in itertools.product(range(L), range(q-1)):
    for b in range(q-1):
        C_self[a + i*(q-1), b + i*(q-1)] = C_i_j_a_b[a + i*(q-1), b + i*(q-1)]

In [18]:
import scipy.linalg

In [19]:
D = scipy.linalg.sqrtm(C_self)

Matrix is singular and may not have a square root.


In [20]:
l, v = np.linalg.eigh(C_i_j_a_b)

In [21]:
l.min()

-2.8410949583068415e-15

In [22]:
m = C_i_j_a_b[:(q-1), :(q-1)]

In [23]:
np.linalg.eigh(m)

(array([-2.28615051e-18, -8.52378583e-19, -2.01509782e-20, -9.81403051e-21,
        -1.86249946e-26, -8.76002263e-30, -2.12910979e-36, -1.01916114e-36,
        -6.27244354e-37,  1.82626432e-38,  3.59199208e-37,  1.43491997e-36,
         3.45193416e-26,  9.73826072e-22,  1.22307603e-21,  1.15373194e-20,
         4.21600444e-20,  3.35534087e-18,  1.77114728e-07,  1.13378689e-02]),
 array([[ 9.07206865e-01,  2.15027211e-01, -4.05752679e-03,
          1.11873476e-03,  4.93812973e-07,  1.80194042e-07,
         -7.86591673e-14,  6.38176929e-14,  2.24678636e-14,
         -1.83130335e-14, -3.99031262e-14,  1.40497337e-13,
          9.02424024e-08,  1.78111659e-02,  8.96798588e-03,
         -1.19973074e-03, -4.21733086e-03, -2.78698440e-01,
          5.12993374e-02, -2.23606701e-01],
        [ 2.93933286e-01, -4.00363448e-01,  7.52971782e-03,
         -3.05519304e-04,  8.92891275e-08,  6.63494475e-08,
         -2.89621145e-14,  2.34388489e-14,  8.28018063e-15,
         -6.66848980e-15, -1.46099