# Gremlin

Gremlin is a method for contact prediction using evolutionary covariation. I use the C++ version.

## Installation
Clone the git repository of Gremlin C++ and compile the source code

```
git clone https://github.com/sokrypton/GREMLIN_CPP.git
cd GREMLIN_CPP
g++ -O3 -std=c++0x -o gremlin_cpp gremlin_cpp.cpp -fopenmp
```

## Running Gremlin
Run the binary on a MSA

```
./gremlin_cpp -i <input_msa> -o <out_file>
```

## How to get the MSA
For now I am using the same hhblits MSA that I used for everything else
In the original paper they used the following parameters on the clustered uniprot (uniref?)

```
-nodiff -neffmax 20 -n 4 -maxfilt 100000

```

They later removed rows or columns with more than 25% gaps and sequences with more than 90% identity to another sequence in the MSA

## Output
The output is a space-separated file with the following structure

- i is the 0-inexed position of the first residue and ii is the 1-indexed position together with its identity
- j and jj are equivalent to i and ii respectively, but for the second residue
- raw is the raw score
- apc is the average product correlation corrected score

The score that I need is the APC score. For convenience I save it as a numpy array.

In [138]:
import pandas as pd
import os
import joblib
import numpy as np
from Bio import SeqIO

with open('../processing/gray2018/uniprot_id_list.txt') as handle:
    for line in handle:
        uniprot_id = line.rstrip()
        gremlin_file = '../processing/gray2018/gremlin/'+uniprot_id+'.gremlin.ssv'
        out_numpy_file = '../processing/gray2018/gremlin/'+uniprot_id+'.gremlin_apc.joblib.xz'
        if not os.path.isfile(gremlin_file):
            print('Missing file:', uniprot_id)
            continue
        if os.path.isfile(out_numpy_file):
            print('Output already existing. Skipping:', uniprot_id)
            continue
        sequence = SeqIO.read('../processing/gray2018/uniprot_sequences/'+uniprot_id+'.fasta', 'fasta').seq
        gremlin_df = pd.read_csv(gremlin_file, sep='\s+')
        gremlin_vec = np.zeros((len(sequence), len(sequence)))
        gremlin_vec[:] = np.nan
        for i,res_i in enumerate(sequence):
            if i not in set(gremlin_df.i):
                continue
            for j in range(i+1, len(sequence)):
                if j not in set(gremlin_df.j):
                    continue
                curr_df = gremlin_df[(gremlin_df.i == i) & (gremlin_df.j == j)].reset_index(drop=True)
                assert len(curr_df) == 1
                assert curr_df.at[0,'ii'] == res_i+str(i+1)
                assert curr_df.at[0,'jj'] == sequence[j]+str(j+1)
                gremlin_vec[i,j] = curr_df.at[0,'apc']
                gremlin_vec[j,i] = curr_df.at[0,'apc']
        joblib.dump(gremlin_vec, out_numpy_file)

Output already existing. Skipping: P00552
Output already existing. Skipping: P02829
Output already existing. Skipping: P04147
Output already existing. Skipping: P06654
Output already existing. Skipping: P0CG48
Output already existing. Skipping: P0CG63
Output already existing. Skipping: P31016
Output already existing. Skipping: P46937
Output already existing. Skipping: P62593
