# TrRosetta 

## Method

TrRosetta is a deep residual convolutional network takes in input a MSA and outputs the relative distance and orientation of each residue pair.
The output includes the probability for the distance range 2 to 20 A, binned in 36 bins of 0.5 A each (plus 1 bin for no contact), and other bins for angles.

## Installation

Easiest way is to use singularity with the tf1 CPU container (needs a lot of RAM)

```
singularity pull docker://tensorflow/tensorflow:1.15.5
```

Clone the repository and download the trained model.

```
git clone https://github.com/gjoni/trRosetta
cd trRosetta
wget https://files.ipd.uw.edu/pub/trRosetta/model2019_07.tar.bz2
tar xf model2019_07.tar.bz2
```


## How to run

Go to the root of the repository and run:

```
singularity run singularity_containers/tensorflow_1.15.5.sif
python ./network/predict.py -m ./model2019_07 <a3m MSA input> <npz file output>
```

## Notes

In general the trimmed MSA seem to be better than the full lenght ones.

## Output check

I compare the trRosetta distogram with the one derived from the actual structure. Since for some proteins the crystal refers to a different repeat of the same domain, I use the unmapped PDB contact map.

In [4]:
import pandas as pd

dms_datasets_df = pd.read_csv('/home/saul/master_thesis_work/dataset/dms/dms_datasets.csv')
dms_datasets_df

Unnamed: 0,uniprot_id,dms_id,pdb_id,pdb_chain,author_year,uniprot_first,uniprot_last,mutated_domain_uniprot_first,mutated_domain_uniprot_last,feature_basename,feature_basename_trrosetta,notes,trimming_notes
0,P62593,beta-lactamase,1btl,A,Firnberg2014,1,-1,1,-1,P62593,P62593,,full sequence used
1,P46937,WW_domain,4rex,A,Fowler2010,1,-1,167,207,P46937,P46937_167-207,,equivalent (trimming makes results worse in XG...
2,P31016,PSD95pdz3,1be9,A,McLaughlin2012,1,-1,303,426,P31016,P31016_303-426,,equivalent
3,P00552,kka2_1:2,1nd4,A,Melnikov2014,1,-1,1,-1,P00552,P00552,,full sequence used
4,P02829,hsp90,1ah6,A,Mishra2016,2,231,2,231,P02829_2-231,P02829_2-231,,trimming works well
5,P0CG63,Ubiquitin,3olm,D,Roscoe2013,1,76,1,76,P0CG63_1-76,P0CG63_1-76,"pdb chain is B but author chain is D, sifts re...",trimming works well
6,P04147,Pab1,6r5k,D,Melamed2013,1,-1,126,203,P04147,P04147_126-203,"pdb chain is B but author chain is D, sifts re...",better not to trim
7,P0CG63,E1_Ubiquitin,3olm,D,Roscoe2014,1,76,1,76,P0CG63_1-76,P0CG63_1-76,"pdb chain is B but author chain is D, sifts re...",trimming works well
8,P06654,gb1,2igd,A,Olson2014,1,-1,226,283,P06654,P06654_226-283,,better not to trim


In [5]:
import numpy as np
import joblib
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt

# vectorized functions
crop_distances = np.vectorize(lambda x : (x if x<=20 else np.nan))
distogram_bins_map = {i:(dist/10) for i,dist in enumerate([np.nan] + list(range(20,205,5)))}
map_to_dist = np.vectorize(lambda x : distogram_bins_map[x])
    
for study in dms_datasets_df.dms_id:
    assert len(dms_datasets_df[dms_datasets_df.dms_id == study]) == 1
    basename_trimmed = set(dms_datasets_df[dms_datasets_df.dms_id == study].feature_basename_trimmed).pop()
    pdb_id = set(dms_datasets_df[dms_datasets_df.dms_id == study].pdb_id).pop()
    pdb_chain = set(dms_datasets_df[dms_datasets_df.dms_id == study].pdb_chain).pop()
    uniprot_id = set(dms_datasets_df[dms_datasets_df.dms_id == study].uniprot_id).pop()
    uniprot_first = set(dms_datasets_df[dms_datasets_df.dms_id == study].mutated_domain_uniprot_first).pop()
    uniprot_last = set(dms_datasets_df[dms_datasets_df.dms_id == study].mutated_domain_uniprot_last).pop()
    # trRosetta distances
    try: # not all sequences have a trimmed version
        distogram = np.load('/home/saul/master_thesis_work/processing/dms/tr_rosetta/' + basename_trimmed + '_trRosetta.npz')['dist']
        best_bins = np.argmax(distogram, axis=2)
        tr_rosetta_distances = map_to_dist(best_bins)
        print('trRosetta trimmed distogram', study)
        plt.close()
        sns.heatmap(tr_rosetta_distances)
        plt.show()
    except:
        print("Trimmed trRosetta output missing for", study)
    # trRosetta distances for the untrimmed protein
    try: # the trrosetta map does not exists for all the proteins
        distogram = np.load('/home/saul/master_thesis_work/processing/dms/tr_rosetta/' + uniprot_id + '_trRosetta.npz')['dist']
        best_bins = np.argmax(distogram, axis=2)
        tr_rosetta_distances = map_to_dist(best_bins)
        print('trRosetta full distogram', study)
        plt.close()
        sns.heatmap(tr_rosetta_distances[uniprot_first - 1:uniprot_last, uniprot_first - 1:uniprot_last])
        plt.show()
    except:
        print("Full trRosetta output missing for", study)
    # calculating PDB distances
    pdb_distances = joblib.load('/home/saul/master_thesis_work/processing/dms/structures/' +
                            pdb_id + '_' + pdb_chain + '.pdb_distance_matrix.joblib.xz')['distance_matrix']

    print('PDB distogram', study)
    plt.close()
    sns.heatmap(crop_distances(pdb_distances))
    plt.show()
    # experimental cmap mapped to uniprot
    uniprot_distances = joblib.load('/home/saul/master_thesis_work/processing/dms/structures/' +
                                    uniprot_id + '_mapped_' + pdb_id + '_' + pdb_chain +
                                    '.uniprot_distance_matrix.joblib.xz')['distance_matrix']

    print('Uniprot mapped distogram', study)
    plt.close()
    sns.heatmap(crop_distances(uniprot_distances[uniprot_first - 1:uniprot_last, uniprot_first - 1:uniprot_last]))
    plt.show()

AttributeError: 'DataFrame' object has no attribute 'feature_basename_trimmed'