# Clustering AVR-Pik
---

This notebook will guide you through the process of clustering 65 complexes of the AVR-Pik protein against a sample of rice proteins. All of the models are dimeric. The 65 models originate from a proteome-wide screen of AVR-Pik against the rice proteome (O. sativa subsp. japonica, 43,000+ initial models). To select this sample we first ran AlphaCRV on all the models, and the 65 structures are part of the best clusters that were identified. Running the pipeline on these models should be enough to reproduce the results.

# Prerequisites

- Install AlphaCRV on a conda environment and activate it
- Download the sequences, models and results for this example at [Zenodo](https://zenodo.org/records/10470744)

# 1. Cluster the models with the `clustering` command

Run the following command to cluster the models. Make sure to change the paths according to your system:

```bash
alphacrv-cluster \
  --bait ./examples/AVRPik/AVRPik.fasta \
  --binders ./examples/AVRPik/AVRPik_binders.fasta \
  --models_dir ./examples/AVRPik/AVRPik_vs_rice_models \
  --destination ./examples/AVRPik/AVRPik_vs_rice_clusters \
  --cpus 8
```

After collecting the quality scores from the models in `--models_dir`, it will count how many models are there with an ipTM scores higher than the threshold (0.75 by default). It will prompt you to confirm or to modify the threshold. After that, it will proceed with the clustering.

The full output is as follows:

```
(env) AlphaCRV$ alphacrv-cluster \
--bait ./examples/AVRPik/AVRPik.fasta \
--binders ./examples/AVRPik/AVRPik_binders.fasta \
--models_dir ./examples/AVRPik/AVRPik_vs_rice_models \
--destination ./examples/AVRPik/AVRPik_vs_rice_clusters \
--cpus 8
INFO:root:Getting quality scores for models in examples/AVRPik/AVRPik_vs_rice_models...
INFO:root:Found 65 model directories with quality scores.
Will select 65 models with iptm >= 0.75. Press enter to continue, or enter a new threshold: 
INFO:root:Trimming binder molecules to keep only regions with an average PAE against the bait of up to 10.0...
INFO:root:Processed 65 complexes.
INFO:root:Writing trimmed sequences to fasta file.
INFO:root:Running sequence clustering...
INFO:root:Processing output...
INFO:root:Running structural clustering...
INFO:root:Processing output...
INFO:root:Aligning all vs all members of each cluster...
INFO:root:Aligning cluster 1 of 4...
INFO:root:11 members.
INFO:root:Aligning cluster 2 of 4...
INFO:root:21 members.
INFO:root:Aligning cluster 3 of 4...
INFO:root:13 members.
INFO:root:Aligning cluster 4 of 4...
INFO:root:20 members.
INFO:root:Calculating median alignment scores...
INFO:root:Done!!
```

After this step you will have a directory with the following structure:

```
(env) AlphaCRV$ ll examples/AVRPik/AVRPik_vs_rice_clusters/
total 24K
-rw-r--r-- 1 example g-example 1.3K Jan  3 16:30 binders_regions.csv
drwxr-xr-x 1 example g-example    0 Jan  3 16:31 merged_clusters/
drwxr-xr-x 1 example g-example    0 Jan  3 16:30 pdbs_trimmed/
drwxr-xr-x 1 example g-example    0 Jan  3 16:30 seqclusters/
drwxr-xr-x 1 example g-example    0 Jan  3 16:30 strclusters/
-rw-r--r-- 1 example g-example  19K Jan  3 16:30 trimmed_binders.fasta
```

Now let's look at some of the important files:

In [1]:
from pathlib import Path
import pandas as pd

In [2]:
results_dir = Path('./AVRPik/AVRPik_vs_rice_clusters/')

## See clusters

The `merged_clusters.csv` file contains the list of models with their corresponding sequence, structure and merged clusters. It also has the quality scores provided by AlphaFold.

In [3]:
clusters = pd.read_csv(results_dir / 'merged_clusters/merged_clusters.csv')

In [4]:
clusters

Unnamed: 0,complex,str_rep,seq_rep,merged_rep,member,iptm,iptm+ptm
0,6R8K-1_A0A0P0Y5A4-1,Q67VV7.pdb_B,Q6ZEZ7,Q0DEU2.pdb_B,A0A0P0Y5A4,0.954354,0.846711
1,6R8K-1_A0A0P0YB11-1,Q67VV7.pdb_B,Q6ZEZ7,Q0DEU2.pdb_B,A0A0P0YB11,0.916484,0.841312
2,6R8K-1_A0A0P0VEM6-1,Q67VV7.pdb_B,Q6ZEZ7,Q0DEU2.pdb_B,A0A0P0VEM6,0.900694,0.839205
3,6R8K-1_Q8L3T8-1,Q67VV7.pdb_B,Q6ZEZ7,Q0DEU2.pdb_B,Q8L3T8,0.888135,0.842678
4,6R8K-1_Q2RAL3-1,Q6K9R5.pdb_B,Q6ZBC3,Q6K9R5.pdb_B,Q2RAL3,0.862302,0.757913
...,...,...,...,...,...,...,...
60,6R8K-1_Q0DBF4-1,A3BDZ2.pdb_B,Q0DBF4,A3BDZ2.pdb_B,Q0DBF4,0.758313,0.778655
61,6R8K-1_A3ADD6-1,Q6K9R5.pdb_B,Q7G2B2,Q6K9R5.pdb_B,A3ADD6,0.756883,0.752243
62,6R8K-1_Q0JCK8-1,Q6K9R5.pdb_B,A0A0N7KFK3,Q6K9R5.pdb_B,Q0JCK8,0.756162,0.744660
63,6R8K-1_Q0JKW0-1,A0A0P0WKJ4.pdb_B,A0A0P0Y219,A0A0P0WKJ4.pdb_B,Q0JKW0,0.755506,0.758945


The columns of the `clusters` DataFrame are:
- `complex`: The name of the complex. This is the same name as the directory where the model is stored.
- `str_rep`: Name of the structure cluster representative
- `seq_rep`: Name of the sequence cluster representative
- `merged_rep`: Name of the merged cluster representative (sequence + structure)
- `member`: The ID of the binder protein
- `iptm`: The ipTM score of the model
- `iptm+ptm`: The ipTM+PTM score of the model (it is calculated by AlphaFold as 0.8*ipTM + 0.2*pTM)

See the number of different merged clusters:

In [5]:
clusters.merged_rep.unique().shape

(4,)

For this example, the models and sequences from the 65 binder proteins were summarized in 4 clusters. Much fewer structures to sort through!

## See alignment scores

Alignment scores are calcualted for each cluster by aligning all vs all members of the cluster.

In [6]:
alignment_scores = pd.read_csv(results_dir / 'merged_clusters/alignment_scores.csv')

In [7]:
alignment_scores.head()

Unnamed: 0,cluster,ref,member,tmscore_ref,tmscore_m,aligned_length,rmsd
0,Q0DEU2.pdb_B,A0A0P0Y5A4,A0A0P0YB11,0.88555,0.97405,432,1.41
1,Q0DEU2.pdb_B,A0A0P0Y5A4,A0A0P0VEM6,0.86424,0.91255,440,3.06
2,Q0DEU2.pdb_B,A0A0P0Y5A4,Q8L3T8,0.85603,0.93629,430,2.58
3,Q0DEU2.pdb_B,A0A0P0Y5A4,B9EWP2,0.86538,0.95866,425,2.16
4,Q0DEU2.pdb_B,A0A0P0Y5A4,Q7XV05,0.37127,0.61033,271,13.11


The columns of the `alignment_scores` DataFrame are:
- `cluster`: The name of the cluster
- `ref`: Binder ID of the reference structure in the alignment
- `member`: Binder ID of the second structure in the alignment
- `tmscore_ref`: TM-score based on the reference structure
- `tmscore_m`: TM-score based on the second structure
- `aligned_length`: Length of the alignment
- `rmsd`: RMSD of the alignment

Based on these scores, the median scores are calculated for each cluster member to find the best representative of the cluster (the one with lowest RMSD score to the other members).

## Read median scores and find top clusters

Now we can rank the clusters based on the median alignment scores of the cluster representatives:

In [8]:
median_scores = pd.read_csv(results_dir / 'merged_clusters/median_scores.csv')

In [9]:
median_scores.shape

(65, 7)

The `median_scores` DataFrame contains the median alignment scores of each cluster member when aligned to all other members of the same cluster.

In [10]:
median_scores.head()

Unnamed: 0,cluster,member,tmscore,rmsd,aligned_length,cluster_size,fraction_binder
0,A0A0P0WKJ4.pdb_B,A0A0P0W913,0.43642,15.29,250.0,20.0,0.826316
1,A0A0P0WKJ4.pdb_B,A0A0P0WKJ4,0.31741,13.1,219.0,20.0,0.540773
2,A0A0P0WKJ4.pdb_B,A0A0P0WQF2,0.40264,12.55,226.0,20.0,0.76
3,A0A0P0WKJ4.pdb_B,A0A0P0XBZ1,0.26659,17.31,266.0,20.0,0.44473
4,A0A0P0WKJ4.pdb_B,A0A0P0XUC4,0.31355,16.23,268.0,20.0,0.486111


The columns of the `median_scores` DataFrame are:
- `cluster`: The name of the cluster
- `member`: ID of the cluster member (binder molecule)
- `tmscore`: Median TM-score of the complex against all other complexes in this cluster
- `rmsd`: Median RMSD of the complex against all other complexes in this cluster
- `aligned_length`: Median length of the alignment
- `cluster_size`
- `fraction_binder`: In average, how much of the binder molecule is included in the alignments of this complex against all other complexes (calculated as `(aligned_length - bait_length)/binder_length`). This is just meant to be an approximation of how complete the aligmnents are for this cluster member.

The next step is to select the cluster representatives. For this, we first need to filter out the cluster members with poor quality alignments, according to the following criteria:
- Small size
- Low median TM-score
- High median RMSD
- Low fraction of the binder aligned in the cluster representative

In [11]:
# Select the clusters with the following criteria:
select = ((median_scores.cluster_size >= 5) & \
            (median_scores.tmscore >= 0.2) & \
            (median_scores.fraction_binder >= 0.2) & \
            (median_scores.rmsd <= 15))
median_scores_filtered = median_scores[select]

In [12]:
median_scores_filtered.shape

(51, 7)

See how many clusters are left after filtering:

In [13]:
median_scores_filtered.cluster.unique().shape

(4,)

Function to format tables:

In [14]:
import seaborn as sns
cm_r = sns.color_palette("mako_r", as_cmap=True)
cm = sns.color_palette("mako", as_cmap=True)

In [15]:
def make_pretty(styler):
    styler.format(precision=2)
    styler.background_gradient(axis=0, cmap=cm_r, subset=pd.IndexSlice[:,"cluster_size"],vmin=5,vmax=15)
    styler.background_gradient(axis=0, cmap=cm_r, subset=pd.IndexSlice[:,"tmscore"],vmin=0.2,vmax=0.8)
    styler.background_gradient(axis=0, cmap=cm, subset=pd.IndexSlice[:,"rmsd"],vmin=2,vmax=10)
    styler.background_gradient(axis=0, cmap=cm_r, subset=pd.IndexSlice[:,"fraction_binder"],vmin=0.2,vmax=0.9)
    return styler

## RESULT 1: See clusters ranked by RMSD

Finally, we can rank the clusters and see which ones have a good combination of low RMSD and large cluster size. These ones are the most likely to contain the true binder.

In [16]:
# Select the rows with the minimum RMSD for each cluster
select = median_scores_filtered.groupby('cluster').rmsd.idxmin()
columns = ['cluster', 'tmscore', 'rmsd', 'cluster_size', 'fraction_binder']
median_scores_filtered.loc[select,columns].sort_values(by='rmsd').style.pipe(make_pretty)

Unnamed: 0,cluster,tmscore,rmsd,cluster_size,fraction_binder
41,Q0DEU2.pdb_B,0.94,2.17,11.0,0.99
28,A3BDZ2.pdb_B,0.82,3.38,13.0,0.94
54,Q6K9R5.pdb_B,0.31,4.99,21.0,0.31
9,A0A0P0WKJ4.pdb_B,0.42,10.93,20.0,0.86


Here we can see that the cluster `Q0DEU2.pdb_B` has the lowest median RMSD and the highest median TM-score. However, it has a relatively small size with only 11 members.

## RESULT 2: See clusters ranked by size

In [17]:
# Select the rows with the minimum RMSD for each cluster
select = median_scores_filtered.groupby('cluster').rmsd.idxmin()
columns = ['cluster', 'tmscore', 'rmsd', 'cluster_size', 'fraction_binder']
median_scores_filtered.loc[select, columns].sort_values(by='cluster_size', ascending=False).style.pipe(make_pretty)

Unnamed: 0,cluster,tmscore,rmsd,cluster_size,fraction_binder
54,Q6K9R5.pdb_B,0.31,4.99,21.0,0.31
9,A0A0P0WKJ4.pdb_B,0.42,10.93,20.0,0.86
28,A3BDZ2.pdb_B,0.82,3.38,13.0,0.94
41,Q0DEU2.pdb_B,0.94,2.17,11.0,0.99


Here we can see that the cluster `Q6K9R5.pdb_B` is the largest cluster with 21 members. It also has a low RMSD of 4.99, although the median TM-score is not the highest. This cluster, along with `A3BDZ2.pdb_B` and `A3BDZ2.pdb_B` would be good candidates for further analysis. Only three clusters with a total of 45 structures, from more than 43,000 starting models!

For this example we know that in our list of candidate binders there are 6 homologues of the true binder protein. We can find out which clusters contain these homologues:

In [19]:
homologues = ['Q7XJV3', 'Q6EPT2','A0A0N7KFK3', 'Q6YY31', 'Q6EPT4', 'Q0JCK8']

In [20]:
clusters[clusters.member.isin(homologues)][['complex','merged_rep','iptm','iptm+ptm']]

Unnamed: 0,complex,merged_rep,iptm,iptm+ptm
14,6R8K-1_Q7XJV3-1,Q6K9R5.pdb_B,0.828985,0.807492
33,6R8K-1_Q6EPT2-1,Q6K9R5.pdb_B,0.806888,0.76364
36,6R8K-1_A0A0N7KFK3-1,Q6K9R5.pdb_B,0.800443,0.782608
45,6R8K-1_Q6YY31-1,Q6K9R5.pdb_B,0.784981,0.775073
56,6R8K-1_Q6EPT4-1,Q6K9R5.pdb_B,0.766761,0.71059
62,6R8K-1_Q0JCK8-1,Q6K9R5.pdb_B,0.756162,0.74466


They are all in the largest cluster!

# 2. Make pymol sessions for the top clusters with `make_pymol_sessions`

Run the following command to select the top clusters that we saw above, make pymol sessions of the top clusters, and optionally do structural clustering on each cluster to find subclusters:

```bash
alphacrv-rank \
  --clusters_dir /path/to/destination/6R8K_clusters \
  --min_members 5 \
  --min_tmscore 0.2 \
  --max_rmsd 15 \
  --cluster_clusters
```

The program will show you the top clusters that will be used to make the pymol sessions. You can press `Enter` to continue, or exit the program with `Ctrl+C` to change the filtering parameters.

```
(env) AlphaCRV$ alphacrv-rank \
--clusters_dir ./examples/AVRPik/AVRPik_vs_rice_clusters \
--min_members 5 \
--min_tmscore 0.2 \
--max_rmsd 15 \
--cluster_clusters
INFO:root:Identified 4 top clusters.
INFO:root:Top clusters:

            cluster   tmscore    rmsd  cluster_size  fraction_binder
0      Q6K9R5.pdb_B  0.313905   4.985          21.0         0.305328
1  A0A0P0WKJ4.pdb_B  0.421400  10.930          20.0         0.856436
2      A3BDZ2.pdb_B  0.824705   3.385          13.0         0.944444
3      Q0DEU2.pdb_B  0.940945   2.165          11.0         0.987633

Press Enter to continue, or Ctrl+C to exit and select different filtering parameters: 
INFO:root:Copying pdbs from the top clusters...
INFO:root:Making Pymol sessions...
PyMOL>select chain B AND model A0A0P0WKJ4_repB
 Selector: selection "sele" defined with 3937 atoms.
PyMOL>bg white

...
PyMOL>select chain B AND model Q6K9R5_repB
 Selector: selection "sele" defined with 1397 atoms.
PyMOL>bg white
PyMOL>set ray_shadow, 0
 Setting: ray_shadow set to off.
PyMOL>color grey80
 Executive: Colored 61825 atoms.
PyMOL>select chain A
 Selector: selection "sele" defined with 31164 atoms.
PyMOL>color slate, sele
 Executive: Colored 31164 atoms.
PyMOL>delete all
INFO:root:Clustering clusters...
INFO:root:Clustering A0A0P0WKJ4.pdb_B
INFO:root:Running structural clustering...
INFO:root:Processing output...
INFO:root:Clustering A3BDZ2.pdb_B
INFO:root:Running structural clustering...
INFO:root:Processing output...
INFO:root:Clustering Q0DEU2.pdb_B
INFO:root:Running structural clustering...
INFO:root:Processing output...
INFO:root:Clustering Q6K9R5.pdb_B
INFO:root:Running structural clustering...
INFO:root:Processing output...
INFO:root:Done!!
```

This command should create the following files in the `--clusters_dir` / `merged_clusters` directory:

- `clustered_clusters.csv`: Contains the subclusters for each of the top clusters.
- `cluster_<cluster_ID>/`: Contains the PDBs of each cluster, and a PyMol session with the cluster members.
- `cluster_<cluster_ID>_clusters/`: Contains the results of the `foldseek easy-cluster` run on the cluster members.

## Read clustered clusters

The following DataFrame contains the subclusters for each of the top clusters:

In [21]:
clustered_clusters = pd.read_csv(results_dir / 'merged_clusters/clustered_clusters.csv')

In [22]:
clustered_clusters.head()

Unnamed: 0,subcluster_rep,member,cluster
0,A0A0P0W913.pdb_B,A0A0P0W913,A0A0P0WKJ4.pdb_B
1,A0A0P0W913.pdb_B,A0A0P0WQF2,A0A0P0WKJ4.pdb_B
2,A0A0P0WKJ4.pdb_B,A0A0P0WKJ4,A0A0P0WKJ4.pdb_B
3,A0A0P0WKJ4.pdb_B,Q0JKW0,A0A0P0WKJ4.pdb_B
4,A0A0P0XBZ1.pdb_B,A0A0P0XBZ1,A0A0P0WKJ4.pdb_B


Now we can look at the most interesting clusters and their subclusters in more detail:

## RESULT 1: Cluster Q6K9R5.pdb_B (contains true binder homologs, top cluster by size)

In [23]:
cluster = 'Q6K9R5.pdb_B'

See subclusters:

In [24]:
clustered_clusters[clustered_clusters.cluster==cluster]

Unnamed: 0,subcluster_rep,member,cluster
44,Q6K9R5.pdb_B,Q6K9R5,Q6K9R5.pdb_B
45,Q6K9R5.pdb_B,A0A0N7KFK3,Q6K9R5.pdb_B
46,Q6K9R5.pdb_B,B7E663,Q6K9R5.pdb_B
47,Q6K9R5.pdb_B,Q6EPT2,Q6K9R5.pdb_B
48,Q6K9R5.pdb_B,Q7XJV3,Q6K9R5.pdb_B
49,Q6K9R5.pdb_B,Q6YY31,Q6K9R5.pdb_B
50,Q6K9R5.pdb_B,Q7G2B2,Q6K9R5.pdb_B
51,Q6K9R5.pdb_B,A3ADD6,Q6K9R5.pdb_B
52,Q6K9R5.pdb_B,Q94CS5,Q6K9R5.pdb_B
53,Q6K9R5.pdb_B,Q0JCK8,Q6K9R5.pdb_B


Only one subcluster (`subcluster_rep` column).

We can see in the `median_scores` DataFrame the model that has the best alignments to all other structures (best representative of the cluster).

In [25]:
median_scores[median_scores.cluster==cluster].sort_values(by='rmsd').head(10)

Unnamed: 0,cluster,member,tmscore,rmsd,aligned_length,cluster_size,fraction_binder
54,Q6K9R5.pdb_B,Q6EPT4,0.313905,4.985,167.5,21.0,0.305328
51,Q6K9R5.pdb_B,Q2RAL3,0.77289,5.37,163.0,21.0,0.972222
57,Q6K9R5.pdb_B,Q6ZBC3,0.591925,5.62,161.0,21.0,0.931507
61,Q6K9R5.pdb_B,Q84TB9,0.608545,6.68,162.0,21.0,0.971831
45,Q6K9R5.pdb_B,A0A0P0XYR1,0.548765,7.355,142.0,21.0,0.590361
50,Q6K9R5.pdb_B,Q2QZ01,0.545625,7.42,142.0,21.0,0.590361
53,Q6K9R5.pdb_B,Q6EPT2,0.575325,7.915,167.0,21.0,0.891566
58,Q6K9R5.pdb_B,Q7EY69,0.544055,8.06,132.0,21.0,0.46988
49,Q6K9R5.pdb_B,Q10N90,0.55301,8.125,139.0,21.0,0.554217
64,Q6K9R5.pdb_B,Q94CS5,0.60464,8.18,163.0,21.0,0.972222


We can also select the IDs of the binder proteins in this cluster, and run them through a tool such as DAVID to perform enrichment analysis.

In [26]:
members = clusters[clusters.merged_rep==cluster].member

In [27]:
for m in members:
    print(m)

Q2RAL3
B7E663
Q7XJV3
Q94CS5
Q6ZBC3
Q8LJL3
Q10N90
Q8L4I2
Q5JL91
Q7EY69
Q6EPT2
A0A0N7KFK3
Q84TB9
A0A0P0XYR1
Q2QZ01
Q6YY31
Q7G2B2
Q6EPT4
Q6K9R5
A3ADD6
Q0JCK8


Looking at the `examples/AVRPik/AVRPik_vs_rice_clusters/merged_clusters/cluster_Q6K9R5.pdb_B/session.pse` file in PyMol, we can visualize the cluster and produce a figure like this:

<img src='./pictures/AVRPik_cluster.png' height='600px'>

When aligning the HMA domains of the binder structures (grey), the different locations of the AVR-Pik protein raise the posibility of a more flexible binding across the entire surface of the HMA domain's beta strand.

Tip: In PyMol, you can remove the residues with low pLDDT score to make the figure cleaner, with the command:
    
```
select b < 50; remove sele
```