# Clustering SKP1
---

This notebook will guide you through the process of clustering 712 complexes of the SKP1 protein against a sample of rice proteins. All of the models are dimeric. The 712 models originate from a proteome-wide screen of SKP1 against the rice proteome (O. sativa subsp. japonica, 43,000+ initial models). To select this sample we first ran AlphaCRV on all the models, and the 712 structures are part of some of the best clusters that were identified. These are many more than the other examples because the largest cluster in this case has around 700 structures, which might be due to the higher number of homologues of SKP1 compared to the two AVR proteins in the other examples. Running the pipeline on these models should be enough to reproduce the results and give you an idea of the workflow.

# Prerequisites

- Install AlphaCRV on a conda environment and activate it
- Download the models and sequences for this example from [Zenodo](https://zenodo.org/record/5525340/files/alphafold-multimer.tar.gz?download=1)

# 1. Cluster the models with the `clustering` command

Run the following command to cluster the models. Make sure to change the paths according to your system:

```bash
alphacrv-cluster \
  --bait ./examples/SKP1/SKP1.fasta \
  --binders ./examples/SKP1/SKP1_binders.fasta \
  --models_dir ./examples/SKP1/SKP1_vs_rice_models \
  --destination ./examples/SKP1/SKP1_vs_rice_clusters \
  --cpus 8
```

After collecting the quality scores from the models in `--models_dir`, it will count how many models are there with an ipTM score higher than the threshold (0.75 by default). It will prompt you to confirm or to modify the threshold. After that, it will proceed with the clustering.

This run will take considerably longer than the two other examples, because of the large size of the first cluster. The full output is as follows:

After this step you will have a directory with the following structure:

Now let's look at some of the important files:

In [1]:
from pathlib import Path
import pandas as pd

In [2]:
results_dir = Path('./SKP1/SKP1_vs_rice_clusters/')

## See clusters

The `merged_clusters.csv` file contains the list of models with their corresponding sequence, structure and merged clusters. It also has the quality scores provided by AlphaFold.

In [3]:
clusters = pd.read_csv(results_dir / 'merged_clusters/merged_clusters.csv')

In [4]:
clusters

Unnamed: 0,complex,str_rep,seq_rep,merged_rep,member,iptm,iptm+ptm
0,8IF6-1_Q6Z1A9-1,A0A0P0Y2U6.pdb_B,A0A0N7KPE6,A0A0P0XHF4.pdb_B,Q6Z1A9,0.903477,0.868600
1,8IF6-1_Q7XSL8-1,A0A0P0WQD9.pdb_B,A0A0P0VJA5,A0A0P0XHF4.pdb_B,Q7XSL8,0.903106,0.899270
2,8IF6-1_Q5Z8K3-1,Q6ZDH1.pdb_B,Q5Z8K3,A0A0P0XHF4.pdb_B,Q5Z8K3,0.899284,0.857990
3,8IF6-1_Q5VMP0-1,A0A0P0WQD9.pdb_B,Q5VMP0,A0A0P0XHF4.pdb_B,Q5VMP0,0.899220,0.894956
4,8IF6-1_A0A0P0WG98-1,Q5Z7U2.pdb_B,C7J8M0,A0A0P0XHF4.pdb_B,A0A0P0WG98,0.899086,0.817529
...,...,...,...,...,...,...,...
707,8IF6-1_Q0JK63-1,Q0JK63.pdb_B,Q0JK63,A0A0P0XHF4.pdb_B,Q0JK63,0.755674,0.761966
708,8IF6-1_Q53WL8-1,Q6ZDH1.pdb_B,A0A0P0WPS2,A0A0P0XHF4.pdb_B,Q53WL8,0.755021,0.761609
709,8IF6-1_A3AQW3-1,A0A0N7KT42.pdb_B,A3AQW3,A0A0P0XHF4.pdb_B,A3AQW3,0.753747,0.705141
710,8IF6-1_Q6ZKB8-1,Q0DQG8.pdb_B,Q6ZKB8,A0A0P0XHF4.pdb_B,Q6ZKB8,0.753729,0.744348


The columns of the `clusters` DataFrame are:
- `complex`: The name of the complex. This is the same name as the directory where the model is stored.
- `str_rep`: Name of the structure cluster representative
- `seq_rep`: Name of the sequence cluster representative
- `merged_rep`: Name of the merged cluster representative (sequence + structure)
- `member`: The ID of the binder protein
- `iptm`: The ipTM score of the model
- `iptm+ptm`: The ipTM+PTM score of the model (it is calculated by AlphaFold as 0.8*ipTM + 0.2*pTM)

See the number of different merged clusters:

In [5]:
clusters.merged_rep.unique().shape

(4,)

For this example, the models and sequences from the 712 binder proteins were summarized in 4 clusters. Much fewer structures to sort through!

## See alignment scores

Alignment scores are calcualted for each cluster by aligning all vs all members of the cluster.

In [6]:
alignment_scores = pd.read_csv(results_dir / 'merged_clusters/alignment_scores.csv')

In [7]:
alignment_scores.head()

Unnamed: 0,cluster,ref,member,tmscore_ref,tmscore_m,aligned_length,rmsd
0,A0A0P0XHF4.pdb_B,Q6Z1A9,Q7XSL8,0.40236,0.25976,215,8.43
1,A0A0P0XHF4.pdb_B,Q6Z1A9,Q5Z8K3,0.38146,0.28447,185,10.48
2,A0A0P0XHF4.pdb_B,Q6Z1A9,Q5VMP0,0.42881,0.23565,419,16.0
3,A0A0P0XHF4.pdb_B,Q6Z1A9,A0A0P0WG98,0.38724,0.35182,235,16.33
4,A0A0P0XHF4.pdb_B,Q6Z1A9,Q67UX0,0.40441,0.30453,263,15.26


The columns of the `alignment_scores` DataFrame are:
- `cluster`: The name of the cluster
- `ref`: Binder ID of the reference structure in the alignment
- `member`: Binder ID of the second structure in the alignment
- `tmscore_ref`: TM-score based on the reference structure
- `tmscore_m`: TM-score based on the second structure
- `aligned_length`: Length of the alignment
- `rmsd`: RMSD of the alignment

Based on these scores, the median scores are calculated for each cluster member to find the best representative of the cluster (the one with lowest RMSD score to the other members).

## Read median scores and find top clusters

Now we can rank the clusters based on the median alignment scores of the cluster representatives:

In [8]:
median_scores = pd.read_csv(results_dir / 'merged_clusters/median_scores.csv')

In [9]:
median_scores.shape

(712, 7)

The `median_scores` DataFrame contains the median alignment scores of each cluster member when aligned to all other members of the same cluster.

In [10]:
median_scores.head()

Unnamed: 0,cluster,member,tmscore,rmsd,aligned_length,cluster_size,fraction_binder
0,A0A0P0VR14.pdb_B,A0A0P0VR14,0.32871,18.64,208.0,6.0,0.111864
1,A0A0P0VR14.pdb_B,Q2R448,0.85991,6.04,445.0,6.0,0.934256
2,A0A0P0VR14.pdb_B,Q6K6K8,0.85162,4.42,445.0,6.0,0.90301
3,A0A0P0VR14.pdb_B,Q7XKU0,0.4911,17.19,405.0,6.0,0.795848
4,A0A0P0VR14.pdb_B,Q7XL60,0.82004,4.95,455.0,6.0,0.894569


The columns of the `median_scores` DataFrame are:
- `cluster`: The name of the cluster
- `member`: ID of the cluster member (binder molecule)
- `tmscore`: Median TM-score of the complex against all other complexes in this cluster
- `rmsd`: Median RMSD of the complex against all other complexes in this cluster
- `aligned_length`: Median length of the alignment
- `cluster_size`
- `fraction_binder`: In average, how much of the binder molecule is included in the alignments of this complex against all other complexes (calculated as `(aligned_length - bait_length)/binder_length`). This is just meant to be an approximation of how complete the aligmnents are for this cluster member.

The next step is to select the cluster representatives. For this, we first need to filter out the cluster members with poor quality alignments, according to the following criteria:
- Small size
- Low median TM-score
- High median RMSD
- Low fraction of the binder aligned in the cluster representative

In [11]:
# Select the clusters with the following criteria:
select = ((median_scores.cluster_size >= 5) & \
            (median_scores.tmscore >= 0.2) & \
            (median_scores.fraction_binder >= 0.2) & \
            (median_scores.rmsd <= 15))
median_scores_filtered = median_scores[select]

In [12]:
median_scores_filtered.shape

(169, 7)

See how many clusters are left after filtering:

In [13]:
median_scores_filtered.cluster.unique().shape

(3,)

Function to format tables:

In [14]:
import seaborn as sns
cm_r = sns.color_palette("mako_r", as_cmap=True)
cm = sns.color_palette("mako", as_cmap=True)

In [15]:
def make_pretty(styler):
    styler.format(precision=2)
    styler.background_gradient(axis=0, cmap=cm_r, subset=pd.IndexSlice[:,"cluster_size"],vmin=5,vmax=15)
    styler.background_gradient(axis=0, cmap=cm_r, subset=pd.IndexSlice[:,"tmscore"],vmin=0.2,vmax=0.8)
    styler.background_gradient(axis=0, cmap=cm, subset=pd.IndexSlice[:,"rmsd"],vmin=2,vmax=10)
    styler.background_gradient(axis=0, cmap=cm_r, subset=pd.IndexSlice[:,"fraction_binder"],vmin=0.2,vmax=0.9)
    return styler

## RESULT 1: See clusters ranked by RMSD

Finally, we can rank the clusters and see which ones have a good combination of low RMSD and large cluster size. These ones are the most likely to contain the true binder.

In [16]:
# Select the rows with the minimum RMSD for each cluster
select = median_scores_filtered.groupby('cluster').rmsd.idxmin()
columns = ['cluster', 'tmscore', 'rmsd', 'cluster_size', 'fraction_binder']
median_scores_filtered.loc[select,columns].sort_values(by='rmsd').style.pipe(make_pretty)

Unnamed: 0,cluster,tmscore,rmsd,cluster_size,fraction_binder
19,A0A0P0XHF4.pdb_B,0.94,1.53,695.0,1.0
2,A0A0P0VR14.pdb_B,0.85,4.42,6.0,0.9
704,Q65XD1.pdb_B,0.7,13.05,9.0,0.93


Here we can see that the cluster `B7ESQ3.pdb_B` has the lowest median RMSD and the highest median TM-score. It also has a very large size with 695 members!

## RESULT 2: See clusters ranked by size

In [17]:
# Select the rows with the minimum RMSD for each cluster
select = median_scores_filtered.groupby('cluster').rmsd.idxmin()
columns = ['cluster', 'tmscore', 'rmsd', 'cluster_size', 'fraction_binder']
median_scores_filtered.loc[select, columns].sort_values(by='cluster_size', ascending=False).style.pipe(make_pretty)

Unnamed: 0,cluster,tmscore,rmsd,cluster_size,fraction_binder
19,A0A0P0XHF4.pdb_B,0.94,1.53,695.0,1.0
704,Q65XD1.pdb_B,0.7,13.05,9.0,0.93
2,A0A0P0VR14.pdb_B,0.85,4.42,6.0,0.9


`B7ESQ3.pdb_B` is also the largest cluster. So we managed to reduce more than 43,000 starting models to only one excellent cluster!

For this example we know that in our list of candidate binders there are 5 homologues of the true binder protein. We can find out which clusters contain these homologues:

In [18]:
homologues = ['Q5VMP0', 'A0A0N7KEW0', 'Q8RZQ3', 'Q69X07', 'A0A0P0Y6A8']

In [19]:
clusters[clusters.member.isin(homologues)][['complex','merged_rep','iptm','iptm+ptm']]

Unnamed: 0,complex,merged_rep,iptm,iptm+ptm
3,8IF6-1_Q5VMP0-1,A0A0P0XHF4.pdb_B,0.89922,0.894956
28,8IF6-1_A0A0P0Y6A8-1,A0A0P0XHF4.pdb_B,0.893004,0.886242
321,8IF6-1_Q8RZQ3-1,A0A0P0XHF4.pdb_B,0.867056,0.855604
372,8IF6-1_A0A0N7KEW0-1,A0A0P0XHF4.pdb_B,0.862492,0.844927
516,8IF6-1_Q69X07-1,A0A0P0XHF4.pdb_B,0.847566,0.808583


They are all in the top cluster!

# 2. Make pymol sessions for the top clusters with `make_pymol_sessions`

Run the following command to select the top clusters that we saw above, make pymol sessions of the top clusters, and optionally do structural clustering on each cluster to find subclusters:

```bash
alphacrv-rank \
  --clusters_dir ./examples/AVRPia/AVRPia_vs_rice_clusters \
  --min_members 5 \
  --min_tmscore 0.2 \
  --max_rmsd 15 \
  --cluster_clusters
```

The program will show you the top clusters that will be used to make the pymol sessions. You can press `Enter` to continue, or exit the program with `Ctrl+C` to change the filtering parameters.

This command should create the following files in the `./examples/SKP1/SKP1_vs_rice_clusters/merged_clusters/` / `merged_clusters` directory:

- `clustered_clusters.csv`: Contains the subclusters for each of the top clusters.
- `cluster_<cluster_ID>/`: Contains the PDBs of each cluster, and a PyMol session with the cluster members.
- `cluster_<cluster_ID>_clusters/`: Contains the results of the `foldseek easy-cluster` run on the cluster members.

## Read clustered clusters

The following DataFrame contains the subclusters for each of the top clusters:

In [20]:
clustered_clusters = pd.read_csv(results_dir / 'merged_clusters/clustered_clusters.csv')

In [21]:
clustered_clusters.head()

Unnamed: 0,subcluster_rep,member,cluster
0,A0A0P0VR14.pdb_B,A0A0P0VR14,A0A0P0VR14.pdb_B
1,A0A0P0VR14.pdb_B,Q7XKU0,A0A0P0VR14.pdb_B
2,A0A0P0VR14.pdb_B,Q7XL60,A0A0P0VR14.pdb_B
3,A0A0P0VR14.pdb_B,Q6K6K8,A0A0P0VR14.pdb_B
4,A0A0P0VR14.pdb_B,Q2R448,A0A0P0VR14.pdb_B


Now we can look at the most interesting clusters and their subclusters in more detail:

## RESULT 1: Cluster B7ESQ3.pdb_B (contains true binder homologs, top cluster by size)

In [22]:
cluster = 'A0A0P0XHF4.pdb_B'

See subclusters:

In [23]:
clustered_clusters[clustered_clusters.cluster==cluster].head()

Unnamed: 0,subcluster_rep,member,cluster
6,A0A0P0VGE5.pdb_B,A0A0P0VGE5,A0A0P0XHF4.pdb_B
7,A0A0P0VGE5.pdb_B,Q67W96,A0A0P0XHF4.pdb_B
8,A0A0P0VHD9.pdb_B,A0A0P0VHD9,A0A0P0XHF4.pdb_B
9,A0A0P0VHP5.pdb_B,A0A0P0VHP5,A0A0P0XHF4.pdb_B
10,A0A0P0VHP5.pdb_B,A0A0P0UZC1,A0A0P0XHF4.pdb_B


Count how many subclusters:

In [24]:
clustered_clusters[clustered_clusters.cluster==cluster].subcluster_rep.unique().shape

(53,)

See the amount of strucures in each subcluster:

In [25]:
(clustered_clusters[clustered_clusters.cluster==cluster].groupby('subcluster_rep')
 .size().sort_values(ascending=False).head(15))

subcluster_rep
Q6ZDH1.pdb_B        223
A0A0P0WQD9.pdb_B    116
A0A0P0Y2U6.pdb_B     51
Q5Z7U2.pdb_B         35
Q0DQG8.pdb_B         34
Q75J50.pdb_B         27
Q0JPJ6.pdb_B         19
A0A0P0XHI5.pdb_B     14
A0A0P0X3V6.pdb_B     14
A0A0P0VHP5.pdb_B     12
Q7XAK4.pdb_B          9
Q7EY32.pdb_B          9
A0A0P0V3R7.pdb_B      9
Q8LJA9.pdb_B          8
A0A0P0X458.pdb_B      7
dtype: int64

Here we have 53 subclusters!! This cluster has in total 695 strucures, which is a bit too much to load into PyMOL at once. We have to change our strategy. One way to decide if the subclusters are similar or different to each other would be to only look at the subcluster representatives and classify them according to their topology. We will try now three different approaches to visualize this cluster:

## Take all members of this cluster together

We can just look at the structures with lowest median RMSD to have an idea of how this cluster looks:

In [26]:
median_scores[median_scores.cluster==cluster].sort_values(by='rmsd').head()

Unnamed: 0,cluster,member,tmscore,rmsd,aligned_length,cluster_size,fraction_binder
19,A0A0P0XHF4.pdb_B,A0A0N7KK85,0.94121,1.53,210.0,695.0,1.0
520,A0A0P0XHF4.pdb_B,Q6ASY4,0.93821,1.56,213.0,695.0,0.974359
401,A0A0P0XHF4.pdb_B,Q2QQH8,0.9387,1.6,215.0,695.0,0.97561
37,A0A0P0XHF4.pdb_B,A0A0N7KT42,0.935435,1.63,208.0,695.0,0.970588
256,A0A0P0XHF4.pdb_B,A0A0P0YAQ4,0.935185,1.645,217.0,695.0,0.976744


We can print out the names of the top 20 models to open them in PyMOL (through the terminal or a command inside of PyMOL)

In [27]:
path_cluster = Path.cwd().resolve()

In [27]:
path_cluster = '/Volumes/weka_user/guzmanfj/py/AlphaCRV/examples/'

In [None]:
# Print paths to open models in pymol
membs = median_scores[median_scores.cluster==cluster].sort_values(by='rmsd').head(20).member
' '.join([(path_cluster + str(results_dir) + f'/merged_clusters/cluster_{cluster}/{m}.pdb') for m in membs])

<img src='./pictures/SKP1_cluster.png' height='600px'>

## Look at all the subcluster representatives in PyMOL and group them by topology.

In [28]:
subclusters = (clustered_clusters[clustered_clusters.cluster==cluster])
subclusters_ids = subclusters['subcluster_rep'].str.split('.pdb').str[0]

See the names of the subclusters:

In [29]:
subclusters.subcluster_rep.unique()

array(['A0A0P0VGE5.pdb_B', 'A0A0P0VHD9.pdb_B', 'A0A0P0VHP5.pdb_B',
       'A0A0P0VNF2.pdb_B', 'A0A0P0VNI4.pdb_B', 'A0A0P0VRN2.pdb_B',
       'A0A0P0VZH9.pdb_B', 'A0A0P0W419.pdb_B', 'A0A0P0X4D4.pdb_B',
       'A0A0P0XCY3.pdb_B', 'A0A0P0XE00.pdb_B', 'A0A0P0XE77.pdb_B',
       'A0A0P0XEW7.pdb_B', 'A0A0P0XFT5.pdb_B', 'A0A0P0XHF4.pdb_B',
       'A0A0P0XHI5.pdb_B', 'A0A0P0XNQ3.pdb_B', 'A0A0P0XSL8.pdb_B',
       'A0A0P0Y2U6.pdb_B', 'A0A0P0Y5S9.pdb_B', 'A0A0P0YA59.pdb_B',
       'A0A0P0YAA2.pdb_B', 'A0A0P0YC15.pdb_B', 'A2ZPC4.pdb_B',
       'B7ESQ3.pdb_B', 'B9G6M6.pdb_B', 'Q0DQG8.pdb_B', 'Q0JB10.pdb_B',
       'Q0JK63.pdb_B', 'Q0JPJ6.pdb_B', 'Q109D8.pdb_B', 'Q2QQH8.pdb_B',
       'Q6ZDH1.pdb_B', 'A0A0N7KD81.pdb_A', 'A0A0N7KK85.pdb_B',
       'A0A0N7KN01.pdb_B', 'A0A0N7KT42.pdb_B', 'A0A0P0V091.pdb_B',
       'A0A0P0V3R7.pdb_B', 'A0A0P0WDY8.pdb_B', 'A0A0P0WQD9.pdb_B',
       'A0A0P0X3V6.pdb_B', 'A0A0P0X458.pdb_B', 'Q5N762.pdb_B',
       'Q5Z7U2.pdb_B', 'Q69J07.pdb_B', 'Q69X07.pdb_B', 'Q75IQ5.pdb

Print the PATHs of the subcluster representatives to open them in PyMOL:

In [None]:
' '.join([path_cluster + str(results_dir) + f'/merged_clusters/cluster_{cluster}/{p}.pdb' for p in subclusters_ids.unique()])

Write down the subclusters in different categories:

## Subclusters for only the binding motif

In [30]:
subcluster_names = [
'A0A0P0VGE5.pdb_B',
'A0A0P0VHD9.pdb_B',
'A0A0P0VHP5.pdb_B',
'A0A0P0VNF2.pdb_B',
'A0A0P0VNI4.pdb_B',
'A0A0P0X4D4.pdb_B',
'A0A0P0XCY3.pdb_B',
'A0A0P0XE00.pdb_B',
'A0A0P0XE77.pdb_B',
'A0A0P0XEW7.pdb_B',
'A0A0P0XHF4.pdb_B',
'A0A0P0XHI5.pdb_B',
'A0A0P0XNQ3.pdb_B',
'A0A0P0XSL8.pdb_B',
'A0A0P0YA59.pdb_B',
'A0A0P0YAA2.pdb_B',
'B7ESQ3.pdb_B',
'B9G6M6.pdb_B',
'Q109D8.pdb_B',
'Q2QQH8.pdb_B',
'A0A0N7KK85.pdb_B',
'A0A0N7KT42.pdb_B',
'A0A0P0X458.pdb_B',
'Q5N762.pdb_B',
'Q69J07.pdb_B',
'Q69X07.pdb_B',
'Q75IQ5.pdb_B',
'Q7EY32.pdb_B',
'Q7XAK4.pdb_B'
]

See how many proteins are in these subclusters:

In [31]:
(clustered_clusters[clustered_clusters.subcluster_rep.isin(subcluster_names)]
 .subcluster_rep.value_counts().sum())

113

See the alignment scores of the top members in these subclusters:

In [32]:
# Get list of members for subcluster
members_sc = list(clustered_clusters[clustered_clusters.subcluster_rep.isin(subcluster_names)]
                  .member)

median_scores[median_scores.member.isin(members_sc)].sort_values(by='rmsd').head(20)

Unnamed: 0,cluster,member,tmscore,rmsd,aligned_length,cluster_size,fraction_binder
19,A0A0P0XHF4.pdb_B,A0A0N7KK85,0.94121,1.53,210.0,695.0,1.0
520,A0A0P0XHF4.pdb_B,Q6ASY4,0.93821,1.56,213.0,695.0,0.974359
401,A0A0P0XHF4.pdb_B,Q2QQH8,0.9387,1.6,215.0,695.0,0.97561
37,A0A0P0XHF4.pdb_B,A0A0N7KT42,0.935435,1.63,208.0,695.0,0.970588
256,A0A0P0XHF4.pdb_B,A0A0P0YAQ4,0.935185,1.645,217.0,695.0,0.976744
700,A0A0P0XHF4.pdb_B,Q9FWD4,0.92398,1.74,216.0,695.0,0.97619
418,A0A0P0XHF4.pdb_B,Q2R0S7,0.9346,1.76,217.0,695.0,0.976744
145,A0A0P0XHF4.pdb_B,A0A0P0X0M4,0.92131,1.78,215.0,695.0,0.930233
290,A0A0P0XHF4.pdb_B,B9FCI9,0.919485,1.83,219.0,695.0,0.93617
394,A0A0P0XHF4.pdb_B,Q2QMT0,0.918475,1.83,216.0,695.0,0.97619


See if any of the homologous proteins are in here:

In [33]:
[h in members_sc for h in homologues]

[False, False, False, True, False]

In [34]:
homologues

['Q5VMP0', 'A0A0N7KEW0', 'Q8RZQ3', 'Q69X07', 'A0A0P0Y6A8']

`Q69X07` is in this subcluster.

Print the paths of the top 20 PDB files to open them in PyMOL:

In [None]:
# Print paths to open models in pymol
membs = median_scores[median_scores.member.isin(members_sc)].sort_values(by='rmsd').head(20).member
' '.join([(path_cluster + str(results_dir) + f'/merged_clusters/cluster_{cluster}/{m}.pdb') for m in membs])

<img src='./pictures/SKP1_cluster_binding_motif.png' height='600px'>

This is very similar to the previous figure with all the subclusters together. The models with the lowest RMSD overall are those who only contain the minimal bindign motif of the binder protein, since they have fewer residues to align.

## Subclusters for the binding motif + Leucine-rich region (horseshoe)

In [35]:
subcluster_names = [
'A0A0P0VRN2.pdb_B',
'A0A0P0XFT5.pdb_B',
'A0A0P0Y2U6.pdb_B',
'A0A0P0Y5S9.pdb_B',
'A0A0P0YC15.pdb_B',
'A0A0N7KD81.pdb_B',
'A0A0P0V091.pdb_B',
'A0A0P0WQD9.pdb_B',
'A0A0P0X3V6.pdb_B',
'Q5Z7U2.pdb_B',
'Q75J50.pdb_B',
'Q7FAH1.pdb_B'
]

See how many proteins are in these subclusters:

In [36]:
(clustered_clusters[clustered_clusters.subcluster_rep.isin(subcluster_names)]
 .subcluster_rep.value_counts().sum())

262

See the alignment scores of the top members in these subclusters:

In [37]:
# Get list of members for subcluster
members_sc = list(clustered_clusters[clustered_clusters.subcluster_rep.isin(subcluster_names)]
                  .member)

median_scores[median_scores.member.isin(members_sc)].sort_values(by='rmsd').head(20)

Unnamed: 0,cluster,member,tmscore,rmsd,aligned_length,cluster_size,fraction_binder
171,A0A0P0XHF4.pdb_B,A0A0P0XBZ2,0.687465,7.325,230.5,695.0,0.533654
149,A0A0P0XHF4.pdb_B,A0A0P0X2K4,0.4153,7.34,216.0,695.0,0.164659
82,A0A0P0XHF4.pdb_B,A0A0P0VUW0,0.711865,7.67,270.0,695.0,0.826087
413,A0A0P0XHF4.pdb_B,Q2QY97,0.4328,7.685,221.0,695.0,0.188525
81,A0A0P0XHF4.pdb_B,A0A0P0VRP6,0.596455,7.75,221.5,695.0,0.360465
310,A0A0P0XHF4.pdb_B,C7J8M0,0.6007,7.785,219.0,695.0,0.27673
118,A0A0P0XHF4.pdb_B,A0A0P0WF72,0.641515,7.825,238.0,695.0,0.516393
282,A0A0P0XHF4.pdb_B,B7EXZ6,0.514025,8.01,214.0,695.0,0.236364
249,A0A0P0XHF4.pdb_B,A0A0P0Y5S9,0.55108,8.015,209.0,695.0,0.244604
244,A0A0P0XHF4.pdb_B,A0A0P0Y2Y9,0.41734,8.035,218.5,695.0,0.173307


See if any of the homologous proteins are in here:

In [38]:
[h in members_sc for h in homologues]

[True, True, True, False, True]

In [39]:
homologues

['Q5VMP0', 'A0A0N7KEW0', 'Q8RZQ3', 'Q69X07', 'A0A0P0Y6A8']

All the rest of homologous proteins are in here!

Print the paths of the top 20 PDB files to open them in PyMOL:

In [None]:
# Print paths to open models in pymol
membs = median_scores[median_scores.member.isin(members_sc)].sort_values(by='rmsd').head(20).member
' '.join([(path_cluster + str(results_dir) + f'/merged_clusters/cluster_{cluster}/{m}.pdb') for m in membs])

<img src='./pictures/SKP1_cluster_llr.png' height='600px'>

These subclusters contain binders that, in addition to the binding motif, have LLR domains that also interact with SKP1 in some way. The true binder of SKP1 and its homologues have a similar architecture.

## Subclusters for the binding motif + beta propeller

In [40]:
subcluster_names = [
'A0A0P0W419.pdb_B',
'Q0DQG8.pdb_B',
'Q0JK63.pdb_B',
'Q6ZDH1.pdb_B',
'A0A0P0V3R7.pdb_B'
]

See how many proteins are in these subclusters:

In [41]:
(clustered_clusters[clustered_clusters.subcluster_rep.isin(subcluster_names)]
 .subcluster_rep.value_counts().sum())

277

See the alignment scores of the top members in these subclusters:

In [42]:
# Get list of members for subcluster
members_sc = list(clustered_clusters[clustered_clusters.subcluster_rep.isin(subcluster_names)]
                  .member)

median_scores[median_scores.member.isin(members_sc)].sort_values(by='rmsd').head(20)

Unnamed: 0,cluster,member,tmscore,rmsd,aligned_length,cluster_size,fraction_binder
213,A0A0P0XHF4.pdb_B,A0A0P0XQK8,0.30251,7.94,201.5,695.0,0.065757
133,A0A0P0XHF4.pdb_B,A0A0P0WR08,0.309095,8.335,188.0,695.0,0.033679
422,A0A0P0XHF4.pdb_B,Q2R1S8,0.293405,8.74,194.0,695.0,0.045564
528,A0A0P0XHF4.pdb_B,Q6EQC5,0.35175,8.76,197.5,695.0,0.070755
575,A0A0P0XHF4.pdb_B,Q6Z4S1,0.31815,8.76,206.5,695.0,0.08445
691,A0A0P0XHF4.pdb_B,Q8VWI8,0.29592,8.865,194.0,695.0,0.046569
360,A0A0P0XHF4.pdb_B,Q0JFD5,0.27402,8.89,191.0,695.0,0.034858
692,A0A0P0XHF4.pdb_B,Q8W0I3,0.30853,8.955,195.0,695.0,0.05168
471,A0A0P0XHF4.pdb_B,Q53WL8,0.33339,9.06,197.0,695.0,0.063768
577,A0A0P0XHF4.pdb_B,Q6Z6Y9,0.307905,9.07,197.0,695.0,0.057143


Print the paths of the top 20 PDB files to open them in PyMOL:

In [None]:
# Print paths to open models in pymol
membs = median_scores[median_scores.member.isin(members_sc)].sort_values(by='rmsd').head(20).member
' '.join([(path_cluster + str(results_dir) + f'/merged_clusters/cluster_{cluster}/{m}.pdb') for m in membs])

<img src='./pictures/SKP1_cluster_beta_propeller.png' height='600px'>

The alignment of the beta propeller domains is much messier, as they are modeled in many different orientations. The screenshot above only contains a few of the complexes.