# Part 2: Clustering with ClusTCR

ClusTCR is a python package developed to have a fast and accuracte way of clustering large TCR repertoires. ClusTCR uses a 2-step method, first dividing the data into supercluster, before clustering TCR sequences with high sequence similarity. Further information on ClusTCR and all of its possibilities can be found here: https://svalkiers.github.io/clusTCR/

When you use this notebook in google colab, run the first few cells in this notebook to install conda and the ClusTCR package.

If you want to use these notebooks on your local machine, just skip these first three cells and directly import pandas and the clustcr package (after they have been installed locally).

### Start when running in Google Colab



In [1]:
# Check whether conda is already installed
!conda --version

conda 23.3.1


In [None]:
#If !conda --version returns no results, install conda with :
!pip install -q condacolab
import condacolab 
condacolab.install()

‚è¨ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
üì¶ Installing...
üìå Adjusting configuration...
ü©π Patching environment...
‚è≤ Done in 0:00:27
üîÅ Restarting kernel...


In [None]:
# Install the clustcr package with conda (can take a little while)
!conda install clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge

### Start when running on local machine

In [None]:
# clone the github repository and its data
!git clone https://github.com/vincentvandeuren/tcr_workshop_2023.git

# change your working directory
%cd tcr_workshop_2023

Cloning into 'tcr_workshop_2023'...
remote: Enumerating objects: 50, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 50 (delta 17), reused 28 (delta 5), pack-reused 0[K
Unpacking objects: 100% (50/50), 17.20 MiB | 3.54 MiB/s, done.
/content/tcr_workshop_2023


In [None]:
# Import packages
import pandas as pd
from clustcr import Clustering

In [None]:
#Determine the number of threads available
#!cat /proc/cpuinfo

In [None]:
# Initiate ClusTCR clustering object
clustering = Clustering(n_cpus=2) # change n_cpus to nunber of threads in your machine

In [None]:
# Load the parsed data from the previous step for clustering
p1_d0 = pd.read_csv('data/P1_0_parsed.tsv', sep='\t', index_col=[0])
p1_d15 = pd.read_csv('data/P1_15_parsed.tsv', sep='\t', index_col=[0])

In [None]:
# Look at the data
p1_d0

Unnamed: 0,junction_aa,v_call,j_call,Total_count,Total_frequency
0,CASSNSDRTYGDNEQFF,TRBV6-2,TRBJ2-1,33422.0,1.654679e-02
1,CATSSVLTQQETQYF,TRBV24-1,TRBJ2-5,24502.0,1.213062e-02
2,CASSSRGLANTQYF,TRBV12-3,TRBJ2-3,22361.0,1.107064e-02
3,CSVVGADTYEQYF,TRBV29-1,TRBJ2-7,20930.0,1.036217e-02
4,CASSLGTALNTEAFF,TRBV7-8,TRBJ1-1,20193.0,9.997287e-03
...,...,...,...,...,...
171917,CASSSVVTSTDTQYF,TRBV6-5,TRBJ2-3,1.0,4.950868e-07
171918,CASSLGEDRPYGYTF,TRBV5-4,TRBJ1-2,1.0,4.950868e-07
171920,CASSPGTSGGALETQYF,TRBV6-2,TRBJ2-5,1.0,4.950868e-07
171921,CASSAGGAGYNEQFF,TRBV9,TRBJ2-1,1.0,4.950868e-07


In [None]:
# Set a timepoint variable to differentiate between both samples
p1_d0["timepoint"] = "0"
p1_d15["timepoint"] = "15"

data_merged = pd.concat([p1_d0, p1_d15])

In [None]:
# Look atthe merged format
data_merged

Unnamed: 0,junction_aa,v_call,j_call,Total_count,Total_frequency,timepoint
0,CASSNSDRTYGDNEQFF,TRBV6-2,TRBJ2-1,33422.0,1.654679e-02,0
1,CATSSVLTQQETQYF,TRBV24-1,TRBJ2-5,24502.0,1.213062e-02,0
2,CASSSRGLANTQYF,TRBV12-3,TRBJ2-3,22361.0,1.107064e-02,0
3,CSVVGADTYEQYF,TRBV29-1,TRBJ2-7,20930.0,1.036217e-02,0
4,CASSLGTALNTEAFF,TRBV7-8,TRBJ1-1,20193.0,9.997287e-03,0
...,...,...,...,...,...,...
146322,CVSSLALAGASDTQYF,TRBV27,TRBJ2-3,1.0,7.138712e-07,15
146323,CASSPRRTSPAYEQYF,TRBV28,TRBJ2-7,1.0,7.138712e-07,15
146324,CASRGVVPSSYNEQFF,TRBV28,TRBJ2-1,1.0,7.138712e-07,15
146325,CASSSGGPYNEQFF,TRBV19,TRBJ2-1,1.0,7.138712e-07,15


In [None]:
# Fit data to the clustering object (+- 5 min)
clustering_result = clustering.fit(data_merged['junction_aa'])

Clustering 306784 TCRs using two-step approach.
Total time to run ClusTCR: 344.770s


In [None]:
# Calculate several pysicochemical features for each cluster (+- 1.5 min)
# The explanation for each feature can be found at: https://svalkiers.github.io/clusTCR/docs/analyzing/features.html
feature_df = clustering_result.compute_features(compute_pgen=False)

In [None]:
# Show the features per cluster
feature_df

Unnamed: 0_level_0,h,size,length,basicity_avg,hydrophobicity_avg,helicity_avg,mutation stability_avg,basicity_var,hydrophobicity_var,helicity_var,mutation stability_var
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0.149922,280,9,209.835195,0.561101,1.093860,23.753896,1.600693,0.182807,0.000627,1.833525
1,0.171802,55,10,209.955909,0.388682,1.088242,22.774242,1.213307,0.136845,0.000502,1.659895
2,0.153084,180,10,211.054630,0.055361,1.056431,22.952315,1.422374,0.201584,0.000758,1.945378
3,0.151814,10,10,209.945833,0.891000,1.095000,24.391667,0.826470,0.035402,0.000119,2.064892
4,0.128854,7,10,208.988095,0.656429,1.099286,22.583333,0.083839,0.191393,0.000227,1.148148
...,...,...,...,...,...,...,...,...,...,...,...
11602,0.076923,2,13,211.316667,-0.266333,1.062667,20.166667,0.369800,0.007280,0.000200,0.500000
11603,0.076923,2,13,212.340000,-0.273333,1.038000,20.600000,0.980000,0.003872,0.000072,0.320000
11604,0.076923,2,13,210.346667,0.397667,1.113333,22.666667,0.096800,0.000018,0.000004,0.000000
11605,0.076923,2,13,214.020000,-0.562333,1.051000,22.900000,1.668356,0.126002,0.000242,0.642222


In [None]:
# Get a summary table depicting the cluster motif, cluster size, and all TCRs within the cluster
clustering_summary = clustering_result.summary()
clustering_summary['sequences'] = clustering_result.cluster_contents()

# Display summary data
clustering_summary

Unnamed: 0,size,motif,sequences
0,280,CAS...GELFF,"[CASSQEGELFF, CASSQSGELFF, CASSNPGELFF, CASSRP..."
1,55,CATS..tGELFF,"[CATSDPGGELFF, CATSDPSGELFF, CATSDDTGELFF, CAS..."
2,180,CASS...EKLFF,"[CASSLEGEKLFF, CASSLRGEKLFF, CASSQGDEKLFF, CAS..."
3,10,CSVE[GV].TGELFF,"[CSVEGYSGELFF, CSVEGYTGELFF, CSVEVAGGELFF, CSV..."
4,7,CA[SI]S[VG]DTGELFF,"[CASSVDTGELFF, CASSGDTGELFF, CAISGDTGELFF, CAI..."
...,...,...,...
11602,2,CATSDRTS[EG]TGELFF,"[CATSDRTSETGELFF, CATSDRTSGTGELFF]"
11603,2,CASSPREM[HG]TGELFF,"[CASSPREMGTGELFF, CASSPREMHTGELFF]"
11604,2,CASS[QS]RLAGVGELFF,"[CASSQRLAGVGELFF, CASSSRLAGVGELFF]"
11605,2,CASSGRDRG[LR]EKLFF,"[CASSGRDRGLEKLFF, CASSGRDRGREKLFF]"


In [None]:
# Display sequence information
clustering_clusters = clustering_result.clusters_df
clustering_clusters

Unnamed: 0,junction_aa,cluster,basicity,hydrophobicity,helicity,mutation stability
0,CASSQEGELFF,0,209.972727,0.240909,1.058182,22.181818
1,CASSQSGELFF,0,209.245455,0.118182,1.071818,22.000000
2,CASSNPGELFF,0,209.736364,-0.163636,1.030909,22.545455
3,CASSRPGELFF,0,211.936364,-0.070909,1.031818,23.545455
4,CASSAGGELFF,0,208.090909,0.341818,1.110909,23.818182
...,...,...,...,...,...,...
114348,CASSGRDRGLEKLFF,11605,213.106667,-0.311333,1.062000,23.466667
114349,CASSGRDRGREKLFF,11605,214.933333,-0.813333,1.040000,22.333333
114350,CASSFWDSNTGELFF,11606,209.920000,0.438667,1.072000,23.133333
114351,CASSTWDSNTGELFF,11606,209.893333,0.033333,1.060667,21.733333


## Comparing clusters between repertoires
This same clustering analysis can be performed for both repertoires separately.
Then you can compare clusters, cluster sizes, features, ... between both timepoints.

However, comparing clusters and cluster features can be performed much more efficient using the code provided below. Here, we divide the clusters per sample (P1_0 and P1_15) and compare cluster size, clonal count and clonal frequency per cluster and per sample.

In [None]:
# Add the cluster numbers and features to the original data
data_merged = pd.merge(
    left = data_merged,
    right= clustering_clusters,
    on="junction_aa",
    how="right"
)

# Show the table
data_merged

Unnamed: 0,junction_aa,v_call,j_call,Total_count,Total_frequency,timepoint,cluster,basicity,hydrophobicity,helicity,mutation stability
0,CASSQEGELFF,TRBV4-1,TRBJ2-2,16.0,0.000008,0,0,209.972727,0.240909,1.058182,22.181818
1,CASSQSGELFF,TRBV4-1,TRBJ2-2,5.0,0.000002,0,0,209.245455,0.118182,1.071818,22.000000
2,CASSNPGELFF,TRBV12-4,TRBJ2-2,8.0,0.000006,15,0,209.736364,-0.163636,1.030909,22.545455
3,CASSRPGELFF,TRBV12-3,TRBJ2-2,4.0,0.000002,0,0,211.936364,-0.070909,1.031818,23.545455
4,CASSAGGELFF,TRBV10-1,TRBJ2-2,22.0,0.000011,0,0,208.090909,0.341818,1.110909,23.818182
...,...,...,...,...,...,...,...,...,...,...,...
124376,CASSGRDRGLEKLFF,TRBV7-9,TRBJ1-4,2.0,0.000001,15,11605,213.106667,-0.311333,1.062000,23.466667
124377,CASSGRDRGREKLFF,TRBV10-2,TRBJ1-4,4.0,0.000002,0,11605,214.933333,-0.813333,1.040000,22.333333
124378,CASSFWDSNTGELFF,TRBV12-3,TRBJ2-2,12.0,0.000009,15,11606,209.920000,0.438667,1.072000,23.133333
124379,CASSTWDSNTGELFF,TRBV2,TRBJ2-2,8.0,0.000006,15,11606,209.893333,0.033333,1.060667,21.733333


In [None]:
# Compare cluster size, clonotype count and frequency between different clusters

# step 1: group the data per cluster number and timepoint
# step 2: get the number of TCRs in each cluster with '"junction_aa":len'
# step 3: sum all individual clone counts and frequencies per cluster to get a cluster total
# step 4: rename the columns and fill NaN fields
cluster_sizes = data_merged.groupby(["cluster", "timepoint"]).agg({"junction_aa":len, "Total_count":sum, "Total_frequency":sum}).rename({"junction_aa":"size"}, axis="columns").unstack().fillna(0)


In [None]:
# Calcualte the difference in cluster size between day 0 and day 15
cluster_sizes['size', 'delta'] = cluster_sizes['size', '15'] - cluster_sizes['size', '0']
cluster_sizes['size', 'delta'] = cluster_sizes['size', '15'] - cluster_sizes['size', '0']

In [None]:
# Show comparison data for cluster size (and change in size), clonal count and clonal frequency per cluster
cluster_sizes

Unnamed: 0_level_0,size,size,Total_count,Total_count,Total_frequency,Total_frequency,size
timepoint,0,15,0,15,0,15,delta
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,190.0,142.0,2242.0,1336.0,0.001110,0.000954,-48.0
1,35.0,22.0,277.0,114.0,0.000137,0.000081,-13.0
2,92.0,104.0,1032.0,714.0,0.000511,0.000510,12.0
3,5.0,6.0,45.0,32.0,0.000022,0.000023,1.0
4,5.0,4.0,43.0,37.0,0.000021,0.000026,-1.0
...,...,...,...,...,...,...,...
11602,2.0,1.0,189.0,55.0,0.000094,0.000039,-1.0
11603,1.0,1.0,4.0,9.0,0.000002,0.000006,0.0
11604,1.0,1.0,3.0,4.0,0.000001,0.000003,0.0
11605,1.0,1.0,4.0,2.0,0.000002,0.000001,0.0


In [None]:
# Sort the data to find which cluster sizes changed the most between both time points
cluster_sizes["size"].sort_values(by="delta")

timepoint,0,15,delta
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3527,358.0,253.0,-105.0
3521,433.0,341.0,-92.0
11311,264.0,185.0,-79.0
11304,459.0,385.0,-74.0
7613,228.0,161.0,-67.0
...,...,...,...
8893,136.0,157.0,21.0
1302,178.0,200.0,22.0
8897,48.0,70.0,22.0
10738,36.0,59.0,23.0
