# Part 2: Clustering with ClusTCR

ClusTCR is a python package developed to have a fast and accuracte way of clustering large TCR repertoires. ClusTCR uses a 2-step method, first dividing the data into supercluster, before clustering TCR sequences with high sequence similarity. Further information on ClusTCR and all of its possibilities can be found here: https://svalkiers.github.io/clusTCR/

When you use this notebook in google colab, run the first few cells in this notebook to install conda and the ClusTCR package.

If you want to use these notebooks on your local machine, just skip these first three cells and directly import pandas and the clustcr package (after they have been installed locally).

### Start when running in Google Colab

In [1]:
# Check whether conda is already installed
!conda --version

conda 23.3.1


In [None]:
#If !conda --version returns no results, install conda with :
!pip install -q condacolab
import condacolab 
condacolab.install()

‚è¨ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
üì¶ Installing...
üìå Adjusting configuration...
ü©π Patching environment...
‚è≤ Done in 0:00:27
üîÅ Restarting kernel...


In [None]:
# Install the clustcr package with conda (can take a little while)
!conda install clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge

In [2]:
# clone the TCR workshop github repository and its data
!git clone https://github.com/vincentvandeuren/tcr_workshop_2023.git

# change your working directory
%cd tcr_workshop_2023

# Check if the tcr_workshop_2023 folder is available in the file menu, you might have to press refresh

/home/vincent/Documents/projects/tcr_workshop_2023


In [3]:
# Import packages
import pandas as pd
from clustcr import Clustering

In [21]:
#Determine the number of threads available
#!cat /proc/cpuinfo

In [6]:
# Initiate ClusTCR clustering object
clustering = Clustering(n_cpus=2) # change n_cpus to nunber of threads in your machine

In [7]:
# Load the parsed data from the previous step for clustering
p1_d0 = pd.read_csv('data/P1_0_parsed.tsv', sep='\t', index_col=[0])
p1_d15 = pd.read_csv('data/P1_15_parsed.tsv', sep='\t', index_col=[0])

In [8]:
# Look at the data
p1_d0

Unnamed: 0,junction_aa,v_call,j_call,Total_count,Total_frequency,productive
0,CASSNSDRTYGDNEQFF,TRBV6-2,TRBJ2-1,33422.0,2.171360e-02,True
1,CATSSVLTQQETQYF,TRBV24-1,TRBJ2-5,24502.0,1.591845e-02,True
2,CASSSRGLANTQYF,TRBV12-3,TRBJ2-3,22361.0,1.452749e-02,True
3,CSVVGADTYEQYF,TRBV29-1,TRBJ2-7,20930.0,1.359780e-02,True
4,CASSLGTALNTEAFF,TRBV7-8,TRBJ1-1,20193.0,1.311898e-02,True
...,...,...,...,...,...,...
99711,CASSPRGDPSTDTQYF,TRBV28,TRBJ2-3,1.0,6.496797e-07,True
99712,CASSLSGTSYEQFF,TRBV27,TRBJ2-1,1.0,6.496797e-07,True
99713,CSATGFSYTEQFF,TRBV20-1,TRBJ2-1,1.0,6.496797e-07,True
99714,CASSVGGGQALWGETQYF,TRBV19,TRBJ2-5,1.0,6.496797e-07,True


In [9]:
# Set a timepoint variable to differentiate between both samples
p1_d0["timepoint"] = "0"
p1_d15["timepoint"] = "15"

data_merged = pd.concat([p1_d0, p1_d15])

In [10]:
# Look at the merged format
data_merged

Unnamed: 0,junction_aa,v_call,j_call,Total_count,Total_frequency,productive,timepoint
0,CASSNSDRTYGDNEQFF,TRBV6-2,TRBJ2-1,33422.0,2.171360e-02,True,0
1,CATSSVLTQQETQYF,TRBV24-1,TRBJ2-5,24502.0,1.591845e-02,True,0
2,CASSSRGLANTQYF,TRBV12-3,TRBJ2-3,22361.0,1.452749e-02,True,0
3,CSVVGADTYEQYF,TRBV29-1,TRBJ2-7,20930.0,1.359780e-02,True,0
4,CASSLGTALNTEAFF,TRBV7-8,TRBJ1-1,20193.0,1.311898e-02,True,0
...,...,...,...,...,...,...,...
84864,CASSLGSGRSYNEQFF,TRBV10-2,TRBJ2-1,1.0,9.156553e-07,True,15
84865,CASSIDRLVQGLNQPQHF,TRBV19,TRBJ1-5,1.0,9.156553e-07,True,15
84866,CVTCRYPNTEAFF,TRBV6-1,TRBJ1-1,1.0,9.156553e-07,True,15
84867,CASNVVGRLQYF,TRBV28,TRBJ2-7,1.0,9.156553e-07,True,15


In [11]:
# Fit data to the clustering object (+- 5 min)
clustering_result = clustering.fit(data_merged['junction_aa'])

Clustering 179928 TCRs using two-step approach.
Total time to run ClusTCR: 30.694s


In [22]:
# Calculate several pysicochemical features for each cluster (+- 1.5 min)
# The explanation for each feature can be found at: https://svalkiers.github.io/clusTCR/docs/analyzing/features.html
feature_df = clustering_result.compute_features(compute_pgen=False)

  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dty

In [13]:
# Show the features per cluster
feature_df

Unnamed: 0_level_0,h,size,length,basicity_avg,hydrophobicity_avg,helicity_avg,mutation stability_avg,basicity_var,hydrophobicity_var,helicity_var,mutation stability_var
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0.136702,269,11,210.532800,-0.313017,1.043800,21.364026,1.315800,0.181254,0.001014,1.592704
1,0.144592,56,11,210.442170,-0.039602,1.065563,21.929945,1.388506,0.122167,0.000383,1.434379
2,0.164355,182,11,209.932587,-0.406665,1.047172,20.980981,1.101254,0.243649,0.001209,2.126381
3,0.129440,391,11,210.599882,0.027012,1.059154,20.574857,1.070285,0.149060,0.000521,1.598687
4,0.131179,253,11,210.868167,-0.137066,1.055096,20.608696,1.545378,0.130226,0.000540,1.438482
...,...,...,...,...,...,...,...,...,...,...,...
7138,0.071429,2,14,211.178125,0.407500,1.087187,22.406250,0.082520,0.000028,0.000086,0.861328
7139,0.066667,2,15,211.026471,-0.948235,1.021765,20.500000,0.002924,0.082848,0.000504,0.209343
7140,0.071429,2,14,208.153125,-0.473125,1.087813,22.281250,0.046895,0.000413,0.000044,0.564453
7141,0.066667,2,15,209.085294,-0.354412,1.087059,23.235294,1.107422,0.004941,0.000034,0.062284


In [14]:
# Get a summary table depicting the cluster motif, cluster size, and all TCRs within the cluster
clustering_summary = clustering_result.summary()
clustering_summary['sequences'] = clustering_result.cluster_contents()

# Display summary data
clustering_summary

Unnamed: 0,size,motif,sequences
0,269,CASS.G..YEQYF,"[CASSLGGGYEQYF, CASSLGSKYEQYF, CASSPPSSYEQYF, ..."
1,56,CASSL.G.yEQYF,"[CASSLGGRYEQYF, CASSLAGFYEQYF, CASSLAGRYEQYF, ..."
2,182,CASS[LP].gg[ND]EQFF,"[CASSLEGGNEQFF, CASSLAGGDEQYF, CASSLWGGDEQYF, ..."
3,391,CASSL...NEQFF,"[CASSLFGGNEQFF, CASSLGGANEQFF, CASSLGDYNEQFF, ..."
4,253,CASSL...ETQYF,"[CASSLGTGETQYF, CASSLGSQETQYF, CASSLQHQETQYF, ..."
...,...,...,...
7138,2,CASS[LW]ARDLNTGELFF,"[CASSLARDLNTGELFF, CASSWARDLNTGELFF]"
7139,2,CASS[PY]PSSGRNTGELFF,"[CASSPPSSGRNTGELFF, CASSYPSSGRNTGELFF]"
7140,2,CASSYST[SG]GGSGELFF,"[CASSYSTGGGSGELFF, CASSYSTSGGSGELFF]"
7141,2,CASSPGLAGG[RT]TGELFF,"[CASSPGLAGGRTGELFF, CASSPGLAGGTTGELFF]"


In [15]:
# Display sequence information
clustering_clusters = clustering_result.clusters_df
clustering_clusters

# Save results as file for in part 4 of this tutorial
clustering_clusters.to_csv("data/clustcr_results/clustcr_results.tsv", sep="\t")

## Comparing clusters between repertoires
This same clustering analysis can be performed for both repertoires separately.
Then you can compare clusters, cluster sizes, features, ... between both timepoints.

However, comparing clusters and cluster features can be performed much more efficient using the code provided below. Here, we divide the clusters per sample (P1_0 and P1_15) and compare cluster size, clonal count and clonal frequency per cluster and per sample.

In [16]:
# Add the cluster numbers and features to the original data
data_merged = pd.merge(
    left = data_merged,
    right= clustering_clusters,
    on="junction_aa",
    how="right"
)

# Show the table
data_merged

Unnamed: 0,junction_aa,v_call,j_call,Total_count,Total_frequency,productive,timepoint,cluster,basicity,hydrophobicity,helicity,mutation stability
0,CASSLGGGYEQYF,TRBV7-9,TRBJ2-7,70.0,4.547758e-05,True,0,0,208.738462,-0.266923,1.080769,24.307692
1,CASSLGGGYEQYF,TRBV7-9,TRBJ2-7,21.0,1.922876e-05,True,15,0,208.738462,-0.266923,1.080769,24.307692
2,CASSLGSKYEQYF,TRBV5-1,TRBJ2-7,8.0,5.197438e-06,True,0,0,210.584615,-0.361538,1.048462,22.769231
3,CASSPPSSYEQYF,TRBV7-6,TRBJ2-7,11.0,7.146477e-06,True,0,0,210.761538,-1.064615,0.958462,20.076923
4,CASSPPSSYEQYF,TRBV3-1,TRBJ2-7,2.0,1.831311e-06,True,15,0,210.761538,-1.064615,0.958462,20.076923
...,...,...,...,...,...,...,...,...,...,...,...,...
65147,CASSYSTSGGSGELFF,TRBV6-5,TRBJ2-2,1.0,9.156553e-07,True,15,7140,208.306250,-0.458750,1.083125,21.750000
65148,CASSPGLAGGRTGELFF,TRBV7-3,TRBJ2-2,43.0,3.937318e-05,True,15,7141,209.829412,-0.404118,1.082941,23.411765
65149,CASSPGLAGGTTGELFF,TRBV5-1,TRBJ2-2,2.0,1.299359e-06,True,0,7141,208.341176,-0.304706,1.091176,23.058824
65150,CASSTRLAGGLTGELFF,TRBV9,TRBJ2-2,3.0,2.746966e-06,True,15,7142,210.076471,0.296471,1.121176,23.294118


In [17]:
# Compare cluster size, clonotype count and frequency between different clusters

# step 1: group the data per cluster number and timepoint
# step 2: get the number of TCRs in each cluster with '"junction_aa":len'
# step 3: sum all individual clone counts and frequencies per cluster to get a cluster total
# step 4: rename the columns and fill NaN fields
cluster_sizes = (
    data_merged
    .groupby(["cluster", "timepoint"])
    .agg({"junction_aa":len, "Total_count":sum, "Total_frequency":sum})
    .rename({"junction_aa":"size"}, axis="columns")
    .unstack()
    .fillna(0)
    )


In [18]:
# Calcualte the difference in cluster size between day 0 and day 15
cluster_sizes['size', 'delta'] = cluster_sizes['size', '15'] - cluster_sizes['size', '0']
cluster_sizes['size', 'delta'] = cluster_sizes['size', '15'] - cluster_sizes['size', '0']

In [19]:
# Show comparison data for cluster size (and change in size), clonal count and clonal frequency per cluster
cluster_sizes

Unnamed: 0_level_0,size,size,Total_count,Total_count,Total_frequency,Total_frequency,size
timepoint,0,15,0,15,0,15,delta
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,164.0,147.0,1842.0,2571.0,0.001197,2.354150e-03,-17.0
1,33.0,29.0,272.0,316.0,0.000177,2.893471e-04,-4.0
2,107.0,91.0,1039.0,804.0,0.000675,7.361869e-04,-16.0
3,242.0,215.0,3018.0,3047.0,0.001961,2.790002e-03,-27.0
4,164.0,130.0,1571.0,1017.0,0.001021,9.312215e-04,-34.0
...,...,...,...,...,...,...,...
7138,2.0,0.0,22.0,0.0,0.000014,0.000000e+00,-2.0
7139,1.0,1.0,3.0,169.0,0.000002,1.547457e-04,0.0
7140,1.0,1.0,3.0,1.0,0.000002,9.156553e-07,0.0
7141,1.0,1.0,2.0,43.0,0.000001,3.937318e-05,0.0


In [20]:
# Sort the data to find which cluster sizes changed the most between both time points
cluster_sizes["size"].sort_values(by="delta")

timepoint,0,15,delta
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5606,348.0,260.0,-88.0
4461,377.0,290.0,-87.0
5871,199.0,127.0,-72.0
6817,219.0,149.0,-70.0
3633,204.0,140.0,-64.0
...,...,...,...
6823,15.0,28.0,13.0
1584,16.0,32.0,16.0
226,24.0,40.0,16.0
4600,70.0,95.0,25.0
