# **AIRR-C Meeting Software Demo: ClusTCR**

![](./workflow.png "ClusTCR workflow")

## 1. Installation
ClusTCR runs as a python package, and is available through the conda repository. Therefore, users should have installed anaconda to use make use of the ClusTCR. After installing anaconda, ClusTCR can be installed by running the following command in the command prompt: 
\
\
`$ conda install clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge`

## 2. Importing data
For the purpose of this software demonstration, we will use an example repertoire consisting of **100,000 unique TCR clonotypes** (V + J + CDR3). In addition to `clustcr` We will use the `pandas` library to handle the data.

In [25]:
import pandas as pd # We will use the pandas library for data handling

In [26]:
data = pd.read_csv("./data/repertoire.tsv", sep="\t")
data

Unnamed: 0,vgene,cdr3,jgene
0,TRBV7-9*00,CASSSRGPETQYF,TRBJ2-5*00
1,TRBV5-6*00,CASSHDLQSSYEQYF,TRBJ2-7*00
2,TRBV6-3*00,CASSYSEVGELFF,TRBJ2-2*00
3,TRBV7-2*00,CASSPPNLYEQYF,TRBJ2-7*00
4,TRBV7-2*00,CASSLGLAGVRNF,TRBJ2-3*00
...,...,...,...
99995,TRBV13*00,CASRLAGKSYEQYF,TRBJ2-7*00
99996,TRBV6-6*00,CASRLASSYNEQFF,TRBJ2-1*00
99997,TRBV6-2*00,CASRLGGSHYEQYF,TRBJ2-7*00
99998,TRBV3-1*00,CASRLTQVTGELFF,TRBJ2-2*00


## 3. Applying the clustcr clustering library

In [27]:
from clustcr import Clustering # We will use the clustcr library for TCR clustering

ClusTCR is a clustering library specifically developed for fast and efficient grouping of TCR sequences, based on their sequence similarity. ClusTCR was benchmarked for clustering accuracy and computational performance against other published TCR clustering methods, including TCRDist, GLIPH2, iSMART and the recent GIANA. ClusTCR shows significant improvement in speed while maintaining comparable accuracy with other tools.
<center><img src="./fig/benchmark.png"/></center>

### 3.1. CDR3-based clustering (vanilla ClusTCR)
We start with an easy example, in which we use the **default parameters** of ClusTCR to cluster the **CDR3 sequences** in the input data. ClusTCR uses the follows the `scikit-klearn` practices, so we start by generating a Clustering object to which we will fit the data.

In [28]:
clustering_configuration_1 = Clustering()
result_1 = clustering_configuration_1.fit(data.cdr3)

Clustering 100000 TCRs using two-step approach.
Total time to run ClusTCR: 30.500s


Next, we can access the clustering results via the `.clusters_df` attribute.

In [29]:
clusters_1 = result_1.clusters_df
clusters_1

Unnamed: 0,CDR3,cluster
0,CASSLDHSGANVLTF,0
1,CASSVDHSGANVLTF,0
2,CASSATGSGANVLTF,0
3,CASSQTGSGANVLTF,0
4,CASSSGRPGANVLTF,0
...,...,...
29248,CASSSTGVGTDTQYF,4488
29249,CASSRGGPGGYEQYF,4489
29250,CASSRGGPMGYEQYF,4489
29251,CASSSAGTAQETQYF,4490


### 3.2. What's happening under the hood?
By default ClusTCR uses a **two-step** clustering approach, that combines a fast **preclustering** step with a careful, slower **subclustering** step for identification of TCR clusters. These steps have been implemented individually within the ClusTCR ecosystem. Therefore, they can be used as such.
#### 3.2.1. The first clustering step: computing superclusters
In the first clustering step, ClusTCR employs the the rapid K-means implementation of the `faiss` library to find large clusters of *roughly* identical sequences.
![](./fig/step_1.png)

In [30]:
faiss_clustering = Clustering(method="faiss")
faiss_results = faiss_clustering.fit(data.cdr3)
superclusters = faiss_results.clusters_df

Clustering using faiss approach.
Total time to run ClusTCR: 2.127s


#### 3.2.1. The second clustering step: subclustering the superclusters
During the second clustering step, a detailed subclustering will be performed on each individual supercluster. To do this, ClusTCR uses a network clustering approach (by default: Markov Clustering algorithm, MCL). This implies the construction of a similarity network, onto which the clustering can be performed. ClusTCR uses a **Hamming distance of 1** as a cutoff to determine similarity between any two CDR3 sequences.
![](./fig/step_2.png)

In this example we manually apply MCL on each supercluster individually. The default **two-step** method of ClusTCR automates this process and performs this subclustering in parallel across multiple CPUs.

In [31]:
for cluster in superclusters.cluster.unique():
    subcluster = superclusters[superclusters.cluster == cluster].CDR3
    mcl_clustering = Clustering(method="mcl")
    mcl_results = mcl_clustering.fit(subcluster)

Clustering using MCL approach.
Total time to run ClusTCR: 4.866s
Clustering using MCL approach.
Total time to run ClusTCR: 0.742s
Clustering using MCL approach.
Total time to run ClusTCR: 1.575s
Clustering using MCL approach.
Total time to run ClusTCR: 3.656s
Clustering using MCL approach.
Total time to run ClusTCR: 3.095s
Clustering using MCL approach.
Total time to run ClusTCR: 2.254s
Clustering using MCL approach.
Total time to run ClusTCR: 0.461s
Clustering using MCL approach.
Total time to run ClusTCR: 0.189s
Clustering using MCL approach.
Total time to run ClusTCR: 1.874s
Clustering using MCL approach.
Total time to run ClusTCR: 0.193s
Clustering using MCL approach.
Total time to run ClusTCR: 0.246s
Clustering using MCL approach.
Total time to run ClusTCR: 1.178s
Clustering using MCL approach.
Total time to run ClusTCR: 0.857s
Clustering using MCL approach.
Total time to run ClusTCR: 0.055s
Clustering using MCL approach.
Total time to run ClusTCR: 2.147s
Clustering using MCL appr

### 3.3. Beyond the default settings
We can speed up the computation by tweaking a few of the parameters, such as `n_cpus` (the number of CPUs)...

In [32]:
clustering_configuration_2 = Clustering(n_cpus = 8)
result_2 = clustering_configuration_2.fit(data.cdr3)

Clustering 100000 TCRs using two-step approach.
Total time to run ClusTCR: 8.893s


... or `faiss_cluster_size`, the size of the superclusters formed by faiss during the first step of the algorithm. 

In [33]:
clustering_configuration_3 = Clustering(n_cpus = 8, faiss_cluster_size = 3000)
result_3 = clustering_configuration_3.fit(data.cdr3)

Clustering 100000 TCRs using two-step approach.
Total time to run ClusTCR: 7.685s


### 3.4. Including V gene information

In [34]:
vgene_clustering = Clustering(n_cpus = 8)
result = vgene_clustering.fit(data, # <- input data
                              include_vgene = True, # <- this parameter specifies that we want to include V gene information into the clustering
                              cdr3_col = "cdr3", # <- name of the column containing the CDR3 information
                              v_gene_col = "vgene") # <- name of the column containing the V gene information

Total time to run ClusTCR: 14.134s


### 3.5. Batch clustering
In case the data set is too big to fit into RAM, ClusTCR provides a functionality to cluster the data in batches. With this approach, ClusTCR uses a (meaningful) subset of sequences to compute the cluster centroids for the pre-clustering step. Sequences will then be assigned to these centroids in batches, resolving potential memory issues.

In [35]:
import os # We will need the os library to scan the directory containing the different files

We'll start by identifying the names of the files we want to batch cluster. We will do this using the `os.listdir()` function.

In [36]:
files = os.listdir("./data/bigrep/")

Next, we use a representative sample of our large repertoire to initialize the clustering centres (centroids). In this dummy example, the size of our large repertoire is 100,000 sequences, so we will use a sample of approximately 20,000 to fit the centroids.

In [37]:
sample = pd.read_csv("./data/representative_sample.tsv", sep="\t")

clustering = Clustering(
    faiss_training_data=sample.cdr3,
    fitting_data_size=100000,
    max_sequence_size=25
    )

Now we precluster the data. During this step, the algorithm will assign the sequences within each batch to their closest centroid.

In [38]:
for file in files:
    filepath = os.path.join("./data/bigrep/", file)
    batch = pd.read_csv(filepath, sep="\t")
    clustering.batch_precluster(batch.cdr3)

Next, we perform the second clustering step. Again, in batches.

In [39]:
results = [clusters for clusters in clustering.batch_cluster()]

In the process, a temporary directory is created that includes a file for each precluster. To remove this at the end, the cleanup function can be called.

In [40]:
clustering.batch_cleanup()

## 4. References
The complete source code and data of this demo are available at [https://github.com/svalkiers/clustcr_demo](https://github.com/svalkiers/clustcr_demo).
\
\
For more information about the use of ClusTCR, you can visit the [documentation](https://svalkiers.github.io/clusTCR/). Here, the functionality of the package is described in much detail. In addition, bugs and errors can be reported them at the [issue section of the ClusTCR GitHub repository](https://github.com/svalkiers/clusTCR/issues).
\
\
To learn more about the underlying methodology, please read the [ClusTCR article in Bioinformatics](https://doi.org/10.1093/bioinformatics/btab446):
\
<sub>
Valkiers, S., Van Houcke, M., Laukens, K., & Meysman, P. (2021). **ClusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity.** *Bioinformatics*, 37(24), 4865-4867.
</sub>
\
\
Finally, if you have any suggestions on features that could improve the ease-of-use of ClusTCR, feel free to report them to sebastiaan.valkiers@uantwerpen.be. 