# **PART 1:** Clustering TCR repertoires using ClusTCR

In [1]:
import os
# Set the working directory to the repository directory
os.chdir("/home/sebastiaan/PhD/Repositories/book_chapter/")

For this part of the tutorial, we will need the `pandas` library, which allows easy handling of our data. In addition, we will using the `clustcr` package to perform clustering on the data.

In [2]:
# pandas for data handling
import pandas as pd
# clustcr for data clustering
from clustcr import Clustering

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Import the data
data = pd.read_csv("./data/examples/P1_0.tsv", sep = "\t")
data.head()

Unnamed: 0,v_call,j_call,junction_aa,duplicate_count,frequency
0,TRBV6-2,TRBJ2-1,CASSNSDRTYGDNEQFF,33422.0,0.012504
1,TRBV24-1,TRBJ2-5,CATSSVLTQQETQYF,24502.0,0.009166
2,TRBV12-3,TRBJ2-3,CASSSRGLANTQYF,22361.0,0.008366
3,TRBV29-1,TRBJ2-7,CSVVGADTYEQYF,20930.0,0.00783
4,TRBV7-8,TRBJ1-1,CASSLGTALNTEAFF,20193.0,0.007554


The ClusTCR syntax is similar to that of `scikit-learn`. First we start by configuring the clustering model. In this case we will use the default parameters, except for the number of CPUs, which we set to 16 here. 

***HINT**: You can use the `multiprocessing` package to check the number of CPUs available on your device, via the `cpu_count` function. You can set the number of CPUs equal to the output of this function.*

We want to include the V gene into the clustering procedure. In order for this to happen, we need to specificy it during the data fitting process.

In [4]:
from multiprocessing import cpu_count
# Initiate the Clustering object 
# Here we set n_cpus equal to the number of available CPUs
clustering = Clustering(n_cpus = cpu_count())
# Fit the data
results = clustering.fit(
    data = data,
    include_vgene = True,
    cdr3_col = "junction_aa",
    v_gene_col = "v_call"
    )

Total time to run ClusTCR: 84.050s


After running the algorithm, ClusTCR generates a ClusteringResult object, which holds different properties of the generated clusters. To view which TCRs belong to which cluster, you can access the `clusters_df` property.

In [6]:
# Access the clustering results
clusters = results.clusters_df
clusters.head()

Unnamed: 0,junction_aa,v_call,cluster
0,CASSEREANEQFF,TRBV6-4,26
1,CASSDRSGGADEQFF,TRBV6-4,1003
2,CASSYGAGANVLTF,TRBV6-5,823
3,CASSEDGNTEAFF,TRBV6-4,43
4,CASSEATGGANVLTF,TRBV6-4,1069


To get a more condensed overview of the clustering results, you can use the `summary()` function. This will create a consensus representation of each cluster, and displays cluster size. Check out the ClusTCR [documentation](https://svalkiers.github.io/clusTCR/docs/clustering/how-to-use.html) to gain more insight into the meaning of the different characters represented in the consensus notation. Note that this may take some time to run.

In [7]:
# Overview of the clustering output
summary = results.summary()
summary.head()

Unnamed: 0,size,motif
0,232,CASS.r.aGELFF
1,241,CASSpaSGGa[ND]EQFF
2,143,CASS[SY]GAGANVLTF
3,3,CASS..GNTEAFF
4,21,CASSEATggANVLTF


Another useful function you can perform on the `ClusteringResult` object is `compute_features()`. This function will calculate several properties of the clusters, including physicochemical characteristics, cluster entropy, and (optionally) generation probability. Note that when calculating generation probability, this function may take a while to run.

In [9]:
import warnings
# Compute more cluster features for downstream use
# Set warnings filter to prevent triggering of RuntimeWarning
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category = RuntimeWarning)
    features = results.compute_features(compute_pgen = True)

In [20]:
features.head()

Unnamed: 0_level_0,h,size,length,basicity_avg,hydrophobicity_avg,helicity_avg,mutation stability_avg,basicity_var,hydrophobicity_var,helicity_var,mutation stability_var,pgen_avg,pgen_var
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
243,0.109465,15,13,209.132444,-0.729644,1.060311,19.137778,0.414596,0.05202,0.000167,0.433312,6.952818e-09,3.4413410000000003e-17
44,0.071429,2,14,213.959375,-0.62375,0.994375,19.09375,0.092988,0.039903,0.000345,0.048828,1.252533e-13,3.122301e-26
267,0.076923,2,13,208.403333,-0.807667,1.065333,20.833333,0.3042,0.00576,0.000748,0.108889,4.525561e-09,1.2943890000000001e-17
607,0.135843,189,11,210.373097,-0.086064,1.068384,19.561254,1.349025,0.127309,0.000558,1.744788,2.142351e-07,9.769091e-14
227,0.150012,146,13,209.674977,-0.666374,1.055721,20.208219,1.176936,0.101645,0.000484,1.814982,3.173853e-08,4.785767e-15


In [10]:
features.to_csv("./results/clustcr/P1_15_cluster_features.tsv", sep = "\t", index = False)
clusters.to_csv("./results/clustcr/P1_0_clusters.tsv", sep = "\t", index = False)

In [11]:
from multiprocessing import cpu_count

data = pd.read_csv("./data/examples/P1_15.tsv", sep = "\t")

# Initiate the Clustering object 
# Here we set n_cpus equal to the number of available CPUs
clustering = Clustering(n_cpus = cpu_count())
# Fit the data
results = clustering.fit(
    data = data,
    include_vgene = True,
    cdr3_col = "junction_aa",
    v_gene_col = "v_call"
    )

clusters = results.clusters_df
clusters.to_csv("./results/clustcr/P1_15_clusters.tsv", sep = "\t", index = False)

Total time to run ClusTCR: 68.556s
