# **PART 1:** Clustering TCR repertoires using ClusTCR

In [1]:
import os
# Set the working directory to the repository directory
os.chdir("/home/sebastiaan/PhD/Repositories/book_chapter/")

For this part of the tutorial, we will need the `pandas` library, which allows easy handling of our data. In addition, we will using the `clustcr` package to perform clustering on the data.

In [2]:
# pandas for data handling
import pandas as pd
# clustcr for data clustering
from clustcr import Clustering

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Import the data
data = pd.read_csv("../data/examples/P1_0_clones.txt", sep = "\t")
data.head()

Unnamed: 0,duplicate_count,frequency,v_call,j_call,junction_aa
0,33422.0,0.011901,TRBV6-2*00,TRBJ2-1*00,CASSNSDRTYGDNEQFF
1,24502.0,0.008725,TRBV24-1*00,TRBJ2-5*00,CATSSVLTQQETQYF
2,22361.0,0.007962,TRBV12-3*00,TRBJ2-3*00,CASSSRGLANTQYF
3,20930.0,0.007453,TRBV29-1*00,TRBJ2-7*00,CSVVGADTYEQYF
4,20193.0,0.00719,TRBV7-8*00,TRBJ1-1*00,CASSLGTALNTEAFF


The ClusTCR syntax is similar to that of `scikit-learn`. First we start by configuring the clustering model. In this case we will use the default parameters, except for the number of CPUs, which we set to 16 here. 

***HINT**: You can use the `multiprocessing` package to check the number of CPUs available on your device, via the `cpu_count`function. You can set the number of CPUs equal to the output of this function.*

We want to include the V gene into the clustering procedure. In order for this to happen, we need to specificy it during the data fitting process.

In [4]:
from multiprocessing import cpu_count
# Initiate the Clustering object 
# Here we set n_cpus equal to the number of available CPUs
clustering = Clustering(n_cpus = cpu_count)
# Fit the data
results = clustering.fit(
    data = data,
    include_vgene = True,
    cdr3_col = "junction_aa",
    v_gene_col = "v_call"
    )

Total time to run ClusTCR: 69.250s


After running the algorithm, ClusTCR generates a ClusteringResult object, which holds different properties of the generated clusters. To view which TCRs belong to which cluster, you can access the `clusters_df` property.

In [5]:
# Access the clustering results
clusters = results.clusters_df
clusters.head()

Unnamed: 0,junction_aa,v_call,cluster
0,CASSEREANEQFF,TRBV6-4*00,783
1,CASSDRSGGADEQFF,TRBV6-4*00,483
2,CASSYGAGANVLTF,TRBV6-5*00,1317
3,CASSEDGNTEAFF,TRBV6-4*00,787
4,CASSEATGGANVLTF,TRBV6-4*00,700


To get a more condensed overview of the clustering results, you can use the `summary()` function. This will create a consensus representation of each cluster, and displays cluster size. Note that this may take some time to run.

In [6]:
# Overview of the clustering output
summary = results.summary()
summary.head()

Unnamed: 0,size,motif
0,33,CASS.r.aGELFF
1,69,CASSpaSGGa[ND]EQFF
2,177,CASS[SY]GAGANVLTF
3,2,CASS..GNTEAFF
4,3,CASSEATggANVLTF


Another useful function you can perform on the `ClusteringResult` object is `compute_features()`. This function will calculate several properties of the clusters, including physicochemical characteristics, cluster entropy, and (optionally) generation probability. Note that when calculating generation probability, this function may take a while to run.

In [12]:
# Compute more cluster features for downstream use
features = results.compute_features(compute_pgen = True)

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean

In [14]:
features.head()

Unnamed: 0_level_0,h,size,length,basicity_avg,hydrophobicity_avg,helicity_avg,mutation stability_avg,basicity_var,hydrophobicity_var,helicity_var,mutation stability_var,pgen_avg,pgen_var
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
761,0.164009,32,11,210.768269,0.071899,1.072067,21.925481,1.927027,0.103435,0.000561,2.066992,3.736845e-08,2.489712e-15
445,0.125349,10,13,209.265333,-0.941267,1.0442,19.186667,0.730013,0.126709,0.00048,0.574617,3.974605e-09,1.0309960000000001e-17
1218,0.083333,2,12,207.946429,-0.1325,1.121786,20.392857,0.077168,0.060006,3.1e-05,1.125,1.633868e-07,1.670971e-14
862,0.145242,15,11,209.548718,-0.231385,1.065538,18.8,0.210454,0.07294,0.000637,0.909383,2.144654e-07,8.638849e-14
509,0.089135,3,13,208.737778,-0.18,1.104667,18.511111,0.028993,0.015787,6.5e-05,0.365926,5.917418e-09,3.8531800000000003e-17
