# Clustering TCR Sequences

Following featurization of the TCRSeq data, users will often want to cluster the TCRSeq data to identify possible antigen-specific clusters of sequences. In order to do this, we have provided multiple ways for clustering your TCR sequences.

## Phenograph Clustering

The first method we will explore is using a network-graph based clustering algorithm called Phenograph (https://github.com/jacoblevine/PhenoGraph). This method automatically determines the number of clusters in the data by maximizing the modularity of the network-graph asssembled from the data. Of note, this algorithm is very fast and will be useful for when there are possibly thousands to tens of thouands of sequences to cluster. However, clusters by this method tend to be quite large.

First, we will load data and train the VAE.

In [None]:
import sys
sys.path.append('../../')
from DeepTCR.DeepTCR_U import DeepTCR_U

# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')

#Load Data from directories
DTCRU.Get_Data(directory='../../Data/Murine_Antigens',Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False)


We will then run the clustering command.

In [None]:
DTCRU.Cluster(clustering_method='phenograph')

Following clustering, we can view the clustering solutions by looking at the object variable called Cluster_DFs.

In [None]:
DFs = DTCRU.Cluster_DFs
print(DFs[0])

We can also choose to save these results to a directory called Name of object + '_Results' by setting the write_to_sheets parameter to True. There, we can find the propritons of every sample in each cluster and csv files for every cluster detailing the sequence information with other information as well.

In [None]:
DTCRU.Cluster(clustering_method='phenograph',write_to_sheets=True)

We can also employ two other clustering algorithms (hierarchical clustering and DBSCAN). For these types of methods, we can either control the settings for the algorithm such as the threshold parameter (t), the criterion/linkage algorithm for heirarchical clustering, or we can allow the method to determine the optimal threshold parameter by maximizing the silhoutte score of the clustering solution. First, if we run hierarchial clustering letting the program determing the right threshold parameters:

## Hierarchical Clustering

In [None]:
DTCRU.Cluster(clustering_method='hierarchical')

Or we can set the parameters ourselves.

In [None]:
DTCRU.Cluster(clustering_method='hierarchical',criterion='distance',t=1.0)

## DBSCAN clustering

And to use DBSCAN...

In [None]:
DTCRU.Cluster(clustering_method='dbscan')

In the case there are perhaps too many sequences to efficiently cluster the data quickly, one can downsample the data and then use a k-nearest neighbor algorithm to classify the rest of the sequences like so .Here, we will downsample 500 sequenes for clustering and then assign the rest via KNN.

In [None]:
DTCRU.Cluster(clustering_method='phenograph',sample=100)

Finally, we can visualize the clustering results through a UMAP representation of the sequences.

## 

In [None]:
DTCRU.UMAP_Plot(by_cluster=True)