# Structural Diversity Assessment

DeepTCR3 can be used to assess the structural diversity within a repertoire. One can think of this structural diversity assessment as being a measure of the number of antigens or concepts within a repertoire. A repertoire that has low diversity is recognizing few antigens while one with a high diversity can recognize more. We will assess two measures of diversity. The first one is the number of clusters/concepts of TCR sequences are present in each sample/repertoire. And the second is how entropic is the distribution of the sequence reads across these clusters. For example, a sample can have 12 clusters but if 90% of the repertoire is is one cluster, this would be a low entropic repertoire.

First, we will load data and train the VAE.

In [1]:
%%capture
import sys
sys.path.append('../../')
from DeepTCR3.DeepTCR3 import DeepTCR3_U

# Instantiate training object
DTCRU = DeepTCR3_U('Tutorial')

#Load Data from directories
DTCRU.Get_Data(directory='../../Data/Murine_Antigens',Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False)

We will then execute the following command to generate diversity measurements.

In [2]:
DTCRU.Structural_Diversity()

Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 2.7122957706451416 seconds
Jaccard graph constructed in 0.7806694507598877 seconds
Wrote graph to binary file in 0.2883594036102295 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.886948
Louvain completed 21 runs in 4.396822452545166 seconds
PhenoGraph complete in 8.189501523971558 seconds


Because the first step of this algorithm is to cluster the data via the phenograph algorithm, we can also change the sample parameter to sub-sample for our initial clustering before applying a K-nearest neighbors algorithm to assign the rest of the sequences. This is helpful in the case of very large TCRSeq file (like those collected in from Tumor-Infiltrating Lymphocytes (TIL)).

In [3]:
DTCRU.Structural_Diversity(sample=500)

Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 0.17600107192993164 seconds
Jaccard graph constructed in 0.16491436958312988 seconds
Wrote graph to binary file in 0.022684335708618164 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.652236
After 2 runs, maximum modularity is Q = 0.655081
After 11 runs, maximum modularity is Q = 0.657114
After 22 runs, maximum modularity is Q = 0.659137
Louvain completed 42 runs in 7.972766876220703 seconds
PhenoGraph complete in 8.344019889831543 seconds


We can then view the structural diversity metrics in the respective object variable.

In [4]:
print(DTCRU.Structural_Diversity_DF)

                  Sample    Class   Entropy  Num of Clusters
0              Db-F2.tsv    Db-F2  1.738316                7
1             Db-M45.tsv   Db-M45  1.894494                9
2              Db-NP.tsv    Db-NP  1.715958                9
3              Db-PA.tsv    Db-PA  1.603360                9
4             Db-PB1.tsv   Db-PB1  1.357835                9
5             Kb-M38.tsv   Kb-M38  0.947212                9
6       Kb-6_dLN_SIY.tsv   Kb-SIY  1.832507                9
7      Kb-5_TILS_SIY.tsv   Kb-SIY  1.739412                8
8     Kb-4_Sp_T_SIYI.tsv   Kb-SIY  1.913302                9
9    Kb-1_Sp_Con_SIY.tsv   Kb-SIY  1.849259                9
10  Kb-2_Sp_Con_TRP2.tsv  Kb-TRP2  1.966476                9
11    Kb-3_Sp_T_TRP2.tsv  Kb-TRP2  1.684725                9
12     Kb-7_dLN_TRP2.tsv  Kb-TRP2  1.923864                9
13           Kb-m139.tsv  Kb-m139  1.975833                9


We can see the entropy and number of clusters for each sample.