# Loading Data with HLA or MHC information

TCR repertoire is understand in the context of the HLA/MHC background in which it resides. The antigenic specificty of a TCR in a host is not only determined by the structure of the TCR but also by what HLA/MHC molecules are present as these dictate the peptides that can be presented and recognized by a particular TCR repertoire.

When comparing the TCR repertoires between inbred mice, since they share the same MHC background, their TCR repertoires are directly comparable. A TCR seen in any animal will be recognizing the same epitope.

However, when comparing human samples with HLA heterogeneity, the same TCR in two humans does not mean they are recognizing the same epitopes as what is presented is determined by the HLA of the human. Therefore, one wants to make comparisons of repertoire in context of the HLA background the TCR was seen in.

In this tutorial, we will demonstrate how to load HLA information along with TCR-Seq data.

## Preparing Data

We've created a toy dataset under Data called 'Human_HLA_Tutorial' to walk through how to incorporate HLA into your TCR-Seq analysis. In that folder, you should see a folder called 'Data' with 10 samples labeled 'Sample_N.tsv'. Additionally, in that same folder, you will see a file called HLA.csv where the first column lists the file names where the TCR-Seq data is stored along with 6 columns for HLA information. It is important that the HLA information is provided in this format in a csv formatted file.

In [1]:
import pandas as pd
import numpy as np

df_hla = pd.read_csv('../Data/Human_HLA_Tutorial/HLA.csv')

In [2]:
df_hla

Unnamed: 0,File,0,1,2,3,4,5
0,Sample_1.tsv,A1101,A2501,B1801,B5201,C1202,C1203
1,Sample_2.tsv,A1101,A2501,B1801,B5201,C1202,C1203
2,Sample_3.tsv,A1101,A2501,B1801,B5201,C1202,C1203
3,Sample_4.tsv,A1101,A2501,B1801,B5201,C1202,C1203
4,Sample_5.tsv,A0201,A2902,B3801,B4501,C0602,C1203
5,Sample_6.tsv,A0201,A2902,B3801,B4501,C0602,C1203
6,Sample_7.tsv,A0201,A2902,B3801,B4501,C0602,C1203
7,Sample_8.tsv,A0201,A2902,B3801,B4501,C0602,C1203
8,Sample_9.tsv,A0201,A0206,B2705,B4001,C0303,C0304
9,Sample_10.tsv,A0201,A0206,B2705,B4001,C0303,C0304


These HLA values can be any categorical value the user wants and HLA can be encoded in anyway the user wants as long as it's consistent across all samples. DeepTCR3 takes the number of unique HLA types in this file and creates categories from them for the purpose of encoding this information into the model.

## Loading TCR and HLA

In [3]:
import sys
sys.path.append('../')
from DeepTCR3.DeepTCR3 import DeepTCR3_U

# Instantiate training object
DTCRU = DeepTCR3_U('HLA Tutorial')

#Load TCR Data from directories
DTCRU.Get_Data(directory='../Data/Human_HLA_Tutorial/Data',Load_Prev_Data=False,
               aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21,
              hla='../Data/Human_HLA_Tutorial/HLA.csv')

Loading Data...
Embedding Sequences...
Data Loaded


As demonstrated, it is simple to load the TCR and HLA information once it has been prepared properly. Of note, if any TCR-Seq files do not have corresponding HLA information as provided in the HLA.csv file, those samples will be dropped from the analysis.

We can also visualize here how many HLA categories were parsed from the data.

In [4]:
DTCRU.lb_hla.classes_

array(['A0201', 'A0206', 'A1101', 'A2501', 'A2902', 'B1801', 'B2705',
       'B3801', 'B4001', 'B4501', 'B5201', 'C0303', 'C0304', 'C0602',
       'C1202', 'C1203'], dtype=object)

And we can also see how the data is encoded in this 'multi-hot' encoding.

In [5]:
DTCRU.hla_data_seq_num

array([[1, 0, 0, ..., 1, 0, 1],
       [1, 0, 0, ..., 1, 0, 1],
       [1, 0, 0, ..., 1, 0, 1],
       ...,
       [1, 0, 0, ..., 1, 0, 1],
       [1, 0, 0, ..., 1, 0, 1],
       [1, 0, 0, ..., 1, 0, 1]])

## HLA Supertypes

While the HLA loci are very genetically diverse, the idea of HLA supertypes is that there are biologically functional groupings of HLA based on how they bind antigen. In DeepTCR3, a user can choose to transform the HLA information from allele (i.e. A0101) to supertype (i.e. A01). In order to do this, it is a simple as setting the use_hla_supertype to True. For this method to work, HLA MUST be provided in the following format -  A0101. Formats such as A*0101 or HLA-A*0101 will not be recognized.

In [6]:
# Instantiate training object
DTCRU = DeepTCR3_U('HLA Tutorial')

#Load TCR Data from directories
DTCRU.Get_Data(directory='../Data/Human_HLA_Tutorial/Data',Load_Prev_Data=False,
               aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21,
              hla='../Data/Human_HLA_Tutorial/HLA.csv',use_hla_supertype=True)

Loading Data...
Embedding Sequences...
Data Loaded


Now if we look at the HLA categories...

In [7]:
DTCRU.lb_hla.classes_

array(['A01', 'A01 A24', 'A02', 'A03', 'B27', 'B44', 'B62'], dtype=object)

We can see that we've reduced the HLA dimensional space to supertypes. 
These supertype designations were taken from:

Sidney, J., Peters, B., Frahm, N., Brander, C., & Sette, A. (2008).
            HLA class I supertypes: a revised and updated classification. BMC immunology, 9(1), 1.

As you can see, the C alleles were removed as there are no well understood groupings as of yet (if this changes, we will update this in later versions of DeepTCR3). However, if one wants to include HLA alleles that do not fall into group as specified by the publication above, one can set another parameter (keep_non_supertype_alleles) to True.

In [8]:
# Instantiate training object
DTCRU = DeepTCR3_U('HLA Tutorial')

#Load TCR Data from directories
DTCRU.Get_Data(directory='../Data/Human_HLA_Tutorial/Data',Load_Prev_Data=False,
               aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21,
              hla='../Data/Human_HLA_Tutorial/HLA.csv',use_hla_supertype=True,keep_non_supertype_alleles=True)

Loading Data...
Embedding Sequences...
Data Loaded


And now if we look at the HLA categories..

In [9]:
DTCRU.lb_hla.classes_

array(['A01', 'A01 A24', 'A02', 'A03', 'B27', 'B44', 'B62', 'C0303',
       'C0304', 'C0602', 'C1202', 'C1203'], dtype=object)

We have both supertypes and alleles present.