# Content

1. Introduction
2. TCR
    2.1. Database Queries
    2.2. Clustering and Distance
    2.3. Epitope Prediction
3. BCR
    3.1. Database Queries 

# 1. Introduction

B- and T-cells recognize their targets via their immune receptors (IRs) - B cell receptor (BCR), and T cell receptor (TCR). Their specificity is determined via the amino acid sequence. Previous studies provided evidence, that the most influencing factor of IR-target interaction the CDR3 sequence of the beta-chain and to lesser degree of the CDR3-alpha chain.

In many studies, determining the individual cell's specificity is of key interest: researcher's can then select the cells relevant to their study and observe their behaviour. Binding screening of T cells can be performed via ??? (Check whether similar for BCRs) (see ...). However, this adds another level of complexity onto study design and further increases cost. 
Computational approaches for inferring specificity can provide this layer of information.

Here, we will introduce the following approaches:
- Database Query: IR sequences and their targets from various studies were collected in multiple databases. We can use these to find matches to our single-cell study.
- Clustering and distances: [] showed, that IRs with similar receptors have common specificity. This property has been used in multiple approaches for comparing IRs with distance metrices and unsupervised clustering.
- Epitope prediction: Recently, several machine-learning methods were developed that directly predict binding between IRs and a target. In theory, these methods could be used to directly assign specificity to the IRs involved in single-cell studies.

However, all three approaches have major pitfalls. The amount of samples in the public databases is severly biased towards diseases and use cases that are commonly researched. Examplatory, this leads to known bindings for only several 100 epitopes sequences for TCRs in the major public databases. Further, the a majority amount of samples in these databases does not provide the full IR sequence (VDJ-genes and CDRs for alpha and beta chains), but rather focuses on the CDR3, while often only reporting only alpha or beta sequence. 

# 2. TCR

In the following, we will showcase these approaches for TCRs. In [ERGO2], ... et all reported the importance of different information (elements of the TCR and MHC type) for training a sequence based classifier:

...

We can assume that this importance is similiar not only for prediction, but also for querying, clustering, and distance calculation and thereby use the ordering here as a guideline for which information to use, when comparing TCRs. 

In [1]:
import scanpy as sc
import scirpy as ir

  import pandas.util.testing as tm


Let's load the data from our previous notebooks.

In [2]:
path_data = 'data'
path_tcr = f'{path_data}/TCR_01_preprocessed_tcr.h5ad'
adata_tcr = sc.read(path_tcr)

In [3]:
adata_tcr.obs['patient_id'].value_counts()

CV0902         6835
AP6            5980
CV0904         5904
CV0137         4837
CV0074         4723
               ... 
IVLPS-2-10h     355
IVLPS-2-90m     344
IVLPS-6-90m     182
CV0198           44
IVLPS-6-10h      31
Name: patient_id, Length: 118, dtype: int64

In [4]:
adata_tcr = adata_tcr[adata_tcr.obs['patient_id'].isin(['CV0902', 'AP6'])]

## 2.1. Database Queries
Here, we will search for TCRs with specificity annotatation from previous studies. Common large-scale databases with TCR-epitope pairs are:
- iedb
- vdjdb
- macpas
- PIRD
- immuneaccess (covid)

These databases underlie sever bias toward which epitopes are represented. Depending on your use-case, databases or study data with a specific content might be better suited.

Overall, there is a trade-off between precision and recall depending on the strictness of our query depending on which information we use for defining our clonotypes (see clonotype definition). Unfortunately, there are no general guidelines regarding what information should be taken into account. We will demonstrate various degrees of strictness when querying our data against the VDJdb.

This database can be convienently accessed via Scirpy. Information on TCR, epitope, and experimental setup is stored in vdjdb.obs following common scanpy/scirpy format.

In [5]:
vdjdb = ir.datasets.vdjdb() 
vdjdb.obs.head(5)

Unnamed: 0_level_0,multi_chain,species,mhc.a,mhc.b,mhc.class,antigen.epitope,antigen.gene,antigen.species,reference.id,method.identification,...,IR_VDJ_2_sequence_id,IR_VJ_1_v_call,IR_VJ_2_v_call,IR_VDJ_1_v_call,IR_VDJ_2_v_call,IR_VJ_1_v_cigar,IR_VJ_2_v_cigar,IR_VDJ_1_v_cigar,IR_VDJ_2_v_cigar,has_ir
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,False,HomoSapiens,HLA-B*08,B2M,MHCI,FLKEKGGL,Nef,HIV-1,PMID:15596521,tetramer-sort,...,,TRAV26-1*01,,TRBV13*01,,,,,,True
1,False,HomoSapiens,HLA-B*08,B2M,MHCI,FLKEKGGL,Nef,HIV-1,PMID:15596521,tetramer-sort,...,,,,TRBV13*01,,,,,,True
2,False,HomoSapiens,HLA-B*08,B2M,MHCI,FLKEKGGL,Nef,HIV-1,PMID:15596521,tetramer-sort,...,,TRAV20*01,,TRBV13*01,,,,,,True
3,False,HomoSapiens,HLA-B*08,B2M,MHCI,FLKEKGGL,Nef,HIV-1,PMID:15596521,tetramer-sort,...,,TRAV2*01,,TRBV13*01,,,,,,True
4,False,HomoSapiens,HLA-B*08,B2M,MHCI,FLKEKGGL,Nef,HIV-1,PMID:15596521,tetramer-sort,...,,TRAV38-2/DV8*01,,TRBV14*01,,,,,,True


Queries between two datasets are already integrated in Scirpy. Since they are not available in every toolkit, we will show an example how to manually perform such a query. However, it is gernally more convienent to convert the database into a toolkits format and with the toolkits native functions.

In [6]:
# TODO

The same query can be performed via toolkits such as Scirpy. First, we calculate the overlap between the TCRs of Query (our single-cell data) and Atlas dataset (here: vdjDB). The resulting matrices for VJ and VDJ receptor (n_ours x n_vdjdb) will contain a 1 where the TCRs match between both datasets and a 0 otherwise. 

We provide the following additional information to the function call:
- metric='identity': only exact sequence matches are considered. For other metrices see (XXX).
- sequence='aa': TCR sequences are compared on an amino-acid level instead of nucleobases, since specificity depends on the protein structure.

In [7]:
ir.pp.ir_dist(adata_tcr, vdjdb, metric='identity', sequence='aa') 

Trying to set attribute `._uns` of view, copying.


Next, we will match all cells between Query and Atlas dataset based on the condition we provide:
- metric='identity', sequence='aa': we need to provide the same values as used during distance calculation
- receptor_arms='VDJ': Compare TCRs based on CDR3-beta. Other options are 'VJ' (alpha chain), 'all' (alpha and beta chain), and 'any' (either alpha or beta chain).
- dual_ir='primary_only': as discussed in XXX, T cells can cary a secondary TCR. This parameter determines on which receptor the query is conducted. Other options are 'any' (primary or secondary receptor), and 'all' (primary and secondary receptor).

In [8]:
ir.tl.ir_query(adata_tcr, vdjdb, metric='identity', sequence='aa', receptor_arms='VDJ', dual_ir='primary_only') 

100%|█████████████████████████████████████████████████████████████████████████████| 6191/6191 [00:12<00:00, 503.06it/s]


Finally, we can annotate the TCRs in the single-cell data with the annotation of matching entries of a the database.
- metric='identity', sequence='aa': Use the same values as provided earlier
- include_ref_cols: list of columns from the reference database, that will be added to the query database. Here, we will add the species of the antigen and the epitope sequence.
- suffix: Here, we mark the different queries we will conduct with a suffix. 

In [11]:
ir.tl.ir_query_annotate(adata_tcr, vdjdb, metric='identity', sequence='aa', 
                        include_ref_cols=['antigen.species', 'antigen.epitope'])
adata_tcr.obs['antigen.species'].value_counts()

100%|████████████████████████████████████████████████████████████████████████████| 1690/1690 [00:00<00:00, 3236.13it/s]


CMV                 601
EBV                  83
HIV-1                70
ambiguous            55
SARS-CoV-2           12
HomoSapiens           8
InfluenzaA            7
HCV                   4
SIV                   1
YFV                   1
DENV2                 1
TriticumAestivum      1
MCMV                  1
Name: antigen.species_VDJ, dtype: int64

We observe several matches towards different dieseases. These annotation must be viewed carefully, since they depend on the aboundance of the disease in the databases, wrong annotation in the databases, false matches due to using incomplete information, and possible MHC restrictions of epitopes. However, from the great aboundance of disease-specific TCRs, we could deduct, that one patient has a latent infection towards the Cytomegalovirus (CMV) and Epstein–Barr virus (EBV), which are common [cite] in population. However, the third most common disease association is toward the less frequent HIV-1.

Next we perfrom a similar query, but regarding only the alpha chain. We will indicate this assignment by using the suffix '_VJ'.

In [12]:
ir.tl.ir_query(adata_tcr, vdjdb, metric='identity', sequence='aa', receptor_arms='VJ', dual_ir='primary_only') 
ir.tl.ir_query_annotate(adata_tcr, vdjdb, metric='identity', sequence='aa', 
                        include_ref_cols=['antigen.species', 'antigen.epitope'], suffix='_VJ')
adata_tcr.obs['antigen.species_VJ'].value_counts()

100%|█████████████████████████████████████████████████████████████████████████████| 5209/5209 [00:10<00:00, 519.55it/s]
100%|████████████████████████████████████████████████████████████████████████████| 4064/4064 [00:01<00:00, 3250.00it/s]


ambiguous       927
CMV             466
InfluenzaA      349
MCMV            143
SARS-CoV-2       55
EBV              43
HomoSapiens      27
HCV              11
HIV-1             6
YFV               3
HSV-2             1
Homo sapiens      1
Name: antigen.species_VJ, dtype: int64

Note, that we recieve more matches of the alpha chain, since it is less variable than the beta chain. This results in many TCRs that cannot be uniquely assigned to a specific disease and are therefore 'ambiguous'. As expected, we can still observe a many matches towards CMV, reinforcing the probability of an infection. However, there is only limited indication for HIV-specific TCRs.

Finally, we query identical sequences on both receptors:

In [13]:
ir.tl.ir_query(adata_tcr, vdjdb, metric='identity', sequence='aa', receptor_arms='all', dual_ir='primary_only') 
ir.tl.ir_query_annotate(adata_tcr, vdjdb, metric='identity', sequence='aa', 
                        include_ref_cols=['antigen.species', 'antigen.epitope'], suffix='_fullIR')
adata_tcr.obs['antigen.species_fullIR'].value_counts()

100%|█████████████████████████████████████████████████████████████████████████████| 6722/6722 [00:14<00:00, 451.45it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 328/328 [00:00<00:00, 3011.47it/s]


CMV            87
InfluenzaA     26
EBV            21
ambiguous      16
HomoSapiens     5
HIV-1           4
YFV             3
SARS-CoV-2      1
SIV             1
Name: antigen.species_fullIR, dtype: int64

Since this is far less descriptive, we recieve less matches. However, these matches are likely to be of higher quality. 

## 2.2. Clustering and Distances
While not directly infering specificity => clusters of similar TCR sequences, likely to have similar specificities => few known TCR-epitope paires in data => assign specificity to rest of cluster


2.3. Distance metrics

If no direct hits, we can infer specificity by searching for similar TCRs within the databases.
Several distance metrics exist. 
In general three directions: 
sequence alignments => convert sequence between each other => most direct way to measure sequence similarity, gaps might interupt sequence motifs needed for TCR-Epitope interactions =>  very local search, 
kmer => similarity based on small motif segments of length k => capture preserved motifs within TCR sequences, Overall less similar sequences due to random reordering, maybe more global metric 
numeric embeddings => project TCR sequence into embedding space where nummeric distances can be calculated. => algorithm dependend, deep learning => infer this from data, however more black-box, less interpretable

Since no large benchmark between theese metrics exist, on tests showed generally similar performace, we will focus on one aligment based + 1 kmer based with two different requirements of used input data:
TCRdist for paired data => one of commenly used, requires alpha, beta + v genes 
TCRmatch for beta data alone => more recent method,  integrated into iedb

# 2.3. Epitope Prediction


This alternative approach not directly relying on comparing TCRs to databases. Prediction of binding between TCR sequence and epitope directly with measuring similarity to TCRs of databases. If in study there is guess of epitope (eg. due to prior vaccination) we can try to predict binding.
Latetly many machine learning models for predicting binding between TCR and epitope. Generally these methods perform well for epitopes with training data, but often fail to generalize for unseen epitopes.
Therefore recommend, to investigate training data of tool in question for the epitope in question, to see whether it was contained in training data to large extend. If not these tools can provide hint, but should still be taken with caution until they are further evaluated in a standarized benchmark for their individual strengths and weaknesses.

3. BCR
Similiar for TCR also on BCR

3.1. Clustering

3.2. Databases

In [None]:
http://opig.stats.ox.ac.uk/webapps/covabdab/
https://www.iedb.org/

3.3. Distance Measurements

3.4. Prediction 

# 4. Takeaways

# 5. Quiz