# Outlook: Multi-modal Approaches

1. Introduction
2. TESSA
3. CoNGA
4. mvTCR
5. Questions
6. Take Aways

## 1. Introduction

With the development of novel single cell technology, IR sequencing is combined often with other Omic-layers, especially with transcriptomics (e.g. []). For B- and T-cells this enables the analysis of various characteristics on a single cell level: while the transcriptome provides insights into the current state of the cell, the IR is indicative for the cell's specificity and thereby explains the cell's fate upon infection or vaccination. 

While studies often provide paired measurements for both modalities, these are often analyzed individually utilizing only limited shared information. Often the IR sequence serves mainly as a barcode to the trajectory of cells over time [] or activation states []. The transcriptomic layer however is then used to identify the ... for clonotypes of intest. However, this underutilizes the additional information provided by paired data, since IRs and gene expression have been shown to be interlinked: adaptive immune cells recognizing the same epitopes, will undergo a similar development upon activation. Therefore, it has been shown that adaptive immune cells with identical? similar IR, expressed a similar phenotype.

Only recently, three methods were developed to jointly utilize TCR sequences and transcriptome each aiming at different aspects of analysis. Due to their novelty, these approaches have not been subjected to benchmark testing and a detailed external evaluation of their strength and weaknesses is therfore missing. The use of the following methods is therefor still marked as experimental for purpose of this tutorial. It may be noted that all methods were developed for TCRs in their original publication. However, due to the developmental similarity between B- and T-cells, the methods might also be applicable for B cells as well after carefull evaluation.

### 1.1 Data Preparation

In [12]:
import os

In [1]:
import scanpy as sc
import pandas as pd
import numpy as np
import seaborn as sb

We will load the already preprocessed data provided by the authors using scanpy for data handling.

In [9]:
path_data = '../data/haniffa21.processed.h5ad'
adata_full = sc.read(path_data)

This dataset contains data from various immune cell types. All algorithms require paire IR-GEX data, so we will select all T-cells. We do this, be keeping only cells, where the cell type annotation is a T-cell subtype.

In [10]:
tags_tcells = ['CD8', 'CD4', 'Treg', 'MAIT',]
adata_tcr = adata_full[adata_full.obs['initial_clustering'].isin(tags_tcells)]

## 2. TESSA
... et al developed TCR functionallandscape estimation supervised with scRNA-seq analysis (TESSA), which aims at embedding and clustering T cell clones based on their TCR sequence and transcriptome via Bayesian modelling. The CDR3beta sequence is first compressed to a 30-dimensional numeric representation using a (pretrained?, Variational?) Autoencoder. Following, the dimensions are upweighted to correlate the TCR representation with the gene expression of similar TCR-groups, thereby assigning importance of TCR position? for explaining the cells gene expression. In an interative process, weights and groups are updated until convergence to reach a maximal allignment between both modalities.

TESSA produced clusters of high purity when embedding Tcells with known epitope specificity from [10x], surpasing the uni-modal model GLIPH [], which is commonly used for clustering TCR sequences. Further, cluster centrally was indicative for higher avidity clones shown by clonal expansion and high ADT counts. Using TESSA on data from [Yost], the author detected novel clusters of responder T cell in patients undergoing PD-1 blockade. 

### 2.1 Data Preperation
TESSA requires several files in specific format:


## 2.2 Running the model
To run the model, we will need to provide different setting options, most of which specify input or output directories. We will summarize them in a dictionary first.

In [13]:
path_file = os.path.dirname(os.path.abspath(__file__))
settings_full = {
    'tcr': f'{dir_in}_tcrs_atlas.csv',
    'model': f'{path_file}/TESSA/BriseisEncoder/TrainedEncoder.h5',
    'embeding_vectors': f'{path_file}/TESSA/BriseisEncoder/Atchley_factors.csv',
    'output_TCR': f'{dir_out}/tessa_tcr_embedding.csv',
    'output_log': f'{dir_out}/tessa_log.log',
    'exp': f'{dir_in}_scRNA_atlas.csv',
    'output_tessa': f'{dir_out}/res/',
    'within_sample_networks': 'FALSE',
}

NameError: name '__file__' is not defined

In the next step, we will create the command for running TESSA by adding setting key and setting option.

In [None]:
command_full = f'python {path_file}/TESSA/Tessa_main.py'
for key, value in settings_full.items():
    command_full += f' -{key} {value}'
command_full

Finally, we run the model on the specified settings by calling the command.

In [None]:
os.system(command_full)

## 2.3. Output

## 3. CoNGA
clonotype neighbor graph analysis (CoNGA) uses similarity graphs at a clonotype level on GEX and TCR, for which the well known distance metric TCRDist [] is used. Based on the graph neighborhood, clusters are formed by shared GEX and TCR assignments. These so called "CoNGA clusters" thereby share similar receptor sequences as well as similar gene expression. 

CoNGA clusters contained clonotypes with highly alike TCRs based on e.g. length and physio-chemical properties and were shown to capture specificity well on []. Among other things, CoNGA was used to identified clusters of mhc-indepent HOBIT/HELIOS-type T-cells in data from []. 

Note, that CoNGA offers support to BCR sequences as stated on their GitHub Page. However, this was not part of the evaluation in the original publication.

## 4. mvTCR
mvTCR by An et al is a multiview Variational Autoencoder that compresses TCR sequence and gene expression into a lower-dimensional representation []. Two deep learning archtictures - Transformer and Multi-layer perceptron - extract information from both TCR and GEX respectively, before they are fused to derive the joint space. Following, the trained models can be used to embedd similar data. 

The authors showed, that multi-modal models can better capture antigen specificity, than uni-modal embeddings on the data from [] for prediction and clustering. Additionally, they showed that cell type and cell functionality are preserved in the embedding space on a SARS-CoV-2 dataset from [].

## 5. Questions

Why could it be useful to integrate IR-sequence information with gene expression?
- GEX can be used to improve IR sequence reads.
+ Both modalities provide different insights into the cell, while still being interdependent.
- Since both modalities capture the same information, integrating them provides an additional quality check.
- 

What information provides us the IR sequence, that is not directly captured in GEX?
- A count matrix between cell and antibody-tagged epitope bindings.
-  
+ The cell's clonotype and thereby cell ancestory is defined by the IR sequence.
+ The IR sequence determines specificity and is therefor a barcode for recognizing the same epitope.

On what premise rely the approaches above?
+ Cells of same or alike IRs often have a similar phenotype.
- Information of IR and GEX provide orthogonal information to each other, since they are independent.
- Knowledge is transfered between large gene expression datasets into which IR data can be mapped.
- 

## 6. Take Aways

- Cell functionality (determined by IR) and cell state (observed via GEX) are interlinked. Cells with alike IR sequences share the similar phenotypes [].
- Due to the inherent structural difference between count matrixs (GEX) and amino acid sequences (IR), it is difficult to directly fuse both modalities.
- Several methods were developed to utilize paired GEX-IR data relying on different approaches such as Bayesian Models [], Graph Theory [], and Deep Learning [].
- Due to the novelty of all methods and the lack of standardarized evaluation, these are not independently benchmarked yet and are hard to compare. 
- All methods were developed for TCRs. While they can in theory easily be applied on BCR data and partially offer a BCR interface (CoGNA), this was not part of the original publication and is not evaluated yet.