# Tutorial

## Prepare Inputs

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## Usage
### Prepare input data
CoMBCR integrates BCRs and gene expressions but requires three files: a BCR sequences file, a gene expression file, and a file containing BCR embeddings generated by a BCR encoder (e.g., AntiBERTa, ESM2).  
- Ensure each file includes an index column labeled "barcode," serving as a unique identifier for each cell.   
- Verify that the cells are aligned in the same order across all three files.
#### BCR sequences file
- This CSV file should include an index column named "barcode" and columns labeled "IGL", and "IGH". 
- The IGL/IHG is the concatnation of "fwr1", "cdr1", "cdr2", "fwr2", "cdr3", "fwr3", and "fwr4" in alpha chain/beta chain.
#### Gene expression file
Normalization and log-transformation are recommended. Batch effect removal is advisable if applicable. We suggest using the top 5,000 highly variable genes, though you can select input genes according to your criteria.


In [2]:
bcr = pd.read_csv("example_pairdata/example_bcr.csv", index_col="barcode")
rna = pd.read_csv("example_pairdata/example_rna.csv", index_col="barcode")
assert(bcr.index.tolist() == rna.index.tolist())

In [3]:
bcr.head()

Unnamed: 0_level_0,cdr3L,IGL,cdr3H,IGH,raw_clonotype_id,sample
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAACCTGAGACTTGAA-1_10x_1,CQQYGTAPFTF,EIVLTQSPGTLSLSPGDRAALSCGASQAVNNNFLAWYQQKPGQAPR...,CARHLKYCTGGSCYSRMVFDSW,QVQLQESGPGLVKPSETLSLTCSVSGGSISSFYWSWIRQPPGRGLE...,clonotype2764,10x_1
AAACCTGGTAGGGACT-1_10x_1,CQYYGSSPSC,EIVLTQSPGTLSLSPGERATLSCRASQSVSSSLLAWYQQKPGQAPR...,CASRFGEFLAVCDFW,EVQLLESGGGLVQPGGSLRLSCAASGFTFSDHAVSWVRQAAGKGLE...,clonotype1232,10x_1
AAACCTGTCAGTCCCT-1_10x_1,CQLFGDSPMYTF,EGVLTQSPGTLSLSPGERATLSCRASQTLNSDFLIWYQLKPGQTPR...,CAHSRKGFCSGETCYSFLETSGYHWFDPW,QVTLKESGPELVKPTQTLTLTCTLSEFPLNSVGMGMGWIRQTPGKT...,clonotype211,10x_1
AAACGGGAGCGACGTA-1_10x_1,CNSYAGNNNYVLF,QSALTQPPSASGSPGQSVTISCTGTSSDVGDSNYVSWYQQHPGKAP...,CARGVKGRFDYW,QLQLQESGSGLVKPSQTLSLTCAVSGGSIGSSSYSWSWIRQPPGKG...,clonotype88,10x_1
AAACGGGAGGGATGGG-1_10x_1,CYSTDSSGNSLF,SYELTQPPSVSVSPGQTARITCSGDALPKKYAYWYQQKSGQAPVLV...,CARGAQYSSGWYDIDYW,QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLE...,clonotype1765,10x_1


In [4]:
rna.head()

Unnamed: 0_level_0,IGHV3.15,IGHV3.48,IGHV1.18,IGHV4.34,IGHV5.51,IGHV2.5,IGHV3.53,IGLV2.8,IGHV1.2,IGHV1.69D,...,RAD21.AS1,UBE2L5,AC005962.2,FAM167A,AC092428.1,CEACAM1,AC114980.1,AC019294.2,HSPA2,DENND5B.AS1
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACCTGAGACTTGAA-1_10x_1,-0.376008,-0.065586,0.91429,0.021602,-0.099588,-0.13634,-0.234204,-0.149438,0.105756,0.006067,...,0.0,0.007155,0.0,0.0,-0.041768,0.0,0.0,0.0,-0.001868,-0.009476
AAACCTGGTAGGGACT-1_10x_1,0.658475,0.784789,-0.049671,-0.285894,0.017293,0.118015,0.530007,-0.09261,0.086478,0.224125,...,0.0,0.003563,0.0,0.000918,0.0,0.0,0.0,0.023573,0.0,0.003488
AAACCTGTCAGTCCCT-1_10x_1,-0.211308,0.152276,0.209336,0.028572,-0.246877,5.379848,0.053748,-0.031826,0.106044,0.667313,...,0.0,0.0,-0.039742,0.0,0.0,0.005113,0.0,0.0,0.0,0.001509
AAACGGGAGCGACGTA-1_10x_1,0.042634,0.478668,0.106011,0.324261,0.064261,-0.292733,0.874743,6.847327,0.020781,0.681784,...,-0.007744,0.065336,0.0,0.0,-0.001011,0.0,0.002812,-0.009169,0.0,0.008073
AAACGGGAGGGATGGG-1_10x_1,0.045003,0.126678,3.84737,-0.160274,-0.264143,0.090358,-0.008701,0.184167,0.298701,-0.029724,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.010457


### Genreate original BCR embeddings

First, we download the pre-trained BCR encoder


In [5]:
from CoMBCR.utils import download_BCRencoder

download_BCRencoder()

Fetching 8 files: 100%|██████████| 8/8 [00:28<00:00,  3.57s/it]

Download Finished. Path /mnt/d/CoMBCR/CoMBCR/BCRencoder





Please clone or download the "runberta_pair.py" in this github. This file is used to measure the original distances between BCRs. We recommend using our default pre-trained encoder, though any encoder can be used to encode BCRs.

```
python3 runberta_pair.py --datapath "example_pairdata/example_bcr.csv" --outdir "example_outdir" --outfilename "antiberta_embedding.csv"
```

Here we directly used the original BCR embeddings under the exampledata

In [5]:
bcrori = pd.read_csv("example_pairdata/example_bcrori.csv", index_col="barcode")
bcrori.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACCTGAGACTTGAA-1_10x_1,-0.026743,-0.664474,-0.444477,0.982374,-0.0562,-0.090404,-0.181163,0.064309,0.139462,-0.009462,...,-0.152017,0.215193,0.473038,-0.206209,0.109365,-0.108639,-0.367183,0.30904,-0.357757,-0.589497
AAACCTGGTAGGGACT-1_10x_1,-0.514521,-0.598449,-0.404644,1.021118,-0.253018,0.093992,-0.366508,0.177612,-0.104526,-0.050579,...,0.010027,0.321669,0.113003,-0.126156,0.341789,-0.211246,-0.517015,0.235695,-0.102193,-0.722848
AAACCTGTCAGTCCCT-1_10x_1,0.082168,-0.43768,-0.455506,1.005682,-0.099462,0.034462,0.027507,-0.01612,-0.080524,-0.378358,...,0.43125,0.392155,0.551969,-0.131442,-0.19472,-0.522121,-0.149101,0.197037,-0.391534,-0.281698
AAACGGGAGCGACGTA-1_10x_1,0.137382,-0.86289,-0.425157,0.81395,-0.498311,0.072822,0.123834,0.367978,-0.162537,-0.148761,...,0.313735,0.489395,0.266964,-0.418712,0.24134,-0.30305,-0.24302,0.138295,-0.184176,-0.638734
AAACGGGAGGGATGGG-1_10x_1,0.161128,-0.334227,-0.600657,0.650591,-0.759796,0.213413,0.108718,0.077125,0.016716,-0.003692,...,0.614696,0.969117,0.398579,-0.450386,0.144531,-0.781121,-0.3271,-0.423524,-0.357415,0.369593


In [6]:
assert(bcr.index.tolist() == bcrori.index.tolist())

## Run CoMBCR

### Parameters
CoMBCR contains the parameters as follows:

**Required**
1. bcrpath: (Required) The path to the BCR sequences file.
2. rnapath:  (Required) The path to the gene expression file.
3. bcroriginal: (Required) The path to bcr original embedding file.
4. outdir:  (Required) The directory where the best checkpoint file and the output embeddings will be stored.

**Configuration**
1. checkpoint: default is "best_network.pth". This parameter specifies the name of the file where the best model checkpoint will be saved.
2. lr: default is 1e-6. Learning rate.
3. lam: default is 1e-1. Intra-modal constrastive loss weight (α in paper).
4. batch_size: default 256.
5. epochs: default 200.
6. lr_step: default [40,100], These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs.
7. patience: default 15, the patience for early stopping.
8. save_epoch: default None. If specified (e.g., 150), saves the model at that epoch and exits training. By default, uses early stopping strategy.
9. encoderprofile_in_dim: default 5000. Adjust this parameter if the number of input genes differs from 5000.
10. separatebatch: The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option.
11. user_defined_cluster: default False. If set to True, the model utilizes custom cluster labels specified in the 'cluster_label' column of BCR input file for intra-modal contrastive learning; Otherwise, it relies on BCR sequence identity.

### Example

The codes below return numpy arrays for BCR embeddings and gex embeddings. Meanwhile, it will output "bcrembedding.csv" and "gexemedding.csv" under the outdir you designated.  

***Note***: If CUDA raised error, this is due to a crush with the previous loaded model. Please restart the jupyter notebook and directly run the cell below.

In [1]:
from CoMBCR.CoMBCRpair import CoMBCR_main

bcremb, gexemb = CoMBCR_main(bcrpath="example_pairdata/example_bcr.csv", 
            rnapath="example_pairdata/example_rna.csv", 
            bcroriginal="example_pairdata/example_bcrori.csv", 
            outdir="outputs",
            epochs=1,
            batch_size=16,
            encoderprofile_in_dim=5000)

learning rate is  1e-06
Adjusting learning rate of group 0 to 1.0000e-06.
Epoch:[0/1]	loss:5.66130	loss_cmc:2.765779	loss_p2p:2.894847	loss_b2b:0.000677
Adjusting learning rate of group 0 to 1.0000e-06.


Read the output files "bcrembedding.csv" and "gexembedding.csv" located in the designated output directory. Please note that these CSV files directly store the numpy arrays and, as such, do not include any "barcode" column. When reading these files, ensure that you do not specify any index column.

In [4]:
bcremb = pd.read_csv("outputs/Embeddings/bcrembeddings.csv")
gexemb = pd.read_csv("outputs/Embeddings/gexembeddings.csv")

In [7]:
bcremb.head()

Unnamed: 0,barcode,0,1,2,3,4,5,6,7,8,...,246,247,248,249,250,251,252,253,254,255
0,AAACCTGCAATAGCAA-1_06,0.032815,-0.063483,-0.024828,0.06312,-0.01404,0.054575,0.021257,-0.038083,-0.007649,...,-0.098875,-0.056584,-0.089965,-0.008764,-0.076704,-0.036625,0.003898,-0.055563,-0.042894,-0.037235
1,AAACCTGCACAACTGT-1_06,0.121508,0.039872,0.012696,0.147247,-0.005573,0.04757,0.058713,0.000451,-0.048308,...,-0.072532,0.021198,-0.013138,-0.014776,-0.049604,-0.017063,0.005151,-0.035945,-0.010601,0.034175
2,AAACCTGCAGCCTGTG-1_06,0.01996,-0.01545,-0.066601,0.129512,0.002069,0.057988,0.071437,0.08109,0.003226,...,-0.156038,0.059611,-0.05169,-0.055863,-0.036513,0.080933,-0.080171,-0.009381,0.027317,0.002608
3,AAACCTGCAGTCAGCC-1_06,0.074969,0.051351,-0.057629,0.03897,0.032714,-0.029748,0.11607,0.03656,-0.006593,...,-0.103539,0.031688,0.044632,-0.106911,0.014068,0.056049,0.008621,-0.016,-0.026184,0.049644
4,AAACCTGGTTCCTCCA-1_06,0.042099,-0.017863,-0.076448,0.095692,0.000738,0.046128,0.07502,0.013586,-0.030703,...,-0.119104,0.026512,-0.028498,-0.044768,-0.012293,0.035605,-0.093089,0.026811,-0.001844,-0.018129


In [8]:
gexemb.head()

Unnamed: 0,barcode,0,1,2,3,4,5,6,7,8,...,246,247,248,249,250,251,252,253,254,255
0,AAACCTGCAATAGCAA-1_06,-0.021588,-0.033296,0.004116,0.029712,-0.086893,-0.055068,0.098709,0.036289,-0.095629,...,0.022271,-0.031552,0.01609,-0.052672,-0.035306,-0.086517,-0.001197,-0.022149,0.00768,0.075515
1,AAACCTGCACAACTGT-1_06,0.036089,-0.047952,0.041654,-0.055746,-0.068391,-0.031658,0.104429,-0.023244,-0.064196,...,0.009589,-0.05498,0.034726,-0.074935,-0.035807,-0.112113,-0.015088,0.003449,-0.009554,0.079182
2,AAACCTGCAGCCTGTG-1_06,-0.01699,-0.051163,0.038526,0.019251,-0.053073,-0.059704,0.102014,0.013154,-0.100988,...,0.000217,-0.052953,0.044267,-0.042444,-0.045406,-0.068724,-0.03193,-0.046555,-0.014433,0.045848
3,AAACCTGCAGTCAGCC-1_06,0.006063,-0.073906,0.019743,-3.6e-05,-0.050769,-0.067077,0.069206,-0.040343,-0.055102,...,-0.035722,-0.093292,0.032075,-0.070056,-0.030867,-0.084757,-0.015546,-0.064913,-0.001539,0.047664
4,AAACCTGGTTCCTCCA-1_06,0.033936,-0.064823,-0.018153,0.00289,-0.088961,-0.029338,0.078806,0.010156,-0.060489,...,-0.026271,-0.091599,0.031854,-0.038834,-0.049545,-0.062961,0.015324,0.016053,-0.030665,0.034966


## Visualization

The visualization steps are the same as  `tutorial.ipynb`.