# TCR Profiling Analysis for anti-GAD65 Study: Preprocessing

# Author: Sumanta Barman


This notebook performs T-cell receptor (TCR) preprocessing and analysis for immune cell profiling in anti-GAD65 patients and controls.

## Overview
- **Objective**: Process 10X Genomics single-cell TCR sequencing data
- **Samples**: GAD65 patients and controls from PBMC and CSF compartments
- **Pipeline**: Dandelion-based TCR annotation, clonotype identification, and network analysis

## Requirements
Key packages:
- dandelion
- scanpy
- scirpy
- numpy, pandas, matplotlib

## Reference Databases
Ensure the following databases are available:
- IGDATA: IgBLAST database
- GERMLINE: Germline V(D)J sequences
- BLASTDB: BLAST database

---

## 1. Import Libraries

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Single-cell analysis
import scanpy as sc
import scirpy as ir
import dandelion as ddl
import anndata

# Additional dependencies
import scrublet as scr
from polyleven import levenshtein
import muon as mu

# Set display options
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')

print("Libraries imported successfully")

## 2. Configure Reference Databases

Set environment variables for IgBLAST and germline databases. Update paths according to your system configuration.

In [None]:
# Configure database paths - UPDATE THESE PATHS FOR YOUR SYSTEM
os.environ["IGDATA"] = "/database/igblast/"
os.environ["GERMLINE"] = "/database/germlines/"
os.environ["BLASTDB"] = "/database/blast/"

print(f"IGDATA: {os.environ['IGDATA']}")
print(f"GERMLINE: {os.environ['GERMLINE']}")
print(f"BLASTDB: {os.environ['BLASTDB']}")

## 3. Data Preprocessing

### 3.1 Define Sample List

Samples include GAD65 patients and controls from both PBMC and CSF compartments.

In [None]:
# Set working directory - UPDATE THIS PATH
data_dir = '/Sample_data/TCR'
os.chdir(data_dir)
print(f"Working directory: {os.getcwd()}")

# Define sample identifiers
samples = [
    # GAD PBMC samples
    'GAD_PBMC_9910_filtered', 'GAD_PBMC_9961_filtered', 
    'GAD_PBMC_R3_filtered', 'GAD_PBMC_R14_filtered',
    # GAD CSF samples
    'GAD_CSF_9910_filtered', 'GAD_CSF_9961_filtered', 
    'GAD_CSF_R3_filtered', 'GAD_CSF_R14_filtered',
    # Control CSF samples
    'control_CSF_8360_filtered', 'control_CSF_8361_filtered', 
    'control_CSF_R21_filtered', 'control_CSF_R31_filtered',
    'control_CSF_R32_filtered', 'control_CSF_R34_filtered', 
    # Control PBMC samples
    'control_PBMC_8360_filtered', 'control_PBMC_8361_filtered',
    'control_PBMC_R21_filtered', 'control_PBMC_R31_filtered', 
    'control_PBMC_R32_filtered', 'control_PBMC_R34_filtered'
]

filename_prefixes = ['filtered' for s in samples]

print(f"Total samples: {len(samples)}")
print(f"Sample breakdown:")
print(f"  - GAD PBMC: {len([s for s in samples if 'GAD_PBMC' in s])}")
print(f"  - GAD CSF: {len([s for s in samples if 'GAD_CSF' in s])}")
print(f"  - Control PBMC: {len([s for s in samples if 'control_PBMC' in s])}")
print(f"  - Control CSF: {len([s for s in samples if 'control_CSF' in s])}")

### 3.2 Format FASTA Headers

Prepare Cell Ranger FASTA files for downstream processing with Dandelion.

In [None]:
# Format FASTA headers for compatibility with Dandelion
ddl.pp.format_fastas(samples, prefix=samples, filename_prefix=filename_prefixes)
print("FASTA formatting complete")

### 3.3 Reannotate V(D)J Genes

Use IgBLAST to reannotate V, D, and J gene segments for TCR sequences.

In [None]:
# Reannotate genes using IgBLAST (loci='tr' for T-cell receptors)
ddl.pp.reannotate_genes(
    samples, 
    loci="tr",  # T-cell receptor loci
    filename_prefix=filename_prefixes
)
print("Gene reannotation complete")

### 3.4 Load and Concatenate VDJ Data

Read Dandelion-annotated TCR data and combine all samples.

In [None]:
# Load VDJ data from all samples
vdj_list = []
for sample in samples:
    vdj_path = f"{sample}/dandelion/filtered_contig_dandelion.tsv"
    vdj = ddl.read_10x_airr(vdj_path)
    vdj_list.append(vdj)
    print(f"Loaded {sample}: {vdj.shape[0]} contigs")

# Concatenate all VDJ objects
vdj = ddl.concat(vdj_list)
print(f"\nTotal contigs after concatenation: {vdj.shape[0]}")

# Save combined VDJ object
vdj.write_h5ddl('Tcell_TCR_filtered_contig_VDJ.h5ddl', complib='bzip2')
print("VDJ data saved to Tcell_TCR_filtered_contig_VDJ.h5ddl")

## 4. Integrate with Transcriptomics Data

### 4.1 Load AnnData Object

Load the processed single-cell RNA-seq data (AnnData format).

In [None]:
# Load transcriptomics data - UPDATE THIS PATH
adata_path = "/seurat_object_combined_singlets_integrated.h5ad"
adata = sc.read(adata_path)
print(f"Loaded AnnData: {adata.shape[0]} cells Ã— {adata.shape[1]} genes")
adata

### 4.2 Prepare Metadata

Standardize cell barcodes and sample identifiers for integration with VDJ data.

In [None]:
# Create sample_id column
adata.obs["sample_id"] = adata.obs["sample"]

# Ensure unique variable names
adata.var_names_make_unique()

print("Metadata prepared")
adata

### 4.3 Standardize Cell Barcodes

Clean cell barcodes and prepend sample identifiers to match VDJ data format.

In [None]:
def clean_index(index):
    """Remove suffix from barcode (e.g., '-1-0' or '-1')."""
    return index.split('-')[0]

# Clean barcodes
adata.obs.index = adata.obs.index.map(clean_index)

# Convert sample_id to string
adata.obs['sample_id'] = adata.obs['sample_id'].astype(str)

# Prepend sample_id to barcodes
adata.obs.index = adata.obs['sample_id'] + '_' + adata.obs.index

print("Cell barcodes standardized")
print(f"Example barcode: {adata.obs.index[0]}")

### 4.4 Link Transcriptomics and VDJ Data

Match TCR sequences to their corresponding cells in the transcriptomics data.

In [None]:
# Check and link VDJ contigs with AnnData object
vdj, adata = ddl.pp.check_contigs(vdj, adata, library_type="tr-ab")

print(f"Cells with TCR data: {adata.obs['has_contig'].sum()}")
print(f"Cells without TCR data: {(~adata.obs['has_contig']).sum()}")

### 4.5 Subset T Cells with TCR

Create a filtered dataset containing only T cells with recovered TCR sequences.

In [None]:
# Subset cells with TCR contigs
adata_contig = adata[adata.obs['has_contig'].isin(['True'])].copy()

print(f"T cells with TCR: {adata_contig.shape[0]} cells")
adata_contig

## 5. Clonotype Analysis

### 5.1 Identify Clonotypes

Define T-cell clonotypes based on identical CDR3 junction sequences.

In [None]:
# Find clones using exact junction sequence matching
# identity=1 means 100% sequence identity required
ddl.tl.find_clones(vdj, identity=1, key="junction")

print("Clonotype identification complete")
print(f"Unique clonotypes: {vdj.metadata['clone_id'].nunique()}")

### 5.2 Filter Chain Status

Retain cells with valid TCR chain configurations.

In [None]:
# Filter for valid chain configurations
valid_chains = [
    "Single pair", "Extra pair", "Extra pair-exception", 
    "Orphan VDJ", 'extra VDJ', 'extra VJ', 'two full chains'
]

print(f"Before filtering: {vdj.shape[0]} contigs")
vdj = vdj[vdj.metadata.chain_status.isin(valid_chains)].copy()
print(f"After filtering: {vdj.shape[0]} contigs")

### 5.3 Generate TCR Network

Build a network graph connecting related TCR sequences.

In [None]:
# Generate TCR similarity network
ddl.tl.generate_network(vdj)
print("TCR network generated")

### 5.4 Calculate Clone Size Categories

Categorize clonotypes by expansion level (small, medium, large).

In [None]:
# Calculate clone sizes with different thresholds
ddl.tl.clone_size(vdj)
ddl.tl.clone_size(vdj, max_size=3)
ddl.tl.clone_size(vdj, max_size=5)
ddl.tl.clone_size(vdj, max_size=10)

print("Clone size categories calculated")

## 6. Transfer VDJ Data to AnnData

Integrate TCR information into the transcriptomics object for downstream analysis.

In [None]:
# Transfer VDJ annotations to AnnData object
ddl.tl.transfer(adata_contig, vdj)

print("VDJ data transferred to AnnData object")
print(f"New columns added: {[col for col in adata_contig.obs.columns if 'clone' in col.lower() or 'tcr' in col.lower()]}")

## 7. Save Processed Data

Export integrated datasets for downstream analysis and visualization.

In [None]:
# Define output filenames
adata_output = "seurat_object_combined_singlets_integrated_GAD_control_CSF_PBMC_Tcell_TCR_transfered_ADATA.h5ad"
vdj_output = "seurat_object_combined_singlets_integrated_GAD_control_CSF_PBMC_Tcell_TCR_transfered_VDJ.h5ddl"

# Save processed data
adata_contig.write(adata_output)
vdj.write(vdj_output)

print(f"Analysis complete!")
print(f"  - AnnData saved: {adata_output}")
print(f"  - VDJ saved: {vdj_output}")
print(f"\nFinal dataset: {adata_contig.shape[0]} T cells with TCR data")

---

## Session Information

For reproducibility, document package versions and system information.

In [None]:
# Print session information
sc.logging.print_versions()