# Analyzing TCR data

With `dandelion>=1.3` onwards, there will be the ability to start analyzing 10x single-cell TCR data with the existing setup for both alpha-beta and gamma-delta TCR data formats. Currently, the alpha-beta and gamma-delta data sets have to be analyzed separately.

We will download the various input formats of TCR files from 10x's [resource page](https://www.10xgenomics.com/resources/datasets) as part of this tutorial:

```bash
# bash
mkdir -p dandelion_tutorial/sc5p_v2_hs_PBMC_10k;
mkdir -p dandelion_tutorial/sc5p_v1p1_hs_melanoma_10k;

cd dandelion_tutorial/sc5p_v2_hs_PBMC_10k;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_t_airr_rearrangement.tsv;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_t_filtered_contig_annotations.csv;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_t_filtered_contig.fasta;

cd ../sc5p_v1p1_hs_melanoma_10k;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_filtered_feature_bc_matrix.h5;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_t_airr_rearrangement.tsv;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_t_filtered_contig_annotations.csv;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_t_filtered_contig.fasta;
```

<b>Import dandelion module</b>

In [None]:
import os
import dandelion as ddl

# change directory to somewhere more workable
os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
ddl.logging.print_versions()

I'm showing two examples for reading in the data: with or without reannotation.

<b>Read in AIRR format</b>

In [None]:
# read in the airr_rearrangement.tsv file
file1 = "sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_t_airr_rearrangement.tsv"
file2 = "sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_t_airr_rearrangement.tsv"

In [None]:
vdj1 = ddl.read_10x_airr(file1)
vdj1

In [None]:
vdj2 = ddl.read_10x_airr(file2)
vdj2

In [None]:
# combine into a singular object
# let's add the sample_id to each cell barcode so that we don't end up overlapping later on
sample_id = "sc5p_v2_hs_PBMC_10k"
vdj1.data["sample_id"] = sample_id
vdj1.data["cell_id"] = [sample_id + "_" + c for c in vdj1.data["cell_id"]]
vdj1.data["sequence_id"] = [
    sample_id + "_" + s for s in vdj1.data["sequence_id"]
]

sample_id = "sc5p_v1p1_hs_melanoma_10k"
vdj2.data["sample_id"] = sample_id
vdj2.data["cell_id"] = [sample_id + "_" + c for c in vdj2.data["cell_id"]]
vdj2.data["sequence_id"] = [
    sample_id + "_" + s for s in vdj2.data["sequence_id"]
]

# combine into a singular object
vdj = ddl.concat([vdj1, vdj2])
vdj

<b>Read in with reannotation</b>

We specify the `filename_prefix` option because they have different prefixes that precedes `_contig.fasta` and `_contig_annotations.csv`.

In [None]:
samples = ["sc5p_v2_hs_PBMC_10k", "sc5p_v1p1_hs_melanoma_10k"]
filename_prefixes = [
    "sc5p_v2_hs_PBMC_10k_t_filtered",
    "sc5p_v1p1_hs_melanoma_10k_t_filtered",
]
ddl.pp.format_fastas(samples, prefix=samples, filename_prefix=filename_prefixes)

Make sure to toggle `loci = 'tr'` for TCR data. I'm setting `reassign_dj = True` so as to try and force a reassignment of J genes (and D genes if it can) with stricter cut offs.

In [None]:
ddl.pp.reannotate_genes(
    samples, loci="tr", reassign_dj=True, filename_prefix=filename_prefixes
)

There's no need to run the the rest of the preprocessing steps.

We'll read in the reannotated files like as follow:

In [None]:
import pandas as pd

tcr_files = []
for sample in samples:
    file_location = (
        sample + "/dandelion/" + sample + "_t_filtered_contig_dandelion.tsv"
    )
    tcr_files.append(pd.read_csv(file_location, sep="\t"))
tcr = pd.concat(tcr_files, ignore_index=True)
tcr.reset_index(inplace=True, drop=True)
tcr

The reannotated file can be used with dandelion as per the BCR tutorial.

For the rest of the tutorial, I'm going to show how to proceed with 10x's AIRR format file instead as there are some minor differences.

<b>Import modules for use with scanpy</b>

In [None]:
import anndata as ad
import scanpy as sc

import warnings

warnings.filterwarnings("ignore")
sc.logging.print_header()

<b>Import the transcriptome data</b>

In [None]:
gex_files = {
    "sc5p_v2_hs_PBMC_10k": "sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5",
    "sc5p_v1p1_hs_melanoma_10k": "sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_filtered_feature_bc_matrix.h5",
}

In [None]:
adata_list = []
for f in gex_files:
    adata_tmp = sc.read_10x_h5(gex_files[f], gex_only=True)
    adata_tmp.obs["sample_id"] = f
    adata_tmp.obs_names = [f + "_" + x for x in adata_tmp.obs_names]
    adata_tmp.var_names_make_unique()
    adata_list.append(adata_tmp)
adata = ad.concat(adata_list)
adata

<b>Run QC on the transcriptome data.</b>

In [None]:
ddl.pp.recipe_scanpy_qc(adata)
adata

<b>Filtering TCR data.</b>

Note that I'm using the `Dandelion` object as input rather than the pandas dataframe (yes both types of input will works. In fact, a file path to the .tsv will work too).

In [None]:
adata.obs

In [None]:
# The function will return both objects.
vdj, adata = ddl.pp.check_contigs(vdj, adata, library_type="tr-ab")

<b>Check the output V(D)J table</b>

In [None]:
vdj

<b>Check the AnnData object as well</b>

In [None]:
adata

<b>The number of cells that actually has a matching BCR can be tabluated.</b>

In [None]:
pd.crosstab(adata.obs["has_contig"], adata.obs["chain_status"])

<b>Now actually filter the AnnData object and run through a standard workflow starting by filtering genes and normalizing the data</b>

Because the 'filtered' `AnnData` object was returned as a filtered but otherwise unprocessed object, we still need to normalize and run through the usual process here. The following is just a standard scanpy workflow.

In [None]:
# filter genes
sc.pp.filter_genes(adata, min_cells=3)
# Normalize the counts
sc.pp.normalize_total(adata, target_sum=1e4)
# Logarithmize the data
sc.pp.log1p(adata)
# Stash the normalised counts
adata.raw = adata

<b>Identify highly-variable genes</b>

In [None]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

<b>Filter the genes to only those marked as highly-variable</b>

In [None]:
adata = adata[:, adata.var.highly_variable]

<b>Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.</b>

In [None]:
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)

<b>Run PCA</b>

In [None]:
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

<b>Computing the neighborhood graph, umap and clusters</b>

In [None]:
# Computing the neighborhood graph
sc.pp.neighbors(adata)
# Embedding the neighborhood graph
sc.tl.umap(adata)
# Clustering the neighborhood graph
sc.tl.leiden(adata)

<b>Visualizing the clusters and whether or not there's a corresponding contig.</b>

In [None]:
sc.pl.umap(adata, color=["leiden", "chain_status"])

<b>Visualizing some T cell genes.</b>

In [None]:
sc.pl.umap(adata, color=["CD3E", "CD8B"])

<b>Find clones.</b>

<div class="alert alert-info">

Note

Here we specify `identity = 1` so only cells with identical CDR3 nucleotide sequences (`key = 'junction'`) are grouped into clones/clonotypes.

</div>

In [None]:
ddl.tl.find_clones(vdj, identity=1, key="junction")
vdj

<b>Generate TCR network.</b>

The 10x-provided AIRR file is missing columns like `sequence_alignment` and `sequence_alignment_aa` so we will use the next best thing, which is `sequence` or `sequence_aa`. Note that these columns are not-gapped.

Specify `key = 'sequence_aa'` to toggle this behavior. Can also try `junction` or `junction_aa` if just want to visualise the CDR3 linkage.

In [None]:
# again, i'm removing the Orphan VJ cells (lacking TRB chain i.e. VDJ information).
vdj = vdj[
    vdj.metadata.chain_status.isin(
        ["Single pair", "Extra pair", "Extra pair-exception", "Orphan VDJ"]
    )
].copy()

In [None]:
ddl.tl.generate_network(vdj, key="sequence_aa")

In [None]:
vdj

<b>Plotting in scanpy.</b>

In [None]:
ddl.tl.transfer(
    adata, vdj
)  # this will include singletons. To show only expanded clones, specify expanded_only=True

In [None]:
sc.set_figure_params(figsize=[5, 5])
ddl.pl.clone_network(adata, color=["sample_id"], edges_width=1, size=15)

In [None]:
adata

In [None]:
sc.set_figure_params(figsize=[4.5, 5])
ddl.pl.clone_network(
    adata,
    color=[
        "chain_status",
        "rearrangement_status_VDJ",
        "rearrangement_status_VJ",
    ],
    ncols=1,
    legend_fontoutline=3,
    size=10,
    edges_width=1,
)

In [None]:
ddl.tl.transfer(adata, vdj, expanded_only=True)

In [None]:
sc.set_figure_params(figsize=[5, 5])
ddl.pl.clone_network(adata, color=["sample_id"], edges_width=1, size=50)

In [None]:
sc.set_figure_params(figsize=[4.5, 5])
ddl.pl.clone_network(
    adata,
    color=[
        "locus_status",
        "rearrangement_status_VDJ",
        "rearrangement_status_VJ",
    ],
    ncols=1,
    legend_fontoutline=3,
    edges_width=1,
    size=50,
)

### Using `scirpy` to plot
You can also use `scirpy`'s functions to plot the network. 

A likely use case is if you have a lot of cells and you don't want to wait for `dandelion` to generate the layout because it's taking too long. Or you simply prefer scirpy's style of plotting.

You can run `ddl.tl.generate_network(..., compute_layout = False)` and it will finish ultra-fast, and after transfer to `scirpy`, you can use its plotting functions to visualise the networks - the clone network is generated very quickly but visualising it using spring layout does take quite a while.

In [None]:
import scirpy as ir

ir.tl.clonotype_network(adata, min_cells=2)
ir.pl.clonotype_network(adata, color="clone_id", panel_size=(7, 7))

You can change the clonotype labels by transferring with a different `clone_key`. For example, from numerically ordered from largest to smallest.

In [None]:
ddl.tl.transfer(adata, vdj, clone_key="clone_id_by_size")
ir.tl.clonotype_network(adata, clonotype_key="clone_id_by_size", min_cells=2)
ir.pl.clonotype_network(adata, color="clone_id_by_size", panel_size=(7, 7))

You can also transfer with the clones collapsed for plotting as pie-charts as per how `scirpy` does it.

In [None]:
ddl.tl.transfer(adata, vdj, clone_key="clone_id_by_size", collapse_nodes=True)
ir.tl.clonotype_network(adata, clonotype_key="clone_id_by_size", min_cells=2)
ir.pl.clonotype_network(adata, color="sample_id", panel_size=(7, 7))

<b>Finish.</b>

We can save the files.

In [None]:
adata.write("adata_tcr.h5ad", compression="gzip")

In [None]:
vdj.write_h5ddl("dandelion_results_tcr.h5ddl", compression="gzip")